Music’s power to unite people is a profound human experience. Imagine harnessing this power through machines to bridge language barriers and bring people even closer together. Part 1 of this blog post series explored the creation of a Machine Learning (ML) pipeline able to translate songs from English to Mandarin to do just that.
This blog post will explore the challenges of porting such a complex pipeline to Android, with insight on key design choices to facilitate the process. Here you will learn all the steps required to convert and connect five different ML models to run on Android so you can watch your mobile sing in another language.
The pipeline in its current state requires certain modifications before it can be run on an Android phone. First, the code is written in Python which cannot be natively run on Android studio. Frameworks such as Chaquopy can achieve this, but for this example which has a large amount of pre and post processing, the speed of C++ is preferred. Second, the ML models must be converted to a framework that is supported on Android. These frameworks optimize the size and latency of the model by using a FlatBuffers format, helping them run on devices with restricted memory and compute resources such as mobile phones.
The following graphic shows the final pipeline developed in Part 1 as well as the framework of each model. To port the pipeline to Android, TensorFlow models require conversion to the TFLite framework and PyTorch models must be converted to ExecuTorch. The ggml framework is supported by C code which allows it to run directly on edge devices. The Marian framework is more complex and beyond the scope of this blog post as it is a C++ library which is not currently optimized for mobile. It has TensorFlow and PyTorch implementations which could potentially be converted to TFLite/ExecuTorch.
Figure 1: Final Pipeline illustrating input and output songs
Due to the large number of frameworks, it is beneficial to call each model from a unified structure. Whisper.cpp (the chosen speech-to-text model) is written in C++ and called from a Java Native Interface (JNI) bridge (a way of interfacing C++ code and Java). This bridge allows precompiled C++ functions to be called from the Java or Kotlin code present in an Android app. The following graphic is a diagram showing how the JNI bridge works. The bridge is formed by a jni.c file which lists the declarations of the C++ functions that will be visible to the Java or Kotlin code. Passing large amounts of data over this bridge is slow, so it is beneficial to perform all processing on one side of the bridge. In this case calling all the models from C++, calling and saving the audio files on the device so no data is passed over the bridge, is the most efficient solution. This also has the added benefit of portability and improved speed.Figure 2: JNI Bridge implementation
To implement the JNI bridge the following steps are needed:
These will be explored in the following sections.
The only model in TensorFlow is basic-pitch, the song to midi model. Basic-pitch consists of a pretrained model that is followed by Python post-processing code. The repository contains 3 pretrained models including both TensorFlow and TFLite versions. The TFLite model is easily called by changing a flag in the code.
The post-processing must be translated to C++. This processing involves taking the three outputs generated by the pretrained model to create a list of notes with their start and end times in midi format. Instead of creating a midi file with the final result, which involves using extra libraries, the notes and start times were left in a vector of tuples containing the note information. This data can be manipulated directly to obtain the non-overlapping list of filtered notes required by the Singing Voice Synthesis (SVS) model.
Both the vocal separation (vocal remover) and Singing Voice Synthesis (SVS) models are in PyTorch with neither containing a lowered ExecuTorch model. The vocal-remover repository contains a pretrained PyTorch model as well as Python code to pre-process the input audio file and post-process the single output of the model into two audio files.
The conversion of the PyTorch model is achieved following this tutorial. Unlike TFLite, ExecuTorch allows you to specify a delegate for example XNNPack during the conversion. XNNPack is a library of operators highly optimized for Neural Networks (NN) which speed up inference. Note lowering the model to a delegate increases the memory and time required for conversion.
Some minor changes were needed to allow the model to fully convert. For example, to resolve an unsupported operator error, the AdaptiveAvgPool2d operator was replaced with torch.mean(). Additionally, to improve inference speed, the to_edge_transform_and_lower() method was used in the conversion script following this example as the ExecuTorch tutorial is slightly outdated.
Converting the code to C++ was quite complex, due to the manipulation of large tensors. The reading of the sound file requires extra libraries which causes some variation between the C++ and Python outputs. These libraries are incomplete and require extra functions to be written manually.
The second PyTorch model to convert is for SVS. This model consists of Espnet, a framework which can train models for a range of audio models and opencpop_visinger2 which is a .pte file which contains the weights for the model. The versatility of the framework unfortunately means that there is a lot of redundant code. The framework expects to parse a large dataset of text files containing the training data which it manipulates into separate files. This can be simplified and translated to C++ along with the preprocessing for the model.
The model conversion is achieved following the same script as before. However, the model seems to be very complex with many layers and dynamic inputs. The dynamic inputs cause many errors during conversion due to the tool not being able to guard data-dependent expressions. These errors can often be fixed by adding size checks into the model.
The wisper.cpp model for speech-to-text is easy to move to Android as the repository’s examples folder contains whisper.android, the code for a basic Kotlin app. By adding the whisper.cpp repository as a package to your app, you can then call functions from the jni.c file that is provided. The fullTranscribe method allows you to provide an array of audio samples to transcribe. One thing to note is that whisper expects files sampled at 16kHz which requires resampling the input audio files as most music is sampled at 44.1kHz.
To join all the models together a parent C++ function which calls and passes data between the models is required. The structure of the jni.c file provided by whisper.cpp can be reproduced to call this parent function. A cmake file is needed to pull together all the libraries and source files as well as the jni.c file. Now the parent function can be called from anywhere in the app, for example when a button is pressed.
Figure 3: Song translation App UI
All apps need a fun UI. Below is one of the fragments that are created to allow the user to select the song and input/output languages. Currently the user only has one option, but this is useful to display the information and is reusable as new language support is added. The other fragment follows a similar style and allows the user to play back the generated song. When the generate button is pressed the pipeline is called, reading the input song that has already been stored on the device and then saving the new file for playback.
There is a lot to be learned when converting a pipeline as complex as this to Android. The use of multiple frameworks increases the learning curve, time must be spent learning how to use each one and ensure their packages are compatible during compilation. With five models, finding high quality models small enough to run on Android is challenging and it cannot be assumed that all will use the same framework. However, it is a good idea to keep this in mind when a choice is present. Additionally, ExecuTorch is still frequently changing as it is still in beta, this means there are fewer online resources. More resources will become available as ExecuTorch is adopted more widely.
Another design choice to evaluate is the use of C++ to improve performance. C++ lacks a lot of libraries that java and Python have which makes the translation process more challenging. Additionally, there is less documentation for interfacing with TFLite and ExecuTorch models in C++, making failures harder to diagnose.
This blog post outlines the steps required to port an innovative song translation ML pipeline, introduced in Part 1, to Android. Based on this work, several key considerations are shared for selecting ML frameworks, enabling readers to make informed design choices. The use of ML models in music production is increasing, though not without controversy. The challenges faced in using and porting these models are highlighted in both blogs, with the expectation that ongoing advancements in models, frameworks, and tools will soon simplify the creation of new applications to meet the high demand and interest in music.
Now that you know how to develop apps to make the most of ML, learn about how to make your app’s memory secure with the Memory Tagging Extension.