Music is the universal language, uniting people regardless of the language they speak. However, the meaning of song lyrics cannot always project across language barriers. AI helps solve this. AI's live speech translation is facilitating tourism and businesses. Extending AI’s capability to recreate songs with lyrics in another language is a crucial way to project the power of songs further.
Song translation is a complex and challenging process traditionally done manually. Nuances include matching up the length of the translated phrase and ensuring pauses in the song do not occur in the middle of a sentence. For example, Disney avoids direct translations, often completely changing the meaning to make songs sound appealing in the other language. In Figure 1, the Italian translation is very distant from the English. Achieving something similar with AI is far from easy. This two-part blog investigates how to reproduce this using current open-source ML models available today. Part 2 explores the challenges of moving a complex pipeline to Android.
The blog also details the path taken to compile a pipeline of machine learning models which translate songs from English to Mandarin. The aim of the pipeline is to port to Android with all computation performed on device. Where possible, the size and speed of the models is prioritised.
Figure 1: English to Italian translation of Let It Go from Frozen
Song translation is a hot topic, with singers reaching wider audiences by appealing to their native languages. Both Lauv and Westlife have released songs in languages they are not natively fluent in, using AI in the process. AI is used in different ways, from producing models trained in the cadence of the singer’s voice to annotating the original song with the timings of each word. Although human intervention is required to produce a good quality song, AI facilitates and speeds up the process. Over time, ML models will improve, removing the need for humans in the process. This highlights that the creation of translated songs is broken down into clear steps. For example, identifying notes and lyrics of the original songs, finding accurate translations matching the tone of the original song and finally reproducing the singer’s voice in a way that matches the cadence of the new language.
A collection of proprietary models used to do Singing Voice Synthesis (SVS) are explored below.
Basic-pitch is a model developed by Spotify which converts songs into a list of notes using the midi format. This limits the frequencies inferred and other configurable parameters. The model detects notes well, though there may be noise in the output with many short notes or fluctuations in pitch.
LLark acts as a LLM tied to a song input, which takes a song and text prompt as an input to return information about the song. For example, a description or tempo. The model can be overconfident producing hallucinations, however it is more performant than similar open-source alternatives such as ImageBind-LLM and Listen, Think and Understand
SongSensAI is skilled at both annotating music and translating, providing both word-for-word and phrase translation. This shows the model understands both individual words and the nuances of the whole phrase.
SunoAI (Figure 2) is used to generate songs from a prompt. The user can specify a genre and if they require a singing or instrumental version. SunoAI generates songs in multiple languages, where the singing may sound slightly robotic but overall, very clear. This is one of several other proprietary models that have this ability, another popular choice is UdioAI.
Figure 2: SunoAI application
JukeboxAI generates singing when given the genre, artist and lyrics. The tune is generated at random and even with known lyrics it avoids copying the original song’s notes. This means the note sequence required cannot be specified. Inference is around 9 hours per minute of song and only generates singing in English.
The research above shows that input required for SVS is dependent on the chosen model. The first step of creating a song translation pipeline is selecting the model to recreate singing. This influences the remaining models required to generate the input.
Due to the inference time of Jukebox AI and proprietary nature of SunoAI, a different model is required for this application. SunoAI provides an open-source model called Bark, a text-to-speech model which sings and reproduces non-verbal sounds. The model supports 13 languages and has small and large versions, which is ideal for an Android application. However, through testing, the model performed better at generating speech over singing. The model was reluctant to sing, with prompt engineering offering little success.
A proposed solution to encourage singing requires voice cloning, passing a vocal sample to the model for it to mimic (bark-with-voice-clone). This may improve the model's accent in the new language. Some success was achieved, nevertheless the model is still better suited to inference of less than 10s and there is no control over the pitch or timing of the singing.
As an alternative, there is framework for mandarin SVS called VISinger2. This allows the user to train a SVS model from an input comprised of the following:
This input reflects the samples in the opencpop dataset. This dataset annotates 100 songs with the above data, allowing a model to have full control of the exact pitch of the singing.
Opencpop_visinger2 is a pretrained model using the ESPnet framework, which is an extension of VISinger2. The model achieved accurate and pleasant-sounding mandarin which reflected the notes given clearly. The results are repeatable and support singing of longer durations without hallucinating. The only limitations are that it only produces one note per phoneme and does not take into account the tones of the characters. This also does not feed the input characters or slur notes to the model.
As this model requires a list of notes and their durations, a model to infer the pitch of a singing extract is required. Basic-pitch is highly configurable and provides an online tool illustrating the output. Post-processing is required to select a set of discrete non-overlapping notes to pass to the SVS model. This is performed in Python, which assigns a note and duration to the list of phonemes generated from the translation model to pass to the SVS model.
The quality of the midi generation and post-processing is highly dependent on the input song. Missed or extra notes often make the tune hard to recognise and hard to filter out. Modifying the input parameters improves the output but also depends on the input song, since there is no parameter selection which fits all songs.
Few multilingual translation models support Mandarin; thus, a one-to-one mapped model is required. Helsinki University have created models that translate multiple combinations of languages. To achieve the highest performance, the English to Mandarin only model is selected. There is a variant that supports 100 languages which can easily be swapped out if an SVS model able to sing in multiple languages is found. The model also translates most phrases, however, it uses a very literal translation which may sound strange to a native speaker. Furthermore, the model struggles with repetition, hallucinating when too much repetition is present.
The output of the model is in Chinese characters; thus, a library is required to convert this to pinyin (the pronunciation of mandarin characters using the Latin alphabet). The Python library aptly named pinyin achieves this and removes the tones in the pinyin as the SVS model does not support accents. A table splits the pinyin into phonemes.
As the chosen translation model is a text-to-text model, Whisper performs the speech to text, providing multiple models in a range of sizes. This allows the smallest model with acceptable accuracy to be selected. This is supported in many frameworks, whisper.cpp is compatible with Android and high-performance inference. The model transcribes vocals with little degradation in quality. The best performance was achieved with the small English only model.
To improve the accuracy of the s2t and remove harmonies in the midi generation, the vocals must be separated from instruments. The instruments are layered over the SVS output to recreate the original song. Vocal-remover was chosen for its ease of use and ability to keep both outputs. The output quality is high with little of the vocals present in the instrumental section. Performance of the s2t and song-to-midi models increases significantly when used in conjunction with vocal-remover.
To layer the instruments behind the SVS, a library is required to equalise the singing and instruments to ensure neither are overpowered. The pydub Python library is used for inference on a laptop.
To illustrate how the models are coupled together, the pipeline is shown in Figure 3.
Figure 3: Final Pipeline illustrating input and output songs
The frameworks each model uses are specified. This is complicated to run on both a laptop and on Android due to library dependencies. With a careful balance of libraries the pipeline translates and sings a 30 second clip of the Disney song I See the Light.
The singing is clear with reproducible results. However, the pipeline cannot translate songs with multiple singers or harmonies. The pipeline also cannot modulate the pitch of words, which limits the translation of songs where a word is carried over two notes. There are also some limitations with the sound quality. Tones are not represented in the input and the translation can sometimes sound strange to a native speaker. The length of the translation will often differ from the original, leading to some words being truncated if there are not enough notes, and pauses in the music may occur in the middle of a word. However, the pipeline is a success.
Some preliminary timing results are shown in Table 1. The total pipeline time is 84 seconds with 83 seconds spent performing inference. Over half of the inference time is spent separating the vocals and instruments, with the midi generation taking the least amount of time.
Table 1: Results from translating a 30 second clip of music on a laptop
Improvements to the pipeline include refining quality, inference time and scope. Further post-processing can improve quality, including equalising the number of phonemes to notes and avoiding pauses in the middle of two-character words. Finding a translation model better at translating full phrases could make the lyrics sound more natural. Increasing the batch size of the vocal separation model could also improve inference time or alternatively, finding a smaller model. Finally, the scope may increase through exploiting whisper.cpp’s ability to translate speech in other languages to English. This also implements BISinger, an SVS model supporting English and Mandarin.
The next step is to transfer the pipeline to an Android app, where inference happens on device; letting users translate songs from anywhere. This will be explored in part 2 of this blog series.
This blog investigates state of the art ML models powering the future of generative audio. Reviewing this literature aided in building an open-source pipeline to translate 30 second clips of music from English to Mandarin. With the full pipeline working on a laptop, Part 2 will explore the steps taken to deploy this on Android, overcoming the challenges of working with ExecuTorch since its release last year.
Explore available resources in the Arm Community. The Automatic Speech Recognition with Wav2Letter using PyArmNN and Debian Packages Tutorial looks at how audio development platforms take advantage of accelerated ML processing. The On-device speaker identification for Digital Television (DTV) blog outlines how to choose the best processor for your audio DSP application.