Text-to-music generators are an important development for user-friendly music creation, using simple commands (prompts) to tell the AIs what music to make, as we know from ChatGPT. Not only are new pieces of music created en masse by AIs, but also voice clones of superstars that are indistinguishable from the original. The most popular tools, such as Music LM, Stable Audio, Riffusion and MusicGen, will be discussed in more detail in this part of the blog series.

AI in the Music Industry – Part 13: Text-to-Music-Generators: Music LM, Stable Audio, Riffusion and MusicGen

Google has already taken the next step in AI music creation when one of its research teams presented the Music LM text-to-music generator in a scientific paper in January 2023.[1] The AI system is able to convert a text instruction (prompt) such as “Write me a calm violin melody accompanied by a distorted guitar riff” into sounds. Melodies can be whistled or hummed, and the AI is able to recognise and reproduce the piece of music. The article uses numerous music examples to demonstrate the power of Music LM. It outperforms existing text-to-music generators such as Riffusion or Dance Diffusion. However, because Google has trained its AI system on a much larger amount of data, the results are much more impressive. The music generated is much more complex in terms of sound and of high sonic quality. The AI system is even capable of translating images, such as a painting by Salvatore Dali, into music.[2] However, the AI was not immediately released to the public because the 280,000 hours of music Music LM needed to train raised copyright concerns. Researchers discovered that around 1 per cent of the music generated by Music LM was a direct rip-off of existing, copyrighted songs.[3] On 10 May 2023, Music LM was made publicly available in limited form as an app in the AI Test Kitchen, but users must register and join a waiting list and agree that the data they generate may be used for further training of the AI.[4]

One step further is another AI company with its text-to-music AI in 2023. London-based Stability AI, which attracted a lot of public attention with its AI image generator Stable Diffusion, unveiled Stable Audio in September 2023. A year earlier, with little fanfare, Stability AI had released the text-to-music AI Dance Diffusion, which was the first to create short pieces of music based on prompts.[5] However, the performance of this AI was limited, so Stability AI had to make improvements a year later with Stable Audio. Unlike its predecessor, Stable Audio is based on around 1.2 billion parameters and produces high-quality audio files based on latent diffusion technology. This is explained in detail on the company’s website.[6] The AI system has a variable audio encoder, a text encoder and the diffusion model.[7] It does not work with raw audio samples, but with latently coded audio files, which allows for faster training. A convolutional neural network (CNN) is used, fed with a data set of more than 800,000 audio files containing pieces of music, sound effects and individual audio tracks of instruments, as well as text metadata. The music data, totalling 19,500 hours, comes from the data provider AudioSparx and is available royalty-free. The training process and the operation of the text input system, known as the CLAP model, are explained in detail. Without going into the technical details, CLAP allows the text features of a prompt to contain information that enables it to be translated into sound or music. The whole process is also illustrated graphically on the website. (fig. 1).

Figure 1: How Stable Audio Works

Source: Stability AI, “Stable Audio: Fast Timing-Conditioned Latent Audio Diffusion”, September 13, 2023, accessed: 2024-04-29.

Stability AI wants to play in the same league as Google and Microsoft (Open AI) with its AI systems, but needs venture capital to do so. Its founder and CEO, Emad Mostaque, is a former hedge fund manager who studied mathematics and computer science at Oxford University. In 2020, Mostaque set up Stability AI as an open-access project, owned and run by its employees. In October 2022, a consortium led by Coatue and Lightspeed Venture Partners US invested US $101 million to expand the technical infrastructure and advance the AI projects. However, the backers have a limited influence and Mostaque continues to control the board.[8]

Another AI startup taking on the tech giants is Riffusion, which started as a fun project in 2022 by two software developers, Seth Forsgren and Hayk Martiros. But instead of turning text or images into music, they have created an AI system that can convert music into images in the form of spectrograms.[9] The fun soon turned serious when investors came knocking the door to commercialise Riffusion. An investment consortium consisting of Greycroft, South Park Commons and Sky9 US put up US $9 million to develop a text-to-music AI. In an interview with TechCrunch, Seth Forsgren described how it works as follows: “Users simply describe the lyrics and a musical style, and our model generates riffs complete with singing and custom artwork in a few seconds.”.[10] However, Riffusion still lacks the capital it needs to grow further in order to compete with the tech giants.

The Facebook group Meta, on the other hand, has enough capital and presented the text-to-music generator MusicGen to the public in June 2023. Unlike Google’s Music LM, MusicGen’s source code has been made public, allowing AI developers to extend and modify it. Like Music LM and Stable Audio, MusicGen is a text-to-music generator capable of generating approximately 12 seconds of music based on prompts. MusicGen has been trained on around 20,000 hours of music and sound files licensed from media database operators ShutterStock and Pond5.[11] In comparison, MusicGen is less powerful than competing systems from Google and Stability AI, but provides a good working basis for musicians and producers. All in all, we can expect MusicGen to become one of the features of the Metaverse.

In summary, the generative music AI market is dominated by large tech companies such as Google (Magenta Studios, WaveNet, MusicLM), Microsoft/Open AI (MuseNet) and Meta (MusicGen). However, AI start-ups such as Stability AI and Riffusion have also been able to position themselves in the market, challenging the technology giants with innovative approaches. The question is whether these small companies will be able to raise enough capital to compete with the billion-dollar companies. It could be that they suffer the same fate as AI pioneers DeepMind and Open AI, which were eventually bought up by the tech giants.


Endnotes

[1] Andrea Agostinelli et al., 2023, “MusicLM: Generating Music From Text”, github.io, January 28, 2023, accessed: 2024-04-29.

[2] TechCrunch, “Google created an AI that can generate music from text descriptions, but won’t release it”, January 27, 2023, accessed: 2024-04-29.

[3] Ibid.

[4] TechCrunch, “Google makes its text-to-music AI public”, May 10, 2023, accessed: 2024-04-29.

[5] TechCrunch, “AI music generators could be a boon for artists – but also problematic”, October 7, 2022, accessed: 2024-04-29.

[6] Stability AI, “Stable Audio: Fast Timing-Conditioned Latent Audio Diffusion”, September 13, 2023, accessed: 2024-04-29.

[7] Encoders are electromechanical devices that convert an acoustic wave into binary codes, i.e. quasi digitised. See Wikipedia, “Encoder (digital)”, version of January 24, 2024, accessed: 2024-04-29.

[8] TechCrunch, “Stability AI, the startup behind Stable Diffusion, raises $101M”, October 17, 2023, accessed: 2024-04-29.

[9] TechCrunch, “Try ‘Riffusion’, an AI model that composes music by visualizing it”, December 16, 2022, accessed: 2024-04-29.

[10] TechCrunch, “AI-generating music app Riffusion turns viral success into $4M in funding”, October 17, 2023, accessed: 2024-04-29.

[11] TechCrunch, “Meta open sources an AI-powered music generator”, June 12, 2023, accessed: 2024-04-29.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.