Stability AI unveils ‘Stable Audio’ model for controllable audio generation

Stability AI has introduced “Stable Audio,” a latent diffusion model designed to revolutionise audio generation.

This breakthrough promises to be another leap forward for generative AI and combines text metadata, audio duration, and start time conditioning to offer unprecedented control over the content and length of generated audio—even enabling the creation of complete songs.

Audio diffusion models traditionally faced a significant limitation in generating audio of fixed durations, often leading to abrupt and incomplete musical phrases. This was primarily due to the models being trained on random audio chunks cropped from longer files and then forced into predetermined lengths.

Stable Audio effectively tackles this historic challenge, enabling the generation of audio with specified lengths, up to the training window size.

One of the standout features of Stable Audio is its use of a heavily downsampled latent representation of audio, resulting in vastly accelerated inference times compared to raw audio. Through cutting-edge diffusion sampling techniques, the flagship Stable Audio model can generate 95 seconds of stereo audio at a 44.1 kHz sample rate in under a second utilising the power of an NVIDIA A100 GPU.

A sound foundation

The core architecture of Stable Audio comprises a variational autoencoder (VAE), a text encoder, and a U-Net-based conditioned diffusion model.

The VAE plays a pivotal role by compressing stereo audio into a noise-resistant, lossy latent encoding that significantly expedites both generation and training processes. This approach, based on the Descript Audio Codec encoder and decoder architectures, facilitates encoding and decoding of arbitrary-length audio while ensuring high-fidelity output.

To harness the influence of text prompts, Stability AI utilises a text encoder derived from a CLAP model specially trained on their dataset. This enables the model to imbue text features with information about the relationships between words and sounds. These text features, extracted from the penultimate layer of the CLAP text encoder, are integrated into the diffusion U-Net through cross-attention layers.

During training, the model learns to incorporate two key properties from audio chunks: the starting second (“seconds_start”) and the total duration of the original audio file (“seconds_total”). These properties are transformed into discrete learned embeddings per second, which are then concatenated with the text prompt tokens. This unique conditioning allows users to specify the desired length of the generated audio during inference.

The diffusion model at the heart of Stable Audio boasts a staggering 907 million parameters and leverages a sophisticated blend of residual layers, self-attention layers, and cross-attention layers to denoise the input while considering text and timing embeddings. To enhance memory efficiency and scalability for longer sequence lengths, the model incorporates memory-efficient implementations of attention.

To train the flagship Stable Audio model, Stability AI curated an extensive dataset comprising over 800,000 audio files encompassing music, sound effects, and single-instrument stems. This rich dataset, furnished in partnership with AudioSparx – a prominent stock music provider – amounts to a staggering 19,500 hours of audio.

Stable Audio represents the vanguard of audio generation research, emerging from Stability AI’s generative audio research lab, Harmonai. The team remains dedicated to advancing model architectures, refining datasets, and enhancing training procedures. Their pursuit encompasses elevating output quality, fine-tuning controllability, optimising inference speed, and expanding the range of achievable output lengths.

Stability AI has hinted at forthcoming releases from Harmonai, teasing the possibility of open-source models based on Stable Audio and accessible training code.

This latest groundbreaking announcement follows a string of noteworthy stories about Stability. Earlier this week, Stability joined seven other prominent AI companies that signed the White House’s voluntary AI safety pledge as part of its second round.

You can try Stable Audio for yourself here.

(Photo by Eric Nopanen on Unsplash)

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with Digital Transformation Week.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

Ryan is a senior editor at TechForge Media with over a decade of experience covering the latest technology and interviewing leading industry figures. He can often be sighted at tech conferences with a strong coffee in one hand and a laptop in the other. If it’s geeky, he’s probably into it. Find him on Twitter (@Gadget_Ry) or Mastodon (@gadgetry@techhub.social)

View all posts