Sunday, May 18, 2025
LBNN
  • Business
  • Markets
  • Politics
  • Crypto
  • Finance
  • Energy
  • Technology
  • Taxes
  • Creator Economy
  • Wealth Management
  • Documentaries
No Result
View All Result
LBNN

Stability AI unveils ‘Stable Audio’ model for controllable audio generation

Simon Osuji by Simon Osuji
September 14, 2023
in Artificial Intelligence
0
Stability AI unveils ‘Stable Audio’ model for controllable audio generation
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter

Related posts

21 Best High School Graduation Gifts (2025)

21 Best High School Graduation Gifts (2025)

May 18, 2025
How the Signal Knockoff App TeleMessage Got Hacked in 20 Minutes

How the Signal Knockoff App TeleMessage Got Hacked in 20 Minutes

May 18, 2025


Stability AI has introduced “Stable Audio,” a latent diffusion model designed to revolutionise audio generation.

This breakthrough promises to be another leap forward for generative AI and combines text metadata, audio duration, and start time conditioning to offer unprecedented control over the content and length of generated audio—even enabling the creation of complete songs.

Audio diffusion models traditionally faced a significant limitation in generating audio of fixed durations, often leading to abrupt and incomplete musical phrases. This was primarily due to the models being trained on random audio chunks cropped from longer files and then forced into predetermined lengths.

Stable Audio effectively tackles this historic challenge, enabling the generation of audio with specified lengths, up to the training window size.

One of the standout features of Stable Audio is its use of a heavily downsampled latent representation of audio, resulting in vastly accelerated inference times compared to raw audio. Through cutting-edge diffusion sampling techniques, the flagship Stable Audio model can generate 95 seconds of stereo audio at a 44.1 kHz sample rate in under a second utilising the power of an NVIDIA A100 GPU.

A sound foundation

The core architecture of Stable Audio comprises a variational autoencoder (VAE), a text encoder, and a U-Net-based conditioned diffusion model.

The VAE plays a pivotal role by compressing stereo audio into a noise-resistant, lossy latent encoding that significantly expedites both generation and training processes. This approach, based on the Descript Audio Codec encoder and decoder architectures, facilitates encoding and decoding of arbitrary-length audio while ensuring high-fidelity output.

To harness the influence of text prompts, Stability AI utilises a text encoder derived from a CLAP model specially trained on their dataset. This enables the model to imbue text features with information about the relationships between words and sounds. These text features, extracted from the penultimate layer of the CLAP text encoder, are integrated into the diffusion U-Net through cross-attention layers.

During training, the model learns to incorporate two key properties from audio chunks: the starting second (“seconds_start”) and the total duration of the original audio file (“seconds_total”). These properties are transformed into discrete learned embeddings per second, which are then concatenated with the text prompt tokens. This unique conditioning allows users to specify the desired length of the generated audio during inference.

The diffusion model at the heart of Stable Audio boasts a staggering 907 million parameters and leverages a sophisticated blend of residual layers, self-attention layers, and cross-attention layers to denoise the input while considering text and timing embeddings. To enhance memory efficiency and scalability for longer sequence lengths, the model incorporates memory-efficient implementations of attention.

To train the flagship Stable Audio model, Stability AI curated an extensive dataset comprising over 800,000 audio files encompassing music, sound effects, and single-instrument stems. This rich dataset, furnished in partnership with AudioSparx – a prominent stock music provider – amounts to a staggering 19,500 hours of audio.

Stable Audio represents the vanguard of audio generation research, emerging from Stability AI’s generative audio research lab, Harmonai. The team remains dedicated to advancing model architectures, refining datasets, and enhancing training procedures. Their pursuit encompasses elevating output quality, fine-tuning controllability, optimising inference speed, and expanding the range of achievable output lengths.

Stability AI has hinted at forthcoming releases from Harmonai, teasing the possibility of open-source models based on Stable Audio and accessible training code.

This latest groundbreaking announcement follows a string of noteworthy stories about Stability. Earlier this week, Stability joined seven other prominent AI companies that signed the White House’s voluntary AI safety pledge as part of its second round.

You can try Stable Audio for yourself here.

(Photo by Eric Nopanen on Unsplash)

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with Digital Transformation Week.

Explore other upcoming enterprise technology events and webinars powered by TechForge here.

  • Ryan Daws

    Ryan is a senior editor at TechForge Media with over a decade of experience covering the latest technology and interviewing leading industry figures. He can often be sighted at tech conferences with a strong coffee in one hand and a laptop in the other. If it’s geeky, he’s probably into it. Find him on Twitter (@Gadget_Ry) or Mastodon (@gadgetry@techhub.social)

    View all posts

Tags: ai, artificial intelligence, audio generation, clap model, generative ai, harmonai, latent diffusion, Model, stability ai, stable audio



Source link

Previous Post

How to Destroy Metal Credit Card?

Next Post

I’m Thrilled to Relocate to America for my Master’s with My Family

Next Post
I’m Thrilled to Relocate to America for my Master’s with My Family

I'm Thrilled to Relocate to America for my Master's with My Family

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

RECOMMENDED NEWS

Women’s health faces growing headwinds, despite jump in venture investment

Women’s health faces growing headwinds, despite jump in venture investment

5 days ago
Nigerian entrepreneur adapting to a tough business environment

Nigerian entrepreneur adapting to a tough business environment

2 weeks ago
Shibarium Reaches Utility Milestone as Launch Nears

Shibarium Reaches Utility Milestone as Launch Nears

2 years ago
Municipalities enter the climate crisis chat at built environment indaba

Municipalities enter the climate crisis chat at built environment indaba

5 months ago

POPULAR NEWS

  • Ghana to build three oil refineries, five petrochemical plants in energy sector overhaul

    Ghana to build three oil refineries, five petrochemical plants in energy sector overhaul

    0 shares
    Share 0 Tweet 0
  • When Will SHIB Reach $1? Here’s What ChatGPT Says

    0 shares
    Share 0 Tweet 0
  • Matthew Slater, son of Jackson State great, happy to see HBCUs back at the forefront

    0 shares
    Share 0 Tweet 0
  • Dolly Varden Focuses on Adding Ounces the Remainder of 2023

    0 shares
    Share 0 Tweet 0
  • US Dollar Might Fall To 96-97 Range in March 2024

    0 shares
    Share 0 Tweet 0
  • Privacy Policy
  • Contact

© 2023 LBNN - All rights reserved.

No Result
View All Result
  • Home
  • Business
  • Politics
  • Markets
  • Crypto
  • Economics
    • Manufacturing
    • Real Estate
    • Infrastructure
  • Finance
  • Energy
  • Creator Economy
  • Wealth Management
  • Taxes
  • Telecoms
  • Military & Defense
  • Careers
  • Technology
  • Artificial Intelligence
  • Investigative journalism
  • Art & Culture
  • Documentaries
  • Quizzes
    • Enneagram quiz
  • Newsletters
    • LBNN Newsletter
    • Divergent Capitalist

© 2023 LBNN - All rights reserved.