Silent No More: OpenAI Sora Gets a Voice with ElevenLabs

When watching videos generated by OpenAI’s Sora, do you feel like something is missing?

Aren’t these videos similar to the early silent films?

However, even silent films were not completely silent. During the film screening, there would be a band or pianist in the cinema playing music according to the plot, rendering emotions and promoting the development of the plot.

Now, AI voice cloning startup ElevenLabs has filled this gap by adding realistic background sounds to Sora videos.

ElevenLabs Releases AI Sound Effects Trailer

In the AI Sound Effects trailer released by ElevenLabs, they chose some popular Sora videos to showcase the capabilities of the new model.

In this 1-minute video, all sounds—from footsteps on a busy street in an urban setting, to waves crashing, the rhythmic clicking of a train running, a lively crowd on New Year’s Day, the mechanical sound of a futuristic robot, to human voices in a Hollywood-style promotional video, and so on—are generated from text prompts.

ElevenLabs says they are developing a new product that can generate sounds based on scene descriptions given by users, adding sound effects to originally silent video clips. Adding sound effects to Sora videos was just a test run for them. After the trailer was released, it received many praises, but there were also some criticisms that these AI-synthesized sounds lacked “love” and “detail”.

Application scenarios of AI sound effects

Represented by tools such as Sora, Runway, and Pika, completely AI-generated content is on the rise. They are realistic but lack background audio. This is where ElevenLabs’ new model comes in. It allows users to create sound effects for video content by describing the sounds they want.

ElevenLabs says their text-to-sound effects model is not ready for release yet, but once it is online, it will be able to help content creators create immersive sounds, including footsteps, waves, and ambient sounds.

There are already some text-to-sound effect models on the market, but they are usually built around music AI models, such as myEdit, AudioGen, and StabilityAI’s Stable Audio. In addition to AI-generated content, the sounds generated by ElevenLabs’ new model can even be applied to any other video that needs background sound effects, such as Instagram videos, commercials, or video game trailers.

Challenges of generating AI sound effects

Although the sound effects are all generated from text prompts, it is not easy to generate the correct simulation effect. The system needs to learn both the text and the video pixels at the same time.

NVIDIA AI scientist Jim Fan also noticed ElevenLabs’ new product. He pointed out that an end-to-end Transformer needs to figure out a lot of things to correctly simulate sound effects, such as:

  • Determine the category, material, and spatial location of each object
  • Is it hitting a wooden or metal surface? How fast?
  • What kind of spatial environment is it in?

He wrote: “We don’t have such a high-quality AI audio engine yet.”

ElevenLabs: A unicorn in the AI voice field

ElevenLabs was founded in 2022 by former Google machine learning engineer Piotr Dabkowski and former Palantir deployment strategist Mati Staniszewski. Since then, the company has launched AI-powered text-to-speech software and AI dubbing tools that automatically translate speech in videos into over 20 languages, “preserving the original tone and style.” Earlier this year, the company joined the AI unicorn club with its recent $80 million Series B funding.

The future of AI sound effects

ElevenLabs’ new model may give them a first-mover advantage, but it’s worth noting that several other companies active in the AI voice space also have the potential to enter this field, including well-known players such as MURF.AI, Play.ht, and WellSaid Labs.

In the future, we should see more and more tools that can analyze video content and automatically add sound effects correctly.

One of the dreams of generative AI is to be able to create complete and comprehensive content from a single prompt. With the advancement of technologies such as text-to-sound effects, AI video, and synthetic speech, we are gradually approaching this dream.

Want to experience ElevenLabs’ latest AI Sound Effects?

🔥 Join ElevenLabs HERE

🔥 Apply for the waiting list HERE

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top