Stable Diffusion 3 and Stable Video Diffusion: Major Announcements from Stability AI

Just now, Stability AI has made two more major moves overnight. First, it released the new generation of Stable Diffusion 3, a text-to-image model. At the same time, the video generation platform Stable Video Diffusion (SVD) has also officially opened public beta testing. It is likely to be a strong competitor to Sora.

Stable Diffusion 3 Released

Stability AI has unveiled an early preview of Stable Diffusion 3 (SD3), the next generation of its text-to-image model. With model parameters ranging from 800M to 8B, SD3 is claimed to be the “most powerful text-to-image model to date.”

Detailed technical reports are not yet available. According to the official release, SD3 mainly adopts a new Diffusion Transformer architecture (DiT, similar to Sora), and combines it with Flow Matching technology (proposed by Meta AI and Weizmann Institute of Science in 2022), which greatly improves the performance of multi-subject prompts, image quality, and spelling ability.

Announcing Stable Diffusion 3

Key Technical Innovations and Improvements

  • New Transformer Architecture: SD3 adopts a new diffusion transformer architecture (DiT), similar to Sora, which provides the model with stronger image generation capabilities.
  • Utilizing Transformer Improvements: SD3 takes full advantage of the latest advancements in Transformer technology, allowing it to handle more complex and diverse data types. This enables the model to further expand its capabilities and accept multimodal inputs (video, images), providing greater flexibility and accuracy in understanding and generating image content.
  • Flow Matching and Other Enhancements: The model also incorporates flow matching technology and other technical improvements to further enhance the quality and diversity of generated images. Flow matching helps the model better understand and simulate dynamic elements and structures in images, resulting in visually more coherent and natural images.

Performance Enhancements

  • Multi-Prompt Handling: The new model has better understanding and processing capabilities for prompts containing multiple themes or elements. This means users can describe more complex scenes in a single prompt, and the model can generate images more accurately based on these descriptions.
astronaut riding pig pink umbrella bird

Prompt: a painting of an astronaut riding a pig wearing a tutu holding a pink umbrella, on the ground next to the pig is a robin bird wearing a top hat, in the corner are the words “stable diffusion”.

  • Image Accuracy and Quality: SD3 has significantly improved image quality, including finer detail representation, more accurate color matching, and more natural light and shadow processing. These improvements make the generated images more realistic and better capture the user’s creative intent.
chameleon over black background

Prompt: studio photograph closeup of a chameleon over a black background

  • Spelling and Text Processing: This version has better spelling and text comprehension capabilities when dealing with text elements, especially text information directly displayed in images (such as slogans, numbers, labels, etc.). This includes more accurately recognizing and rendering text in user prompts, even in complex visual backgrounds.
good night baby tiger

Prompt: Resting on the kitchen table is an embroidered cloth with the text ‘good night’ and an embroidered baby tiger. Next to the cloth there is a lit candle. The lighting is dim and drama.

Stable Video Diffusion

In addition to SD3, Stability AI has also launched a public beta of its Stable Video Diffusion (SVD) platform, which supports generating videos from uploaded images and text prompts.

Key Features

  • High-quality video generation
  • Support for camera movement control
  • 150 free credits per day, allowing for 15 video generations

Community Involvement in Data Labeling

An interesting aspect of SVD is the platform’s design for collecting labeled data from the community. During the video generation process, a pop-up window will appear, showing two community videos for users to choose which one is better. Additionally, after generating a video, there will be a rating bubble where users can upvote or downvote.

svd collecting labeled data

Conclusion

Both SD3 and SVD represent significant advancements in the field of AI image and video generation. With continuous model iteration and community involvement, it is expected that the quality and capabilities of these models will continue to improve, leading to even more exciting possibilities in the future.


Further Reading:

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top