If you upload a picture or type in a few words into an AI program, it will give you back a whole video with sound, song, or even a detailed story. You probably thought that we are not there yet. But it looks like we are. Because Microsoft introduced a new groundbreaking AI model that takes content generation to the next level and they called CoDi.
What is CoDi?
CoDi stands for Composable Diffusion. It's a multimodal AI model that can simultaneously process and generate content across multiple modalities, such as language, image, video, or audio. Unlike existing generative AI systems, it can generate multiple modalities in parallel, and its input is not limited to a subset of modalities like text or image. CoDi can take any combination of input modalities and generate any combination of output modalities even if they are not present in the training data.
Imagine you can just type in a few words or upload an image or a sound clip and get a whole video or a song or a story out of it. You can mix and match different inputs and outputs to create something completely new and original. There are so many possibilities that we could do with this so. CoDi is the latest work of Microsoft's project i-code which aims to develop integrative and composable multimodal AI. It's a groundbreaking achievement that could transform the way humans interact with computers on various tasks, including-
1. Assistive technology
2. Custom learning tools
3. Ambient computing
4. Content generation.
How does CoDi do these tasks?
It's not magic. It's a pretty complicated science. So I'll try to explain it in a simple way. CoDi uses a technique called diffusion models to generate content.
Diffusion models are a type of generative model that learn to reverse a diffusion process that gradually adds noise to the data until it becomes random. For example, if you have an image of a cat, you can add some noise to it until it becomes unrecognizable. Then you can train a model to remove the noise and reconstruct the original image. Diffusion models have been shown to be very effective for generating high quality images. But CoDi takes diffusion models to the next level by extending them to multiple modalities and making them composable.
What does composable mean?
Here, composable means that CoDi can combine different diffusion models for different modalities into one unified model that can generate any to any outputs. For example, CoDi can combine a diffusion model for text with a diffusion model for image and a diffusion model for audio into one model that can generate text from image, image from text, audio from text, text from audio, image from audio, audio from image or any combination of these. It does this by learning a shared diverse space for all modalities. This means that CoDi maps all modalities into a common representation that preserves their diversity and uniqueness. For example, it can map an image of a cat and a sentence, describing the cat into the same space, but also keep them distinct from each other. To learn this shared diverse space CoDi uses two components-
1. Latent diffusion models (LDMs)
2. Many to many generation techniques
Latent diffusion models (LDMs)
LDMs are diffusion models that learn to map each modality into a latent space. It is independent of the modality type. Now an LDM can map an image of a cat into a vector of numbers that represents its features. LDMs allow CoDi to handle different modalities in a consistent way.
Many to many generation techniques
Many-to-many generation techniques or methods enable CoDi to generate any output modality from any input modality. For example, CoDi can use cross-attention generators to generate text from image or image from text by attending to the relevant features in both modalities.It can also use environment translators to generate video from text or audio by translating the input modality into an environment representation that captures its dynamics. By combining LDMs and many-to-many generation techniques, CoDi can learn a shared diverse space that enables composable generation across multiple modalities.
What can CoDi do with amazing abilities?
CoDi can take in single or multiple prompts including video image text or audio and generate multiple aligned outputs like video with accompanying sound.
What CoDi can generate from different inputs?
For example, it can take text, image, and audio input and provide video and audio output. Imagine you have a text prompt that says teddy bear on a skateboard, 4K, high resolution and you also have an image of a teddy bear and a sound of a skateboard. What do you think CoDi will generate from these inputs?
It will generate a video of a teddy bear on a skateboard with the sound of the skateboard and the video will be in 4K resolution and high quality. Then it can take text as an input prompt to produce video and audio output.
Let's say, you have a text prompt that says fireworks in the sky. It will generate a video and an audio output that match the text input like it might generate a video of fireworks in the sky with the sound of fireworks. Now let's try something more complex. Imagine you only have a text input but you want to generate three outputs text, audio, and image. Now for a text prompt that says sea shore sound ambiance, it will generate three outputs that are related to the text input. It might generate another text description that says wave crashes the shore seagulls, an audio output that has the sound of sea shore, and an image output that shows the sea shore.
Why is an AI model like CoDi important?
It's because it breaks the boundaries between modalities and enables more natural and holistic human-computer interaction. CoDi can help us create dynamic and engaging content that can appeal to multiple senses and emotions. It can also help us access information and express ourselves in different ways that suit our preferences and needs. For instance, it can help create accessible technology for people with limitations, like- generate captions for videos or images for people who are deaf or hard of hearing. It can also generate audio descriptions or text summaries for people who are blind or visually impaired. It can even generate sign language videos or images for people who use sign language as their primary mode of communication.
Another example is education. It can create special tools to help different types of learners, adjusting content to match what a learner knows and wants to achieve. Moreover, it can make content that suits learners' interests and likes.
CoDi is also affordable and easy to access. It does not require expensive hardware or software to run. It is actually available as an Azure cognitive service that anyone can use through an API or a web interface. It is also scalable, flexible and can handle any combination of modalities and generate any-to-any outputs. You can also adjust and tweak CoDi to suit particular areas and jobs better. To sum up, CoDi is a revolutionary AI model that can generate anything from anything all at once through composable diffusion. It is leading us into a new era of generative AI that can enrich our lives and experiences.Source: Microsoft Research Blog