If you upload a picture or type in a
few words into an AI program, it will give you back a whole video with sound, song, or even a detailed story. You probably thought that we are not there yet. But
it looks like we are. Because Microsoft introduced a new groundbreaking AI model that takes content generation to the next
level and they called CoDi.
What is CoDi?
CoDi stands for Composable Diffusion. It's a multimodal AI model that can simultaneously process and generate content
across multiple modalities, such as language, image, video, or audio. Unlike existing
generative AI systems, it can generate multiple modalities in parallel, and its
input is not limited to a subset of modalities like text or image. CoDi can
take any combination of input modalities and generate any combination of output modalities even if they are not
present in the training data.
Imagine you can just type in a few words or upload an image or a sound clip and get a whole video or a song or a story out of it. You can mix and match different inputs and outputs to create something completely new and original. There are so many possibilities that we could do with this so. CoDi is the latest work of Microsoft's project i-code which aims to develop integrative and composable multimodal AI. It's a groundbreaking achievement that could transform the way humans interact with computers on various tasks, including-
1. Assistive technology
2. Custom learning tools
3. Ambient computing
4. Content generation.
How does CoDi do these tasks?
It's not magic. It's a pretty
complicated science. So I'll try to explain it in a simple way. CoDi uses a technique called diffusion
models to generate content.
models are a type of generative model
that learn to reverse a diffusion process that gradually adds noise to the data
until it becomes random. For example, if you have an image of a cat, you can
add some noise to it until it becomes unrecognizable. Then you can train a
model to remove the noise and reconstruct the original image. Diffusion models
have been shown to be very effective for generating high quality images. But CoDi
takes diffusion models to the next level by extending them to multiple
modalities and making them composable.
What does composable mean?
Here, composable means that CoDi can
combine different diffusion models for different modalities into one unified
model that can generate any to any outputs.
For example, CoDi can combine a diffusion model for text with a diffusion model
for image and a diffusion model for audio into one model that can generate text from image, image from text, audio from text, text from audio,
image from audio, audio from image or any combination of these. It does this by learning a shared
diverse space for all modalities. This means that CoDi maps all modalities into
a common representation that preserves their diversity and uniqueness. For
example, it can map an image of a cat and a sentence, describing the cat into
the same space, but also keep them distinct from each other. To learn this shared diverse space CoDi
uses two components-
Latent diffusion models (LDMs)
2. Many to many generation techniques
Latent diffusion models (LDMs)
LDMs are diffusion models that learn to map each modality into a
latent space. It is independent of the modality type. Now an LDM can map an
image of a cat into a vector of numbers that represents its features. LDMs
allow CoDi to handle different modalities in a consistent way.
Many to many generation techniques
generation techniques or methods enable CoDi to generate any output modality from any
input modality. For example, CoDi can use cross-attention generators to generate text from image or image
from text by attending to the relevant features in both modalities.It can also use
environment translators to generate video from text or audio by translating the
input modality into an environment representation that captures its dynamics. By
combining LDMs and many-to-many generation techniques, CoDi can learn a shared
diverse space that enables composable generation across multiple modalities.
What can CoDi do with amazing abilities?
CoDi can take in single or multiple prompts
including video image text or audio and generate multiple aligned outputs like
video with accompanying sound.
What CoDi can generate from different inputs?
For example, it can take text, image, and audio input and provide video and audio output. Imagine you have a text
prompt that says teddy bear on a skateboard, 4K, high resolution and you also
have an image of a teddy bear and a sound of a skateboard. What do you think CoDi
will generate from these inputs?
It will generate a video of a teddy bear on a skateboard with the sound of the skateboard and the video will be in 4K resolution and high quality. Then it can take text as an input prompt to produce video and audio output.
Let's say, you have a text prompt that
says fireworks in the sky. It will generate a video and an audio output that
match the text input like it might generate a video of fireworks in the sky with
the sound of fireworks. Now let's try something more complex. Imagine you only
have a text input but you want to generate three outputs text, audio, and image.
Now for a text prompt that says sea shore sound ambiance, it will generate
three outputs that are related to the text input. It might generate another
text description that says wave crashes the shore seagulls, an audio output
that has the sound of sea shore, and an image output that shows the sea shore.
Why is an AI model like CoDi important?
It's because it breaks the boundaries
between modalities and enables more natural and holistic human-computer
interaction. CoDi can help us create dynamic and engaging content that can
appeal to multiple senses and emotions. It can also help us access information
and express ourselves in different ways that suit our preferences and needs. For
instance, it can help create accessible technology for people with limitations,
like- generate captions for videos or images for people who are deaf or hard of
hearing. It can also generate audio descriptions or text summaries for people
who are blind or visually impaired. It can even generate sign language videos or
images for people who use sign language as their primary mode of communication.
Another example is education. It can
create special tools to help different types of learners, adjusting content to match
what a learner knows and wants to achieve. Moreover, it can make content that
suits learners' interests and likes.
CoDi is also affordable and easy to access. It does not require expensive hardware or software to run. It is actually available as an Azure cognitive service that anyone can use through an API or a web interface. It is also scalable, flexible and can handle any combination of modalities and generate any-to-any outputs. You can also adjust and tweak CoDi to suit particular areas and jobs better. To sum up, CoDi is a revolutionary AI model that can generate anything from anything all at once through composable diffusion. It is leading us into a new era of generative AI that can enrich our lives and experiences.Source: Microsoft Research Blog