
Google’s Deep Mind has showcased its latest results from its generative AI video-to-audio research. The system combines what is seen on screen with the user’s written prompt to create synced audio.
Called V2A AI, it can be paired with video-generation models such as Veo. It can create soundtracks, sound effects, and dialogue for on-screen action.
Deep Mind also claims it can generate “an unlimited number of soundtracks for any video input” by tuning the model with positive and negative prompts.
It works by encoding and compressing the video input, then leverages it to iteratively refine the desired audio effects from background noise, based on the user’s text prompt and the visual input.
The audio output is then decoded and exported as a waveform which can be recombined with the video input.
The user isn’t required to go in and manually sync the audio and video tracks, because the system does it automatically.
The Deep Mind team said, “By training on video, audio and the additional annotations, our technology learns to associate specific audio events with various visual scenes while responding to the information provided in the annotations or transcripts.”
The system isn’t entirely flaw-free yet. One; the output audio quality is dependent on the fidelity of the video input, and two; the system can mess up when video artifacts or distortions are present.
Deep Mind revealed syncing dialogue to the audio track is still a challenge as well.
“V2A attempts to generate speech from the input transcripts and synchronize it with characters’ lip movements. But the paired video-generation model may not be conditioned on transcripts. This creates a mismatch, often resulting in uncanny lip-syncing, as the video model doesn’t generate mouth movements that match the transcript.”
The team also revealed the system still has to undergo “rigorous safety assessments and testing” before it’s released to the public.
Stability AI also released a similar product last week, and ELevenLabs released its sound effects tool last month.