Researchers have created a new machine learning tool that can turn audio clips into a scarily realistic lip-synced video of a person speaking. Experts at the University of Washington developed a new algorithm that converts the audio files of an individual's speech into realistic mouth shapes and movements and then grafts them onto the head of the person in another existing video.
The team used the tool to successfully generate realistic, lip-synced videos of former US president Barack Obama talking about terrorism, job creation, fatherhood and various other topics using audio clips of his previous speeches, weekly video addresses and even decades-old interviews.
Researchers said they chose videos of Obama since there are thousands of hours of presidential videos in the public domain for the neural network to learn and train from.
The system uses a neural network that was trained to watch videos of a person talking and translate different audio files into realistic mouth shapes.
Combining previous research from the university's Graphics and Image Laboratory team with a new mouth synthesis technique, these mouth shapes were then superimposed and blended on an existing video of that person. The technology also allowed a small time shift for the neural network to anticipate what the speaker would say next.
For the Obama videos, the team said the system required around 14 hours of footage to learn from. However, they aim to have the algorithm recognize a person's voice and speech patterns with just an hour of reference video in the future.
Researchers said the tool could be used to improve video calls or even lead to more advanced algorithms that can determine if a video is real or fake.
"In the future video, chat tools like Skype or Messenger will enable anyone to collect videos that could be used to train computer models," Ira Kemelmacher-Shlizerman, an assistant professor at the UW's Paul G. Allen School of Computer Science & Engineering, said in a statement.
Since streaming audio over the internet takes up less bandwidth, researchers say the new system could put an end to those annoying, glitchy video chats that time out due to poor connections.
"When you watch Skype or Google Hangouts, often the connection is stuttery and low-resolution and really unpleasant, but often the audio is pretty good," co-author and Allen School professor Steve Seitz said. "So if you could use the audio to produce much higher-quality video, that would be terrific."
Previous audio-to-video conversion processes involve filming multiple people in a studio saying the same sentence multiple times to try and capture how a specific sound correlates to different mouth shapes in a time-consuming, tedious and often expensive process.
"People are particularly sensitive to any areas of your mouth that don't look realistic," said lead author Supasorn Suwajanakorn said. "If you don't render teeth right or the chin moves at the wrong time, people can spot it right away and it's going to look fake. So you have to render the mouth region perfectly to get beyond the uncanny valley."
The researchers' system, however, can train on videos found "in the wild" online or anywhere else. However, Given the growing concerns over fake news spreading online, researchers pointed out that their program only works with audio from the actual speaker.
"You can't just take anyone's voice and turn it into an Obama video," Seitz said. "We very consciously decided against going down the path of putting other people's words into someone's mouth. We're simply taking real words that someone spoke and turning them into realistic video of that individual."