A research paper published by Google has revealed details of a new text-to-speech system which engineers claim is near-perfect at recreating human voices.
Known as Tacotron 2, the system is a 'neural network architecture' which accurately synthesises written text using two separate engines. The software manages this feat by visually charting audio frequencies into a spectogram and then feeding this data into another AI engine, known as WaveNet, which reads the chart and generates corresponding speech.
Below are two different audio samples provided by Google in their currently non-peer reviewed paper. One sample is generated by a computer, while the other is a real person speaking.
Google does not specify which is which in the paper but there is a big giveaway - they have labeled some files as 'gen' in the research paper's page source, likely indicating these were generated by a computer. We have put these as the second iteration below.
Here are their examples. Can you tell the difference?
"That girl did a video about Star Wars lipstick."
"George Washington was the first President of the United States."
"She earned a doctorate in sociology at Columbia University."
"I'm too busy for romance."
Currently the Tacotron 2 system only provides human speech in the mode of a woman's voice and in order to recreate male voices Google would have to retrain the system again.