Google Speech Synthesis Gets More Realistic

John Lister's picture

Google says it's made the most realistic computer speech simulation ever. It uses artificial intelligence to reproduce the way humans put words together.

The idea of Google's "Cloud Text-to-Speech" is to go beyond the traditional approach when dealing with speech synthesis. That effectively boils down to recording a batch of sound files of different syllables, then patching them together to form words. That works well for some languages such as Japanese, where speech patterns are very regular, but not so well for language such as English that have more complexity with pronunciation.

Full Sentences Analyzed

For example, the way that different syllables blend together isn't always consistent in English. There's also some variance in the way that some syllables are stressed more than others. That's why some speech synthesis still sounds like a machine talking.

Google says it's approach involves taking recordings of human beings saying different words and then analyzing the audio wave patterns. Its system can then take genuine examples of how different people say real words and then combine them into a consistent voice. The company says that because it works from recordings of full speech rather than standalone syllables, the results have more realistic features right down to the sounds of lips smacking on some words. (Source: theverge.com)

'Deep Mind' Behind System

Taking this approach uses a huge amount of computing power, which is why Google's system works by remote processing on its servers rather than on individual computers or devices. It's powered by "Deep Mind," its system that aims to mirror the way humans learn different skills and make intelligent decisions. Using the remote servers gives enough power that 20 second's worth of audio can be generated in one second.

Though Google plans to license the system for websites and applications, it has a free trial page. Although it certainly wouldn't be confused with a real person speaking, it is an improvement on some previous technologies. For example, it pronounces words in a more appropriate manner when they form part of a question rather than a mere statement. It's also good at capturing some geographic variances such as Australian speakers of English increasing their pitch throughout a sentence. (Source: google.com)

What's Your Opinion?

Does synthesized speech need to be improved? What uses can you see for it if it gets more realistic? Do you think computerized speech can ever sound completely believable?

Rate this article: 
Average: 4.8 (6 votes)

Comments

Dennis Faas's picture

Surely there will be many great things to come from this technology. That said, the one thing that came to mind when reading this article is that tech support scammers (especially from India) could use this technology to significantly improve their reach.

Let's look at an example: a dead give away that "something just isn't right" when dealing with these people over the phone (if and when they make a call to your home, claiming your PC has a virus) is that they are foreign, very difficult to understand, and they are asking for money to "fix" the "problem". Now, if you get rid of the "foreign" and "difficult to understand" bit, the scam seems that much more legitimate. "Microsoft Bob" from India now sounds like the legit "Microsoft Bob" from Redmond, Washington. When this technology improves even further, it will be used in real-time conversations.

I recently watched the TV show "Click" from the BBC which demonstrated how video and speech are being analyzed and reprogrammed to generate fake video and speech. There's even a fake speech given by Barack Obama via Youtube.