Tech giant Google’s attempts to develop a natural-sounding voice from text have taken a big jump forward.
The company has developed a text-to-speech artificial intelligence system, called Tacotron 2, that can speak in a very human-like voice, it said in blog post.
A team of Google researchers wrote in the blog post that the new approach does not use complex linguistic and acoustic features as input. “Instead, we generate human-like speech from text using neural networks trained using only speech examples and corresponding text transcripts,” they said.
Research into text-to-speech technology has progressed greatly over the past few years and many tech companies have been working on it.
The Google researchers said that they incorporated ideas from past work such as Tacotron and WaveNet to come up with the improved Tacotron 2 system.
How does Tacotron 2 work? The researchers explained that the new system uses a sequence-to-sequence model optimised for text-to-speech to map a sequence of letters to a sequence of features that encode the audio.
“These features, an 80-dimensional audio spectrogram with frames computed every 12.5 milliseconds, capture not only pronunciation of words, but also various subtleties of human speech, including volume, speed and intonation. Finally, these features are converted to a 24 kHz waveform using a WaveNet-like architecture," the researchers said.
The researchers also evaluated the generated voices. "In an evaluation where we asked human listeners to rate the naturalness of the generated speech, we obtained a score that was comparable to that of professional recordings," they said.
Still, there are some difficult problems to solve.
For example, the new system has difficulties pronouncing complex words such as ‘decorum’ and ‘merlot’. In extreme cases, it can randomly generate strange noises.
Also, the system cannot yet generate audio in real time. “Furthermore, we cannot yet control the generated speech, such as directing it to sound happy or sad. Each of these is an interesting research problem on its own," the researchers wrote.