Synthesizing Voices

About a year ago I wrote an article about deepfakes and their potential threat to politicians and social media websites. Ever since then however deepfakes have not caused a huge impact on the internet and I rarely hear about them anymore. This is partly due to the fact that deepfakes are still not perfect and most can deepfakes can be distinguished easily. But the main reason deepfakes haven't had a negative impact yet is because social media platforms have been quick to take down any deepfakes they see.


Currently most deepfakes are good at mimicking facial expressions but not vocals. However this might change as I have recently come across multiple websites and communities that are working on synthesizing voices using AI. The two I've come across are Uberduck and Vo.Codes. These websites are mostly for fun purposes and are not supposed to be used maliciously.

What they do

Example of

Most of these websites work in a very similar manner. You enter some text that you would like to synthesize then you choose the voice that you want to synthisize that text to. After that the website will generate a audio file for you with the synthesized voice. Most of these voices actually turn out really realistic but are still quite distinguishable, it depends on the voice and the text you give it. Some voices don't have as many datasets so they're worse, sometimes the AI has trouble saying short sentences, and some text is just harder to pronounce or for the AI.

These websites have all sorts of voices ranging from public figures like Barack Obama to cartoon characters like Homer Simpson to rappers like Eminem.


How they work

Uberduck and both use an open source software developed by NVIDIA called Tacotron 2 in order to synthesize their voice models. The backbone of the AI though is the data that the community submits. Anyone can submit their trained voice models in order to improve each voice's datasets. People can also request voices and share their own datasets in order to make the synthesized voices sound more realistics.


Just for fun here are a couple examples of the AI in action.

This is the voice of Homer Simpson - generated by Vo.Codes

This is the voice of Eminem - generated by The voice struggles to pronounce "mimicking" and "synthesizing".

This is the voice of David Attenborough - generated by Vo.Codes

This is the voice of Freddie Mercury singing - generated by The voice struggles to pronounce "mimicking", "expressions", and "synthesizing".

This is the voice of SpongeBob SquarePants - generated by Vo.Codes

Final Thoughts

It's still very uncertain what the future of this technology is but right now it's something that is really fun to mess with. People are making their own comedic videos and creations using these audio files and I myself have enjoyed generating voices of people I know.