Northeastern professor creates voices for the voiceless

Corey Dockser

Text-to-speech technology has become widely available in modern society, granting verbal communication to millions. But, with a limited set of voices available, users are unable to identify themselves by sound. It was this problem that led to the creation of VocaliD.

VocaliD was founded in 2014 after six years of development by Rupal Patel, a professor of communication sciences and disorders and computer and information sciences at Northeastern University. Patel began her work with synthetic voices in 2007. At the time, her focus was on creating unique voices without distorting the speech to the point of unintelligibility.

The program is capable of combining a brief vocalization from someone with impaired speech with thousands of lines read by a single volunteer donor, recorded in Patel’s lab.

Media outlets began to pick up Patel’s work following an article published by the Northeastern College of Computer and Information Science. An interview with NPR pushed Patel to think beyond research.

“We were just doing studies about how to make these voices and hadn’t really necessarily played the custom voices for any of the people that could potentially use it,” Patel said. The NPR reporter pressed her multiple times on if she could use her work to more directly help people in need.

Soon, Patel began to consult with some of the people who could benefit from her research. This convinced her to move away from theory and toward practical implementation. In 2013 she gave a TED talk, which gave her the idea of crowdsourcing voices.

As she was practicing her talk, the people she was working with kept asking her what the average person could do after seeing her presentation. She decided to ask people to donate their voices.”

Since the launch of VocaliD, 28,000 people from 110 countries ranging from 6 to 91 years old have contributed to VocaliD’s proprietary voice database, the Human Voice Bank.

Contributing to the bank is a largely automated process requiring only a computer, a microphone and an internet connection. Donors complete an audition to see if their recordings are of high enough quality for the Bank. Geoff Meltzner, vice president of and director of research and technology at VocaliD, said the remaining work is simple.

“Once you’ve passed the audition, you then enter the Voice Bank and you get presented with what kind of material you want to read, whether it’s children’s stories or news stories or science articles or fiction, and you’re prompted with a set of sentences for you to record,” Meltzner said.

These stored voices have different roles depending on which of VocaliD’s two products recipients choose. The Vocal Legacy is for people who are able to speak clearly now, but are concerned about losing their speech in the future. After five to seven hours of recording, the website will turn those recordings into a downloadable file.

The BeSpoke voice, on the other hand, is more complex. Designed for those able to generate a vocalization but not speech, recipients need only submit a 2 to 3 second clip to the website to initiate the process. This clip is then run through the database to find a matching donor, and then the donor’s samples are combined with the recipients to form a new, unique voice. Requests can be made for certain desirable traits, like accents. Like the Legacy, this voice is stored on the website and available for download at any time.

For recipients who don’t want the same voice forever, either because they’ve aged or they simply want to try on something else, Patel and her team are developing aging methods, too.

“One of the patents we have right now is to be able to grow the voice algorithmically,” she said. “In the child age band we have ways to modify the voice slowly and gradually. But as you jump from child to teenager there’s actually a pretty huge jump, and at that point it just isn’t going to sound good enough — it’s going to distort the speech signal too much so our preference is to find a new voice.”

Improvements to VocaliD are ongoing. For example, the Human Voicebank 2.0 was launched Nov. 13 and expanded recording options by allowing donors to pick from a wide variety of texts to suit their preferences, rather than the limited collection of sentences previously offered. Looking farther ahead, the team would like to expand beyond English to include Spanish, Hebrew, Chinese and various Indian dialects, Meltzner said.

Both of VocaliD’s products have the same price tag: $1,500. Getting a voice is a one-time purchase. As insurance won’t yet cover the cost of a prosthetic voice, potential users are encouraged to crowdfund the expense. While sticker shock could be a deterrent to potential users, the sheer uniqueness of the technology is enough to win over some, at least for the moment.

“I think it’s definitely a worthwhile investment, especially if you start young,” said Rohith Parvathaneni, a second-year data science major. “The price can be a bit high right now but that may just be due to the fact that it’s new software so with time it’s probably going to go down as it’s optimised.”