Here is a short review of freely available (open source or not) « text-to-speech » technologies. I digged in this topic because I wanted to check whether anyone invented some software package turning my RSS aggregator into a personalized radio. More precisely, while I am doing some other task (feeding one of my kids, brushing my teeth, having my breakfast, …) I would like to be able to check my favorite blogs for news without having to read stuff. My conclusion : two packages come near to the expected result.
Regarding features, the most advanced one is NewsAloud from nextup.com. It acts as a simple and limited news aggregator combined with a text-to-speech engine that reads selected newsfeeds loud. But it still lacks some important features (loading my OPML subscription file so that I don’t have to enter my favorites RSS feeds one by one, displaying a scrolling text as it is read, …) and worst : it is NOT open source.
The second nice-looking package going in the expected direction is just a nice hack called BlogTalker and enabling any IBlogExtension-compatible .Net aggregator (RSSBandit, NewsGator…) to read any blog entry. But it is just a proof-of-concept since it cannot be setup so that it reads a whole newsfeed nor any set of newsfeeds. It seems to me that adding TTS abilities to existing news aggregators is the way to go (compared to NewsAloud which is coming from TTS technologies and trying to build a news aggregator from there). And BlogTalker passes successfully the « is it open source ? » test.
Both packages depend on third party text-to-speech engines (the « voices » you install on your system). As such, they are dependent on the quality of the underlying TTS engine. For example, if you are a Windows user, you can freely use some Microsoft voices (Mike, Mary, Sam, robot voices, …) or Lernout & Hauspie voices or many other freely available TTS engines that support the Microsoft Speech API version 4 or 5 (or above ?). The problem is that these voices do not sound good enough to me. As a native French speaker, I am comfortable with the LH Pierre or LH Veronique French voices even if they still sound like automat voices. But for listening to English newsfeeds on the long run, the MS, LH or other voices are not good enough. Fortunately, AT&T invented its « natural voices » which sound extremely … natural according to the samples provided online. Unfortunately, you have to purchase them. I will wait for this new kind of natural voices to become commoditized.
Meanwhile, I have to admit that TTS-enabled news aggregators are not ready for end-users. You can assemble a nice proof-of-concept but the quality is still lacking with the above three issues : aggregators are not fully mature (from a usability point-of-view), high-quality TTS engines are still rare, nobody has achieved to integrate them well one with the other yet. With the maturation of audio streaming technologies, I expect some hacker some day to TTS-enable my favorite CMS : Plone. With the help some of the Plone aggregation modules (CMFFeed, CMFSin, …), it would be able to stream personalized audio newsfeed directly to WinAmp… Does it sound like a dream ? Not sure…
During my tests, I encountered several other TTS utilities that are open source (or free or included in Windows) :
- Windows Narrator is a nice feature that reads any Windows message box for more accessibility. It seems to be bundled in all the recent Windows releases. Windows TTS features are also delivered with the help of the friendly-but-useless Microsoft Agents.
- Speakerdaemon‘s concept is simple : it monitors any set of local files or URLs and it speaks a predefined message at any change in the local or remote resource (« Your favorite weblog has been updated ! »). Too bad it cannot read the content or excerpts (think regular expressions) of these resources.
- SayzMe sits in your icon tray and reads any text that is pasted by Windows into the clipboard. Limited but easy.
- Clip2Speech offer the same simple set of features as SayzMe plus it allows you to convert text to .WAV files.
- Voxx Open Source is somewhat ambitious. It offers both TTS features (read any highlighted text when you hit Ctrl-3, read message boxes, read any text file, convert text to .WAV or .MP3, …) and speech recognition. Once again, it is « just » a packaging (front-end) of third party speech recognition engines. As such, it uses by default Microsoft Speech recognizer which is not available in French (but in U.S. English, Chinese and Japanese if I remember properly). I have still to try it in its U.S. English with a headset microphone since my laptop microphones catches too much noise for it to be usable. The speech recognition feature allows the user to dictate a text or to command Voxx or Windows via voice. So it is an open source competitor to IBM ViaVoice or ScanSoft Dragon Naturally Speaking.
- PhantomSpeech is middleware that plugs into TTS engines and allows application developers to add TTS capabilities to their applications. It is said to be distributed with addins for Office 2000. Indeed I could display a PhantomSpeech toolbar in Word 2003. It could read a text but only using the female Microsoft voice. And this toolbar had unexpected behaviors and errors within Office. Not reliable as a front-end application. Anyway, the use and configuration of speech engines is really a mess. The result is that PhantomSpeech does not look as really intended for end-users but maybe just for developers.
- CHIPSpeaking is a nice utility for « the vocally disabled » (people who cannot speak). It allows the user to dictate sentences with a virtual keyboard and to record predefined sentences that are read aloud with one click.
- ReadPlease (the free version) is just a nice simple text reader made by developers who played too much with Warcraft (click on the faces and you’ll see why). The word being read is highlighted. Simple options allow users to change the voices with one click (which is cool when you switch between several languages) or to customize the size of the text, …
- Spacejock’s yRead is another text reader that includes a pronunciation editor (read km as « kilometers » please) and also allows the download of public domain texts available from Project Gutenberg. The phrase being read is highlighted, you can easily switch from one voice (and language) to another. Too bad its Window always sucks the focus when it reads a new phrase.
- For the *nix-inclined people, I should also mention the famous Festival suite of TTS components (Festival, FLite, Festvox). For the java-inclined people, don’t miss the FreeTTS engine (that is based on Festival Lite !) and the associated end-user applications. An example of an end-user application based on Festival is the CMU Communicator, see its sample conversation as a demo.
- Last but not least, do not miss Euler and the underlying MBROLA package. Euler is a simple open source reading machine based on MBROLA that implements a huge number of voices in many many languages plus these voices can include very natural intonations and vocal stresses. Euler + MBROLA were produces by an academic research program. They are free for non-commercial use and their source code is available (BTW, it is said that MBROLA could not be distributed under an open source license because of a France Telecom software patent !). Beware : the installation of MBROLA may be quite tricky. First, download the MBROLATools Windows binaries package, download patch #1 and read the instructions included, (I had problems when trying patch #2 so I did not use it), download as many MBROLA voices as you want (wow ! that many languages supported !), then download Win Euler (or any other MBROLA compatible TTS engine from third parties ; note that MBROLA is supported by festival).
Further ranting about TTS engines : I feel like the ecosystem of speech engines is really not mature enough. Sure several vendors provide speech engines. But they are not uniformly supported by the O.S.. There was a Microsoft S.A.P.I. version 4 (SDK available here) which is now version 5.1 but people even mention v.6 (included in Office 2003 U.S. ?) and a v.7 to be included in LongHorn (note that there also is another TTS API : the Java Speech API 1.0 – JSAPI– as implemented by FreeTTS… bridgeable with MS SAPI ?). But as any Microsoft standard, these API are … not that standardized (i.e. they seem to be Microsoft-specific). Even worst : they seem rather unstable since the installation of various speech engines give strange results : some software detects most of the installed TTS engines, other only detect SOME of the SAPI v.4 TTS engines, some other display a mix of some of your SAPI4 TTS engines and some of your SAPI5 TTS engines…. In order to be able to use SAPI5 engines and control panel I had to install Microsoft Reader and to TTS-enable it (additional download). What a mess ! The result is that you cannot easily control which voices you will be using on your computer (which one will be supported or not ?). As a further example, I could not install and use the free CMU Kal Diphone voice and I still don’t know why. Is it the API fault ? the engine’s fault ? Don’t know… Last remark on this point : Festival seems to be the main open source stream in the field of TTS technologies but it does not seem to be fully mature ; and the end-user applications based on it seem to be quite rare. Let’s wait some more years before it becomes a mainstream, user-friendly and free technology.
More precisely, the TTS puzzle seems to be made with the following parts :
- a TTS engine made with three parts :
- a text processing system that takes a text as input and produces phonetic and prosodic (duration of phonemes and a piecewise linear description of pitch) commands
- a speech synthesizer that transforms phonemes plus a prosody (think « speech melody ») into a speech
- a « voice » that is a reference database that allows the speech to be synthesized according to the tone, characteristics and accent of a given voice
- an A.P.I. that hosts this engine and publishes its features toward end-user applications and may provide some general features such as a control panel
- an end-user application (a reading machine, a file monitor with audio alerts, a audio news aggregator, …) that invokes the dedicated speech API
You can get more detailed information from the MRBOLA project site.
These were my notes and ranting about text-to-speech technologies. Please drop me a comment if you feel like my explanations were wrong or biased as I don’t know this field in details and I may have made a lot of errors here. Thanks !