Gnuspeech is an extensible, text-to-speech package, based on real-time,
articulatory, speech-synthesis-by-rules.
That is, it converts text strings into phonetic descriptions, aided by a pronouncing dictionary, letter-to-sound rules, rhythm and intonation models;
transforms
the phonetic descriptions into parameters for a low-level articulatory
synthesiser; and uses these to drive an articulatory model of the human
vocal tract producing an output suitable for the normal sound output
devices used by GNU/Linux.
The research that provides the foundation of the system was carried
out in research departments in France, Sweden, Poland, and Canada.
Some of the features of gnuspeech and associated tools include:
Overview of the main Articulatory Speech Synthesis System
It is a play on words. This is a new approach to speech synthesis from text. It is also a GNU project, aimed at providing high quality text-to-speech output for GNU/Linux. In addition, it provides a comprehensive tool for psychophysical and linguistic experiments.
gnuspeech is currently under development. It is being ported from an original NeXTSTEP 3.x version to run under GNU/Linux. No full GNU/Linux release is currently available, but a release of the interactive Monet system for Mac OS/X and GNUstep is available, with some work remaining to be completed for GNUstep version (as of 2008-05-23—see next section for details).
-
Development & Coming Soon
gnuspeech is being ported both to GNU/Linux and to the Macintosh under OS/X. There are a number of components/apps/modules which have to be ported. Some have already been ported. Interested persons are invited to contact the authors/developers through the gnu project facilities. To join this mailing list, please visit the subscription page. The current state of the project is as follows:
- Monet:
The interactive language database and testing tool used to create the original databases for English Text-To-Speech conversion using the new articulatory model of the vocal tract (the “tube model”—basically a wave-guide or lattice filter that emulates the properties of the acoustic tube directly rather than through the use of formant filters etc). Monet translates its symbolic input into a digital waveform representing the “spoken” version of the input. Monet was originally designed developed by Craig Schock (based on an original specification by David Hill), with testing and suggestions for improvements by David Hill and Leonard Manzara) as proprietary research software used in-house for the development of the Trillium Sound Research TextToSpeech package offered on the NeXT computer. It also was available as part of the Trillium Experimenter kit. On the demise of NeXT (whose remains were bought by Apple Computer), Monet, and all other Trillium software was reconfigured as a GNU project (gnuspeech) and made available to the community under a General Public Licence and can be found at the web site http://savannah.gnu.org/projects/gnuspeech/. To access the sources for Monet and other components, click on “-Browse Sources Repository” under “Development Tools”. Monet is there under “…/current/Applications” but requires the tube model and other components to compile (see “…/Frameworks” and “…/Tools” under “current”. At present, complete compilation is only possible under Macintosh OS/X 4.3 or later, though the sources are being modified to compile under GNUstep as well and this may introduce certain minor glitches in the Mac OS/X compilation from time to time. The big hold-up in getting full compilation under GNUstep is the lack of sutiable audio output facilities under GNUstep. Compilation under Mac OS/X uses Core Audio and the plan is to implement the needed components of Core Audio for GNUstep. Two people concerned with the ongoing GNUstep development (Greg Casamento—the Chief GNUStep maintainer—and Robert Slover) have been considering the problem. Both have been extremely busy—especially Greg after taking over as Chief on the GNUstep project. The implementation is on Robert's “to-do” list. Until then, those wishing to try out Monet and do further development will have to work on the Mac using the source which is designed to compile under either OS/X xcode/interface builder, or under GNUstep. The Mac port is pretty well complete except for a few items such as modifying the intonation patterns for the automatically generated speech and was done by Steve Nygard following his experience at OmniGroup. Steve had worked on the original NeXT implementation for Trillium whilst he was at the University of Calgary. Monet’s emulation of the human vocal tract depends on research carried out by Fant and his colleagues at the Speech Technology Lab at KTH, Stockholm on formant sensitivity analysis, and by René Carré at the ENST Dept. of Signals in Paris on the “Distinctive Region Model” for controlling the artificial vocal tract.
- The tube model:
This was orignally a ‘C’ implementation of the tube model that forms the core of the synthesis system, and was created by Leonard Manzara who also ported it to the DSP56001 signal processor and made it run in real-time. It is based on work by Perry Cook and Julius Smith at the Stanford University Center for Computer Research in Music and Acoustics (CCRMA). The version required to compile Monet is available in the same repository as Monet, but under “...current/Tools/softwareTRM”. A copy of the original ‘C’ version is available in the repository under “gnuspeech/gnuspeech/trillium/src/softwareTRM/tube.c”.
- Synthesizer:
This is not, in fact, a complete synthesizer! It is an interactive application that allows a user (usually a language developer or someone interested in the behaviour of the tube model) to interact directly with the tube model, listen to the output under different static conditions, and analyse the output. It was an important tool used in developing the databases for the original British English TextToSpeech system because it allowed the tube configurations needed to define the speech “postures” (of the vocal tract) to be explored and finalised. Although it has built-in analysis and display features, it was also used in conjunction with a Kay Sonagraf spectrum analyser that was used to analyse the spectrum of natural speech in order to compare the spectral analyses of putative “postures” with what was seen in natural speech in a form that was the same for both. The Sonagraf was also used to check the output of Monet against the same utterances in natural speech. Synthesizer is 70% ported to the Mac under OS/X but none of the new sources is yet available. I (David Hill) am the one working on this, but I keep getting diverted. It should have been finished 6 months ago! Real soon now! The original version of Synthesizer was created (for the NeXT) by Leonard Manzara.
- Preditor:
This was an application to allow users to create and maintain their own dictionaries. The original TextToSpeech kit looked up several dictionaries in the order User, Application and Main. PrEditor allows the User and Application dictionaries to be created and maintained. An initial port was begun by Eric Zoerner and is in a sub-subdirectory under the same subdirectory as Monet. It is not yet functional. The original PrEditor on the NeXT was written by Vince DeMarco and David Marwood, documented by Leonard Manzara and later upgraded by Michael Forbes.
- The “Main” dictionary:
This has not really changed since the original NeXT implementation and is incorporated as a module in the source code for Monet. It is an hybrid pronunciation between British (RP) English—mainly the vowels and related stuff; and General American—especially the rhotic “r” sound. It includes around 70,000 words, plus facilities for creating/checking derivatives such as plurals, adverbs …, and information concerning word stress, and part-of-speech. The part-of-speech information is still not used. The main dictionary was compiled mainly by me, David Hill, after a preliminary version plus creation tools were set up by Craig Schock.
- BigMouth:
(Not to be confused with a different app of a similar name by a different company). This was an application that allowed text-to-speech to be tried out without reference to any particular application on the NeXT and also drove the speech service. It uses the TextToSpeech Server that ran as a daemon, started at boot time. It has yet to be ported (see also the next item on Real-time Monet). The original source for BigMouth was created by Leonard Manzara.
- Real-time Monet and the TextToSpeech Server (TTS Server):
Monet incorporates all kinds of interactive interfaces for creating and modifying the databases relating to the language being created or managed. It also has the means to use these databases to create the output speech waveform. The original NeXT-based TextToSpeech Kit came in three versions. The User Kit which simply provided speech output as a service available to any application; the Developer Kit which provided the means to incorporate speech into applications directly; and the Experimenter Kit which allowed full access to all the tools used by Trillium in developing language databases including dictionaries. All of these used the TextToSpeech Server for the actual conversion of text to speech output. The task was made easier on the NeXT, which was relatively slow, by using the built-in DSP (a Motorola DSP-56001). In the Mac implementation of Monet and Synthesizer, the host computer performs all the computation—as CPU speeds are two orders of magnitude or more faster than the old NeXT. The use of the DSP on the NeXT also gave a certain absolute separation between the tasks associated with creating the event framework for synthesis, and the tasks associated with transforming the event framework into the digital speech waveform (Real-time Monet) and outputting it—the latter tasks being carried out by the tube model. Thus the tube model ran on the DSP in real-time and communicated by DMA access. There was also a ‘C’ version of the tube model which could not run in real-time. It was useful for producing a slightly higher quality of speech since it did not have to be squeezed into the DSP and rigorously optimised because of the marginal ability (even on the DSP) to run in real-time. The ‘C’ version of the tube model is what forms the basis of the current port—possible because of the greatly increased processor speeds these days.
Real-time Monet is a stripped-down version of Monet. All the database creation and manipulation components are absent, as are all interactive interfaces. On the NeXT version, the defaults database was used to hold the parameters for controlling static aspects of the synthesis (tube length, mean pitch, and so on—the so-called “utterance-rate parameters”) and Real-time Monet computed the event framework from the input text via an intermediate input syntax which resulted from pre-processing the text. This pre-processing included dictionary look-up to get the correct pronunciation (deficient in the sense there was no grammatical parsing or attempt to determine meaning, so that different pronunciations of words with the same spelling could not be disambiguated). The word stress information from the dictionary was used to determine the rhythmic framework according to the Jones/Abercrombie/Halliday (British) “tendency-towards-isochrony” theory of British English speech by placing “foot” boundaries before the word stress in words having word-stressed syllables. The punctuation was also used in this process, and allowed a distinction to be made between statements, emphatic statements, questions, and questions expecting a yes/no answer for purposes of selecting different intonation contours (not ever really done totally satisfactorily). Without using knowledge of meaning, it was hard to decide where the tonic (information point) of the phrase or sentence should be marked, which means that the tonic foot was generally placed in phrase/sentence final position by default. This causes some degradation of the speech rhythm and intonation and is the first deficiency that should be corrected.
That said, Real-time Monet and the TextToSpeech server have yet to be ported or rewritten for GNU/Linux and the Mac. The current Monet port, like the original Monet, incorporates the tube model to generate output and expects the output of the text pre-processor as input. A new applet (unfortunately named “GnuSpeech” and presently residing in the “gnuspeech/current/Frameworks” folder) allows plain text to be converted into the syntax needed for the current version of Monet. Steve Nygard recently “tidied things up”, following comments from people on the list, and I haven't checked out the resulting new arrangements to see if I can still understand the relationships well enough to compile it all, having many balls in the air. Any time I spend will be finishing Synthesizer. Knowing Steve, I am sure there’s no problem with compiling Monet and associated modules in their re-arranged form. Please communicate your experience on the mailing list (to join, visit the subscription page.
There's a diagram of the relationships between the various TTS components of the complete system above.
- ServerTest and ServerTestPlus:
This was an interactive module to allow the functioning of the TextToSpeech Server to be tested as it was running. There were originally two versions (plain and Plus), the latter having a number of “hidden” methods that were restricted to Trillium's “in-house” use. Now that the whole system is available under a GPL, the restricted “ServerTest” version is obsolete and the name ServerTest will refer to a reimplementation of ServerTestPlus. One of the 18 originally-hidden methods allowed plain text to be converted into the intermediate Real-time Monet input syntax. It was hidden to keep the main dictionary material proprietary, as it could have been used to decode the encoded dictionary. This particular function is currently provided by the misleadingly-named GnuSpeech applet (see above). ServerTest will be needed once the TextToSpeech Server has been re-implemented—something that has not yet been done. The original versions were written by Leonard Manzara.
- WhosOnFirst:
WhosOnFirst was the first publicly available software associated with the Trillium TextToSpeech system and was designed as a bit of a teaser. As issued, it provided indication, on the NeXT console, of remote logins. It also told the user that if they had the Trillium TextToSpeech system, they could get voice alerts not only to remote logins, but other system activity such as application launches. The App was written by Craig Schock and was instrumental in catching and identifying a hacker trying to break into our system soon after it was set up. WhosOnFirst has not yet been ported and for best value must await a ported version of the TextToSpeech Server.
- say:
A command line interface to the TextToSpeech Server that can be used from a terminal or in shell scripts. It was written by Craig Schock and has not been ported yet.
- SpeechManager:
The SpeechManager was provided to allow the TextToSpeech Server parameters to be optimised for different systems since no particular setting of priorities, initial silence fill, and so on could be right for all systems. In particular, in networked systems, or systems with a high compute load from other tasks, the speech would sometimes crackle due to interference from other tasks. The App, which could only be run as root, allowed the TextToSpeech Server to be restarted, and the various parameters controlling priority and so on to be set to new values to avoid crackling whilst minimising the use of system resources. It may be that these functions are obsolete these days, given the increased compute power available. Some functions (such as reporting the version of the main dictionary in use, or restarting the TextToSpeech Server) may still be required when the TTS Server is reimplemented. The original App was written by Craig Schock. It has not been ported.
- SpeechRegistrar:
An applet that was provided to allow any of the TextToSpeech Kits to be registered, using a password, and run under the root account. The original function is now obsolete, but may be useful, in revised form, as a way of building user groups for the ported system. It was written by Craig Schock. It has not been ported.
- TrilliumSoundEditor:
This was a speech editor and analysis program intended to provide a more versatile replacement for the publicly available Sonagram program written by Hiroshi Momose. Although TrilliumSoundEditor was never completely finished, it provided the basic functionality required for speech development and could be finished/upgraded/ported at some point in the future. The program was written by Craig Schock. None of the App has yet been ported.
As a summary, much of the core software has been or is being ported to the Mac under OS/X, but porting anything that “speaks” is blocked from completion under GNU/Linux by lack of adequatesuitable audio output facilities. Thus Monet has been ported to the Mac under OS/X using xcode/InterfaceBuilder and it produces speech from input text as well as providing the development facilities for managing and creating language databases for text-to-speech. The Monet source will also compile, more or less, under GNUstep within the GNU/Linux environment but without bult-in speech output facilities. The sources are in the gnuspeech repository (see below). Synthesizer is in the process of being ported to Mac OS/X using xcode/InterfaceBuilder and is about 70% complete. Sources are not yet publicly available. PrEditor is in the process of being ported and the sources are in the gnuspeech repository. Some accessory tools are available. There is an immediate need to port the TextToSpeech Server (the daemon, or stripped version of Monet), and stripping the current Monet is likely a better approach than porting the original for both the Mac and to GNU/Linux versions, based on a source that will compile for either using conditional compilation—as for the current Monet. Other items are as noted in the text above. Robert Slover has undertaken to solve the audio output requirement for GNU/Linux, he just needs time beyond that devoted to the work that earns his living! Greg Casamento has simply run out of resources for taking on this task as he is now the chief GNUstep maintainer.
Gnuspeech is currently fully available as a NextStep 3.x version, and partly available (specifically Monet) as a version that compiles for both Mac OS/X and GNU/Linux under GNUStep. These files are available in the CVS repository.
Developers should contact the authors/developers through the gnu project facilities. To join this mailing list, please visit the subscription page. Papers and manuals are available on-line (see below)
A number of papers and manuals relevant to gnuspeech exist:
provides a reasonably detailed explanation of the theory underlying the tube resonance model.
The Tube Resonance Model a write-up of the waveguide model of the acoustic tubes that form the underlying model of the human vocal apparatus.
There is the original NeXTSTEP Developer Package, which is available under a GPL, but does not run under GNU/Linux. There is also now a version of the full Monet system for Mac OS/X and GNUstep that provides the core of the text-to-speech development facilities and allows arbitrary text to be changed to speech. Note that further work is needed to strip this version to make a daemon-like module for incorporation within applications, or as a service, as noted above. Check out the Savannah CVS repository and search for gnuspeech. Current work is under the current directory.
See the section on Manuals and papers
To contact the maintainers of gnuspeech, to report a bug, or to
contribute fixes or improvements, to join the development team, or to join the gnuspeech mailing list, please visit the gnuspeech project page and use the facilities provided.
Return to GNU's home page.
Please send FSF & GNU inquiries & questions to
gnu@gnu.org.
We thank David Hill for writing this page.
Please send comments on these web pages to
webmasters@www.gnu.org,
send other questions to
gnu@gnu.org.
Copyright (C) 1998, 2001 Free Software Foundation, Inc.,
59 Temple Place - Suite 330, Boston, MA 02111, USA
Verbatim copying and distribution of this entire article is
permitted in any medium, provided this copyright notice is preserved.
Page last updated 2008-10-16 @ 19:53 PDT