> Hardware and Software Aspects of a Speech Synthesizer
> Developed for Persons with Disabilities
> Tony Vitale
>
> Assistive Technology Group
>
> Digital Equipment Corporation
>
> vitale@speech.enet.dec.com
>
> Copyright 1993. Journal of the American Voice Input/Output
> Society.
>
> NOTE
>
> Reprinted with Permission from the Editor
>
> IBM, PCXT, and Personal Computer AT are registered trademarks
> of International Business Machines, Inc., and Screen Reader is
> a trademark of International Business Machines, Inc., Microsoft
> Windows is a trademark and MS-DOS is a registered trademark
> of Microsoft Corporation, Jaws is a trademark of Henter-Joyce,
> Inc., Vocalize is a trademark of G.W. Micro, ASAP is a trademark
> of Microtalk, Touch Tone is a trademark of American Telephone
> and Telegraph, Bookwise is a trademark of Xerox Imaging Systems,
> Inc., Vert Pro is a trademark of TeleSensory Corporation, Flip-
> per is a trademark of Omnicron. Trademarks of other products are
> held by the companies producing them.
>
> Digital Equipment Corporation
> Maynard, Massachusetts
>
> CONTENTS
>
> Preface............................................... v
>
> 1 Motivation........................................... 1
>
> 2 Description.......................................... 1
>
> 3 Architecture......................................... 2
>
> 3.1 Generic Functionality............................. 2
>
> 4 Command Functionality and Speech Functionality....... 3
>
> 4.1 Command Interface................................. 3
>
> 4.2 Speech Improvements............................... 5
>
> 5 Communication........................................ 8
>
> 6 Windows[TM] Support.................................. 9
>
> 7 Testing.............................................. 9
>
> 7.1 Field Testing..................................... 9
>
> 7.2 Segmental and Prosodic Intelligibility Testing.... 10
>
> 7.3 Screen-Access Software Testing.................... 10
>
> 7.4 Learning Disabilities Software Testing............ 11
>
> 8 Commercial Applications Workstations................. 12
>
> 9 Telephone-based Response............................. 13
>
> 10 Conclusion.......................................... 13
>
> Appendix A SELECTED REFERENCES........................... 15
>
> iii
>
> Preface
>
> Until fairly recently, very high quality speech synthesis has
> been too expensive for extensive use as assistive technology or
> for prototyping and developing applications for the multimedia
> workstation. Furthermore, the functionality inherent in such
> devices has not been optimized for either type of application.
> This paper discusses a new generation of synthesizer which
> offers high quality text-to-speech at reduced cost and with
> more appropriate functionality for such applications. This
> is a PC card which is XT/AT compatible, capable of running on
> machines from the 8088-based XTs to the 80486-class machines
> operating at high bus speeds. It combines the latest in high
> quality text-to-speech technology with functionality designed
> to be used with currently available software such as screen-
> readers. This functionality includes the ability for immediate
> stop and start speaking, a large buffering capability, large
> internal (fixed) and user (modifiable) dictionaries, faster
> speaking rates, and improved pronunciation capability including
> the ability to automatically pronounce proper names. Output
> of stored (digitized) voice files is also possible. The card
> has no on-board telephonics but could be combined with one of
> the available telephonics cards to provide for voice-response
> applications.
>
> Preface v
>
> 1 Motivation
>
> Speech I/O technology has long been important for individuals
> with disabilities. Moreover, with the recent passage of the
> Americans with Disabilities Act (ADA), speech has become
>
> The Americans with Disabilities Act P.L. 101-336 became law
> when it was signed by President George Bush on 10:26 AM on
> July 26, 1990. This established a "clear and comprehensive
> prohibition of discrimination on the basis of disability." An
> earlier version of this paper was presented at the Symposium on
> Telecommunications, Office Automation, and the Americans with
> Disabilities Act in Minneapolis, September 1992. I would like to
> thank Barbara Wise of the Dept. of Psychology at the University
> of Colorado who articulated much of the section on learning
> disabilities, and Ken Kuenzel of Kuenzel Software Technology for
> his observations on Windows[TM].
>
> Voice input is beneficial for individuals with upper extremity
> problems such as difficulty moving their arms and hands, and
> synthesized speech output is useful for individuals with vocal
> impairments as well as learning disabilities. Furthermore, it
> is estimated that as many as 12.8 million Americans have some
> visual limitation and speech technology can be of assistance in
> reading computer-generated text aloud.
>
> High quality speech synthesis has traditionally been rele-
> gated to the commercial sector and has been fairly expensive
> for individuals within the disabled community. These users have
> typically had the choice of a low-cost synthesizer with corre-
> spondingly low intelligibility levels and limited functionality,
> or a high-end device with a greater range of functionality, high
> intelligibility and greater naturalness but available only at
> a relatively high cost. These high-end systems were typically
> designed for telecom applications rather than for PC-based work-
> stations and assistive applications which might run on such
> workstations.
>
> 2 Description
>
> The synthesis system is an option card and software which runs
> on MS-DOS[TM] systems and provides synthesized voice output of
> ASCII text sent to it by other PC software applications. The
> card can be used in IBM PC/XT/AT [TM] or 100% compatible per-
> sonal computers and fits in an XT/EISA/ISA 8/16/32 bit bus, full
> length option slot. Either a 5.25 inch or 3.50 inch disk drive
> can be used to load the distribution media. The card can run
> on DOS V3.3 or later. The interface to the synthesizer card for
> both commands and text is done through a memory-resident DOS
> driver which is a Terminate and Stay Resident (TSR) program.
> The synthesizer card comes with an external loudspeaker which
> has both a volume control as well as a jack for headphones. A
> speaker jack can be plugged directly into the board for trav-
> eling and similar uses. The speech and functionality microcode
>
> 1
>
> results in high quality speech synthesis. A wide variety of ap-
> plications software for screen-reading and learning disabilities
> is available through third parties (below).
>
> The synthesizer was designed for persons with visual impair-
> ments, learning disabilities and severe speech impairments to
> enable them to have access to speech output when they use a com-
> puter. For some, the primary benefit is an additional dimension
> for input to assist them in hearing which keys have been typed
> and in understanding or "seeing" the screen. For others, the
> benefit is speech output that might enable them to speak on the
> telephone or make comments in a vocational environment.
>
> 3 Architecture
>
> The synthesizer is a phoneme-based formant synthesizer and has
> a three-level architecture (Bruckert et al. 1983). In Level 1,
> text from the PC is converted from normal (ASCII) orthographic
> text into phonemic code. The latter uses a standard phonemic al-
> phabet in which each symbol is phonetically unambiguous. It also
> uses an internal dictionary and letter-to-sound rules to perform
> this conversion. In the second level, the phonemic code is con-
> verted into synthesizer control parameters. These are continuous
> variables which control aspects of the speech such as pitch,
> amplitude, duration and the like for the various voices. In the
> last level, the control parameters generate a speech waveform.
> This waveform is converted to an analog speech signal through
> a D/A converter. In the second and third levels, a synthesizer
> control command (a set of phonetic parameters) is generated ev-
> ery 6.4 milliseconds, and the digital signal processor generates
> a speech waveform value every 100 microseconds. This process
> generates "frames" of speech which are perceived by a listener
> to be one continuous, unbroken sequence.
>
> 3.1 Generic Functionality
>
> Because of the fact that many potential users of speech syn-
> thesizers have visual impairments, accessibility and ease of
> use were key factors. A sizable and representative population
> of users and developers with visual impairments were asked to
> provide input for the product. A Getting Started Card in Braille
> assists the blind user by listing the contents of the package.
> Installation is made easier for the unsighted person by pro-
> viding instructions in three formats: (a) a hardcopy form for
> reference by a sighted person; (b) a softcopy form in ASCII
> which can be read to the user by the synthesizer itself after
> a screen-reader or similar software is installed, and (c) an
> audio tape version of the installation procedure which can be
> used with any standard audio cassette recorder. Furthermore,
> the ASCII form can be output to a Braille printer for those
> individuals who can read Braille.
>
> 2
>
> To further assist visually-impaired users, the manual contains
> detailed descriptions of the locations of switches and the size
> and shape of certain parts of the board. The board is installed
> using standard PC-card installation procedures. The software
> installation is initiated by simply inserting a diskette and
> typing Install. For unsighted users, a series of tones is used
> in the installation procedure. A falling tone is an indication
> to change or remove the diskette; a rising tone is an indication
> to press Enter. Non-default or special installations such as
> those which place files in non-default directories may require
> a sighted assistant. When the installation is completed, a
> verbal message is spoken by the synthesizer so that the user
> will know the board is operating correctly. This startup message
> is customizable and easily modified. Many unsighted users have
> successfully installed the software without difficulty.
>
> The original implementation of the system contained a BIOS ROM
> which handled such items as IRQ and BIOS address settings. In a
> later implementation, the BIOS ROM was removed greatly simplify-
> ing setup requiring only I/O addresses to be set. In the event
> that there are conflicts with other boards already installed
> in the PC, an automatic configuration utility is provided so
> that conflicts are automatically analyzed and recommendations
> are made for changes in switch settings. In the current version
> with the BIOS ROM removed, there are only 4 possible switch set-
> tings. In case of a conflict, a sighted assistant is necessary
> since the screen cannot be read by special software until the
> synthesizer is properly configured.
>
> Simple in-line commands were created to allow the user to con-
> trol both speech and functionality. The PC card is soft-loadable
> which provides for more flexibility in future upgrades and modi-
> fications of the existing software.
>
> 4 Command Functionality and Speech Functionality
>
> 4.1 Command Interface
>
> The synthesizer card provides commands for a wide range of func-
> tions. There are commands for modifying characteristics (i.e.
> the phonetic parameters) of a particular voice and increasing
> or decreasing the speaking rate (above). For developers who
> need to work with customized voices or speech researchers who
> are attempting to analyze and better understand human speech,
> the command set has the ability to set a wide variety of speech
> parameters from pitch range, (for a greater excitement level)
> and head size (for a deeper voice) to formant frequencies and
> bandwidths. There are 28 acoustic parameters including formant
> frequencies, bandwidths and gains which can be modified and
> set. One of the goals for the future is to develop an interface
> which would allow for more facile and flexible methods of ma-
> nipulating prosodic features, perhaps via a joystick or similar
> device. Therefore, speech researchers as well as disabled users
> (especially vocally-impaired individuals) will be able to work
>
> 3
>
> more effectively with speech parameters to create a more natural
> output.
>
> Since ambiguities exist in a variety of text outside of normal
> orthography (below), a number of settable modes are provided.
> These include a math mode in which numeric text is inter-
> preted as mathematical notation, a spell mode in which words
> are spelled rather than pronounced, a European mode in which
> certain numeric sequences are interpreted in a European (vs. US)
> style, and a name mode in which selected words are treated as
> proper names (below).
>
> One mode which was added at the request of users of screen-
> reading software was a citation mode . This requires a "ci-
> tation" pronunciation of a word in certain circumstances. For
> example, if a visually-impaired person using a screen-reader is
> in word mode (i.e. each word is pronounced separately), certain
> words should be pronounced as they would be in isolation. Some
> of the higher quality synthesizers are tuned to natural speech
> within discourse and take into account linguistic phenomena
> such as vowel reduction, consonant cluster simplification, sub-
> phonemic attenuation and morphophonemic alternation. However,
> this tends to reduce intelligibility and distract the listener
> when such words are pronounced in isolation. For example, the
> word to exhibits different phonetic behavior when occurring in
> different environments such as before a consonant, a vowel or
> silence. If the context is / - # [+cons], for example, the vowel
> is acoustically quite different than if it is in / - # [+voc]
> (i.e. similar to the morphophonemic alternation of the definite
> article in similar contexts). Therefore, a citation mode was
> created such that a set of words would always be pronounced as
> if they occurred in isolation. Naturally, retention of this mode
> in running discourse would have a detrimental effect on natu-
> ralness for the same reason. Further development is underway to
> make this mode automatic, i.e. to anticipate when a user wants
> words pronounced in isolation.
>
> Punctuation can be processed in a number of different ways de-
> pending upon the application. Punctuation is usually treated
> normally where only non-clause-final or non-phrase-final punc-
> tuation is spoken. However, two additional modes were created
> whereby all or no punctuation is spoken. Therefore, applications
> where special text (such as a high level computer language)
> needs to be read, can turn all punctuation on. Other applica-
> tions such as scanning of unrestricted text may turn punctuation
> completely off.
>
> Commands allow the application to terminate speech immediately
> instead of waiting for the buffered text to complete. The com-
> mand interface also allows for the resumption of speaking where
> the text left off, or the flushing of text and immediate pro-
> cessing of new text.
>
> 4
>
> There are modes for letter-by-letter and word-by-word pronuncia-
> tion as well as full clause (normal) pronunciation such that the
> synthesizer is able to immediately speak single characters with-
> out waiting for an entire clause to be buffered. This is useful
> in applications requiring auditory feedback for what was typed
> on the keyboard and is one of the most widely used features by
> unsighted individuals. The software also provides normal clause
> buffering for highly natural speech. This mode whereby char-
> acters are spoken immediately, necessitated a restructuring of
> certain modules since prosodic contours are typically based on
> clause-level strings to attain greater naturalness. This was
> done through the creation of dedicated tables containing all of
> the characters which could be produced on a typical keyboard.
> This then allows the unsighted user immediate aural feedback
> of all characters including those which are not echoed such as
> spacebar and backspace. Commands, therefore simplify the in-
> terface to screen-readers for the processing of letters, words,
> phrases, clauses, paragraphs and whole documents. Both input and
> output buffer size is 4 Kbytes.
>
> Volume control is settable both in hardware and in software.
> There is a volume control on the external loudspeaker (above)
> but volume is also settable in software by a command sequence
> through the standard command interface or directly through the
> TSR. The software control was added since various manufacturers
> of screen-access software need to be able to easily manipulate
> various aspects of speech such as volume.
>
> Commands also exist to generate tones (e.g., for margin bell,
> alert etc.) in addition to speech sounds. A developer still has
> the flexibility of modifying acoustic parameters such as pitch,
> duration and the like to create different voice qualities. Be-
> cause of this ability to modify pitch and duration, vocal music
> can be synthesized as well. Pitch was tuned to the musical (A
> = 440Hz) rather than the physical (A = 430.4 Hz) scale since a
> common application especially among vocally-impaired youngsters
> is to use the synthesizer as lyrical accompaniment to a musical
> instrument.
>
> 4.2 Speech Improvements
>
> Intelligibility and naturalness retain their high priority.
> Word pronunciation is extremely accurate. Normal words such as
> the common nouns found in a hardcopy dictionary are rarely mis-
> pronounced. There have been modifications in letter-to-phoneme
> rules and a more sophisticated morpheme decomposition algorithm
> has been added. Rules have also been added for proper names,
> medical terms and other subsets of the English lexicon which
> would require extensive memory if a database (i.e. dictionary)
> were exclusively used.
>
> 5
>
> The synthesizer contains a large built-in dictionary which as-
> sists both the pronunciation of individual words as well as its
> rhythmic naturalness. This fixed (non-accessible) dictionary is
> many times larger than previous versions of the same synthesizer
> and consists of complex lexical entries with a wider variety of
> syntactic and semantic information. This is used to feed other
> modules in order to increase naturalness. This has allowed for
> improvements such as automatic homograph handling (below) and
> will, in the future, assist in more natural pausing, contextual
> timing (e.g., phrase-final lengthening), higher intelligibility
> at fast speaking speeds, and generally more natural prosodics.
> Such fixed dictionaries are inaccessible to the user although a
> user dictionary is accessible and modifiable (below).
>
> Heuristic rules have been refined and are now more intelligent.
> In addition to normal number processing for dates, fractions and
> the like, unpronounceable sequences such as initialisms (e.g.,
> FBI, IR Q, EEC ) are also handled automatically.
>
> A user (modifiable) dictionary can be used to load application-
> specific words, DOS-specific terms, and the like. This is also
> much larger than those of earlier versions although the size
> is somewhat variable and depends upon what other software is
> resident on the board. However, the usable space may be as high
> as 350 Kbytes and is sufficient to load thousands of words.
> Because of the large dictionary, developers can now input many
> keyboard key names and commonly used DOS and PC application
> words and commands as well as developing their own application-
> specific lexicon.
>
> Speech rate runs from a slow speed of 100 wpm to an upper rate
> of approximately 720 wpm. Very rapid speaking speeds are useful
> for applications where scanning large bodies of text is nec-
> essary. The rate of speech for a normal speaker of English
> is roughly 160-220 words per minute. Although human speech
> has been timed at over 550 wpm, there are physiological con-
> straints which reduce intelligibility dramatically in very rapid
> speech. Furthermore, when speech at these rapid rates is slowed
> down to more normal rates, intelligibly is comparatively much
> lower. Informal observations suggest high quality synthesized
> speech may be more comprehensible at these high speeds than hu-
> man speech perhaps due to the fact that with redundancy as a
> constant there is an upper limit to the speed of movement of
> human articulators. When speech at these high speeds is slowed
> down, intelligibility is lower when compared to normal
>
> A similar result occurs when a read passage at normal speed
> is sped up or slowed down due to a consequent frequency shift.
> Foulke (1969) has claimed that intelligibility declines rapidly
> at speeds in excess of 275 wpm due, in part, on limitations
> in short-term memory. It is also interesting to note that this
> threshold may be language-dependent. In any event, many indi-
> viduals with visual impairments have requested faster scanning
> rates in synthesizers in order to more efficiently scan lengthy
> documents while searching for particular information.
>
> 6
>
> While speaking speed can run in excess of 450-500 words per
> minute, many sighted people have the ability to scan text at
> speeds in excess of 3000 wpm. Visually-impaired individuals need
> the ability to scan text aurally at speeds which exceed those of
> normal speaking (approximately 180-200 wpm) but clearly not as
> fast as visual scanning speeds. More efficient algorithms are
> being tested to maintain high intelligibility at these speeds by
> automatically adjusting duration and pausing as well as reducing
> morphosyntactic complexity and utilizing periodic segmentation.
>
> Another problem which was addressed is the more accurate pro-
> nunciation of proper nouns such as first names, last names and
> street names (Church 1986; Spiegel 1985; Vitale 1989, 1991b
> and others). Names commonly found in a telephone book are es-
> sentially loanwords from various languages following different
> phonotactic rules. It is no longer reasonable for a high quality
> synthesis device to exclude this large subset of words from be-
> ing properly handled. Therefore, rules were added which handled
> such names with a higher degree of accuracy and greater level
> of intelligence. The algorithms underlying this functionality
> were originally developed for large commercial telecom applica-
> tions such as reverse directory assistance (i.e. number-to-name)
> but have now been modified to run in real-time on the PC card
> (Vitale 1989, 1991b). Name pronunciation can be run in two dif-
> ferent modes: (a) process the very next word as a proper name,
> or (b) process all uppercase non sentence-initial words as a
> proper name.
>
> Homographs, forms which are spelled alike but pronounced differ-
> ently (from Gk. homo 'same' and graph 'writing'), are handled
> automatically in most cases and this is accomplished through
> a relatively simple morphosyntactic scan. There are over two
> hundred pairs of these words in English and it is a linguistic
> phenomenon found in many other languages as diverse as French
> and Japanese. Examples of such words in English are record, per-
> mit, attribute, deliberate, bass , and many others which can
> have two different pronunciations depending upon how the word
> is used in context. Therefore, to handle sentences such as They
> refused the refuse , human intervention (manual phonemicization)
> is typically needed. Some homographs are quite common and their
> mispronunciation has been a constant source of distraction to
> synthesizer users in the blind community. Since text-to-phoneme
> algorithms are somewhat simplistic and simply convert a spelled
> form to a sequence of sound symbols, homographs are clearly
> troublesome.
>
> In practice, of course, the homograph problem has rarely re-
> sulted in intelligibility problems except perhaps in telephone-
> based applications. However, it is simply one more issue which,
> when handled automatically, contributes to the overall natu-
> ralness of speech synthesis systems. A fuller discussion of the
> homograph problem is beyond the scope of this paper. However,
> the problem is more pervasive than is sometimes believed and can
> include function words (e.g., can, just ), proper names (e.g.,
> Robert, Guy), and other lexical categories. Clearly, homographs
>
> 7
>
> belonging to the same form class such as proper names (above),
> and even simple verbs (e.g., read, tear) are much more difficult
> to handle since a simple parse is insufficient to disambiguate
> them.
>
> Finally, the card has the capability of outputting digitized
> speech in addition to synthesized speech. It does not have
> the capability for the user to input voice and digitize the
> utterance. However, a voice segment which has already been
> recorded, digitized, and stored on the PC can be played back
> through the card. Voice files can be created using one of the
> digitizing boards currently on the market. The playback is done
> at 10KHz. A special command can be used to synchronize digitized
> data with the text stream.
>
> 5 Communication
>
> There are a number of ways to communicate directly with the
> board. It can be treated as a standard ASCII device connected
> to a serial port or a parallel printer port and thus simple
> copy or print commands may be issued (e.g., print (filename)
> lptx or copy (filename) comx . Communication may also be ac-
> complished directly with the TSR. Both types of functionality
> can be present simultaneously so it is possible to utilize both
> paths at the same time. Synchronization of commands and data, if
> both paths are utilized, are the responsibility of the appli-
> cation. Also, many of the command strings such as change rate,
> change voice, start speaking, stop speaking, index, index re-
> plyand the like can be shortened to the smallest unambiguous
> substring. These mnemonic commands are more user-friendly than
> the escape sequences of the past. Most commands also contain
> a number of parameters for finer granularity of control. For
> example, to generate an error message, the user has the option
> of generating the message as a text string, an escape sequence,
> a tone, or even a synthesized voice message in any pre-selected
> voice or speaking rate. Tones and voice messages were added for
> unsighted individuals.
>
> Other commands control the processing of phonemic text. Text-
> to-speech devices still require a utility whereby a user can
> create a specialized lexical entry or unusual word such as
> application-specific jargon which can be phonemicized within
> the text itself rather than as a dictionary entry. When set,
> this command allows everything within square brackets to be
> interpreted as phonemic text, by defining the characters "[" and
> "]" as phoneme delimiters. Thus, all text and characters which
> appear between square brackets will be interpreted as phonemic
> text and will be pronounced as such. These commands can be used
> in-line within normal orthographic text. Also, in a previous
> implementation, square brackets were nested (i.e. necessitating
> an equal number of closed and open brackets) which occasionally
> presented problems in extricating text from phonemic mode. In
> the system described here, one square bracket is sufficient to
> close phonemic mode.
>
> 8
>
> While there are no on-board telephonics, the PC card can gener-
> ate Dual Tone Multiple Frequency Tones (Touch-Tones[TM] ) for
> 0-9, *, #, "," (pause) and A, B, C, D (for handsets which con-
> tain these). The comma can be used to generate a 2-second pause
> for applications where intertonal pause is required for dialout.
> 3
>
> Other commands can be used to generate sounds of different fre-
> quencies and lengths depending upon the parameters set (above).
> This allows for a wide variety of sounds for purposes such as
> notification, warning, and so on. Regular tones can also be used
> for a number of other purposes such as indications of margin
> bell, etc. for screen reading applications which is useful for
> someone who wishes to work in a quiet environment without using
> the speaker on the PC.
>
> Verbal feedback can also be given on incorrectly entered command
> strings. For example, the command [:error ...] can set the
> error mode for the module to be, e.g., visual, verbal, or a
> tone warning (above). This command is useful for debugging in
> an application development setting and is especially useful for
> unsighted individuals.
>
> 6 Windows[TM] Support
>
> With the recent interest in the graphical user interface (GUI)
> there has been renewed interest in as well as concern about the
> interface for unsighted individuals. A number of large corpo-
> rations have been investigating the use of GUIs with respect
> to the Americans with Disabilities Act. In Europe, a simi-
> lar law for individuals with disabilities requires ergonomic
> considerations in the design of GUI software.
>
> Third party developers can be supported via a private Dynamic
> Linked Library (DLL) that provides functionality similar to that
> in the DOS TSR/driver. The DLL also supports a standard config-
> uration interface dialog, standard parameter setting dialogs,
> and default user initialization files. Users of standard Win-
> dows programs will be able to access the DLL via an installable
> driver that can be configured as the default printer, allow-
> ing any Windows software package that prints, to output to the
> module.
>
> Tone dialing through a handset microphone is usually not very
> reliable.
>
> 7 Testing
>
> 7.1 Field Testing
>
> Extensive field testing was conducted throughout a widespread
> geographical area over a period of 3 months with individuals who
> were visually-impaired and learning-disabled and using software
> programs developed for reading screens (below). A number of
> software developers who participated in these tests also had
>
> 9
>
> some disability, often blindness or severe visual impairment.
> The feedback allowed for a number of essential modifications to
> be made to both hardware as well as software.
>
> Early testing determined that it was difficult for an unsighted
> person to effectively feel the ON/OFF position of the DIP-
> switches as these were on a corner of the original switch pack.
> The part was eventually changed so that the switches were in
> a more accessible position and the ON/OFF status could be more
> easily determined. In addition, an installation procedure was
> developed so that unsighted individuals could install the board
> and software without the assistance of a sighted person (above).
> A series of tones is used in this procedure as an indication to
> change or remove the diskette or to press Enter.
>
> 7.2 Segmental and Prosodic Intelligibility Testing
>
> Testing was continued in areas of the synthesizer relating
> to intelligibility and naturalness in both the segmental and
> prosodic domain. For example, investigation has begun on which
> levels of structure contribute most to naturalness (e.g., glot-
> tal pulse, segmental phonetic, prosodic, etc.) and results of
> such analysis should yield interesting clues about how listen-
> ers perceive naturalness. Hopefully, this will ultimately lead
> to a better definition of naturalness as well as allow modi-
> fications to values and rules which relate to factors such as
> timing, F and the like. Segmental durations typically have a
> more limited effect on intelligibility although clearly tar-
> gets have to be hit. If the onset of voicing were to begin
> too early in a segment, a voiceless stop might be perceived
> as voiced thus creating a phonological (and hence intelligi-
> bility) problem. Cumulatively, even sub-phonemic inaccuracies
> contribute perceptually to an overall lack of naturalness. On
> the prosodic level, improper stress on many noun compounds, lack
> of pauses for breath groups and sense groups, over-generalized
> question intonation and the like, all have an negative effect on
> naturalness.
>
> 7.3 Screen-Access Software Testing
>
> The synthesizer card works with many of the available These
> software packages include JAWS[TM] (Job Access with Speech -
> Henter-Joyce), Vocal-Eyes[TM] (G.W. Micro), ASAP[TM] (Automatic
> Screen Access Program - MicroTalk), Vert Pro[TM] (TeleSensory
> Corporation), Flipper[TM] (Omnicron), Screen Reader[TM] (IBM)
> and a number of others.
>
> Screen-access software or screen-readers are software packages
> which are designed to allow visually-impaired individuals the
> ability to navigate through screen displays of an application
> by an intelligent processing of text in conjunction with the
> functionality of a speech synthesizer. Screen-readers offer
> such functions as a typing mode for auditory feedback (above),
> pitch changes to determine the position of uppercase letters,
>
> 10
>
> different voices for keyboard entry and screen changes, and
> ability to navigate around the screen by auditory feedback of
> cursor position.
>
> Some educational applications for screen access programs (in-
> cluding both learning disabled as well as general pedagogical
> applications) require special functionality to pronounce words
> by syllable. While syllabic segmentation on phonetic form is a
> prerequisite for accurate lexical stress assignment within the
> architecture of a speech synthesizer and is done internally,
> the syllable-by-syllable pronunciation of words more properly
> belongs in the application where different theories of syllabi-
> fication can be tested and implemented. Creative ways have been
> found to syllabify from an orthographic rather than Examples of
> this are Bookwise[TM] from Xerox Imaging Systems, and Reading
> with ROSS, a program for children with learning disabilities
> from the University of Colorado.
>
> 7.4 Learning Disabilities Software Testing
>
> Another benefit of text-to-speech is in the area of learning
> disabilities. The device was also tested at a number of sites
> which utilized software for children with reading and spelling
> problems. Children with severe reading problems, or dyslexia,
> typically have inherited problems in certain language processes,
> especially phonemic awareness (Olson et al. 1989). Such chil-
> dren have difficulty hearing the order of sounds (phonotactic
> distinctions) within a syllable which makes it difficult to map
> speech onto print and which underlies the problems in reading
> and spelling. The Reading with ROSS program at the University
> of Colorado has helped children improve their skills in word
> recognition, phonology, and attitude about reading, by providing
> speech support for any word that a child finds difficult while
> reading a story on the computer (Olson and Wise, 1992).
>
> A program with speech support for given words works well with
> synthetic speech, but could be accomplished almost as well with
> current versions of digitized speech. However, other programs
> have been developed or are in the process of development that
> can only be implemented with the text-to-speech capabilities
> of synthesized speech. For example, the Spello program, also
> from the University of Colorado, was designed to improve the low
> spelling skills of reading disabled children and, more impor-
> tantly, to remediate the underlying difficulties manipulating
> sounds and linking sounds to print (Wise & Olson, 1992). With
> Spello, children make an attempt at spelling a word that the
> computer pronounces. The synthesizer can pronounce not only the
> "target" correct word, but also every attempt that the child
> makes, so long as it contains a vowel. The Spello program thus
> gives unique phonological feedback that would be impossible with
> digitized speech, because obviously all possible versions of a
> word a child might attempt cannot be recorded. For example, a
> child may discover that cowk or chack does not sound much like
> chalk, but that chawk and chalk sound the same. The program
>
> 11
>
> would then give the children spelling feedback about which let-
> ters are in the correct word, and which are in the right place.
> Children spent significantly more time exploring how to spell
> the words when provided with speech feedback for errors than
> when provided with feedback for only the correct word (as dig-
> itized speech could do). Children appeared to make significant
> gains also in their ability to read related nonsense words after
> a week of training.
>
> Another program will allow children to explore and correct
> errors they have made in reading and can be simplified with
> the use of synthesized speech. Nonsense spellings such as skable
> and plumble would be segmented ska/ble and plum/ble to rhyme
> with stable and mumble. Similarly, has/ket and pa/sin
>
> In commercial applications, it could be argued that such cases
> need not be handled. In an assistive technology setting, how-
> ever, every attempt had to be made to incorporate these even
> though they do not properly constitute normal English ortho-
> graphic form.
>
> Initially, the more sophisticated letter-to-sound and morpheme-
> decomposition algorithms resulted in a number of errors on
> these non-standard spellings since phonological, phonotactic
> and morphophonemic rules were optimized to the target language
> (English) and such nonsense words fall outside the scope of the
> language. Nevertheless, modifications were made on letter-to-
> sound and morpheme decomposition rules to allow for this. Such
> modifications turn out to be positive since it can be argued
> that a good synthesis architecture should attempt to emulate the
> performance of a human in its errors as well as its accuracy.
>
> 8 Commercial Applications Workstations
>
> Many workstations are now PC-based and therefore the PC-based
> synthesizer will be expected to play a prominent role in speech
> output in such workstations in the next few years. One example
> of a workstation application which utilized the synthesizer
> card in field test for its synthesized voice output is a medical
> workstation which combines graphics, full-motion video, and
> speech I/O (Grams 1992). It contains large and comprehensive
> medical dictionaries, the phonetics of which were translated to
> those of the synthesizer so that any item in the dictionary can
> be accurately pronounced. Its database contains 8000 images,
> drug databases (including drug-drug interactions), knowledge
> equivalence of 12 textbooks, high resolution color displays,
> synthesized speech, digitized speech and speech recognition.
> A researcher or physician is able to reference abstracts from
> over 1000 journals within four weeks of publication making modem
> access unnecessary and allowing the workstation to function more
> as a stand-alone system. This workstation or similar ones will
> hopefully be in use in hospitals and doctors' offices around the
> country in the not-to-distant future. It is a small intellectual
> leap from this generic workstation to one which could be used by
>
> 12
>
> physically-challenged individuals in universities, libraries,
> clinics, schools, offices and the home.
>
> 9 Telephone-based Response
>
> The PC card has no on-board telephonics. However, fairly so-
> phisticated telephonics boards for the PC are now available and
> these typically offer stored voice output capabilities and han-
> dle a number of telephone ports. These telephonics boards and
> the synthesizer card could be combined into a packaged system to
> handle telephone-based voice messaging and voice response ap-
> plications such as E-mail access, reverse directory assistance,
> and database inquiry such as the Intelligent Newspaper for the
> Blind, financial and medical applications, and others. The soft-
> ware allows accommodation of up to four synthesizer cards in one
> PC.
>
> A number of new and exciting applications are being written for
> the disabled community. For example, the Intelligent Newspaper,
> similar to the Talking Yellow Pages, will use speech synthe-
> sis (and DTMF tones or speech recognition) to allow access to
> any information contained in the daily paper: sports, weather,
> news, etc. Newspapers will also be on-line at many public li-
> braries and a greater variety of voice response services will be
> available to assist people with limited mobility.
>
> Another interesting telephone-based application is the automa-
> tion of relay services for hearing-impaired individuals. Some
> of the current research focuses on automatic speech recogni-
> tion (ASR) (Kanevsky et al. 1991) while others, understanding
> the limitations of large vocabulary ASR over the telephone, are
> concentrating on how speech synthesis, combined with spelling
> correction algorithms, can improve such services by insuring
> privacy and eliminating the need for operator-assisted calls
> (Kukich 1992).
>
> 10 Conclusion
>
> A great deal of assistive voice technology has been generated
> from commercial applications. Major progress is being made
> because voice is a technology which supplements input devices
> such as the keyboard and mouse, which helps compensate for
> the sensory overload on our vision and limitations in manual
> dexterity, and which spans the distance gap via telephone lines.
> The logical connection here is that the non-disabled individual
> needs speech technology for essentially the same reason as the
> disabled individual: the need to work more expediently and
> play more enjoyably. Non-disabled individuals simply have a
> different point d'appui. While the priorities of activities
> within speech I/O technology necessarily change when discussing
> its use as assistive technology rather than within the context
> of telecommunications or office automation, many of the goals
> remain the same and a great deal of the groundbreaking work done
>
> 13
>
> in assistive technology is useful for the commercial markets as
> well.
>
> Conversely, the commercial applications such as those for
> telecommunications and office automation will spawn seminal
> research and development which will benefit assistive devices.
> The development of any of these applications means greater ac-
> cessibility for the visually or vocally-impaired individual. The
> technology is here to do much of what the applications will soon
> call for. The multi-media workstation now combines graphics,
> image, and voice input as well as output. e.g., voice activated
> window systems, verbal feedback on input commands, verbal feed-
> back within windows, the integration of the telephone into the
> workstation, and the like. It is the user interface and the ap-
> plications which still need to be determined. But it is a small
> intellectual leap from this generic workstation to one which
> could be used by physically-challenged and speech, language and
> hearing-impaired individuals in universities, libraries, clinics
> and the home.
>
> The recent passage of the Americans with Disabilities Act has
> mandated that individuals with disabilities must be allowed ac-
> cess to employment, public services (including transportation),
> public accommodations, and telecommunications. In the next few
> years, voice output systems will be installed at airports as
> well as in train and bus stations for directional assistance.
> There is even a movement now underway in some cities to install
> simple voice systems on buses and perhaps later at traffic sig-
> nals to provide location and status messages. Through this all,
> speech technology is and will remain a seminal solution for the
> challenges faced by individuals with disabilities.
>
> 14
>
> APPENDIX A
>
> SELECTED REFERENCES
>
> Arons, B., 1992 . A Review of Time-Compressed Speech. In Pro-
> ceedings of the American Voice Input/Output Society.
>
> Bruckert, E., Minow, M., and Tetschner, W., 1983. Three-Tiered
> Software and VLSI Aid Developmental System to Read Text Aloud.
> Electronics. April 23.
>
> Church, K. W., 1986. Stress Assignment in Letter to Sound Rules
> for English . in Proceedings, IEEE International Conference on
> Acoustics, Speech and Signal Processing 4:2423-2426.
>
> Foulke, W., and Sticht, T. G., 1969. Review of Research on the
> Intelligibility and Comprehension of Accelerated Speech ." in
> Psychological Bulletin 72.56-62.
>
> Grams, R., 1992. A Physician's Workstation Designed for NASA and
> Earth-Based Applications. Journal of Medical Systems.
>
> Kanevsky, D., Danis C., Daggett, G., Epstein, E., Gopalakr-
> ishnan, P., Nahamoo, D., 1991. Prospects of Automatic Speech
> Recognition and Relay Se Proceedings of RESNA. July, Kansas
> City.
>
> Klatt, D. H., 1987 Review of Text to Speech Conversion for
> English. Journal of the Acoustical Society of America, Vol.
> 82 (3) :737-793.
>
> Kukich, K., 1992. Spelling Correction for the Telecommunications
> Network for the Deaf. Communications of the ACM 35.5, pp. 80-90.
>
> Lazzaro, J. J., 1990. Opening Doors for the Disabled , Byte.
> August, pp. 258-268.
>
> Olson, R.K., Wise, B.W., Conners, F., Rack, J., and Fulker, D.,
> 1989. Specific Deficits in Component Reading Processes: Genetic
> and Environmental Influences. Journal of Learning Disabilities,
> 22, 339-348.
>
> Olson, R.K. and Wise, B.W., 1992. Reading on the Computer with
> Orthographic and Speech Feedback: An Overview of the Colorado
> Remediation Project. Reading and Writing, 4, 107-144.
>
> Spiegel. M., 1985. Pronouncing Surnames Automatically , in
> Proceedings of the American Voice Input/Output Society, 109-132.
>
> Vitale, A.J., 1989. Application-Driven Technology: Automated
> Customer Name and Address. Proceedings of the American Voice
> Input/Output Society . October, Newport Beach, California.
>
> Selected References 15
>
> Vitale, A.J., 1991a. Speech Synthesis as a Prosthesis for Vocal
> Dysfunction: Modifications in Functionality and Improvements in
> Base Technology. Proceedings of Speech Tech '91. 221-230.
>
> Vitale, A.J., 1991b. An Algorithm for High Accuracy Name Pro-
> nunciation by Parametric Speech Synthesizer. ournal of Computa-
> tional Linguistics 17,3. pp. 257-276.
>
> Vitale, A.J., 1992. Issues in Speech Technology for Persons with
> Disabilities. Journal of the American Voice Input/Output Society
> Wise B.W. and Olson, R.K., 1992. How Poor Readers and Spellers
> use Interactive Speech in a Computerized Spelling Program .
> Reading and Writing, 4 , 145-163.
>
> Vitale, A.J., 1993. Hardware and Software Aspects of a Speech
> Synthesizer Developed for Persons with Disabilities. Journal of
> the American Voice Input/Output Society 13.27-39.
>
> 16 Selected References
|