CODI: Cornucopia of Disability Information

Hardware and Software Aspects of a Speech Synthesizer Developed for Persons with Disabilities

 
Web codi.buffalo.edu
>               Hardware and Software Aspects of a Speech Synthesizer
>                     Developed for Persons with Disabilities

>                                     Tony Vitale
>
>                             Assistive Technology Group
>
>                            Digital Equipment Corporation
>
>                             vitale@speech.enet.dec.com
>
>          Copyright 1993. Journal of the American Voice Input/Output
>          Society.
>
>                                        NOTE
>
>             Reprinted with Permission from the Editor
>
>          IBM, PCXT, and Personal Computer AT are registered trademarks
>          of International Business Machines, Inc., and Screen Reader is
>          a trademark of International Business Machines, Inc., Microsoft
>          Windows is a trademark and MS-DOS is a registered trademark
>          of Microsoft Corporation, Jaws is a trademark of Henter-Joyce,
>          Inc., Vocalize is a trademark of G.W. Micro, ASAP is a trademark
>          of Microtalk, Touch Tone is a trademark of American Telephone
>          and Telegraph, Bookwise is a trademark of Xerox Imaging Systems,
>          Inc., Vert Pro is a trademark of TeleSensory Corporation, Flip-
>          per is a trademark of Omnicron. Trademarks of other products are
>          held by the companies producing them.
>
>          Digital Equipment Corporation
>          Maynard, Massachusetts
>
>                                      CONTENTS
>
>              Preface...............................................     v
>
>             1 Motivation...........................................     1
>
>             2 Description..........................................     1
>
>             3 Architecture.........................................     2
>
>              3.1 Generic Functionality.............................     2
>
>             4 Command Functionality and Speech Functionality.......     3
>
>              4.1 Command Interface.................................     3
>
>              4.2 Speech Improvements...............................     5
>
>             5 Communication........................................     8
>
>             6 Windows[TM] Support..................................     9
>
>             7 Testing..............................................     9
>
>              7.1 Field Testing.....................................     9
>
>              7.2 Segmental and Prosodic Intelligibility Testing....    10
>
>              7.3 Screen-Access Software Testing....................    10
>
>              7.4 Learning Disabilities Software Testing............    11
>
>             8 Commercial Applications Workstations.................    12
>
>             9 Telephone-based Response.............................    13
>
>             10 Conclusion..........................................    13
>
>          Appendix A  SELECTED REFERENCES...........................    15
>
>                                                                       iii
>
>          Preface
>
>          Until fairly recently, very high quality speech synthesis has
>          been too expensive for extensive use as assistive technology or
>          for prototyping and developing applications for the multimedia
>          workstation. Furthermore, the functionality inherent in such
>          devices has not been optimized for either type of application.
>          This paper discusses a new generation of synthesizer which
>          offers high quality text-to-speech at reduced cost and with
>          more appropriate functionality for such applications. This
>          is a PC card which is XT/AT compatible, capable of running on
>          machines from the 8088-based XTs to the 80486-class machines
>          operating at high bus speeds. It combines the latest in high
>          quality text-to-speech technology with functionality designed
>          to be used with currently available software such as screen-
>          readers. This functionality includes the ability for immediate
>          stop and start speaking, a large buffering capability, large
>          internal (fixed) and user (modifiable) dictionaries, faster
>          speaking rates, and improved pronunciation capability including
>          the ability to automatically pronounce proper names. Output
>          of stored (digitized) voice files is also possible. The card
>          has no on-board telephonics but could be combined with one of
>          the available telephonics cards to provide for voice-response
>          applications.
>
>                                                                Preface  v
>
>          1  Motivation
>
>          Speech I/O technology has long been important for individuals
>          with disabilities. Moreover, with the recent passage of the
>          Americans with Disabilities Act (ADA), speech has become
>
>          The Americans with Disabilities Act P.L. 101-336 became law
>          when it was signed by President George Bush on 10:26 AM on
>          July 26, 1990. This established a "clear and comprehensive
>          prohibition of discrimination on the basis of disability." An
>          earlier version of this paper was presented at the Symposium on
>          Telecommunications, Office Automation, and the Americans with
>          Disabilities Act in Minneapolis, September 1992. I would like to
>          thank Barbara Wise of the Dept. of Psychology at the University
>          of Colorado who articulated much of the section on learning
>          disabilities, and Ken Kuenzel of Kuenzel Software Technology for
>          his observations on Windows[TM].
>
>          Voice input is beneficial for individuals with upper extremity
>          problems such as difficulty moving their arms and hands, and
>          synthesized speech output is useful for individuals with vocal
>          impairments as well as learning disabilities. Furthermore, it
>          is estimated that as many as 12.8 million Americans have some
>          visual limitation and speech technology can be of assistance in
>          reading computer-generated text aloud.
>
>          High quality speech synthesis has traditionally been rele-
>          gated to the commercial sector and has been fairly expensive
>          for individuals within the disabled community. These users have
>          typically had the choice of a low-cost synthesizer with corre-
>          spondingly low intelligibility levels and limited functionality,
>          or a high-end device with a greater range of functionality, high
>          intelligibility and greater naturalness but available only at
>          a relatively high cost. These high-end systems were typically
>          designed for telecom applications rather than for PC-based work-
>          stations and assistive applications which might run on such
>          workstations.
>
>          2  Description
>
>          The synthesis system is an option card and software which runs
>          on MS-DOS[TM] systems and provides synthesized voice output of
>          ASCII text sent to it by other PC software applications. The
>          card can be used in IBM PC/XT/AT [TM] or 100% compatible per-
>          sonal computers and fits in an XT/EISA/ISA 8/16/32 bit bus, full
>          length option slot. Either a 5.25 inch or 3.50 inch disk drive
>          can be used to load the distribution media. The card can run
>          on DOS V3.3 or later. The interface to the synthesizer card for
>          both commands and text is done through a memory-resident DOS
>          driver which is a Terminate and Stay Resident (TSR) program.
>          The synthesizer card comes with an external loudspeaker which
>          has both a volume control as well as a jack for headphones. A
>          speaker jack can be plugged directly into the board for trav-
>          eling and similar uses. The speech and functionality microcode
>
>                                                                         1
>
>          results in high quality speech synthesis. A wide variety of ap-
>          plications software for screen-reading and learning disabilities
>          is available through third parties (below).
>
>          The synthesizer was designed for persons with visual impair-
>          ments, learning disabilities and severe speech impairments to
>          enable them to have access to speech output when they use a com-
>          puter. For some, the primary benefit is an additional dimension
>          for input to assist them in hearing which keys have been typed
>          and in understanding or "seeing" the screen. For others, the
>          benefit is speech output that might enable them to speak on the
>          telephone or make comments in a vocational environment.
>
>          3  Architecture
>
>          The synthesizer is a phoneme-based formant synthesizer and has
>          a three-level architecture (Bruckert et al. 1983). In Level 1,
>          text from the PC is converted from normal (ASCII) orthographic
>          text into phonemic code. The latter uses a standard phonemic al-
>          phabet in which each symbol is phonetically unambiguous. It also
>          uses an internal dictionary and letter-to-sound rules to perform
>          this conversion. In the second level, the phonemic code is con-
>          verted into synthesizer control parameters. These are continuous
>          variables which control aspects of the speech such as pitch,
>          amplitude, duration and the like for the various voices. In the
>          last level, the control parameters generate a speech waveform.
>          This waveform is converted to an analog speech signal through
>          a D/A converter. In the second and third levels, a synthesizer
>          control command (a set of phonetic parameters) is generated ev-
>          ery 6.4 milliseconds, and the digital signal processor generates
>          a speech waveform value every 100 microseconds. This process
>          generates "frames" of speech which are perceived by a listener
>          to be one continuous, unbroken sequence.
>
>          3.1  Generic Functionality
>
>          Because of the fact that many potential users of speech syn-
>          thesizers have visual impairments, accessibility and ease of
>          use were key factors. A sizable and representative population
>          of users and developers with visual impairments were asked to
>          provide input for the product. A Getting Started Card in Braille
>          assists the blind user by listing the contents of the package.
>          Installation is made easier for the unsighted person by pro-
>          viding instructions in three formats: (a) a hardcopy form for
>          reference by a sighted person; (b) a softcopy form in ASCII
>          which can be read to the user by the synthesizer itself after
>          a screen-reader or similar software is installed, and (c) an
>          audio tape version of the installation procedure which can be
>          used with any standard audio cassette recorder. Furthermore,
>          the ASCII form can be output to a Braille printer for those
>          individuals who can read Braille.
>
>          2
>
>          To further assist visually-impaired users, the manual contains
>          detailed descriptions of the locations of switches and the size
>          and shape of certain parts of the board. The board is installed
>          using standard PC-card installation procedures. The software
>          installation is initiated by simply inserting a diskette and
>          typing Install. For unsighted users, a series of tones is used
>          in the installation procedure. A falling tone is an indication
>          to change or remove the diskette; a rising tone is an indication
>          to press Enter. Non-default or special installations such as
>          those which place files in non-default directories may require
>          a sighted assistant. When the installation is completed, a
>          verbal message is spoken by the synthesizer so that the user
>          will know the board is operating correctly. This startup message
>          is customizable and easily modified. Many unsighted users have
>          successfully installed the software without difficulty.
>
>          The original implementation of the system contained a BIOS ROM
>          which handled such items as IRQ and BIOS address settings. In a
>          later implementation, the BIOS ROM was removed greatly simplify-
>          ing setup requiring only I/O addresses to be set. In the event
>          that there are conflicts with other boards already installed
>          in the PC, an automatic configuration utility is provided so
>          that conflicts are automatically analyzed and recommendations
>          are made for changes in switch settings. In the current version
>          with the BIOS ROM removed, there are only 4 possible switch set-
>          tings. In case of a conflict, a sighted assistant is necessary
>          since the screen cannot be read by special software until the
>          synthesizer is properly configured.
>
>          Simple in-line commands were created to allow the user to con-
>          trol both speech and functionality. The PC card is soft-loadable
>          which provides for more flexibility in future upgrades and modi-
>          fications of the existing software.
>
>          4  Command Functionality and Speech Functionality
>
>          4.1  Command Interface
>
>          The synthesizer card provides commands for a wide range of func-
>          tions. There are commands for modifying characteristics (i.e.
>          the phonetic parameters) of a particular voice and increasing
>          or decreasing the speaking rate (above). For developers who
>          need to work with customized voices or speech researchers who
>          are attempting to analyze and better understand human speech,
>          the command set has the ability to set a wide variety of speech
>          parameters from pitch range, (for a greater excitement level)
>          and head size (for a deeper voice) to formant frequencies and
>          bandwidths. There are 28 acoustic parameters including formant
>          frequencies, bandwidths and gains which can be modified and
>          set. One of the goals for the future is to develop an interface
>          which would allow for more facile and flexible methods of ma-
>          nipulating prosodic features, perhaps via a joystick or similar
>          device. Therefore, speech researchers as well as disabled users
>          (especially vocally-impaired individuals) will be able to work
>
>                                                                         3
>
>          more effectively with speech parameters to create a more natural
>          output.
>
>          Since ambiguities exist in a variety of text outside of normal
>          orthography (below), a number of settable modes are provided.
>          These include a math mode in which numeric text is inter-
>          preted as mathematical notation, a spell mode in which words
>          are spelled rather than pronounced, a European mode in which
>          certain numeric sequences are interpreted in a European (vs. US)
>          style, and a name mode in which selected words are treated as
>          proper names (below).
>
>          One mode which was added at the request of users of screen-
>          reading software was a citation mode . This requires a "ci-
>          tation" pronunciation of a word in certain circumstances. For
>          example, if a visually-impaired person using a screen-reader is
>          in word mode (i.e. each word is pronounced separately), certain
>          words should be pronounced as they would be in isolation. Some
>          of the higher quality synthesizers are tuned to natural speech
>          within discourse and take into account linguistic phenomena
>          such as vowel reduction, consonant cluster simplification, sub-
>          phonemic attenuation and morphophonemic alternation. However,
>          this tends to reduce intelligibility and distract the listener
>          when such words are pronounced in isolation. For example, the
>          word to exhibits different phonetic behavior when occurring in
>          different environments such as before a consonant, a vowel or
>          silence. If the context is / - # [+cons], for example, the vowel
>          is acoustically quite different than if it is in / - # [+voc]
>          (i.e. similar to the morphophonemic alternation of the definite
>          article in similar contexts). Therefore, a citation mode was
>          created such that a set of words would always be pronounced as
>          if they occurred in isolation. Naturally, retention of this mode
>          in running discourse would have a detrimental effect on natu-
>          ralness for the same reason. Further development is underway to
>          make this mode automatic, i.e. to anticipate when a user wants
>          words pronounced in isolation.
>
>          Punctuation can be processed in a number of different ways de-
>          pending upon the application. Punctuation is usually treated
>          normally where only non-clause-final or non-phrase-final punc-
>          tuation is spoken. However, two additional modes were created
>          whereby all or no punctuation is spoken. Therefore, applications
>          where special text (such as a high level computer language)
>          needs to be read, can turn all punctuation on. Other applica-
>          tions such as scanning of unrestricted text may turn punctuation
>          completely off.
>
>          Commands allow the application to terminate speech immediately
>          instead of waiting for the buffered text to complete. The com-
>          mand interface also allows for the resumption of speaking where
>          the text left off, or the flushing of text and immediate pro-
>          cessing of new text.
>
>          4
>
>          There are modes for letter-by-letter and word-by-word pronuncia-
>          tion as well as full clause (normal) pronunciation such that the
>          synthesizer is able to immediately speak single characters with-
>          out waiting for an entire clause to be buffered. This is useful
>          in applications requiring auditory feedback for what was typed
>          on the keyboard and is one of the most widely used features by
>          unsighted individuals. The software also provides normal clause
>          buffering for highly natural speech. This mode whereby char-
>          acters are spoken immediately, necessitated a restructuring of
>          certain modules since prosodic contours are typically based on
>          clause-level strings to attain greater naturalness. This was
>          done through the creation of dedicated tables containing all of
>          the characters which could be produced on a typical keyboard.
>          This then allows the unsighted user immediate aural feedback
>          of all characters including those which are not echoed such as
>          spacebar and backspace. Commands, therefore simplify the in-
>          terface to screen-readers for the processing of letters, words,
>          phrases, clauses, paragraphs and whole documents. Both input and
>          output buffer size is 4 Kbytes.
>
>          Volume control is settable both in hardware and in software.
>          There is a volume control on the external loudspeaker (above)
>          but volume is also settable in software by a command sequence
>          through the standard command interface or directly through the
>          TSR. The software control was added since various manufacturers
>          of screen-access software need to be able to easily manipulate
>          various aspects of speech such as volume.
>
>          Commands also exist to generate tones (e.g., for margin bell,
>          alert etc.) in addition to speech sounds. A developer still has
>          the flexibility of modifying acoustic parameters such as pitch,
>          duration and the like to create different voice qualities. Be-
>          cause of this ability to modify pitch and duration, vocal music
>          can be synthesized as well. Pitch was tuned to the musical (A
>          = 440Hz) rather than the physical (A = 430.4 Hz) scale since a
>          common application especially among vocally-impaired youngsters
>          is to use the synthesizer as lyrical accompaniment to a musical
>          instrument.
>
>          4.2  Speech Improvements
>
>          Intelligibility and naturalness retain their high priority.
>          Word pronunciation is extremely accurate. Normal words such as
>          the common nouns found in a hardcopy dictionary are rarely mis-
>          pronounced. There have been modifications in letter-to-phoneme
>          rules and a more sophisticated morpheme decomposition algorithm
>          has been added. Rules have also been added for proper names,
>          medical terms and other subsets of the English lexicon which
>          would require extensive memory if a database (i.e. dictionary)
>          were exclusively used.
>
>                                                                         5
>
>          The synthesizer contains a large built-in dictionary which as-
>          sists both the pronunciation of individual words as well as its
>          rhythmic naturalness. This fixed (non-accessible) dictionary is
>          many times larger than previous versions of the same synthesizer
>          and consists of complex lexical entries with a wider variety of
>          syntactic and semantic information. This is used to feed other
>          modules in order to increase naturalness. This has allowed for
>          improvements such as automatic homograph handling (below) and
>          will, in the future, assist in more natural pausing, contextual
>          timing (e.g., phrase-final lengthening), higher intelligibility
>          at fast speaking speeds, and generally more natural prosodics.
>          Such fixed dictionaries are inaccessible to the user although a
>          user dictionary is accessible and modifiable (below).
>
>          Heuristic rules have been refined and are now more intelligent.
>          In addition to normal number processing for dates, fractions and
>          the like, unpronounceable sequences such as initialisms (e.g.,
>          FBI, IR Q, EEC ) are also handled automatically.
>
>          A user (modifiable) dictionary can be used to load application-
>          specific words, DOS-specific terms, and the like. This is also
>          much larger than those of earlier versions although the size
>          is somewhat variable and depends upon what other software is
>          resident on the board. However, the usable space may be as high
>          as 350 Kbytes and is sufficient to load thousands of words.
>          Because of the large dictionary, developers can now input many
>          keyboard key names and commonly used DOS and PC application
>          words and commands as well as developing their own application-
>          specific lexicon.
>
>          Speech rate runs from a slow speed of 100 wpm to an upper rate
>          of approximately 720 wpm. Very rapid speaking speeds are useful
>          for applications where scanning large bodies of text is nec-
>          essary. The rate of speech for a normal speaker of English
>          is roughly 160-220 words per minute. Although human speech
>          has been timed at over 550 wpm, there are physiological con-
>          straints which reduce intelligibility dramatically in very rapid
>          speech. Furthermore, when speech at these rapid rates is slowed
>          down to more normal rates, intelligibly is comparatively much
>          lower. Informal observations suggest high quality synthesized
>          speech may be more comprehensible at these high speeds than hu-
>          man speech perhaps due to the fact that with redundancy as a
>          constant there is an upper limit to the speed of movement of
>          human articulators. When speech at these high speeds is slowed
>          down, intelligibility is lower when compared to normal
>
>          A similar result occurs when a read passage at normal speed
>          is sped up or slowed down due to a consequent frequency shift.
>          Foulke (1969) has claimed that intelligibility declines rapidly
>          at speeds in excess of 275 wpm due, in part, on limitations
>          in short-term memory. It is also interesting to note that this
>          threshold may be language-dependent. In any event, many indi-
>          viduals with visual impairments have requested faster scanning
>          rates in synthesizers in order to more efficiently scan lengthy
>          documents while searching for particular information.
>
>          6
>
>          While speaking speed can run in excess of 450-500 words per
>          minute, many sighted people have the ability to scan text at
>          speeds in excess of 3000 wpm. Visually-impaired individuals need
>          the ability to scan text aurally at speeds which exceed those of
>          normal speaking (approximately 180-200 wpm) but clearly not as
>          fast as visual scanning speeds. More efficient algorithms are
>          being tested to maintain high intelligibility at these speeds by
>          automatically adjusting duration and pausing as well as reducing
>          morphosyntactic complexity and utilizing periodic segmentation.
>
>          Another problem which was addressed is the more accurate pro-
>          nunciation of proper nouns such as first names, last names and
>          street names (Church 1986; Spiegel 1985; Vitale 1989, 1991b
>          and others). Names commonly found in a telephone book are es-
>          sentially loanwords from various languages following different
>          phonotactic rules. It is no longer reasonable for a high quality
>          synthesis device to exclude this large subset of words from be-
>          ing properly handled. Therefore, rules were added which handled
>          such names with a higher degree of accuracy and greater level
>          of intelligence. The algorithms underlying this functionality
>          were originally developed for large commercial telecom applica-
>          tions such as reverse directory assistance (i.e. number-to-name)
>          but have now been modified to run in real-time on the PC card
>          (Vitale 1989, 1991b). Name pronunciation can be run in two dif-
>          ferent modes: (a) process the very next word as a proper name,
>          or (b) process all uppercase non sentence-initial words as a
>          proper name.
>
>          Homographs, forms which are spelled alike but pronounced differ-
>          ently (from Gk. homo 'same' and graph 'writing'), are handled
>          automatically in most cases and this is accomplished through
>          a relatively simple morphosyntactic scan. There are over two
>          hundred pairs of these words in English and it is a linguistic
>          phenomenon found in many other languages as diverse as French
>          and Japanese. Examples of such words in English are record, per-
>          mit, attribute, deliberate, bass , and many others which can
>          have two different pronunciations depending upon how the word
>          is used in context. Therefore, to handle sentences such as They
>          refused the refuse , human intervention (manual phonemicization)
>          is typically needed. Some homographs are quite common and their
>          mispronunciation has been a constant source of distraction to
>          synthesizer users in the blind community. Since text-to-phoneme
>          algorithms are somewhat simplistic and simply convert a spelled
>          form to a sequence of sound symbols, homographs are clearly
>          troublesome.
>
>          In practice, of course, the homograph problem has rarely re-
>          sulted in intelligibility problems except perhaps in telephone-
>          based applications. However, it is simply one more issue which,
>          when handled automatically, contributes to the overall natu-
>          ralness of speech synthesis systems. A fuller discussion of the
>          homograph problem is beyond the scope of this paper. However,
>          the problem is more pervasive than is sometimes believed and can
>          include function words (e.g., can, just ), proper names (e.g.,
>          Robert, Guy), and other lexical categories. Clearly, homographs
>
>                                                                         7
>
>          belonging to the same form class such as proper names (above),
>          and even simple verbs (e.g., read, tear) are much more difficult
>          to handle since a simple parse is insufficient to disambiguate
>          them.
>
>          Finally, the card has the capability of outputting digitized
>          speech in addition to synthesized speech. It does not have
>          the capability for the user to input voice and digitize the
>          utterance. However, a voice segment which has already been
>          recorded, digitized, and stored on the PC can be played back
>          through the card. Voice files can be created using one of the
>          digitizing boards currently on the market. The playback is done
>          at 10KHz. A special command can be used to synchronize digitized
>          data with the text stream.
>
>          5  Communication
>
>          There are a number of ways to communicate directly with the
>          board. It can be treated as a standard ASCII device connected
>          to a serial port or a parallel printer port and thus simple
>          copy or print commands may be issued (e.g., print (filename)
>          lptx or copy (filename) comx . Communication may also be ac-
>          complished directly with the TSR. Both types of functionality
>          can be present simultaneously so it is possible to utilize both
>          paths at the same time. Synchronization of commands and data, if
>          both paths are utilized, are the responsibility of the appli-
>          cation. Also, many of the command strings such as change rate,
>          change voice, start speaking, stop speaking, index, index re-
>          plyand the like can be shortened to the smallest unambiguous
>          substring. These mnemonic commands are more user-friendly than
>          the escape sequences of the past. Most commands also contain
>          a number of parameters for finer granularity of control. For
>          example, to generate an error message, the user has the option
>          of generating the message as a text string, an escape sequence,
>          a tone, or even a synthesized voice message in any pre-selected
>          voice or speaking rate. Tones and voice messages were added for
>          unsighted individuals.
>
>          Other commands control the processing of phonemic text. Text-
>          to-speech devices still require a utility whereby a user can
>          create a specialized lexical entry or unusual word such as
>          application-specific jargon which can be phonemicized within
>          the text itself rather than as a dictionary entry. When set,
>          this command allows everything within square brackets to be
>          interpreted as phonemic text, by defining the characters "[" and
>          "]" as phoneme delimiters. Thus, all text and characters which
>          appear between square brackets will be interpreted as phonemic
>          text and will be pronounced as such. These commands can be used
>          in-line within normal orthographic text. Also, in a previous
>          implementation, square brackets were nested (i.e. necessitating
>          an equal number of closed and open brackets) which occasionally
>          presented problems in extricating text from phonemic mode. In
>          the system described here, one square bracket is sufficient to
>          close phonemic mode.
>
>          8
>
>          While there are no on-board telephonics, the PC card can gener-
>          ate Dual Tone Multiple Frequency Tones (Touch-Tones[TM] ) for
>          0-9, *, #, "," (pause) and A, B, C, D (for handsets which con-
>          tain these). The comma can be used to generate a 2-second pause
>          for applications where intertonal pause is required for dialout.
>          3
>
>          Other commands can be used to generate sounds of different fre-
>          quencies and lengths depending upon the parameters set (above).
>          This allows for a wide variety of sounds for purposes such as
>          notification, warning, and so on. Regular tones can also be used
>          for a number of other purposes such as indications of margin
>          bell, etc. for screen reading applications which is useful for
>          someone who wishes to work in a quiet environment without using
>          the speaker on the PC.
>
>          Verbal feedback can also be given on incorrectly entered command
>          strings. For example, the command [:error ...] can set the
>          error mode for the module to be, e.g., visual, verbal, or a
>          tone warning (above). This command is useful for debugging in
>          an application development setting and is especially useful for
>          unsighted individuals.
>
>          6  Windows[TM] Support
>
>          With the recent interest in the graphical user interface (GUI)
>          there has been renewed interest in as well as concern about the
>          interface for unsighted individuals. A number of large corpo-
>          rations have been investigating the use of GUIs with respect
>          to the Americans with Disabilities Act. In Europe, a simi-
>          lar law for individuals with disabilities requires ergonomic
>          considerations in the design of GUI software.
>
>          Third party developers can be supported via a private Dynamic
>          Linked Library (DLL) that provides functionality similar to that
>          in the DOS TSR/driver. The DLL also supports a standard config-
>          uration interface dialog, standard parameter setting dialogs,
>          and default user initialization files. Users of standard Win-
>          dows programs will be able to access the DLL via an installable
>          driver that can be configured as the default printer, allow-
>          ing any Windows software package that prints, to output to the
>          module.
>
>          Tone dialing through a handset microphone is usually not very
>          reliable.
>
>          7  Testing
>
>          7.1  Field Testing
>
>          Extensive field testing was conducted throughout a widespread
>          geographical area over a period of 3 months with individuals who
>          were visually-impaired and learning-disabled and using software
>          programs developed for reading screens (below). A number of
>          software developers who participated in these tests also had
>
>                                                                         9
>
>          some disability, often blindness or severe visual impairment.
>          The feedback allowed for a number of essential modifications to
>          be made to both hardware as well as software.
>
>          Early testing determined that it was difficult for an unsighted
>          person to effectively feel the ON/OFF position of the DIP-
>          switches as these were on a corner of the original switch pack.
>          The part was eventually changed so that the switches were in
>          a more accessible position and the ON/OFF status could be more
>          easily determined. In addition, an installation procedure was
>          developed so that unsighted individuals could install the board
>          and software without the assistance of a sighted person (above).
>          A series of tones is used in this procedure as an indication to
>          change or remove the diskette or to press Enter.
>
>          7.2  Segmental and Prosodic Intelligibility Testing
>
>          Testing was continued in areas of the synthesizer relating
>          to intelligibility and naturalness in both the segmental and
>          prosodic domain. For example, investigation has begun on which
>          levels of structure contribute most to naturalness (e.g., glot-
>          tal pulse, segmental phonetic, prosodic, etc.) and results of
>          such analysis should yield interesting clues about how listen-
>          ers perceive naturalness. Hopefully, this will ultimately lead
>          to a better definition of naturalness as well as allow modi-
>          fications to values and rules which relate to factors such as
>          timing, F and the like. Segmental durations typically have a
>          more limited effect on intelligibility although clearly tar-
>          gets have to be hit. If the onset of voicing were to begin
>          too early in a segment, a voiceless stop might be perceived
>          as voiced thus creating a phonological (and hence intelligi-
>          bility) problem. Cumulatively, even sub-phonemic inaccuracies
>          contribute perceptually to an overall lack of naturalness. On
>          the prosodic level, improper stress on many noun compounds, lack
>          of pauses for breath groups and sense groups, over-generalized
>          question intonation and the like, all have an negative effect on
>          naturalness.
>
>          7.3  Screen-Access Software Testing
>
>          The synthesizer card works with many of the available These
>          software packages include JAWS[TM] (Job Access with Speech -
>          Henter-Joyce), Vocal-Eyes[TM] (G.W. Micro), ASAP[TM] (Automatic
>          Screen Access Program - MicroTalk), Vert Pro[TM] (TeleSensory
>          Corporation), Flipper[TM] (Omnicron), Screen Reader[TM] (IBM)
>          and a number of others.
>
>          Screen-access software or screen-readers are software packages
>          which are designed to allow visually-impaired individuals the
>          ability to navigate through screen displays of an application
>          by an intelligent processing of text in conjunction with the
>          functionality of a speech synthesizer. Screen-readers offer
>          such functions as a typing mode for auditory feedback (above),
>          pitch changes to determine the position of uppercase letters,
>
>          10
>
>          different voices for keyboard entry and screen changes, and
>          ability to navigate around the screen by auditory feedback of
>          cursor position.
>
>          Some educational applications for screen access programs (in-
>          cluding both learning disabled as well as general pedagogical
>          applications) require special functionality to pronounce words
>          by syllable. While syllabic segmentation on phonetic form is a
>          prerequisite for accurate lexical stress assignment within the
>          architecture of a speech synthesizer and is done internally,
>          the syllable-by-syllable pronunciation of words more properly
>          belongs in the application where different theories of syllabi-
>          fication can be tested and implemented. Creative ways have been
>          found to syllabify from an orthographic rather than Examples of
>          this are Bookwise[TM] from Xerox Imaging Systems, and Reading
>          with ROSS, a program for children with learning disabilities
>          from the University of Colorado.
>
>          7.4  Learning Disabilities Software Testing
>
>          Another benefit of text-to-speech is in the area of learning
>          disabilities. The device was also tested at a number of sites
>          which utilized software for children with reading and spelling
>          problems. Children with severe reading problems, or dyslexia,
>          typically have inherited problems in certain language processes,
>          especially phonemic awareness (Olson et al. 1989). Such chil-
>          dren have difficulty hearing the order of sounds (phonotactic
>          distinctions) within a syllable which makes it difficult to map
>          speech onto print and which underlies the problems in reading
>          and spelling. The Reading with ROSS program at the University
>          of Colorado has helped children improve their skills in word
>          recognition, phonology, and attitude about reading, by providing
>          speech support for any word that a child finds difficult while
>          reading a story on the computer (Olson and Wise, 1992).
>
>          A program with speech support for given words works well with
>          synthetic speech, but could be accomplished almost as well with
>          current versions of digitized speech. However, other programs
>          have been developed or are in the process of development that
>          can only be implemented with the text-to-speech capabilities
>          of synthesized speech. For example, the Spello program, also
>          from the University of Colorado, was designed to improve the low
>          spelling skills of reading disabled children and, more impor-
>          tantly, to remediate the underlying difficulties manipulating
>          sounds and linking sounds to print (Wise & Olson, 1992). With
>          Spello, children make an attempt at spelling a word that the
>          computer pronounces. The synthesizer can pronounce not only the
>          "target" correct word, but also every attempt that the child
>          makes, so long as it contains a vowel. The Spello program thus
>          gives unique phonological feedback that would be impossible with
>          digitized speech, because obviously all possible versions of a
>          word a child might attempt cannot be recorded. For example, a
>          child may discover that cowk or chack does not sound much like
>          chalk, but that chawk and chalk sound the same. The program
>
>                                                                        11
>
>          would then give the children spelling feedback about which let-
>          ters are in the correct word, and which are in the right place.
>          Children spent significantly more time exploring how to spell
>          the words when provided with speech feedback for errors than
>          when provided with feedback for only the correct word (as dig-
>          itized speech could do). Children appeared to make significant
>          gains also in their ability to read related nonsense words after
>          a week of training.
>
>          Another program will allow children to explore and correct
>          errors they have made in reading and can be simplified with
>          the use of synthesized speech. Nonsense spellings such as skable
>          and plumble would be segmented ska/ble and plum/ble to rhyme
>          with stable and mumble. Similarly, has/ket and pa/sin
>
>          In commercial applications, it could be argued that such cases
>          need not be handled. In an assistive technology setting, how-
>          ever, every attempt had to be made to incorporate these even
>          though they do not properly constitute normal English ortho-
>          graphic form.
>
>          Initially, the more sophisticated letter-to-sound and morpheme-
>          decomposition algorithms resulted in a number of errors on
>          these non-standard spellings since phonological, phonotactic
>          and morphophonemic rules were optimized to the target language
>          (English) and such nonsense words fall outside the scope of the
>          language. Nevertheless, modifications were made on letter-to-
>          sound and morpheme decomposition rules to allow for this. Such
>          modifications turn out to be positive since it can be argued
>          that a good synthesis architecture should attempt to emulate the
>          performance of a human in its errors as well as its accuracy.
>
>          8  Commercial Applications Workstations
>
>          Many workstations are now PC-based and therefore the PC-based
>          synthesizer will be expected to play a prominent role in speech
>          output in such workstations in the next few years. One example
>          of a workstation application which utilized the synthesizer
>          card in field test for its synthesized voice output is a medical
>          workstation which combines graphics, full-motion video, and
>          speech I/O (Grams 1992). It contains large and comprehensive
>          medical dictionaries, the phonetics of which were translated to
>          those of the synthesizer so that any item in the dictionary can
>          be accurately pronounced. Its database contains 8000 images,
>          drug databases (including drug-drug interactions), knowledge
>          equivalence of 12 textbooks, high resolution color displays,
>          synthesized speech, digitized speech and speech recognition.
>          A researcher or physician is able to reference abstracts from
>          over 1000 journals within four weeks of publication making modem
>          access unnecessary and allowing the workstation to function more
>          as a stand-alone system. This workstation or similar ones will
>          hopefully be in use in hospitals and doctors' offices around the
>          country in the not-to-distant future. It is a small intellectual
>          leap from this generic workstation to one which could be used by
>
>          12
>
>          physically-challenged individuals in universities, libraries,
>          clinics, schools, offices and the home.
>
>          9  Telephone-based Response
>
>          The PC card has no on-board telephonics. However, fairly so-
>          phisticated telephonics boards for the PC are now available and
>          these typically offer stored voice output capabilities and han-
>          dle a number of telephone ports. These telephonics boards and
>          the synthesizer card could be combined into a packaged system to
>          handle telephone-based voice messaging and voice response ap-
>          plications such as E-mail access, reverse directory assistance,
>          and database inquiry such as the Intelligent Newspaper for the
>          Blind, financial and medical applications, and others. The soft-
>          ware allows accommodation of up to four synthesizer cards in one
>          PC.
>
>          A number of new and exciting applications are being written for
>          the disabled community. For example, the Intelligent Newspaper,
>          similar to the Talking Yellow Pages, will use speech synthe-
>          sis (and DTMF tones or speech recognition) to allow access to
>          any information contained in the daily paper: sports, weather,
>          news, etc. Newspapers will also be on-line at many public li-
>          braries and a greater variety of voice response services will be
>          available to assist people with limited mobility.
>
>          Another interesting telephone-based application is the automa-
>          tion of relay services for hearing-impaired individuals. Some
>          of the current research focuses on automatic speech recogni-
>          tion (ASR) (Kanevsky et al. 1991) while others, understanding
>          the limitations of large vocabulary ASR over the telephone, are
>          concentrating on how speech synthesis, combined with spelling
>          correction algorithms, can improve such services by insuring
>          privacy and eliminating the need for operator-assisted calls
>          (Kukich 1992).
>
>          10  Conclusion
>
>          A great deal of assistive voice technology has been generated
>          from commercial applications. Major progress is being made
>          because voice is a technology which supplements input devices
>          such as the keyboard and mouse, which helps compensate for
>          the sensory overload on our vision and limitations in manual
>          dexterity, and which spans the distance gap via telephone lines.
>          The logical connection here is that the non-disabled individual
>          needs speech technology for essentially the same reason as the
>          disabled individual: the need to work more expediently and
>          play more enjoyably. Non-disabled individuals simply have a
>          different point d'appui. While the priorities of activities
>          within speech I/O technology necessarily change when discussing
>          its use as assistive technology rather than within the context
>          of telecommunications or office automation, many of the goals
>          remain the same and a great deal of the groundbreaking work done
>
>                                                                        13
>
>          in assistive technology is useful for the commercial markets as
>          well.
>
>          Conversely, the commercial applications such as those for
>          telecommunications and office automation will spawn seminal
>          research and development which will benefit assistive devices.
>          The development of any of these applications means greater ac-
>          cessibility for the visually or vocally-impaired individual. The
>          technology is here to do much of what the applications will soon
>          call for. The multi-media workstation now combines graphics,
>          image, and voice input as well as output. e.g., voice activated
>          window systems, verbal feedback on input commands, verbal feed-
>          back within windows, the integration of the telephone into the
>          workstation, and the like. It is the user interface and the ap-
>          plications which still need to be determined. But it is a small
>          intellectual leap from this generic workstation to one which
>          could be used by physically-challenged and speech, language and
>          hearing-impaired individuals in universities, libraries, clinics
>          and the home.
>
>          The recent passage of the Americans with Disabilities Act has
>          mandated that individuals with disabilities must be allowed ac-
>          cess to employment, public services (including transportation),
>          public accommodations, and telecommunications. In the next few
>          years, voice output systems will be installed at airports as
>          well as in train and bus stations for directional assistance.
>          There is even a movement now underway in some cities to install
>          simple voice systems on buses and perhaps later at traffic sig-
>          nals to provide location and status messages. Through this all,
>          speech technology is and will remain a seminal solution for the
>          challenges faced by individuals with disabilities.
>
>          14
>
>                                     APPENDIX  A
>
>                                 SELECTED REFERENCES
>
>          Arons, B., 1992 . A Review of Time-Compressed Speech. In Pro-
>          ceedings of the American Voice Input/Output Society.
>
>          Bruckert, E., Minow, M., and Tetschner, W., 1983. Three-Tiered
>          Software and VLSI Aid Developmental System to Read Text Aloud.
>          Electronics. April 23.
>
>          Church, K. W., 1986. Stress Assignment in Letter to Sound Rules
>          for English . in Proceedings, IEEE International Conference on
>          Acoustics, Speech and Signal Processing 4:2423-2426.
>
>          Foulke, W., and Sticht, T. G., 1969. Review of Research on the
>          Intelligibility and Comprehension of Accelerated Speech ." in
>          Psychological Bulletin 72.56-62.
>
>          Grams, R., 1992. A Physician's Workstation Designed for NASA and
>          Earth-Based Applications. Journal of Medical Systems.
>
>          Kanevsky, D., Danis C., Daggett, G., Epstein, E., Gopalakr-
>          ishnan, P., Nahamoo, D., 1991. Prospects of Automatic Speech
>          Recognition and Relay Se Proceedings of RESNA. July, Kansas
>          City.
>
>          Klatt, D. H., 1987 Review of Text to Speech Conversion for
>          English. Journal of the Acoustical Society of America, Vol.
>          82 (3) :737-793.
>
>          Kukich, K., 1992. Spelling Correction for the Telecommunications
>          Network for the Deaf. Communications of the ACM 35.5, pp. 80-90.
>
>          Lazzaro, J. J., 1990. Opening Doors for the Disabled , Byte.
>          August, pp. 258-268.
>
>          Olson, R.K., Wise, B.W., Conners, F., Rack, J., and Fulker, D.,
>          1989. Specific Deficits in Component Reading Processes: Genetic
>          and Environmental Influences. Journal of Learning Disabilities,
>          22, 339-348.
>
>          Olson, R.K. and Wise, B.W., 1992. Reading on the Computer with
>          Orthographic and Speech Feedback: An Overview of the Colorado
>          Remediation Project. Reading and Writing, 4, 107-144.
>
>          Spiegel. M., 1985. Pronouncing Surnames Automatically , in
>          Proceedings of the American Voice Input/Output Society, 109-132.
>
>          Vitale, A.J., 1989. Application-Driven Technology: Automated
>          Customer Name and Address. Proceedings of the American Voice
>          Input/Output Society . October, Newport Beach, California.
>
>                                                   Selected References  15
>
>          Vitale, A.J., 1991a. Speech Synthesis as a Prosthesis for Vocal
>          Dysfunction: Modifications in Functionality and Improvements in
>          Base Technology. Proceedings of Speech Tech '91. 221-230.
>
>          Vitale, A.J., 1991b. An Algorithm for High Accuracy Name Pro-
>          nunciation by Parametric Speech Synthesizer. ournal of Computa-
>          tional Linguistics 17,3. pp. 257-276.
>
>          Vitale, A.J., 1992. Issues in Speech Technology for Persons with
>          Disabilities. Journal of the American Voice Input/Output Society
>          Wise B.W. and Olson, R.K., 1992. How Poor Readers and Spellers
>          use Interactive Speech in a Computerized Spelling Program .
>          Reading and Writing, 4 , 145-163.
>
>          Vitale, A.J., 1993. Hardware and Software Aspects of a Speech
>          Synthesizer Developed for Persons with Disabilities. Journal of
>          the American Voice Input/Output Society 13.27-39.
>
>          16  Selected References
UB School of Public Health and Health Professions