UNI-MB - logo
UMNIK - logo
 
(UM)
  • The problems in multilingual text processing for text-to-speech synthesis system
    Rojc, Matej, 1972-
    The language independent approach requires the separation of all the language specific information into the language specific inventory, which is composed of different lexica, various dictionaries ... and lists. The remaining core should represent the language independent text-processing engine. During the tokenization process, the sentence string is read from left to right. The longest sub-string matched in automation is identified as a token. Automation only recognizes tokens that are followed by a standard word delimiter, such as a blank or a punctuation mark. This is due to the search of the longest match. If a string is immediately followed by a letter or a digit it cannot be identified as a token since the incorporation of the next character necessariliy produces a longer match in the automation. Tokenization is a non-trivial problem especially in the case of the multilingual approach. Confronted with large corpora of raw text the engine makes the diffucult choices, choices whose reprecussions are sometimes only felt long after. Token descriptions are usually different in various languages. The basic idea is to describe these varius descriptions of tokens for each language in external files. An appropriate compiler should generate corresponding automation for the corresponding language using these descriptions in external files.
    Vrsta gradiva - prispevek na konferenci
    Leto - 1999
    Jezik - angleški
    COBISS.SI-ID - 5492502

vir: Advances in speech technology : recent progress in speech technology : proceedings (Str. 151-157)

Knjižnica Signatura – lokacija, inventarna št. ... Status izvoda
Knjižnica tehniških fakultet, Maribor podatek o zalogi ni dostopen (vzdrževanje baze podatkov)
loading ...
loading ...
loading ...