A CONTINUOUS SPEECH RECOGNITION SYSTEM FOR TURKISH LANGUAGE BASED ON TRIPHONE MODEL by Fatma PATLAR Istanbul Kultur University, 2009 THESIS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER ENGINEERING PROGRAMME SUPERVISOR Assist.Prof. Dr. Ertuğrul SAATÇİ ii ABSTRACT A CONTINUOUS SPEECH RECOGNITION SYSTEM FOR TURKISH LANGUAGE BASED ON TRIPHONE MODEL PATLAR, Fatma M.S. in Computer Engineering Supervisor: Assist.Prof. Dr. Ertuğrul SAATÇİ August 2009, 73 pages The field of speech recognition has been growing in popularity for various applications. Such recognition embedded applications include automated dictation systems and command interfaces. Embedding recognition to a product allows a unique level of hands-free usage and user interaction. Our main goal was to develop a system that can perform a relatively accurate transcription of speech and in particular, a Continuous Speech Recognition based on Triphone model for Turkish Language. Turkish is generally different from Indo-European languages (English, Spanish, French, German etc.) by its agglutinative and suffixing morphology. Therefore vocabulary growth rate is very high and as a consequence, constructing a continuous speech recognition system for Turkish based on whole words is not feasible. By considering this fact in this thesis, acoustic models which are based on triphones, are modelled as five state Hidden Markov Models. Mel-Frequency Cepstral Coefficients (MFCC) approach was preferred as the feature vector extraction method and training is done using embedding training that uses Baum- Welch re-estimation. Recognition is implemented on a search network which can be ultimately seen as HMM states connected by transitions and Viterbi Token Passing algorithm runs on this network to find the mostly likely state sequence according to the utterance. Also to make a more accurate recognition bigram language model is constructed. MATLAB is used in processing speech and The Hidden Markov Model Tool Kit (HTK) is used to train models and perform recognition. iii The performance of this thesis has been evaluated using two different databases, one of them is more commonly formed TURTEL speech database that is used for speaker independent system tests and the other one is particularly formed weather forecast reports database that is used for speaker dependent system tests. In recognition experiments, word accuracy of speaker independent system has been measured as 59-63 percent. After finding optimum value for decision tree pruning factor by try outs, system tests have been repeated again by using the language model and the optimum pruning factor. These adjustments improved the performance by 30-33 percent and word accuracy has reached to 92-93 percent for all tests. While the word accuracy of the speaker dependent system tests on the single person database is between 89-93 percent, usage of the language model and the optimum decision tree pruning factor has resulted with an increase in the performance and the word accuracy has reached to 95-97 percent. Keywords: Continuous Speech Recognition, Triphone, Hidden Markov Model, Language Modelling, Bigram language model iv ÖZET ÜÇLÜ SES MODELLİ TÜRKÇE SÜREKLİ KONUŞMA TANIMA SİSTEMİ PATLAR, Fatma Yüksek Lisans, Bilgisayar Mühendisliği Bölümü Tez Yöneticisi: Yrd.Doç.Dr. Ertuğrul SAATÇİ Ağustos 2009, 73 sayfa Konuşma tanıma tabanlı uygulamaların popülaritesi her geçen gün daha da artmaktadır. Bu uygulamalara dikte sistemlerini ve komut arayüzlü sistemleri örnek olarak verebiliriz. Bir ürüne konuşma tanımayı entegre etmek kullanıcıya benzersiz bir kullanım kolaylığı ve etkileşim imkanı sunar. Bizimde buradaki asıl amacımız Türkçe için nispeten hassas çeviri imkanı sunacak geniş kelime dağarcıklı bir sistem tasarlamaktı. Türkçe, sondan eklemeli morfolojisiyle genel olarak Hint-Avrupa dillerinden (İngilizce, İspanyolca, Fransızca, Almanca vs.) farklıdır. Bu yapısı sözcük dağarcığında büyük bir artışa neden olmakta ve sonuç olarak Türkçe için kelime tabanlı sürekli konuşma tanıma sistemlerinin yapılabilirliği pek mümkün olmamaktadır. Bu gerçeğide göz önüne alarak, bu tezde, akustik modeller, beş durumlu Saklı Markov Modelleri olarak modellenmiş üçlü-sesler temel alınarak oluşturulmuşlardır. Özellik vektörü çıkarımı için Mel Kepstral Katsayılar yaklaşımı tercih edilmiş, eğitim ise Baum-Welch yeniden tahmin algoritmasını kullanan "gömülü eğitim" yöntemi kullanılarak yapılmıştır. Tanıma işlemi bir arama ağı üzerinde işleyen Viterbi Token Passing algoritması kullanılarak gerçeklenmiştir. Bu arama ağı aslında model durumlarının geçişlerler birbirine bağlanmış hali olarak görülebilir. Aynı zamanda daha doğru bir tanıma yapabilmek için ikili dil modellemesi de uygulanmıştır. SMM‟i, “gömülü eğitim” kullanılarak eğitilmiş; tanıma kısmında ise “Andaç geçirmeli Viterbi algoritması” kullanılmıştır. v Konuşmanın analizi ve işlenmesinde MATLAB; modellerin eğitimde ve tanımanın gerçekleştirilmesinde ise Hidden Markov Toolkit (HTK)‟den faydalanılmıştır. Eğitim ve testlerde iki ayrı ses veritabanı kullanılmıştır. Genel amaçlı hazırlanmış olan TURTEL veritabanı kullanıcı bağımsız testlerde, daha özel amaçlı oluşturulan hava durumu tahmin raporları veritabanı ise kullanıcı bağımlı sistem testlerinde kullanılmıştır. Konuşmacı bagımsız sistem tanıma testlerinde kelime doğruluk yüzdesi 59-63 olarak hesaplandı. Sistem performansını arttırmak için en uygun karar ağacı budama eşiği seçildi ve bunun sisteme dil modeli ile uygulanmasının ardından yüzde 30-33 arası artış sağlanarak doğruluk yüzdesinde 92-93 arasi değerler elde edildi. Kullanıcı bağımlı olan tek kişilik veritabanında yapılan testlerde doğruluk oranı yüzde 89-93 civarında iken, en uygun karar ağacının ve dil modelinin kullanılmasının doğruluk oranını yüzde 95-97'lere yükselttiği gözlemlendi. Anahtar Kelimeler : Sürekli Konuşma Tanıma, Dil Modeli, Üçlü Ses, Saklı Markov Modeli, İkili Dil Modeli vi To My Family vii ACKNOWLEDGEMENTS I would like to express my appreciation, for their valuable supervision and his strong encouragement throughout the development of this thesis my advisor, Assist. Prof. Dr. Ertuğrul SAATÇİ. I am most grateful to my family for everything and all my friends in the office and others for their patience, help and being very supporting and kind to me. Finally, I would like to thank The Scientific and Technical Research Council of Turkey (TÜBİTAK, UEKAE) for support for my thesis. viii CONTENTS ABSTRACT .............................................................................................................................ii ÖZET ....................................................................................................................................... iv ACKNOWLEDGEMENTS ................................................................................................... vii LIST OF FIGURES ................................................................................................................. x LIST OF TABLES .................................................................................................................. xi LIST OF SYSMBOLS – ABBREVATIONS ....................................................................... xii EQUATIONS ........................................................................................................................ xiii INTRODUCTION ................................................................................................................... 1 CHAPTER 1: SPEECH PROCESSING ............................................................................ 3 1 BACKGROUND INFORMATION ................................................................................ 3 1.1 Historical Knowledge on Speech Processing ................................................... 3 1.2 Speech Production System ................................................................................. 5 1.3 Language and Phonetic Features....................................................................... 7 1.4 Speech Representation ........................................................................................ 7 1.4.1 Three-State Representation ........................................................................ 8 1.4.2 Spectral Representation ............................................................................. 10 1.4.3 Parameterization of the Spectral Activity ................................................. 10 1.5 Speech Recognition System ............................................................................. 11 1.5.1 Categories of Speech Recognition ........................................................... 11 CHAPTER 2: SPEECH TO TEXT SYSTEM .................................................................. 13 2.6 Preprocessing ...................................................................................................... 14 2.7 Feature Extraction ............................................................................................... 17 2.7.1 Frame Blocking ............................................................................................ 19 2.7.2 Windowing .................................................................................................... 19 2.7.3 Fast Fourier Transform ............................................................................... 20 2.7.4 Mel Frequency Cepstrum Coefficient ....................................................... 20 2.7.5 Mel Filter Bank ............................................................................................. 21 2.7.6 Discrete Cosine Transformation ............................................................... 23 2.8 Training and Recognition ................................................................................... 23 2.8.1 Dynamic Time Warping .............................................................................. 24 2.8.2 Artificial Neural Networks ........................................................................... 24 2.8.3 Hidden Markov Model ................................................................................. 24 CHAPTER 3: EXPERIMENTAL WORK ....................................................................... 32 ix 3.1 Data Preparation ................................................................................................. 32 3.1.1 Preparing Dictionary ................................................................................... 32 3.2 Recording the Data ............................................................................................. 33 3.3 Preprocessing ...................................................................................................... 33 3.4 Feature Extraction ............................................................................................... 34 3.5 Model Creation .................................................................................................... 38 3.5.1 Monophone Model....................................................................................... 38 3.5.2 Triphone Model ............................................................................................ 39 3.5.3 Language Model .......................................................................................... 41 3.6 Training ................................................................................................................. 41 3.7 Recognition .......................................................................................................... 44 3.8 Offline Recognition .............................................................................................. 46 3.9 Live Recognition .................................................................................................. 46 CHAPTER 4: SIMULATION RESULTS .......................................................................... 47 4.1 Properties of the Test Data ................................................................................ 47 4.2 Parameters of the System ................................................................................. 48 4.3 Measures of Recognition Performance ........................................................... 48 4.4 Experimental Results .......................................................................................... 48 CONCLUSION ..................................................................................................................... 51 FUTURE WORK .................................................................................................................. 53 REFERENCES .................................................................................................................... 54 APPENDICES A .................................................................................................................. 57 APPENDICES B .................................................................................................................. 58 x LIST OF FIGURES Figure 1 : Schematic diagram of the speech production [13] ......................................... 5 Figure 2 :Human Vocal Mechanism [13] ........................................................................... 6 Figure 3 : Three state representation ................................................................................. 9 Figure 4 : Speech amplitude(a) and Spectrogram(b) .................................................... 10 Figure 5 : Speech Recognition Process .......................................................................... 14 Figure 6 : Preprocessing steps ......................................................................................... 14 Figure 7 : Effect of the noise cancelling (word record is sIfIr (zero)) .......................... 15 Figure 8 : The effect of Preemphasis Filter in Time and Frequency Domain (word record is sIfIr) ....................................................................................................................... 16 Figure 9 : The preemphasized signal cut down by the VAD. ....................................... 17 Figure 10 : Feature Extraction Steps ............................................................................... 18 Figure 11 : Mel scale Filter Bank ...................................................................................... 22 Figure 12 : A five state left to right HMM ......................................................................... 26 Figure 13 : Illustration of the sequence of operations required for the computation of the join event that the system is in state at time t and state at time t +1 [28] .. 30 Figure 14: MFCC feature vector extraction process ...................................................... 34 Figure 15 : Frame blocking scheme ................................................................................. 35 Figure 16 : Hamming Window ........................................................................................... 36 Figure 17 : Hanning Window ............................................................................................. 36 Figure 18 : Sample HMM Model (state 1 and 5 denote none-emitting states) .......... 38 Figure 19 : Making Monophone Model ............................................................................ 39 Figure 20 : Triphone model of 'bitki' ................................................................................. 40 Figure 21 : Making Triphone from Monophone .............................................................. 40 Figure 22 : Concatenating triphone HMM for a composite HMM ................................ 42 Figure 23 : A sample decision tree for phone /a/ ........................................................... 43 Figure 24: Training process ............................................................................................... 43 Figure 25 : Recognition Network....................................................................................... 45 Figure 26 : Recognition search network .......................................................................... 45 xi LIST OF TABLES Table 1 : Collected data.......................................................................................................... 47 Table 2 : Recognition performance under all conditions (For TURTEL). ................................ 49 Table 3 : Recognition performance for weather reports under various conditions. ............. 50 xii LIST OF SYSMBOLS – ABBREVATIONS ASR Automatic Speech Recognition CSR Continuous Speech Recognition DFT Discrete Fourier Transform FFT Fast Fourier Transform MFCC Mel Frequency Cepstral Coefficient HMM Hidden Markov Model HTK Hidden Markov Model Toolkit The state transition probability The observation symbol probability The initial state HMM Model HMM States Observation symbols Observation sequence Backward variable Forward and Backward variables This partial probability The best score (highest probability) xiii EQUATIONS (2.1) Preemphasis Filter (2.2) Square Energy Algorithm (2.3) Rectangular Window (2.4) Hamming Window (2.5) Fourier series and Coefficients (2.6) Discrete Fourier Transform (2.7) Mel Frequency (2.8) Discrete Cosine Transformation (2.9) HMM state transition probability distribution for each state (2.10) HMM observation symbol probability distribution (2.11) HMM initial state distribution (2.12) Forward Algorithm partial probability (2.13) Backward Algorithm partial probability (3.1) Observed speech waveform (4.1) Percentage of correctly recognized words (4.2) Accuracy of recognition 1 INTRODUCTION Speech Recognition is an active area of research for over sixty years and its popularity grows increasingly over the past years. Speech communication within human beings has played an important role because of its inherent superiority over other modes of human communication. As time passed, the need for better control of complex machines appeared and speech response systems have begun to play a major role in human-machine communication. As technology created so many other ways of communication to human, nothing can replace or equal speech. In the past, human has been required to interact with machines in the language of those machines as they were not able to train them in the way of understanding human speech. With speech recognition and speech response system, human can communicate with machines using natural language human terminology. The use of voice processing systems for voices input and output provides a degree of freedom for mobility, alternative modalities for command and data communication and the possibility of substantial reduction in training time to learn to interface with complex system. All those characteristics of speech yield lots of positive advantages over other methods of human-machine interaction, when incorporated into an effective voice control or voice data entry system [1]. In early 1970s commercial application of speech recognition started. Up to that time, as lots of advances have been made in the development of more powerful algorithms for speech analysis and classification. The extraction of word features to permit improved performance in isolated and connected word recognition and in the reduction of the cost of the related hardware has been the major goals of this field. Most successful speech recognition systems today use the statistical approach. The Hidden Markov Models are used to model speech variations in a statistical and stochastic manner [2]. The idea is collecting statistics about the variations in the signal over a database of samples, and uses these to make a representation of the stochastic process. The speech signal carries acoustic information about the symbolic translation of its meaning. The acoustic information is embodied in HMMs and sound events are modelled by them. Besides the acoustic information, the 2 sequence of the symbols also constitutes a stochastic process, which can also be modelled. Such statistical approaches to model the symbolic events are called language models. There are few studies in Turkish speech recognition with respect to the most of the other languages. Especially the large vocabulary recognition field for Turkish is quite empty. The main reason for this is the difficulties caused by the Turkish language. Turkish is an agglutinative language which makes it highly productive in terms of word generation. This is an important problem for traditional speech recognizer which has a fixed vocabulary. In real life this speech recognition technology might be used to facilitate for people with functional disability or can also be applied to many other areas. This leads to the discussion about intelligent computers and homes where could operate the light switch turn off/on, using computer without any contact to keyboard or operate some domestic appliances. 3 CHAPTER 1: SPEECH PROCESSING 1 BACKGROUND INFORMATION 1.1 Historical Knowledge on Speech Processing It has been over 60 years since first works on the speech recognition and so much has been accomplished not only in recognition accuracy but also in performance factor. For example in isolated word recognition, vocabulary sizes have increased from ten to several hundred words. Highly confusable words have been distinguished, improved adaptation to the individual speaker dependent and speaker independent systems have been developed, the telephone and noisy, distorting channels have been effectively used and effect of other environmental conditions like vibration, g- forces, and emotional stress have also been explored [3]. Historically, start and development of speech recognition researches can be summarized as follows; The first serious attempt at automatic speech recognition was described by Dreyfus- Graf (in France) in 1950. In the „stenosograph‟ different sequences of sounds gave different tracks around the screen [3]. In 1952, Davis, Biddulph, and Balashek of Bell Telephone Laboratories developed the first complete speech recognizer. Several years later, Dudley and Balashek (1958) developed a recognizer called “Audrey” that used ten frequency bands and derived certain spectral features whose durations were compared with stored feature pattern for words in the vocabulary. A major events in the history of speech recognition occurred in 1972, when the first commercial products appeared that word recognizer (100 Words / Phonological constraints), from Scope Electronics Inc, and from Threshold Technology Inc. Early history of speech recognition: 1947: Sound Spectrograph 1952: Digits, using word templates, 1 speaker 1958: Digits, using phonetics sequences 4 1960: Digits, digital computer time normalization 1962: IBM Shoebox Recognizer 1964: Word recognizers for Japanese 1965: Vowels and consonants detected in continuous speech 1967: Voice actuated astronaut manoeuvring unit 1968: 54 Word recognizers, digit string recognizer 1969: Vicens reddy recognizer of continuous speech 1972: 1st Commercial word recognizer 100 Words / Phonological constraints 1974: Dynamic programming (200 Words) Telephone; Oxygen Mask 1975: Alphabet and Digits, 91 Words Multiple talker, No training 1976: ARPA Systems; HARPY, HEARSAY, HWIM 182 Talkers, 97% Accuracy, Telephone 1977: CRT Compatible voice terminal 1978: IBM Continuous Speech Recognizer Spoken language systems technology has made rapid advances in the past decade of eighties, support by progress in the underlying speech and language technologies as well as rapid advances in computing technology. As a result, there are now several research prototype spoken language systems that support limited interactions in domains such as travel planning, urban exploration, and office management. These systems operate in near real life, accepting spontaneous, continuous speech from speakers with no prior enrolment; they have vocabularies of 1000- 2000 words, and an overall correct understanding rate of almost 90%. If we look at the speech recognition from Turkish Language perspective, there are only a few studies on Turkish speech recognition. Most of the studies were being done on isolated word recognition, where only a word is recognized at each search step [4] [5] [6] [7] [8]. 5 Recently, large vocabulary continuous speech recognition systems are started to be studied. However, the performance of these systems are poor and not in the level of systems in other languages like English. Similar or same methods used in English speech recognition generally also can be used in Turkish recognition. The acoustic modelling is almost universal for all languages, since the methods that are used yield similar results [9] [10] [11]. However, the classical language modelling techniques cannot be applied successfully to Turkish. Turkish is an agglutinative language. Thus, great number of different words can be built from a base by using derivations and inflections [12]. The productive derivational and inflectional structure brings the problem of high growth rate of vocabulary. Furthermore, Turkish is a free word order language. So the language models that are constructed using the co occurrences of words do not perform well. 1.2 Speech Production System Human communication is to be seen as a comprehensive diagram of the process from speech production to speech perception between the talker and listener Figure 1 : Schematic diagram of the speech production [13] The first element Speech formulation (A) is associated with the formulation of the speech signal in the talker‟s mind. This formulation is used by the Human Vocal Mechanism (B) to produce the actual speech waveform. The waveform is transferred via the air (C) to the listener. During this transfer the acoustic wave can be affected by external sources, for example noise, resulting in a more complex waveform. 6 When the wave reaches the listener‟s hearing system the listener percept the waveform (D) and the listener‟s mind (E) starts processing this waveform to comprehend its content so the listener understands what the talker is trying to tell him or her [13]. One issue with speech recognition is to simulate how the listener process the speech produced by the talker. There are several actions taking place in the listeners hearing system during the process of speech signals. The perception process can be seen as the inverse of the speech production process. Worth mentioning is that the production and perception is highly a nonlinear process [13]. To be able to understand how the production of speech is performed one need to know how the human‟s vocal mechanism is constructed. The most important parts of the human vocal mechanism are the vocal tract together with nasal cavity, which begins at the velum. The velum is a trapdoor-like mechanism that is used to formulate nasal sounds when needed. When the velum is lowered, the nasal cavity is coupled together with the vocal tract to formulate the desired speech signal. The cross-sectional area of the vocal tract is limited by the tongue, lips, jaw and velum and varies from 0-20 [14]. Figure 2 :Human Vocal Mechanism [13] When human produce speech, air is expelled from the lungs through the trachea. The air flowing from the lungs causes the vocal cords to vibrate and by forming the vocal tract, lips, tongue, jaw and maybe using the nasal cavity, different sounds can be produced. 7 1.3 Language and Phonetic Features The speech production begins in the human‟s mind, when we form a thought that is to be produced and transferred to the listener. After having formed the desired thought, we construct a phase/sentence by choosing a collection of finite mutually exclusive sounds. The basic theoretical unit for describing how to bring linguistic meaning to the formed speech, in the mind, is called phonemes. Phonetics is the study of the sounds of human speech. It is concerned with the actual properties of speech sounds (phones), and their production, audition and perception, while phonology, which emerged from it, studies sound systems and abstract sound units (such as phonemes and distinctive features). Phonetics deals with the sounds themselves rather than the contexts in which they are used in languages. Turkish has an agglutinative morphology with productive inflectional and derivational suffixations. Since it is possible to produce new words with suffixes, the number of words is very high. According to [15], the number of distinct words in a corpus of 10 million words is greater than 400 thousand. Such sparseness increases the number of parameters to be estimated for a word based language model. Language modelling for agglutinative languages needs to be different from modelling language like English. Such languages also have inflections but not to the same degree as an agglutinative language [16]. The Turkish alphabet contains 29 letters (45 phonetic symbols) and is a phoneme based language which means phonemes are represented by letters in the written language. There are 8 vowels and 21 consonants; Vowels: a e ı i o ö u ü Consonants : b c ç d f g ğ h j k l m n o ö p r s ş t v y z There is nearly one to one mapping between written text and its pronunciation but some vowels and consonants have variants depending on the place they are produced in the vocal tract. 1.4 Speech Representation The speech signal and all its characteristics can be represented in two different domains, the time and the frequency domain. 8 A speech signal is a slowly time varying signal in the sense that, when examined over a short period of time (between 5 and 100 ms), its characteristics are short-time stationary. This is not the case if we look at a speech signal under a longer time perspective (approximately T>0.5 s). In this case the signals characteristics are non-stationary, meaning that it changes to reflect the different sounds spoken by the talker. The speech representation can exist in either the time or frequency domain, and in three different ways [14]. These are three-state representation, spectral representation and parameterization of the spectral activity. 1.4.1 Three-State Representation The three state representations is one way to classify events in speech. The events of interest for the three state representations are: Silence (S) – No speech is produced Unvoiced (U) – Vocal cords are not vibrating, resulting in a periodic or random speech waveform. Voiced (V) – Vocal cords are tensed and vibrating periodically, resulting in a speech waveform that is quasi periodic. Quasi periodic means that the speech waveform can be seen as periodic over short time period (5-100 ms) during which it is stationary. Stationary signals can be define constant in their statistical parameters over time. 9 Figure 3 : Three state representation The upper plot (a) contains the whole speech sequence and in the plot (b) a part of the upper plot (a) is reproduced by zooming an area of the whole speech sequence. At figure the segmentation into a three state representation, in relation to the different parts of the plot(b), is given. The segmentation of the speech waveform into well defined states is not straight forward. But this difficulty is not as a big problem as one can think. However, in speech recognition applications the boundaries between different state are not exactly defined and therefore non crucial. As complementary information to this type of representation it might be relevant to mention that three states can be combined. These combinations result in three other types of excitation signals: mixed, plosive and whisper. Mixed: Simultaneously voiced and unvoiced Plosive: Start with a short region of silence then voiced or unvoiced speech, or both build up air pressure behind the closure, then suddenly release it. Whisper: No excitation. Air moved through vocal tract. 10 1.4.2 Spectral Representation Spectral representation of the speech intensity over time is very popular and the most popular one is the sound spectrogram. Figure 4 : Speech amplitude(a) and Spectrogram(b) Here the darkest (dark blue) parts represents the parts of the speech waveform where no speech is produced and the lighter (red) parts represents intensity if speech is produced [17]. Figure (a) the speech waveform is given in the time domain and figure (b) shows a spectrogram in the time-frequency domain. 1.4.3 Parameterization of the Spectral Activity When speech is produced in the sense of a time varying signal, its characteristics can be represented via a parameterization of the spectral activity. This representation is based on the model of speech production. The human vocal tract can (roughly) be described as a tube excited by air either at the end or at a point along the tube. From acoustic theory it is known that the 11 transfer function of the energy from the excitation source to the output can be described in terms of natural frequencies or resonances of the tube, more known as formants. Formants represent the frequencies that pass the most acoustic energy from the source to the output. This representation is highly efficient, but is more of theoretical than practical interest. This because it is difficult to estimate the formant frequencies in low level speech reliably and defining the formants for unvoiced (U) and silent (S) regions. 1.5 Speech Recognition System Recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of words. The recognized words can be the final results, as for applications such as commands and control, data entry, and document preparation. Speech recognition systems can be characterized by many parameters, some of the more important of which are speaking mode-style, vocabulary, enrolment etc. An isolated-word speech recognition system requires that the speaker pause briefly between words, connected word speech recognition similar to Isolated words, but with a minimal pause whereas a continuous speech recognition system does not. Spontaneous, or extemporaneously generated, speech contains disfluencies, and is much more difficult to recognize than speech read from script. Some systems require speaker enrolment a user must provide samples of his or her speech before using them, whereas other systems are said to be speaker-independent, in that no enrollment is necessary. Some of the other parameters depend on the specific task. Recognition is generally more difficult when vocabularies are large or have many similar-sounding words. 1.5.1 Categories of Speech Recognition 1.5.1.1 Recognition of Isolated Words An isolated word recognition system operates on single words at a time, requiring a distinct pause between saying each word [18]. It requires about a half second (500ms) or greater pause be inserted between spoken words [19]. This is the simplest form of recognition to perform because the end points are easier to find and the pronunciation of a word tends not to affect others. Thus, because the occurrences of words are more consistent, they are easier to recognize. The technique of the system is to compare the incoming speech signal with an internal 12 representation of the acoustic pattern of each word in a relatively small vocabulary and to select the best match, using some distance metric [18]. Usually the recognition success is about 100% and it is adequate for many applications but is far from being a natural way of communicating [19]. 1.5.1.2 Recognition of Connected Speech Connected word recognition is one of the most interesting problems at the present stage of speech recognition research since isolated word recognition techniques have been improved up to a practically high level [20]. Connect word systems (or more correctly 'connected utterances') are similar to Isolated words, but allow separate utterances to be 'run-together' with a minimal pause between them [21]. 1.5.1.3 Recognition of Continuous Speech A continuous speech system operates on speech in which words are connected together, i.e. not separated by pause. Continuous speech is more difficult to handle because of variety of effects [18]. First problem is to finding the utterance boundaries for allow users to speak almost naturally, while the computer determines the content. Another problem is co-articulation. The production of each phoneme is affected by production of surrounding phonemes, and similarly that the start and end of words are affected by preceding and following words. The recognition of continuous speech is also affected by the rate of speech [18]. Basically, these types of recognitions are named as a computer dictation. 13 CHAPTER 2: SPEECH TO TEXT SYSTEM Speech recognition, also known as automated speech recognition (ASR) or speech- to-text (STT) is a process which converts the acoustic speech signal into written text. A typical example of speech to text application is in dictation systems. Speech to text systems may also be used as an intermediate step for making spoken language accessible to machine translation. Systems generally perform two different types of recognition: single-word and continuous speech recognition. Continuous speech is more difficult to handle because of a variety of effects such as speech rate, co-articulation, etc. Co-articulation express to changes in the articulation of a speech segment. Speech to text systems can be separated into two categories which are speaker dependent and speaker independent. A speaker dependent system is developed to operate for a single speaker. These systems are usually easier to develop more accurate, but not as flexible as speaker independent systems. A speaker independent system is developed to recognize speech regardless of speakers, i.e. it does not need to be trained to recognize individual speakers. These systems are the most difficult to develop and their accuracy are lower than the speaker dependent systems. However, they are more flexible. In this chapter we will explain general speech to text system frame and its modules. A general speech to text system involves several steps that can be categorized as follows: Data Preparation, Pre-Processing, Feature Extraction, Training and Recognition. Additional to this language model can be adapted to increase the performance of the system. Block Diagram of a speech to text system can be given as shown: 14 Speech file N-gramDictionary Acoustic Model Transition Probability Decision Tree AudioIO WaveIO WaveFileIO State Element Mel Cepstral Mel Frequency FFT Preemphasis Noise Cancelling Frame Blocking Windowing Search Network Questions Decision TreeClustering HMM Set Mixture Element Transition P.HMM Element Observation P. Bigram Data Preperation Pre Processing Feature Extraction Model Creation VAD Language Model Dictionary Word List Decision Tree Search Dictionary Grammar File Microphone Model Train Train Training Recognition TEXT Figure 5 : Speech Recognition Process 2.6 Preprocessing Preprocessing is a process of preparing speech signal for further processing or in the other words to enhance the speech signal. The objective in the pre-processing is to modify the speech signal in such a way that will be more suitable for the feature extraction. Preprocessing involves three steps: noise cancelling, preemphasis and voice activation detection. Noise Cancelling Voice Activation Detection Preemphesis Clean Speech Unvoiced-Voiced speech Preprocessing Speech Voiced speech Figure 6 : Preprocessing steps 15 Noise Cancelling: Noise-cancelling reduces unwanted ambient sounds by a low pass filter which is a filter that passes low-frequency signals. Human can produce sounds between 100-4500Hz and for speech production the range is smaller. Thus, if telephone records (at 8000 Hz) are used, less than 120 Hz and above 3400 Hz can be a logical cut-off because over the 3400 Hz human speech has no information. Figure 7 : Effect of the noise cancelling (word record is sIfIr (zero)) Preemphasis: The first stage in MFCC feature extraction is to boost the amount of energy in the high frequencies. It turn s out that if we look at the spectrum for voiced segment like vowels, there is more energy at the lower frequencies than at the higher frequencies. This drop in energy across frequencies (which is called spectral tilt) is caused by the nature of the glottal pulse. Boosting the high frequency energy makes information from these higher formants more available to the acoustic model and improves phone detection accuracy [22]. 16 This is usually done by a highpass filter. The most commonly used filter type for this step is the FIR filter. The filter characteristic is given by the equation (2.1): (2.1) The value (a) is chosen to be approximately 0.95. There are two reasons behind this [23]. The first is that to introduce a zero near z=1, so that the spectral contributions of the larynx and the lips have been effectively eliminated. Figure 8 : The effect of Preemphasis Filter in Time and Frequency Domain (word record is sIfIr) By this way, the analysis can be asserted to be seeking parameters corresponding to the vocal tract only. The second reason, if the speech signal is dominated by low frequencies, preemphasis is used to prevent numerical instability [24]. Voice Activation Detection: An important problem in speech processing is to detect the presence of speech in a background noise. This problem is often referred to as the end point location problem [25] [16]. 17 The accurate detection of a word start and end points means that subsequent processing of the data can be kept to a minimum. In order to decide start and end points of a speech signal, the energies of each block is calculated first using sum of square energy algorithm. (2.2) Assuming that there is no speech in the first few frames, of recording, the average of the first few frames, give the maximum energy and the minimum energy which are used calculates for cutting down unvoiced parts of speech. Figure 9 : The preemphasized signal cut down by the VAD. 2.7 Feature Extraction The purpose of feature extraction is to convert the speech waveform to some type of parametric representation (at a considerably lower information rate). The speech signal is a slowly time varying signal (it is called quasi-stationary) [26]. When examined over a sufficiently short period of time (typically between 20 and 100 ms), its characteristics are fairly stationary. However, over long periods of time (on the order of 0.2s or more) the signal characteristics change to reflect the different speech sounds being spoken. 18 Therefore, short-time spectral analysis is the most common way to characterize the speech signal. A wide range of possibilities exist for parametrically representing the speech signal for the recognition task, such as Linear Prediction Coding (LPC), Mel- Frequency Cepstrum Coefficients (MFCC), and others. MFCC is perhaps the best known and most popular, and this feature has been used in this work. MFCC‟s are based on the known variation of the human ear‟s critical bandwidths with frequency. Linear prediction coefficients are mainly focused to model vocal tract while extracting feature coefficients from the speech, but MFCC‟s apply Mel scale to power spectrum of speech in order to imitate human hearing mechanism. The MFCC technique makes use of two types of filter, namely, linearly spaced filters and logarithmically spaced filters. To capture the phonetically important characteristics of speech, signal is expressed in the Mel frequency scale. This scale has a linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000 Hz. Normal speech waveform may vary from time to time depending on the physical condition of speakers‟ vocal cord. Rather than the speech waveforms themselves, MFFCs are less susceptible to the said variations [26] [34]. Frame Blocking Mel Cepstral Mel Frequency Warping Fast Fourier Transform WindowingFramed Smoothed Mel Spectrum Spectrum Preprocessed Speech Mel Cepstrum Figure 10 : Feature Extraction Steps In a typical Feature Extraction process, the first step is windowing the speech to divide the speech into frames. Since high frequency formants have smaller amplitude than low frequency formants, high frequencies may be emphasized to obtain similar amplitude for all formants [4]. After windowing, magnitude squared FFT is used to find the power spectrum of each frame. Here we perform filter bank processing to the power spectrum, which uses mel scale. Discrete cosine transformation is applied after converting the power spectrum to log domain in order to compute MFCC. 19 2.7.1 Frame Blocking Speech signals change in every few millisecond, because of this speech should be analyzed in small duration frames. Spectral evaluation of signals is reliable in case the signal is assumed to be stationary. Speech characteristics do not change much in a short time period because of this, speech signal is blocked into frames of N samples (generally between 20 and 100 millisecond), with adjacent frames being separated by M (M