Phonotactic Universals and Automatic Speech Recognition

No abstract is currently available for this project

Humans have exceptionally detailed knowledge of what constitutes a likely word in their native language. This knowledge is referred to as phonotactics and is crucial in a number of aspects of linguistic behavior, such as word production, word comprehension, and word choice. To illustrate what phonotactics are, consider the nonsense words "blick" and "bnick". Neither of these are words of English. Nonetheless, every English speaker knows that "blick" could be word, while "bnick" could not. Speakers acquire this knowledge to a great extent from statistical regularities in the set of words in their language. Frequent and infrequent patterns of sounds and acoustic features allow the language learner to induce constraints on what words are possible (and likely) in their language.

Crucially, however, not every logically possible regularity can be learned as a phonotactic constraint. Some words are considered substantially less well-formed than implied by their statistical properties. For example, "bdick" is less acceptable to speakers of English than "bnick", despite that fact that both are unattested. When we examine other languages, we often find that similar patterns emerge. For example, unlike English, Russian allows words beginning in both "bd" and "bn," however, like in English, “bd” is dispreferred and less frequent. There are many such phonotactic universals which hold across the world’s languages.

A detailed understanding of how people learn and represent both universal and language-specific phonotactic constraints is both an important scientific problem in its own right and an important and under-explored factor in engineering. The field of automatic speech recognition (ASR) seeks to build models that automatically transcribe speech. Although the field of ASR has advanced steadily over the last three decades, the basic training paradigm has not evolved much at all, and relies heavily on large corpora of annotated data and specialized electronic pronunciation dictionaries. One of the outcomes of this training methodology is that, due to the expense and effort needed to create training resources, ASR systems have been produced for only a relatively small number of the world's languages; a generous estimate might be 100-150 languages out of 6-7,000 that are spoken worldwide.

Our goal for this project is to bring together researchers from Linguistics, Psychology and Computer Science to (i) formalize a model of human phonotactic knowledge that can learn language-specific probabilistic patterns in an unsupervised fashion while enforcing known phonotactic universals; (ii) evaluate the model’s success from a psychological point of view on a corpus of crowd-sourced word-likeness judgments;(iii) integrate this model with current, state-of-the art approaches to ASR and, thereby, sharply reduce the amount of supervised training data needed for the system; (iv) apply the system to a large number of different languages—including a number of relatively rare languages.

Our first goal is to model the way in which human learners acquire the sound patterns of their own language in an unsupervised fashion from limited amounts of data. To do this, we will construct a series of probabilistic models of increasing sophistication, inspired by the representations and constraints used in linguistic theories of phonotactics. These will include Markov models, hidden Markov models, factorial hidden Markov models, infinite hidden Markov models and, finally, a infinite factorial hidden Markov model. Each of these models enriches the space of inferable generalizations: hidden Markov models allow for "hidden" states that can capture abstract groupings of sounds (cf natural classes) by learning multiple probability distributions over phones in an unsupervised way. Factorial HMMs allow for entire vectors of hidden states which allows the model to capture generalizations driven by particular phonological features (cf autosegmental representations). The infinite versions of these models allow for the number of states to be inferred directly from the data.

By employing such sophisticated machinery we will go beyond the current state of the art in phonotactic modeling. To date, probabilistic models of phonotactics have relied exclusively on n-gram-based statistics (e.g. feature-based tri-grams, Hayes and Wilson 2008). Not only are these models limited in the kinds of generalizations they can learn, but they also require possible generalizations to be a priori defined through a phone inventory or a feature system. We will be the first to apply various unsupervised Markov models to the problem of phonotactics. Therefore, our findings will be highly relevant to researchers in linguistics by identifying the ways in which sound structures converge across languages.

We will train these phonotactic models on a unique corpus consisting of pronunciation dictionaries of 62 different languages from 26 major language families. Each dictionary contains between 1,000 and 100,000 words, with about 10,000 words on average (over 700k words total), which are annotated for meaning and grammatical category in addition to a detailed phonetic transcription of their pronunciation. We will begin by addressing well-known phonotactic problems such as vowel-harmony, consonant clusters, and consonant co-occurrence restrictions in turn, culminating in the formalization of a complete model of the phonotactic generalizations humans can learn from their language. Each model’s accuracy will be evaluated against a crowd-sourced database of word-likeness judgments.

With a detailed and accurate model of phonotactics, we can take a step towards automatic acquisition of speech without twin crutches of modern ASR training: namely no annotated speech data, and no pronunciation dictionary. We plan to initially address the problem of detecting phone-like units in the speech signal through non-parametric Bayesian inference methodology. We will explore a Dirichlet process mixture model to decode the composition of the phonetic units from speech audio. The nature of the Dirichlet process inherently determines the size of the phonetic inventory of a language, and through the non-parametric Bayesian approach, mixture models are allowed to learn the topology as well as the probabilistic structure for each phone-like unit automatically.

One of the methods we propose to explore for learning word pronunciations extends recent work on automatically learning the pronunciation dictionary from annotated data. We believe we can extend this model to the case where we learn an initial set of phone-like units automatically, and use a language model trained from the text data to iteratively refine a set of possible lexical pronunciations based on phonotactically probably sequences of phones from our phonotactic model. We will also consider whether such techniques can be effective in a completely unsupervised training mode, or whether a small amount of annotated data can be effective to bootstrap the procedure. For example, we will use our phonotactic model to discover the broad phone class each learned phonetic unit belongs to by comparing the co-occurrence patterns of phonetic units with expectations given phonotactic constraints. Having determined the broad phone class of each phonetic unit, we can further constrain predicted pronunciations for different words and achieve more precise results.

We will profit from synergizing efforts in a variety of disciplines: behavioral methods of the cognitive science will help assess the accuracy of different models in capturing human knowledge of phonotactics. Linguistic theory will help understand articulatory and perceptual sources for the observed patterns. Computer science will allow us to effectively implement different models and provide us with sophisticated tools for ASR.

We expect the outcomes of this research to benefit the field of spoken language acquisition by requiring significantly less human expertise (ideally none) and expense to create ASR technology for new languages. Thus, we believe we can alter the landscape for the thousands of languages of the world that are resource-constrained.  We hope to demonstrate the results of this research on several of such languages to demonstrate the feasibility of these techniques. Finally, we hope to address the linguistic question of what constitutes a likely sound pattern in human language.

Members Profiles