AccessMyLibrary provides FREE access to millions of articles from top publications available through your library.
Infants are born into a rich and complex environment from which they construct mental representations to model structure that they find in the world. These representations enable infants to understand and predict their surroundings and ultimately to achieve their goals. They accomplish this using a combination of evolved innate structures and powerful learning algorithms.
To explore issues of early language learning, we have developed CELL, a computational model which acquires words from multimodal sensory input. CELL stands for Cross-channel Early Lexical Learning. Set in an information theoretic framework, the model acquires a lexicon by finding and statistically modeling consistent intermodal structure. The model was implemented using current methods of computer vision and speech processing. By using these methods, the system is able to process natural speech and images directly without reliance on manual annotation or transcription. Although the model is limited in its ability to deal with complex scenes and noisy acoustic signals, it nonetheless demonstrates the potential of using these techniques for the purpose of modeling cognitive processes involved in language acquisition.
CELL learns by finding and modeling consistent structure across channels of sensor data. The model relies on a set of innate mechanisms which specify how speech and visual input are represented and compared, and probabilistic learning mechanisms for integrating information across modalities. These innate mechanisms are motivated by empirical findings in the infant development literature. CELL has been implemented for the task of learning shape names from a database of infant-directed speech recordings which were paired with images of objects. (1)
2. Problems of early lexical acquisition
CELL addresses three inter-related questions of early lexical acquisition. First, how do infants discover speech segments which correspond to the words of their language? Second, how do they learn perceptually grounded semantic categories? And tying these questions together: How do infants learn to associate linguistic units with appropriate semantic categories?
Discovering spoken units of a language is difficult since most utterances contain multiple connected words. There are no equivalents of the spaces between printed words when we speak naturally; there are no pauses or other cues which separate the continuous flow of words. Imagine hearing a foreign language for the first time. Without knowing any of the words of the language, imagine trying to determine the location of word boundaries in an utterance, or for that matter, even the number of words. Infants first attempting to segment spoken input face a similarly difficult challenge. This problem is often referred to as the speech segmentation or word discovery problem. Our goal was to understand and model the identification and extraction of semantically salient words from fluent contexts.
In addition to successfully segmenting speech, infants must learn categories which serve as referents of words. In the current work we consider object shape categories derived from camera images. No pre-existing shape categories are assumed in the model. Instead, visual categories are formed from observations alone. By representing object shapes, the model is able to learn words which refer to objects based on their shape.
A third problem of interest is how infants learn to associate linguistic units with appropriate semantic categories. Input to the model, as to infants, consists of spoken utterances paired with visual contexts. Each utterance may consist of one or more words. Similarly, each context may be an instance of many possible shape categories. Given a pool of utterance-context pairs, the learner must infer speech-to-shape mappings (lexical items) which best fit the data.
Within CELL, these three problems are treated as different facets of one underlying problem: to discover structure across spoken and contextual input.
The CELL model addresses problems of word discovery from fluent speech and word-to-meaning acquisition within a single framework. Although computational modeling efforts have not explored these problems jointly, there are several models which treat the problems separately.
Several models have been proposed which perform a complete segmentation of an unsegmented corpus. In contrast, our model does not attempt to perform a complete segmentation. Our goal of word discovery refers to the problem of discovering some words of the underlying language from unsegmented input. Our task is thus a subtask of complete segmentation. Models of complete segmentation are nonetheless interesting as indicators of the extent to which segmentation may be performed by analysis of speech input alone.
Speech segmentation models may be divided into two classes. One class attempts to detect word boundaries based on local sound sequence patterns or statistics. As a side effect of finding segmentation boundaries, words are also found. The idea of finding boundaries by considering probabilities or frequencies of local sounds sequences dates back to Harris (1954) and has been explored more recently with computational models. Harrington, Watson and Cooper (1989) developed a computational algorithm which detected word boundaries based on trigrams of phonemes (sequences of three phonemes). The model was trained by giving it a lexicon of valid words of the language. A list of all within-word trigrams were compiled from this lexicon. To segment utterances, the model detected all trigrams which did not occur word-internally during training. In experiments, the system achieved 37% word boundary detection using a training lexicon of 12,000 common English words. The performance could be improved by using trigram probabilities rather than discrete occurrence tables. Although this work did not address learning word boundaries (since the training phase relied on a presegmented lexicon), the results demonstrate that phonotactic patterns may aid segmentation. This hypothesis is supported by infant research which has shown that 8-month old infants are sensitive to transition probabilities of syllables suggesting that they may use these cues to aid in segmentation (Saffran, Aslin & Newport, 1996).
A second class of segmentation algorithms explicitly model the words of the language. The concept of minimum description length (MDL) (Rissanen, 1978) rovides a natural framework for constructing such algorithms. Within the MDL framework, the objective of the language learner is to acquire a lexicon which is most consistent with the observed linguistic input. A corpus of input text or speech is encoded as sequences of items from the acquired lexicon. A set of utterances may then be represented by the lexicon and a sequence of indices into the lexicon. The encoding of indices into the lexicon is optimized by assigning shorter codes to common lexical items. A trade-off exists between the size of the lexicon and the length of the resulting encoding of a corpus of utterances. The MDL framework provides a probabilistically sound basis for optimizing this trade-off to arrive at an optimal lexicon. de Marcken (1996) developed an algorithm which obtained a hierarchical decomposition of an unsegmented corpus of text or phoneme transcripts. The decomposition was optimized within the MDL framework. Rather than posit word boundaries, this model generated multiple levels of possible segmentation. The hierarchical design reflects the hierarchical nature of language extending from phonemes, morphemes and words to phrases. Brent (1999) developed an algorithm which generated a prior probability distribution over all possible sequences of all possible words constructed from a given alphabet. A corpus of unsegmented utterances was treated as a single observation sample in this model. The lexicon which was most probable for the observed corpus was selected and could be used to segment the corpus. Brent reported favorable segmentation results in comparison to several alternative schemes. This algorithm also operates according to the minimum description length criterion since choosing a maximally probable model of the language is equivalent to minimizing description length (Cover & Thomas, 1991).
In addition to problems of word discovery from unsegmented speech, CELL also addresses the problem of learning word-to-meaning associations. CELL is concerned with learning words whose referents may be learned from direct physical observations. The current instantiation of the model, however, does not address learning words which are abstractly defined or difficult to learn by direct observation. Nonetheless, acquisition of word meaning in this limited sense is not trivial. There are multiple levels of ambiguity when learning word meaning from context. First, words often appear in the absence of their referents, even in infant-directed speech. This introduces ambiguity for learners attempting to link words to referents by observing co-occurrences. Ambiguity may also arise from the fact that a given context may be interpreted in numerous different ways (Quine, 1960). Even if a word is assumed to refer to a specific context, an unlimited number of interpretations of the context may be logically possible. Further ambiguities arise since both words and contexts are observed through perceptual processes that are susceptible to multiple sources of variation and distortion. The remainder of this section discusses approaches to resolving such ambiguities.
Infant directed speech often refers to the infant's immediate context (Snow, 1977) Thus it is reasonable for the learner to assume that some or all of the words of an utterance will refer to some aspect of the immediate context. Ambiguities inherent in single utterance-context observations may be resolved by integrating evidence from multiple observations.
Siskind (1992) modeled the acquisition of associations of words to semantic symbols using cross-situational learning. By considering multiple situations, the most likely word-to-symbol associations were obtained by looking for consistent word-to-context patterns. The model acquired partial knowledge from ambiguous utterance-context pairs which were combined across situations to eliminate ambiguity. In related work, Sankar and Gorin (1993) created a computer simulation of a blocks world in which a person could interactively type sentences which were associated with synthetic objects of various colors and shapes. The system learned to identify words which could be visually grounded and associated them with appropriate shapes and colors. The mutual information between the occurrence of a word and a shape or color type computed from multiple observations was used to evaluate the strength of association. CELL employs a cross-situational strategy to resolve word-referent ambiguity using mutual information.
Even if we assume that utterances refer to immediate contexts, Quine observed that any feature or combination of features of the context may serve as the referent of a word. To overcome this problem, some prior bias can be assumed to favor some meanings over others. In humans, consistent constraints bias which aspects of the environment are most salient and thus likely to serve as referents for words (for example, the shape bias (Landau, Smith & Jones, 1988)). Computational models similarly may be preprogrammed to attend to specific features of contextual input and ignore others. For example, Sankar & Gorin's model only represented shape and color attributes of synthetic objects, thus constraining their model to only learn words groundable in these input channels. By not representing texture, weight and countless other potential attributes (and combinations of attributes) of an object, implicit constraints were placed on what was learnable.
The choice of representation is equally important in constraining the semantics of acquired words. Regier (1996) developed a model for learning spatial words ("above", "below", "through", etc.) by presenting a neural network with line drawing animations paired with word labels. Regier proposed a simple set of geometrical attributes derived from the relative positions, shapes, and sizes of objects would serve as the grounding for spatial terms. He showed that his choice of attributes were sufficient for learning a variety of spatial terms across several languages. A general purpose learning system which could represent many attributes in addition to those hardwired in Regier's model would likely be much slower to learn and may initially be more prone to incorrect generalizations. For the experiments reported in this paper, CELL does not address Quine's dilemma since only one type of contextual attribute, object shape, is represented.
A final type of ambiguity arises due to natural variations of sensory phenomena. A word may be uttered with an infinite number of variations and yet be recognized. An object's shape may also vary in countless ways and still be identified. A variety of statistical pattern recognition techniques exist for representing and classifying noisy signals. Popular methods including artificial neural networks and probability density estimation (2) (Bishop, 1995). Computational models which learn from examples can acquire central prototypes from multiple observations and exhibit prototypicality effects similar to human subjects (Rosch, 1975). In the domain of word learning, Plunkett, Sinha, Moller and Strandsby (1992) created a connectionist neural network which learned to pair labels with visual …