Files
Abstract
This dissertation develops an algorithmic approach to linguistics through the study of topics in unsupervised learning of linguistic structure related to morphological paradigms. This work emphasizes reproducibility, accessibility, and extensibility in linguistic research.
The first major chapter studies stem extraction, focusing on analyzing morphological paradigms one at a time. Given a morphological paradigm, what is the stem, and how can we tell algorithmically? While it might appear trivial to extract "jump" from the English verbal paradigm jump-jumps-jumped-jumping, any non-concatenative morphology in any language presents an immediate challenge to an algorithm based on a substring approach to stem extraction. From the perspective of minimizing description length, the stem is best modeled as the longest common subsequence across word forms in a given morphological paradigm.
The next chapter explores paradigm similarity, considering multiple morphological paradigms at a time. The linguistic phenomenon of interest is inflection classes. Cross-linguistically, inflection classes tend to exhibit partial similarity. For instance, while Spanish verbs are customarily categorized as -ar, -er, and -ir verbs, the -er and -ir verbs are conjugationally more similar to each other than either to the -ar verbs. This chapter develops a hierarchical clustering algorithm that characterizes such partial similarity across morphological paradigms in a tree structure.
The final major chapter explores how tables of morphological paradigms can be learned from raw data, such as an unannotated text corpus. The point of departure is the Linguistica program (Goldsmith 2001). While Linguistica induces morphological paradigms from a raw text by learning recurring morphological patterns called signatures, the relationship between signatures is unknown, which means signatures that differ by inflection classes are not connected. This chapter aligns the signatures from Linguistica by leveraging syntagmatic information available in a raw text corpus to induce what is akin to word category knowledge.