Integrating Language Documentation and Computational Tools for Yupik, an Alaska Native Language


One locus of crosslinguistic variation in how languages build words is whether meaning is encoded in free morphemes (units of meaning) that stand alone as words, or whether those morphemes must combine with other morphemes to become words. While English has many free morphemes, the Alaska Native language, St. Lawrence Island/Siberian Yupik, uses the second strategy with very complex words, often sentence-sized. These properties are known as agglutination and polysynthesis. Researchers will document critical structures in the language, digitize existing Yupik materials, and build computational tools to help the community and other researchers. The data from Yupik are extremely important to language science, since many of the phenomena displayed in the language are rare and not well understood. Creating computational tools for languages with very complex words, like Yupik, is of additional benefit to computer scientists and language scientists in that it helps researchers improve computational tools for languages like English. The Native American Languages Act, passed by the US Congress in 1990, enacted into policy the recognition of the unique status and importance of Native American languages. This project will build and improve tools like a morphological analyzer, a spellchecker, and a searchable dictionary, of value to the community in revitalizing their language. Graduate students will be trained in these methods, and researchers will hold outreach meetings with high school students in the language community to teach them important computer and coding skills that will enable them to build further tools. All data gathered will be permanently archived at the Alaska Native Language Archive.

The investigators, a collaboration of language and computer scientists from the University of Illinois at Urbana-Champaign and George Mason University, will undertake this project. It involves three interconnected parts: digitization of existing materials on and in Yupik for use by community members and researchers; recording and analyzing the speech of Yupik speakers; and working with the community to build computer tools for Yupik and teaching students how to do so. A successful computational model of Yupik linguistic phenomena has implications for unsupervised and semi-supervised methods in morphology induction and grammar induction because the types of morphophonological change are pervasive, much more so than models used in other approaches to unsupervised morphology induction. This work is likely to have important implications regarding appropriate computational modeling of polysynthetic agglutinative morphosyntax. Accessing materials at several archives, the team will scan them, and clean and process the scans so they are accessible digitally and searchable. This will create a digital corpus of Yupik materials for use by the community and for linguistic investigations into grammatical mood, tense, and aspect to better understand these complex morphosemantic constructions. The data will also improve the computational tools being developed in this project, providing the Yupik community with access to modern tools like spellcheckers, electronically searchable dictionaries, and electronic books. Finally, in its tight integration of field work and the development of computational tools for the analysis of the language, this project will serve as a model for future collaborations of this kind.

Logistics Summary

This collaboration between Schwartz (1761680, U of IL) and Schreiner (1760977, George Mason U) brings together computational methods with traditional fieldwork and language description to effectively and efficiently document critical aspects of the endangered St. Lawrence Island/Siberian Yupik (Yupik) language, while developing and improving computational tools that will aid in further documentation and analysis and support pedagogical and language revitalization efforts by the Yupik speaking community. From 2020–2021, the team worked via digital channels (telephone, Facebook Messenger, text message, etc.) to continue their work to document the language and build computer tools for Yupik and its speakers. The team will return to the field sites in 2022 as permitted.

Season Field Site

2017 Alaska - Gambell

2017 Alaska - Nome

2018 Alaska - Gambell

2019 Alaska - Fairbanks

2019 Alaska - Gambell

2020 Distance

2021 Distance

2022 Alaska - Gambell

Principal Investigators

Co-Principal Investigators

Project PI(s)
Funded Institutions
George Mason University
University of Illinois Urbana-Champaign
Other Research Location(s)
Gambell, AK
Nome, AK
Fairbanks, AK
Project Start Date
Aug 2018
Award Year