Integrating Language Documentation and Computational Tools for Yupik, an Alaska Native Language

Abstract

One locus of crosslinguistic variation in how languages build words is whether meaning is encoded in free morphemes (units of meaning) that stand alone as words, or whether those morphemes must combine with other morphemes to become words. While English has many free morphemes, the Alaska Native language, St. Lawrence Island/Siberian Yupik, uses the second strategy with very complex words, often sentence-sized. These properties are known as agglutination and polysynthesis. Researchers will document critical structures in the language, digitize existing Yupik materials, and build computational tools to help the community and other researchers. The data from Yupik are extremely important to language science, since many of the phenomena displayed in the language are rare and not well understood. Creating computational tools for languages with very complex words, like Yupik, is of additional benefit to computer scientists and language scientists in that it helps researchers improve computational tools for languages like English. The Native American Languages Act, passed by the US Congress in 1990, enacted into policy the recognition of the unique status and importance of Native American languages. This project will build and improve tools like a morphological analyzer, a spellchecker, and a searchable dictionary, of value to the community in revitalizing their language. Graduate students will be trained in these methods, and researchers will hold outreach meetings with high school students in the language community to teach them important computer and coding skills that will enable them to build further tools. All data gathered will be permanently archived at the Alaska Native Language Archive.

The investigators, a collaboration of language and computer scientists from the University of Illinois at Urbana-Champaign and George Mason University, will undertake this project. It involves three interconnected parts: digitization of existing materials on and in Yupik for use by community members and researchers; recording and analyzing the speech of Yupik speakers; and working with the community to build computer tools for Yupik and teaching students how to do so. A successful computational model of Yupik linguistic phenomena has implications for unsupervised and semi-supervised methods in morphology induction and grammar induction because the types of morphophonological change are pervasive, much more so than models used in other approaches to unsupervised morphology induction. This work is likely to have important implications regarding appropriate computational modeling of polysynthetic agglutinative morphosyntax. Accessing materials at several archives, the team will scan them, and clean and process the scans so they are accessible digitally and searchable. This will create a digital corpus of Yupik materials for use by the community and for linguistic investigations into grammatical mood, tense, and aspect to better understand these complex morphosemantic constructions. The data will also improve the computational tools being developed in this project, providing the Yupik community with access to modern tools like spellcheckers, electronically searchable dictionaries, and electronic books. Finally, in its tight integration of field work and the development of computational tools for the analysis of the language, this project will serve as a model for future collaborations of this kind.

Logistics Summary

This collaboration between Schwartz (1761680, U of IL) and Schreiner (1760977, George Mason U) brings together computational methods with traditional fieldwork and language description to effectively and efficiently document critical aspects of the endangered St. Lawrence Island/Siberian Yupik (Yupik) language, while developing and improving computational tools that will aid in further documentation and analysis and support pedagogical and language revitalization efforts by the Yupik speaking community. From 2020–2021, the team worked via digital channels (telephone, Facebook Messenger, text message, etc.) to continue their work to document the language and build computer tools for Yupik and its speakers. The team will return to the field sites in 2022 as permitted.

Season Field Site

2017 Alaska - Gambell

2017 Alaska - Nome

2018 Alaska - Gambell

2019 Alaska - Fairbanks

2019 Alaska - Gambell

2020 Distance

2021 Distance

2022 Alaska - Gambell

Principal Investigators


Co-Principal Investigators

Project Outcomes

This project combined computational methods with traditional linguistic documentation of an endangered Alaska Native language, Akuzipik (St. Lawrence Island Yupik). We first digitized a large number of existing language resources in and about Akuzipik. These materials comprised stories, narratives, and pedagogical materials. Over the course of the project, we developed computer programs to help researchers and speakers understand and analyze Akuzipik. These programs were then used to analyze the digitized materials, leading us to holes in existing documentation of the language. When a word or rule was identified as being missing from the documentation, speakers of the language were consulted to help us understand more about the language. We also documented other parts of the language about which there was little written previously (e.g., tense and aspect, negation, and noun phrases). The sounds of the language were also studied in much greater depth than previously, including the first acoustic analyses of the language and the first ultrasound study of this or any closely related language. These activities resulted in a more complete record of the language. During the project, we also digitized the existing Akuzipik dictionary and turned it into a website that speakers can easily use. This website draws on our computational tools to break down user input and look up the base word and the other parts of the word.

Intellectual merit: While some documentation of the language existed before the project, many previously missing or misunderstood elements have now been clarified, and a much more complete record of the language exists. The methods for the development of the computer tools for Akuzipik have been useful to scholars working with other related or similar languages. Techniques we used for eliciting information about the language from speakers are of use to other scholars undertaking similar work.

Broader impacts: The digitization of language materials and texts, as well as the documentation of the language and development of computer programs, has supported Akuzipik speakers in their language revitalization and reclamation efforts. The research team has also supported the speaker communities as they have pursued grants related to language revitalization activities.

Project PI(s)
Funded Institutions
George Mason University
University of Illinois Urbana-Champaign
Other Research Location(s)
Gambell, AK
Nome, AK
Fairbanks, AK
Project Start Date
Aug 2018
Award Year
FY18