Finding words that aren't there: Using word embeddings to improve dictionary search for low-resource languages
The next Department of Linguistics public lecture is presented by Dr. Antti Arppe (PhD), University of Alberta
Date: Thursday, March 14
Time: 2:30pm
Location: Arts 207 or via Zoom
This event is free and open to the public. | Register to attend via Zoom
About this event
The Department of Linguistics invites you to a public lecture by Dr. Antti Arppe (PhD), associate professor of Quantitative Linguistics in the University of Alberta, and the founding director of Alberta Language Technology Laboratory (ALTLab).
Modern machine learning techniques have produced many impressive results in language technology, but these techniques generally require an amount of training data that is many orders of magnitude greater than what exists for low-resource languages in general, and endangered ones in particular. However, dictionary definitions in a comparatively much more well-resourced majority language can provide a link between low-resource languages and machine learning models trained on massive amounts of majority-language data.
By leveraging a pre-trained English world embedding to computer sentence embeddings for definitions in a Plains Cree (nêhiyawêwin) dictionary, we have obtained promising results for dictionary search. Not only are the search results in the majority language of the definitions more relevant, but they can be semantically relevant in ways not achievable with classic information retrieval techniques: users can perform powerful searches for words that do not occur at all in the dictionary. These techniques are directly applicable to any bilingual dictionary providing translations between a high- and low-resources language.