- Lectures
- Institute of Linguistics
- Location
Room519, 5th floor, HSSB, Institute of Linguistics, Academia Sinica
- Speaker Name
Joshua K. Hartshorne Assistant Professor (Boston College)
- State
Definitive
- Url
The recently-published Handbook of Formosan Linguistics clearly shows the impressive progress that has been made in terms of field-work and linguistic analyses. At the same time, it also highlights just how little has been done in terms of quantitative work, with almost no computational, psycholinguistic, or language acquisition studies. Because quantitative studies have been important drivers of linguistic theory, the lack of quantitative studies is not just a problem for the quantitative language sciences but for linguistic theory as a hole. Critically, because all Formosan languages are endangered, the window of opportunity for experimental work is rapidly closing.
Currently, the main obstacle to quantiative studies of Formosan is the lack of machine-readable corpora. It is obvious that corpora are the basis of essentially all computational linguistics. It may be less immediately obvious that they are a prerequisite for psycholinguistic and language acquisition studies as well. However, most experimental designs require knowledge of word frequencies, cloze probabilities, and other corpus-based statistics.
In principle, a substantial amount of corpus material has already been compiled for Formosan languages. In practice, converting these to a format that allows statistical analysis is a major undertaking. In this talk, I describe our grant-funded efforts to build FormosanBank, a machine-readable corpus of all 16 extant languages. I describe progress so far as well, short-term plans, and how researchers can contribute. I also present initial results from using machine learning to speed up collecting and processing new corpora.