- 演講或講座
- 語言學研究所
- 地點
中央研究院語言學研究所519會議室
- 演講人姓名
Joshua K. Hartshorne 助教授 (美國波士頓學院)
- 活動狀態
確定
- 活動網址
The recently-published Handbook of Formosan Linguistics clearly shows the impressive progress that has been made in terms of field-work and linguistic analyses. At the same time, it also highlights just how little has been done in terms of quantitative work, with almost no computational, psycholinguistic, or language acquisition studies. Because quantitative studies have been important drivers of linguistic theory, the lack of quantitative studies is not just a problem for the quantitative language sciences but for linguistic theory as a hole. Critically, because all Formosan languages are endangered, the window of opportunity for experimental work is rapidly closing.
Currently, the main obstacle to quantiative studies of Formosan is the lack of machine-readable corpora. It is obvious that corpora are the basis of essentially all computational linguistics. It may be less immediately obvious that they are a prerequisite for psycholinguistic and language acquisition studies as well. However, most experimental designs require knowledge of word frequencies, cloze probabilities, and other corpus-based statistics.
In principle, a substantial amount of corpus material has already been compiled for Formosan languages. In practice, converting these to a format that allows statistical analysis is a major undertaking. In this talk, I describe our grant-funded efforts to build FormosanBank, a machine-readable corpus of all 16 extant languages. I describe progress so far as well, short-term plans, and how researchers can contribute. I also present initial results from using machine learning to speed up collecting and processing new corpora.