Developing embedding models for Scottish Gaelic- 学术资源搜索

Developing embedding models for Scottish Gaelic

W Lamb, M Sinclair - Proceedings of the 2nd Celtic Language …, 2016 - research.ed.ac.uk

Proceedings of the 2nd Celtic Language Technology Workshop, 2016•research.ed.ac.uk

We detail a preliminary project on encoding and evaluating word embeddings for Scottish Gaelic. Word embedding methodologies show promise for diverse natural language processing (NLP) tasks and can be built from raw, unstructured text. Accordingly, they are attractive for under-resourced languages like Gaelic. We instantiated three embedding models on two versions of a 5.8 million token corpus: 1) tokenised and 2) tokenised/lemmatised. Using a simple POS tagger, we quantitatively measured the syntactic similarity between nearest neighbours for each model’s vector-space representations of words. We also queried the models to assess their semantic specificity and breadth. Models built from the tokenised corpus exhibited the effects of data sparsity for semantically constrained queries. The lemmatised versions had more semantic robustness, but at the expense of inflectional sensitivity. We note divergences between the models and an apparent inverse relationship between their semantic and syntactic capacities. Finally, we highlight the promise of word embeddings for a range of future work and downstream applications.

research.ed.ac.uk

展开收起

被引用次数：8 相关文章所有 5 个版本

以上显示的是最相近的搜索结果。查看全部搜索结果

高级搜索

QQ 群

Developing embedding models for Scottish Gaelic

引用