Developing embedding models for Scottish Gaelic

W Lamb, M Sinclair - Proceedings of the 2nd Celtic Language …, 2016 - research.ed.ac.uk
Proceedings of the 2nd Celtic Language Technology Workshop, 2016research.ed.ac.uk
We detail a preliminary project on encoding and evaluating word embeddings for Scottish
Gaelic. Word embedding methodologies show promise for diverse natural language
processing (NLP) tasks and can be built from raw, unstructured text. Accordingly, they are
attractive for under-resourced languages like Gaelic. We instantiated three embedding
models on two versions of a 5.8 million token corpus: 1) tokenised and 2)
tokenised/lemmatised. Using a simple POS tagger, we quantitatively measured the syntactic …
We detail a preliminary project on encoding and evaluating word embeddings for Scottish Gaelic. Word embedding methodologies show promise for diverse natural language processing (NLP) tasks and can be built from raw, unstructured text. Accordingly, they are attractive for under-resourced languages like Gaelic. We instantiated three embedding models on two versions of a 5.8 million token corpus: 1) tokenised and 2) tokenised/lemmatised. Using a simple POS tagger, we quantitatively measured the syntactic similarity between nearest neighbours for each model’s vector-space representations of words. We also queried the models to assess their semantic specificity and breadth. Models built from the tokenised corpus exhibited the effects of data sparsity for semantically constrained queries. The lemmatised versions had more semantic robustness, but at the expense of inflectional sensitivity. We note divergences between the models and an apparent inverse relationship between their semantic and syntactic capacities. Finally, we highlight the promise of word embeddings for a range of future work and downstream applications.
research.ed.ac.uk
以上显示的是最相近的搜索结果。 查看全部搜索结果