A context-free method for the computational analysis of Buddhist texts

C Handy - Digital Humanities and Buddhism: An Introduction …, 2019 - degruyter.com
C Handy
Digital Humanities and Buddhism: An Introduction. Berlin: De Gruyter, 2019degruyter.com
This study demonstrates a practical method for extracting recurrent strings from digitized
texts in cases where grammar, vocabulary and other information about the texts are partly or
entirely unknown. My method involves building concordances of words and phrases from
digitized input sets of texts, using a simple but effective pattern-recognition algorithm. The
algorithm can be generalized to work with information in any language, but I restrict this
study to just three major languages of the Buddhist tradition: classical Sanskrit, classical …
This study demonstrates a practical method for extracting recurrent strings from digitized texts in cases where grammar, vocabulary and other information about the texts are partly or entirely unknown. My method involves building concordances of words and phrases from digitized input sets of texts, using a simple but effective pattern-recognition algorithm. The algorithm can be generalized to work with information in any language, but I restrict this study to just three major languages of the Buddhist tradition: classical Sanskrit, classical Tibetan and classical Chinese. I utilize free text files available in online databases so that my examples can be verified easily. I also provide C source code examples of the algorithm, available at a web link mentioned later in this paper. ¹ In modern English, and in many other modern languages, words in texts are separated by spaces, non-inflected, and essentially discrete particles that can be read as individual strings into a computer. A practical consequence of this linguistic feature is that computer spell-checkers, text search algorithms, and similar functions are computationally friendly (fast processing and small storage size for texts). Most computer programming languages in use today have string-analysis functions based on European languages using a standard roman character set. Unicode standards make it easier to input and output non-roman scripts, but the general string functions in C and Unix (and in later
Thanks to Lance Adams for providing use of his high-performance computing environment, a Linux cluster of 128 logical processors, which greatly reduced the processing time required for this project. While my computer program is able to run on an ordinary desktop computer (or notebook computer), having this extra power was of great benefit in testing the limits of the algorithm, allowing me to process entire corpora in a reasonable amount of time.
De Gruyter
以上显示的是最相近的搜索结果。 查看全部搜索结果