查看文章

arxiv.org 中的 [PDF]

On the Use of ArXiv as a Dataset

作者

Colin Clement, Matthew Bierbaum, Kevin O'Keeffe, Alexander Alemi

发表日期

2019/4/30

期刊

arXiv preprint arXiv:1905.00075

简介

The arXiv has collected 1.5 million pre-print articles over 28 years, hosting literature from scientific fields including Physics, Mathematics, and Computer Science. Each pre-print features text, figures, authors, citations, categories, and other metadata. These rich, multi-modal features, combined with the natural graph structure---created by citation, affiliation, and co-authorship---makes the arXiv an exciting candidate for benchmarking next-generation models. Here we take the first necessary steps toward this goal, by providing a pipeline which standardizes and simplifies access to the arXiv's publicly available data. We use this pipeline to extract and analyze a 6.7 million edge citation graph, with an 11 billion word corpus of full-text research articles. We present some baseline classification results, and motivate application of more exciting generative graph models.

引用总数

被引用次数：98

2019202020212022202320244 3 20 21 24 26

学术搜索中的文章

On the use of arxiv as a dataset

CB Clement, M Bierbaum, KP O'Keeffe, AA Alemi - arXiv preprint arXiv:1905.00075, 2019

被引用次数：93 相关文章所有 6 个版本

Stochastic nonlinear model for somatic cell population dynamics during ovarian follicle activation*

F Clément, F Robin, R Yvinec - Journal of Mathematical Biology, 2021

被引用次数：5 相关文章所有 16 个版本