Automated integration of genomic metadata with sequence-to-sequence models

G Cannizzaro, M Leone, A Bernasconi… - Machine Learning and …, 2021 - Springer
Machine Learning and Knowledge Discovery in Databases. Applied Data Science …, 2021Springer
While exponential growth in public genomic data can afford great insights into biological
processes underlying diseases, a lack of structured metadata often impedes its timely
discovery for analysis. In the Gene Expression Omnibus, for example, descriptions of
genomic samples lack structure, with different terminology (such as “breast cancer”,“breast
tumor”, and “malignant neoplasm of breast”) used to express the same concept. To remedy
this, we learn models to extract salient information from this textual metadata. Rather than …
Abstract
While exponential growth in public genomic data can afford great insights into biological processes underlying diseases, a lack of structured metadata often impedes its timely discovery for analysis. In the Gene Expression Omnibus, for example, descriptions of genomic samples lack structure, with different terminology (such as “breast cancer”, “breast tumor”, and “malignant neoplasm of breast”) used to express the same concept. To remedy this, we learn models to extract salient information from this textual metadata. Rather than treating the problem as classification or named entity recognition, we model it as machine translation, leveraging state-of-the-art sequence-to-sequence (seq2seq) models to directly map unstructured input into a structured text format. The application of such models greatly simplifies training and allows for imputation of output fields that are implied but never explicitly mentioned in the input text.
We experiment with two types of seq2seq models: an LSTM with attention and a transformer (in particular GPT-2), noting that the latter outperforms both the former and also a multi-label classification approach based on a similar transformer architecture (RoBERTa). The GPT-2 model showed a surprising ability to predict attributes with a large set of possible values, often inferring the correct value for unmentioned attributes. The models were evaluated in both homogeneous and heterogenous training/testing environments, indicating the efficacy of the transformer-based seq2seq approach for real data integration applications.
Springer
以上显示的是最相近的搜索结果。 查看全部搜索结果