Files

Abstract

Despite the popularity of data-driven research in scientific fields, we are intrigued by the combined value of datasets in a given area. Our research seeks to establish strategies for retrieving words containing dataset information from academic publications using a specific example of COVID-19 epidemiological papers, which was encouraged by previous studies concerning research originality and how combinatorial work improves science. We deployed LDA and word embedding algorithms to filter epidemiological papers versus clinical ones. We also annotated sentences based on whether each sentence in the abstract and title parts mentions dataset information. Pre-trained word representations enabled classification models to discriminate between data and non-data sentences. The unexpected finding is that, while more diverse terms in a publication's abstract and title help advertise it in terms of citation, they make this document less likely to be one of the top-cited papers. In conclusion, while we have not reached accurate conclusions for identifying data sentences in papers, we have uncovered techniques for filtering possible data sentences. We suggest inspecting a larger corpus in the next stage to evaluate the impact of alternative datasets and gather more information for the paper's word representations and citation.

Details

Actions

PDF

from
to
Export
Download Full History