Linguistic Dumpster Diving: Geographical Classification of Arabic Text

Zacharski, Ron; Abdelali, Ahmed; Helmreich, Stephen; Cowie, Jim

Zacharski, Ron; Abdelali, Ahmed; Helmreich, Stephen; Cowie, Jim

2009

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DataCite
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket

Files

Abstract

In many text analysis tasks it is common to remove frequently occurring words as part of the pre-processing step prior to analysis. Frequent words are removed for two reasons: first, because they are unlikely to contribute in any meaningful way to the results; and, second, removing them can greatly reduce the amount of computation required for the analysis task. In the literature, such words have been called 'noise' in the system, 'fluff words', and 'non-significant words'. While the removal of frequent words is correct for many text analysis tasks, it is not correct for all tasks. There are many analysis tasks where frequent words play a crucial role. To cite just one example, Mosteller and Wallace in their seminal book on stylometrics noted that the frequencies of various function words could distinguish the writings of Alexander Hamilton and James Madison. We use a similar frequent word technique to geographically classify Arabic news stories. In representing a document, we throw away all content words and retain only the most frequent words. In this way, we represent each document by a vector of common word frequencies. In our study we used a collection of 4,167 Arabic documents from 5 newspapers (representing Egypt, Sudan, Libya, Syria, and the U.K.). We then train on this data using a sequential minimal optimization algorithm to create a support vector, and evaluate the approach using 10-fold cross-validation. Depending on the number of frequent words, results range from 92% classification accuracy to 99.8%.

Details

Title Linguistic Dumpster Diving: Geographical Classification of Arabic Text

Author Zacharski, Ron : University of Mary Washington
Abdelali, Ahmed : New Mexico State University
Helmreich, Stephen : New Mexico State University
Cowie, Jim : New Mexico State University

Content Type Article

Published in Journal of the Chicago Colloquium on Digital Humanities and Computer Science

Identifier(s) DOI: https://doi.org/10.6082/M1CJ8BNF

Publication Date 2009

Record Appears in Humanities Division > Journal of the Chicago Colloquium on Digital Humanities and Computer Science > 2009 Journal of the Chicago Colloquium on Digital Humanities and Computer Science Vol. 1, No. 1
All

Record Created 2018-03-05

Actions

PDF

Statistics

Download Full History