Published May 20, 2026 | Version v1
Thesis

Reconstructing Textual Representation: Encoding, Rare Characters, and Data Preparation in Japanese Literary Corpora

  • 1. University of Chicago

Contributors

Description

This thesis examines the structural tensions between renderability, computational tractability, and textual fidelity in the construction of literary corpora, using Aozora Bunko, a major open-access repository of modern Japanese literature, as its primary case. Because early character encoding standards could not accommodate the full range of characters in its holdings, the corpus represents rare characters as bracketed descriptive annotation strings inserted directly into the prose. Rather than imposing a single normalization strategy, this project develops a non-destructive and reversible processing pipeline that externalizes all transformations into structured data layers independent of the source text, allowing researchers to generate task-specific outputs without irreversibly modifying the original. The project argues that corpus construction decisions are not merely technical but reflect historical compromises embedded in encoding standards, and that a layered, recoverable architecture provides both infrastructure for future maintenance and a new queryable resource for the study of character encoding across the full collection.

Additional details

UChicago Information

Division(s)
Arts & Humanities Division
Department(s)
Master of Arts in Digital Studies of Language, Culture, and History