Published May 20, 2026
| Version v1
Thesis
Reconstructing Textual Representation: Encoding, Rare Characters, and Data Preparation in Japanese Literary Corpora
Contributors
Advisors:
Description
This thesis examines the structural tensions between renderability, computational tractability, and textual fidelity in the construction of literary corpora, using Aozora Bunko, a major open-access repository of modern Japanese literature, as its primary case. Because early character encoding standards could not accommodate the full range of characters in its holdings, the corpus represents rare characters as bracketed descriptive annotation strings inserted directly into the prose. Rather than imposing a single normalization strategy, this project develops a non-destructive and reversible processing pipeline that externalizes all transformations into structured data layers independent of the source text, allowing researchers to generate task-specific outputs without irreversibly modifying the original. The project argues that corpus construction decisions are not merely technical but reflect historical compromises embedded in encoding standards, and that a layered, recoverable architecture provides both infrastructure for future maintenance and a new queryable resource for the study of character encoding across the full collection.