Reconstructing Textual Representation: Encoding, Rare Characters, and Data Preparation in Japanese Literary Corpora

Li, Shuning

doi:10.6082/uchicago.17337

Published May 20, 2026 | Version v1

Thesis Metadata-only

Reconstructing Textual Representation: Encoding, Rare Characters, and Data Preparation in Japanese Literary Corpora

Li, Shuning¹

1. University of Chicago

Contributors

Advisors:

This thesis examines the structural tensions between renderability, computational tractability, and textual fidelity in the construction of literary corpora, using Aozora Bunko, a major open-access repository of modern Japanese literature, as its primary case. Because early character encoding standards could not accommodate the full range of characters in its holdings, the corpus represents rare characters as bracketed descriptive annotation strings inserted directly into the prose. Rather than imposing a single normalization strategy, this project develops a non-destructive and reversible processing pipeline that externalizes all transformations into structured data layers independent of the source text, allowing researchers to generate task-specific outputs without irreversibly modifying the original. The project argues that corpus construction decisions are not merely technical but reflect historical compromises embedded in encoding standards, and that a layered, recoverable architecture provides both infrastructure for future maintenance and a new queryable resource for the study of character encoding across the full collection.

Additional details

Division(s): Arts & Humanities Division
Department(s): Master of Arts in Digital Studies of Language, Culture, and History

	All versions	This version
Views	6	6
Downloads	0	0
Data volume	0 Bytes	0 Bytes

Reconstructing Textual Representation: Encoding, Rare Characters, and Data Preparation in Japanese Literary Corpora

Creators

Contributors

Advisors:

Description

Additional details

UChicago Information