Files
Abstract
As machine learning (ML) models have grown in size and scope in recent years, so has the amount of data needed to train them. Unfortunately, individuals whose data is used in large-scale ML models may face unwanted consequences. Such data use may violate individuals' privacy or enroll them in an unwanted ML application. Furthermore, recent advances have greatly enhanced models' ability to generate synthetic data like text and images. This has unleashed a fresh wave of privacy and intellectual property concerns, as generative models can memorize and regurgitate their training data, and are trained on massive datasets scraped from the internet.
While user data privacy issues are well-recognized in the ML research community, most attempts to address it take a model-centric approach. Existing solutions assume that model trainers are well-intended and that data has been taken with consent, or that data use is inevitable and that the best path forward is to mitigate privacy risks.These solutions work but overlook a significant problem: often data is taken without consent, and users do not trust model trainers.
This begs the question: what if data use was not inevitable? What if, instead, users had agency over how and if their data is used in ML systems? This thesis argues that data agency, the ability to know and control how and if one's data is used in ML systems, is an important complement to existing ML data privacy protection approaches. Such agency would shift the current power dynamic, which renders users helpless at the hands of model creators, and help users control their digital destinies. Solutions of this nature would accentuate current work on data privacy, giving users, not just model trainers, control over how their data is used.
This thesis explores solutions that provide users with data agency against large-scale ML systems, allowing individuals to disrupt or discover when their data is used in large-scale ML systems. It proposes three solutions that prevent or trace data use in ML systems or, in extreme cases, directly attack the ML system. It focuses on the use case of large-scale facial recognition (FR) systems, a machine learning technology that has recently become a flashpoint for civil liberties and privacy issues. With this use case in mind, the thesis finally develops a framework for reasoning about broadly about FR data agency.