Go to main content
Formats
Format
BibTeX
MARCXML
TextMARC
MARC
DataCite
DublinCore
EndNote
NLM
RefWorks
RIS
Cite

Files

Abstract

Statistical mechanics was historically significant for its ability to link the microscopic descriptions of matter and energy to the macroscopic observations of thermodynamics. Recent cross-disciplinary work has utilized insights from statistical mechanics to solve the inverse problem---going from observable phenomenon to underlying interactions and principles---and has done so from applications in protein research to theory in neuroscience. Inverse statistical physics relies on a key property of information entropy: that fitting a distribution to data such that it obeys given observed constraints (but is otherwise as maximally entropic as possible) leads to a probabilistic model of the system that is least biased from unobserved assumptions, which leads to maximal predictive power. These maximum entropy models belong to a wider class of regularly used architectures in machine learning, known as energy-based models. When such models are fit to real data of complex multi-dimensional systems, ideally the learned distribution is able to generate states exemplary of the ground truth. In practice, however, this is often not the case; specialized sampling must be performed to generate desired outputs. For example, energy-based models of sequences of evolutionary related families of proteins have the ability to learn the generic constraints necessary to make novel functional sequences, which have been validated by in vivo experiments. However, these learned energy functions require re-scaling by a temperature parameter in order to sample novel functional sequences. Here we utilize minimal and physically motivated energy-based models in order to systematically interrogate the differences between the data-generation processes of ground truth and learned models sampled at varying temperatures. This lends itself well to an examination of the surprising ability of temperature tuning of learned energy functions---a poorly understood heuristic used across machine learning---to improve sampling performance. Whether the post-hoc sampling temperature need be raised or lowered, and by how much, depends on several factors: choice of objective function, amount of training data, and most importantly, properties of order and disorder inherent to the true system. Crucially, we show that the need to lower temperature to improve generative performance arises from a tendency of fit models to overestimate the probability mass on excited states when the number of training data is low and the ground truth is characterized by a strong preference for producing few ground states---induced by large "energy gaps" or low ground truth "temperature." Additionally, we show via a minimal setting that the temperature tuning phenomenon may be directly linked to a wide array of empirical evidence for a synergistic cluster of amino acids, or sector, within a sequence that is responsible for determining the functionality of that protein sequence.

Details

PDF

from
to
Export
Download Full History