Files
Abstract
Accurately estimating substitution patterns in e-commerce is difficult because most demand models rely on hand-coded product attributes that are often missing or incomplete. I propose a multimodal-embedding approach that replaces those attributes with low-dimensional features extracted from product images and texts by pre-trained deep-learning models. Embedding principal components are entered—alongside price—into a mixed logit, allowing visual and textual similarity to discipline cross-price elasticities. Applied to 3,478 Amazon purchases in a 25-item Headsets category, adding just two image principal components from a ResNet-50 encoder lowers the Akaike Information Criterion by 296 points relative to a price-only logit and reduces the out-of-sample mean absolute error of market-share forecasts by 22%. Diversion ratios become more concentrated, raising the category-level Herfindahl–Hirschman Index from 0.073 to 0.088 (+21%), which reveals tighter competition within visually defined sub-segments such as mid-range gaming headsets. These results demonstrate that information already present in product pages can materially improve demand estimation, even when no structured attributes are available.