The Interplay of Data Structure and Imbalance in the Learning Dynamics of Diffusion Models
Flavio Nicoletti, Chenxiao Ma, Enrico Ventura, Luca Saglietti, Stefano Sarao Mannelli
Real-world datasets differ across classes in both structure and frequency, but most theory for diffusion models assumes homogeneous data. This work develops a high-dimensional analytical framework for class-dependent learning in score-based diffusion models. Using a random-features model trained on Gaussian mixtures, the paper characterizes how class variance, centroid geometry, and sampling imbalance shape the timing of generalization and memorization. The analysis predicts that diffusion models may memorize some classes while others remain underlearned, and the theory is validated with U-Net experiments on Fashion MNIST.
[Read Article]