No “Zero-Shot” Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance
Highlights
-
"Multimodal models require exponentially more data to achieve linear improvements in downstream zero-shot performance"
-
"Concept frequency is still predictive of performance"
-
Overall a solid, insightful, and enjoyable paper to read.
Summary
The currently held notion is that zero-shot performance reflects a model’s ability on “unseen” data. However, concepts considered unseen in downstream tasks often appear, sometimes infrequently, in massive pretraining datasets.. The authors show that zero-shot performance is substantially worse for rare concepts, revealing a strong correlation between concept frequency and performance. Specifically, there is a log-linear relationship between concept frequency in pretraining data and zero-shot accuracy on downstream tasks. To better evaluate this long-tail effect, the authors released a curated dataset of 290 rare concepts, enabling more thorough benchmarking of zero-shot performance on infrequent concepts.
Key Contributions
-
The current notion of zero-shot is misleading because massive pretraining datasets contain various frequencies of all kinds of concepts. Unseen is rarely truely unseen in the raw data. For rare concepts, the zero-shot performance is way worse. Zero-shot is highly correlated to concept frequency.
-
There is a log-linear relationship between concept frequency in pretraining data and downstream zero-shot performance, even after accounting for things like dataset similarity, and the effect of synthetic data.
-
As concepts follow a long-tailed concept distribution across pretraining datasets, the authors curate 290 concepts identified as least frequent across datasets (e.g. eggnog, tropical kingbird) to form a dataset of 130K test samples. They use text-to-image generates like SDXL to synthesize this dataset.
Strengths
-
Contributions and and writing are crystal-clear, easy to understand, and easy to reference later. The authors employ proper repetition of key findings they want to drive home.
-
The authors tested on a variety of models, both ones popular ones trained on open and closed source models.
-
The authors showed robustness in their analysis by accounting for dataset similiarity between pretraining and downstream datasets, the effects of synthetic and balanced datasets, and concept misalignment between images and captions.
Weaknesses / Questions
- I would have liked to see more than one graph regarding their dataset in the evaluations. Maybe just pulling one of the graphs from the textbook of an appendix they wrote.
Related Work
-
Zero-shot learning
-
Datasets
-
LLM
-
Synthetic data generation