Model Release Update: BMC-CLIP1.1 Scaling Experiment
By Min Sun, Stanford Data Science Scholar—reposted with permission
Earlier this year, we released BMC-CLIP, a CLIP model continually pre-trained on a subset of BIOMEDICA, a dataset comprising 24M image-caption pairsand 31M image-inline text figure reference pairs from scientific literature. Continually pretraining with a batch size of 4,096 (4 × H100) achieved state-of-the-art zero-shot classification and retrieval performance on 40 biomedical tasks using 10× less compute than prior models.
Today, with support from Marlowe and Stanford Data Science, we’re releasing two new BMC-CLIP-1.1 models trained with larger batch sizes of 8,192 (8 × H100) and 32,768 (32 × H100). To the best of our knowledge, this is the largest biomedical CLIP experiment conducted to date. This is important as larger batches are key for contrastive learning, offering more negative pairs and enhancing the model’s ability to differentiate between relevant and irrelevant image-text matches.
🧪 As expected: Scaling helps. Bigger batch sizes lead to faster convergence and more stable training loss. But interestingly, we also observe checkpoint-specific tradeoffs, some early epochs outperform final ones on specific domains, achieving strong SOTA performance in some tasks! Given these findings, we are releasing all checkpoints for transparency and research. We hope the community can leverage the full open-source nature of our contributions to understand biomedical domain adaptation and model merging.
📂 All checkpoints will be made available here: http://bit.ly/4jVFFIl
📂 All models + data: huggingface.co/BIOMEDICA
📄 BIOMEDICA Paper: https://arxiv.org/abs/2501.07171
🙏 Once again, huge thanks to Marlowe and Stanford Data Science for generously providing the compute, these large-scale experiments (8× and 32× H100!) simply wouldn't have been possible without their support. Their infrastructure enabled us to explore how scaling batch size impacts CLIP-style models in biomedicine. We also appreciate the support from NVIDIA to allow us to support these experiments.
Shout-out to Alejandro Lozano, James Burgess, Serena Yeung-Levy, and Rob Tibshirani for supervising this experiment.
🚀 The future of multimodal biomedical AI is open source.

