By Gowri Nayar
Proteins are the working parts of biology. They build, signal, regulate, and repair. But understanding what a protein does is often surprisingly hard, especially when it has only limited experimental data. This is where data science can help.
In our new paper, GATSBI: Improving context-aware protein embeddings through biologically motivated data splits, we introduce a machine learning framework designed to learn richer protein representations by combining many kinds of biological information at once. Rather than treating proteins as isolated sequences, GATSBI brings together protein interactions, co-expression patterns, tissue-specific associations, and sequence-based embeddings into a single graph-based model.
But the technical contribution is only part of the story. One of the biggest messages of this work is that evaluation matters just as much as modeling. In many machine learning benchmarks, models are tested with convenient random train-test splits. Those are easy to run, but they often do not reflect the questions scientists actually care about. Can a model recover a missing relationship between proteins we already know something about? Can it say something useful about a protein that is mostly unstudied? Those are very different challenges, and our results show they lead to very different conclusions about performance.
GATSBI was built around this idea. We tested it using biologically meaningful data splits that better match real research settings. That change in evaluation revealed something important: model performance can look much better on paper than it does in the situations where biologists most need help. By aligning the benchmark with the real task, we get a more honest picture of what these embeddings can actually do.
That is especially important for understudied proteins. Biology has deep knowledge for a relatively small fraction of proteins, while many others remain poorly characterized. These less-studied proteins are often where computational predictions could make the biggest impact. Across multiple downstream tasks, GATSBI consistently outperformed existing pretrained embeddings, with some of the strongest gains appearing exactly in this lower-evidence regime.
For the broader data science community, this work highlights several ideas with broad relevance.
First, integrating heterogeneous data can be more powerful than relying on a single source of information. Second, graphs remain a natural and effective way to represent structured scientific systems. And third, benchmark design is not just a technical detail. It shapes what we believe a model has learned. In scientific machine learning, choosing the right evaluation setup is part of the method itself.
Protein representation learning sits at a particularly exciting intersection of graph machine learning, multimodal data integration, and real-world scientific discovery. The goal is not only to build models that score well, but to build models that are useful when the data are incomplete, biased, and unevenly distributed — which is exactly what real biological data look like.
You can read the paper here and learn more about this work at ISMB 2026 in Washington, D.C.