Inside GEM-1: How it goes beyond memorization to predict the unseen
Embeddings are key to GEM-1’s prediction ability
One of the most frequent questions we receive is how GEM-1 can accurately predict outcomes for novel genetic and molecular perturbations. The answer lies in our approach: GEM-1 is built to decode biological mechanisms rather than just extrapolate from known data. We are excited to showcase the hard science behind these predictions and demonstrate the transparency and reproducibility of the GEM-1 platform.
To understand how GEM-1 achieves its predictive power, it is essential to grasp the concepts of embeddings and latent space.
Embeddings are numerical vectors that map complex inputs into a shared mathematical space. By translating biological entities into this format, the model can efficiently compare, search, and reason about the relationships between different drugs and genes.
Latent Space is a more specialized concept. It refers to the compressed numerical representation within a generative model, such as a Variational Autoencoder (VAE), that is optimized to mimic a specific probability distribution.
GEM-1 utilizes a Gaussian latent space, which constrains the form these embeddings can take, ensuring the model learns a structured, continuous representation of biology rather than just a list of data points.

Feature-Based Learning and Transferability
Drugs and genes do not exist in isolation; they belong to shared families and perform related functions. In traditional modeling, entities are often represented via one-hot encoding, which essentially treats every drug or gene as a unique, isolated label. However, this approach severely limits a model’s ability to generalize, particularly when working with large, heterogeneous datasets where specific perturbations may only appear a handful of times.
The Problem with “Label-Only” Learning
Modeling gene expression is fundamentally difficult because biological data is noisy and context-dependent. If a model treats a perturbation merely as a name, like “Drug A”, it cannot “connect the dots” when it encounters “Drug B,” even if the two are chemically nearly identical.
Without a deeper representation, the model sees a drug’s signature in a liver cell and its signature in a T cell as two disparate, unrelated events. It fails to recognize the underlying mechanism, making it impossible to transfer knowledge from one biological context to another. Therefore, we built our model to address this learning gap and enable out-of-context knowledge transfer.
GEM-1: Translating Identity into Function
GEM-1 uses rich multi-modal embeddings to overcome this learning gap. Instead of seeing a simple label, the model perceives a composite of biological features.
For molecules the model integrates:
Molecule Fingerprints: The fundamental building blocks of the molecule
Target Interaction Network: How the drug interfaces with the proteome
Topological Connectivity: The graph connectivity of the molecule elements
For genetic perturbations the model integrates:
Amino Acid Sequence: Using pretrained protein LMs to encode identity
Protein-Protein Interaction Network: How the target interfaces with the proteome
Text embedding: LLM derived summary of literature knowledge about the target
These diverse sources of prior information are integrated into the perturbation-encoding sub-network, mapping them into a unified functional space. By operating in this high-dimensional feature space, GEM-1 doesn’t just recognize a drug’s name. Instead, it understands its biological “character,” allowing the model to accurately predict the effects of novel perturbations in contexts it has never seen before.
We rigorously evaluated numerous embedding models, testing them against experimental benchmarks to identify the specific combination that offers the most complementary information. We found that by combining complementary views of a perturbation outperformed any single embedding model.
During the process of model training, GEM-1 further integrates these different sources of information to map them into a common functional latent space. This allows the model to create an understanding that goes beyond the underlying embedding models.
Two drugs under development for cancer therapy, pladienolide B and E7820, provide great examples of the emergent understanding within GEM-1. The two molecules have very different structures and target different proteins in cells, SF3B1 and RBM39. So, it’s not surprising that the two drugs aren’t placed close to each other when single embeddings are considered alone. In GEM-1’s latent space, though, they are right next to each other! It turns out that this makes sense molecularly – SF3B1 and RBM39 are both important components of the RNA splicing machinery – as well as therapeutically, as pladienolide B and E7820 have been shown to be effective in similar cancer subtypes in preclinical studies. GEM-1 captures this functional similarity by integrating multiple data types, revealing that these perturbations belong together biologically even though their chemical structures and direct targets appear unrelated.

This post was authored by Greg Koytiger, VP of AI at Synthesize Bio.
Learn more
You can learn more about the GEM-1 model by reading the preprint and trying it out on our webapp.

