How Our AI Dataset Can Improve on Gold Standard Lab Datasets
How SYNTH-TEx, a featured AI dataset, captures real-world biology without the limitations of GTEx
Synthesize Bio’s goal is ambitious: to build a generative model that can accurately predict gene expression specific to any biological context. Earlier this year, we did a major test drive of our modeling: we attempted to make and evaluate a comprehensive collection of healthy human tissue gene expression. It worked! The result is in our very first publicly-released AI dataset: SYNTH-TEx, the AI-generated tissue gene expression atlas. (By the way, it’s free to use!)
Improving Upon the Gold Standard
Building a general-purpose model for genomics requires immense amounts of data, but it also requires rigor. We intentionally excluded major, high-impact datasets like the foundational Genotype-Tissue Expression (GTEx) dataset from our model’s training. Why? Because GTEx provides the gold standard for assessing how well a model has truly learned the nuanced signals of tissue- and sex-specific expression.
However, even gold standard lab-generated datasets like GTEx have limitations. We identified two key areas where we knew an AI-generated dataset could actually improve upon the status quo:
1. The Sex Imbalance: The GTEx cohort has a male-to-female ratio of about 2:1, which can skew analyses or diminish statistical power for sex-specific findings.
2. The Sample Scarcity: Although GTEx is massive, many tissues within it still lack the deep sample coverage needed for robust statistical modeling.
Enter SYNTH-TEx: Generative Genomics for Data Balance & Depth
SYNTH-TEx was the answer to GTEx’s imbalance and scarcity issues. Instead of being limited by what was available in the tissue bank, we used our generative model to create AI samples, addressing biological gaps and creating a more balanced resource that provides 100 samples per sex per tissue for 23 primary tissues, and 100 relevant samples for an additional 7 tissues.
Here’s how SYNTH-TEx improved the dataset:
Gender-Balanced Design: For tissues present in both sexes, we generated an equal number of 100 samples per sex. This resolved the 2:1 male-biased issue in GTEx, providing researchers with an equitable dataset for studying sex differences.
Increased Depth: We increased the total number of samples for 15 out of 30 tissues, enriching the dataset for complex analyses.
This wasn’t just more data; it was better structured data, designed to push the boundaries of model validation and biological discovery.
From AI to Biology: Validating the AI-Generated Atlas
To validate that these AI-generated samples faithfully capture real-world biology, we conducted several comparisons against the GTEx lab-generated samples.
1. High-Level Clustering (UMAP)
We first mapped our SYNTH-TEx samples into the GTEx latent space using UMAP embeddings. The results were striking: the overlap of our AI-generated samples and the lab-generated GTEx samples in the correct tissue cluster was consistently high.
This showed that at a macro level, our model had successfully learned the foundational molecular signatures that define a tissue, whether it was brain, heart, or muscle.
2. Tissue- and Sex-Specific Expression
For a closer look, we examined the expression of known tissue-specific genes (like ACTA1 for muscle or APOB for liver) and sex-biased genes.
We compared the expression of the female-biased gene XIST and the male-biased gene CD99 side-by-side. Our model successfully learned that XIST expression is significantly higher in female samples across tissues, demonstrating its grasp of fundamental sex differences in the transcriptome.
The success of SYNTH-TEx established a core aim of our work: AI that can generate high-fidelity, biologically relevant genomic data that is not just mimicking the training set, but a way to supplement existing public resources in a way that fundamentally improves them.
The Next Chapters
The release of SYNTH-TEx was a big step—the debut demonstration of our generative genomics platform. If you’d like to learn more, you can download all the data and see how we analyzed it.
We are actively focused on refining our models by enhancing metadata curation, leveraging sophisticated knowledge graphs, and testing new model architectures. We expect our AI data to improve even more rapidly, offering data with ever-increasing accuracy and utility. We invite you to check out our preprint to learn more about how the model works.
SYNTH-TEx marked the beginning of our mission to demonstrate how our models can make great synthetic data. Stay tuned for more posts about our Featured AI Datasets: SYNTH-interferon, SYNTH-cancer, and SYNTH-sc-lung!
This post was authored by Alex Abbas, Director of Computational Biology at Synthesize Bio.






