Mapping how small-molecule perturbations reshape cellular states at single-cell resolution is essential for building predictive, context-aware biological models and for drug-response discovery. This dataset supplies paired single-cell expression profiles and endpoint growth-rate summaries designed specifically to train and evaluate models that predict how cells respond to pharmacological perturbations.
What Sets It Apart
- Scale and pairing: 1,831,648 individual cell profiles across 52 cancer cell lines and 91 treatment conditions, plus a complementary per-(cell line, condition) growth-rate summary. This pairing enables simultaneous training on transcriptional states and phenotypic outcomes (so: models can learn molecular signatures that link to measured sensitivity).
- Tabular, indexed metadata: Separate tables for expression_data, gene_metadata, cell_line_metadata, drug_metadata, and summary_statistics make it straightforward to join assays, map token IDs to gene symbols/Ensembl IDs, and filter by cell-line driver mutations or drug mechanism (so: reproducible, queryable slices for benchmark tasks).
- Designed for high-throughput ML workflows: data shards and streaming access via the Hugging Face datasets API let you iterate without downloading the entire archive, and the gene vocabulary preserves Tahoe-100M token IDs with additional EmeraldBay-only genes (so: compatible with prior Tahoe model vocabularies).
Who It's For and Trade-offs
- Great fit if you are training or benchmarking models that predict perturbation effects from single-cell expression (e.g., conditional generative models, contrastive embeddings, drug-response regressors) or studying context-dependent gene function across cancer lineages. The paired growth-rate summaries support supervised phenotype prediction tasks and transfer learning experiments.
- Look elsewhere if you need primary patient-derived tissue data, longitudinal time-series beyond the five-day endpoint, or bulk RNA-seq only: Emerald Bay is a five-day tumor-pool assay focused on cell-line panels and pharmacological perturbations, not clinical cohorts or multi-timepoint dynamics.
Where It Fits
Use this dataset to build and evaluate AI models that require both high sample counts and explicit treatment labels (dose, compound, combinations), or to extend Tahoe-100M–compatible vocabularies. It is particularly useful for drug-discovery ML pipelines, mechanistic signature discovery, and transfer experiments from cell-line models to other contexts.
