Provides 600,000 synthetic Vietnamese persona texts (100,000 records, 6 personas per record) aligned to Vietnam's 2024 census and surveys for training and evaluating NLP / text-generation models; includes 21 demographic and persona fields, CC BY 4.0, single train split.
Provides ~1M synthetic Salvadoran‑Spanish personas (148k records, ~300M tokens) grounded in 2024 census distributions for demographics, occupations and locations; intended for training/evaluating localized LLMs and synthetic-data workflows. CC BY 4.0, adults only.
Why this matters
Large-scale, regionally grounded persona data for Vietnamese is scarce; this release fills that gap by producing synthetic persona narratives aligned to Vietnam's official 2024 demographic sources. The dataset emphasizes realistic demographic distributions (age, sex, education, occupation, province) across six major provinces and supplies multiple persona types per record to increase conversational diversity in downstream models.
Great fit if you need synthetic, demographically grounded Vietnamese personas to augment training data, reduce sampling bias, or test model behavior across population slices (age, education, occupation, urban/rural, province). It is also useful for benchmarking Vietnamese text-generation and persona-conditioned response diversity.
Look elsewhere if you require: fine-grained real personal identifiers (the dataset omits real names and sensitive enterprise fields), child personas (only ages 18+), or fully public-source provenance for every seed record (the release mixes public statistics with proprietary Data Designer workflows). Also avoid using synthetic personas as direct substitutes for audited, consented human subject data in high-stakes domains without additional review.