LogoAIAny
Icon for item

Nemotron-Personas-El-Salvador

Provides ~1M synthetic Salvadoran‑Spanish personas (148k records, ~300M tokens) grounded in 2024 census distributions for demographics, occupations and locations; intended for training/evaluating localized LLMs and synthetic-data workflows. CC BY 4.0, adults only.

Introduction

Nemotron‑Personas‑El‑Salvador matters because lack of regionally grounded persona data amplifies bias and harms model behaviour when deployed locally. This release gives model builders a census‑anchored synthetic population in Salvadoran Spanish so they can condition outputs on realistic age, occupation, municipality and education distributions without exposing real PII.

What Sets It Apart
  • Census‑anchored generation: Personas are conditioned on the VII Census (2024) distributions for age, department/municipality, education, marital status and employment categories — enabling geographically and demographically realistic sampling for El Salvador.
  • High coverage & scale: Published as 148k parquet records (7 personas per record ≈ 1M personas), ~300M tokens total (≈161M persona tokens) with 25 fields (7 persona narratives + 18 contextual attributes) to support fine‑grained conditioning in training and evaluation.
  • Synthetic + provenance controls: Generated with NeMo Data Designer using a probabilistic graphical model plus an Apache‑2 LLM (openai/gpt-oss-120b) and validators; names distributions were used during generation but name fields are not exposed to reduce memorization and re‑identification risk.
  • Practical license & local focus: Released under CC BY 4.0 and built in collaboration with WideLabs and NVIDIA to support Sovereign AI efforts and localized model development.
Who it's for — and tradeoffs

Great fit if you are training or evaluating LLMs for Salvadoran Spanish, building synthetic-data pipelines that require realistic demographic anchors, or researching bias mitigation and model collapse from synthetic corpora. Look elsewhere if you need labeled clinical/financial personas, under‑18 profiles, or an authoritative Salvadoran NLP corpus for dialectal speech generation: the persona narratives approximate Salvadoran Spanish rather than deriving from a verified local conversational corpus. Known limitations include underrepresentation of some Indigenous and Afrodescendant identities (small‑cell suppression), omission of religion, and potential residual gender‑role artifacts from the narrative LLM. Use responsibly — verify for your downstream regulatory and privacy requirements before production deployment.

Information

  • Websitehuggingface.co
  • AuthorsRodrigo Malossi, Andre Manoel, Shyamala Prayaga, Ashton Sharabiani, Evan Acharya, Bardiya Sadeghi, Will Jennings, Dane Corneil, Yev Meyer, NVIDIA, WideLabs
  • Published date2026/06/03

Categories