AIAny - Nemotron-Personas-Vietnam

Introduction

Why this matters

Large-scale, regionally grounded persona data for Vietnamese is scarce; this release fills that gap by producing synthetic persona narratives aligned to Vietnam's official 2024 demographic sources. The dataset emphasizes realistic demographic distributions (age, sex, education, occupation, province) across six major provinces and supplies multiple persona types per record to increase conversational diversity in downstream models.

What Sets It Apart

Census-grounded synthesis: persona attributes are generated to match distributions from Vietnam's Population & Housing Census 2024 and VHLSS 2024, so demographic coverage reflects recent official statistics rather than generic web crawls — useful when you need regionally representative behavior priors.
Multi-persona per record: each record contains six persona variants (professional, sports, arts, travel, culinary, and a concise persona), enabling augmentation strategies that preserve contextual demographic fields while varying persona voice and intent.
Auditability & reproducibility: produced with a NeMo Data Designer pipeline and a probabilistic graphical model augmented by the SaoLa4-Small component, with an explicit schema (21 fields) and a single train split (100k records). This makes it straightforward to sample, filter, or integrate into training pipelines.
Licensing & scope clarity: CC BY 4.0 license and explicit exclusion of enterprise-only fields (e.g., names/personality trait details) make reuse for research and commercial model training straightforward while highlighting limitations.

Who It's For & Trade-offs

Great fit if you need synthetic, demographically grounded Vietnamese personas to augment training data, reduce sampling bias, or test model behavior across population slices (age, education, occupation, urban/rural, province). It is also useful for benchmarking Vietnamese text-generation and persona-conditioned response diversity.

Look elsewhere if you require: fine-grained real personal identifiers (the dataset omits real names and sensitive enterprise fields), child personas (only ages 18+), or fully public-source provenance for every seed record (the release mixes public statistics with proprietary Data Designer workflows). Also avoid using synthetic personas as direct substitutes for audited, consented human subject data in high-stakes domains without additional review.

Nemotron-Personas-Vietnam

Introduction

What Sets It Apart

Who It's For & Trade-offs

Information

Categories

Tags

More Items

Apple-π: Benchmarking Thinking with Video Towards Law-Grounded Physical Intelligence

Aether-7B-5Attn Intermediate Pretraining Checkpoints

ClothTransformer Dataset