Nemotron-Personas-Belgium offers a census-grounded, multilingual corpus of synthetic personas designed to make model training and evaluation more representative of Belgium’s demographic, geographic and linguistic reality. Rather than naïvely sampling fictional profiles, the dataset conditions persona attributes on official municipal- and census-level statistics so downstream models can be exposed to realistic age, region, household, education and occupation distributions without using any real PII.
What Sets It Apart
- Census and municipal grounding: persona attributes are aligned to Belgium’s 2021 Census and updated population/name statistics, covering all 581 communes and the three regions (Flanders, Brussels, Wallonia) and four language areas. This reduces common coverage gaps (e.g., rural, older cohorts) present in many persona corpora.
- Rich contextual conditioning: each record includes 23 fields (6 persona narratives + 16 contextual fields + uuid), enabling targeted conditioning and selection (e.g., by municipality, education level, occupation) for fairness testing, domain simulation or agent behaviour diversity experiments.
- Multilingual delivery and scale: 1.2M records across four localized splits (nl_BE, fr_BE, de_BE, en_BE) produce 1.8M persona texts and ~1.9B tokens total, with every persona translated into all four languages to support multilingual model development and evaluation.
- Privacy-aware generation: names are used in the generation pipeline (28k first names, 9.5k family names) and cultural-heritage priors inform realism, but individual name fields are not exposed in the released dataset to mitigate memorization and re-identification risk.
- Production tooling: produced with NeMo Data Designer using a proprietary Probabilistic Graphical Model plus the Apache-2.0-licensed google/gemma-4-31B-it model and validators; developed in collaboration with Pleias and KU Leuven.
Who it’s for — and tradeoffs
Great fit if you need census-aligned, multilingual synthetic personas for training/evaluating LLMs, agent systems, fairness audits or Sovereign AI pipelines where privacy and regional fidelity matter. It’s also suitable for generating multi-turn persona-driven conversation data and stress-testing downstream services across Belgian municipalities. Look elsewhere or supplement if you require: exposed real names, income or fine-grained personality trait fields (these are intentionally excluded), religion modelling, or domain-specific enterprise personas for regulated sectors (finance/healthcare) — NVIDIA invites enterprise inquiries for extended variants. Known limitations include machine-translated non-English splits (translation quality may vary), a small German-speaking portion (~1%), municipality names kept in a single canonical form rather than per-language localization, and some necessary independence assumptions in the generative model.
Practical notes
- License: CC BY 4.0 (commercial use permitted with attribution).
- Data shape: 4 splits × 300k records (nl_BE, fr_BE, de_BE, en_BE), 23 fields, Parquet format (~4.0 GB).
