Most clinical-generation pipelines fail upstream because the patient "behind" the text is underspecified. This release supplies 150,000 machine-generated Vietnamese personas designed to be the persona-first conditioning layer for downstream note/dialogue generation and triage simulations—rich demographics and narrative context to reduce brittle, context-light outputs.
What Sets It Apart
- Persona-first schema: dense coverage for demographics, healthcare behavior, and LLM-facing narrative fields so prompts get realistic social and cultural anchors rather than isolated symptoms.
- Scale and variety: 150k rows spanning the full life course, regional dialect clusters, long-tail symptom distributions, and varied socioeconomic contexts to stress-test generation across realistic edge cases.
- Prompt-ready fields and metadata: includes chief complaint, HPI-style narrative snippets, social-barrier cues, and generation metadata (seeds, model ids) to support reproducible scenario pipelines.
- Purposeful limits: medication and deep medical-history fields are intentionally lighter—these personas are anchoring context for synthetic generation, not substitutes for clinical charts.
Who It's For and Trade-offs
Great fit if you build synthetic doctor–patient consultations, intake/HPI note generators, triage simulations, or prompt-stress QA pipelines for Vietnamese clinical workflows. Look elsewhere if you need real-world prevalence estimates, clinical-grade patient records, or a dataset cleared for unrestricted commercial healthcare use—the release is machine-generated, can contain implausible or biased combinations, and is licensed CC-BY-NC-4.0, so apply QA and domain review before downstream use.
