Most multimodal safety benchmarks are English-centric or focus on generic hazards; KSAFE-MM flips that assumption by centering Korean cultural and institutional contexts so evaluations reflect real-world local risks. The dataset stresses that model safety failures often arise not from raw toxicity but from missing local knowledge, culturally grounded cues, and visual-contextual interplay that enable bypasses or harmful outputs.
Key Findings
- Two-part design: KSAFE-MM-G transforms globally shared safety queries into Korean-grounded multimodal samples; KSAFE-MM-C uses in-the-wild images and localized visual cues combined with jailbreak-style textual intents to probe culture-dependent vulnerabilities.
- Reveals asymmetric failure modes: some models show high attack success rates on culturally tailored prompts while others exhibit excessive refusal on benign inputs — indicating a tradeoff between vulnerability and over-sensitivity.
- Dataset construction emphasizes semantic alignment and privacy filtering: image–query pairs were selected from diverse web sources, de-duplicated, and filtered to avoid references to identifiable individuals or companies.
Who it's for and tradeoffs
Great fit if you evaluate multimodal LLM safety for non-English markets, build culturally robust moderation or alignment layers, or research localized attack vectors. Look elsewhere if you only need generic, English-only toxicity benchmarks or lightweight synthetic tests — KSAFE-MM is designed for contextual, in-the-wild evaluation and thus requires handling image hosting, language-specific annotation, and culturally informed judgment during interpretation.
