AIAny - Russian PII NER Benchmark

High-quality PII detection in Russian needs both a fine-grained label set and realistic noisy examples; this dataset supplies both. By combining sanitized production logs with synthetic document templates and manually filtered hard negatives, it stresses models on real-world name/address patterns and the various formats and typos of Russian identity/document numbers.

What Sets It Apart

21 fine-grained PII types mapped to practical coarse groups (PERSON, LOCATION, RU_DOC_ID, PASSPORT, etc.), enabling both detailed and aggregate evaluation so teams can choose task granularity that matches their guardrails.
Realistic mix: 2,841 sentences and 5,614 entity spans drawn from production-like logs (with real data replaced), synthetic document-style texts, and hard negatives — useful for measuring both recall on structured identifiers and robustness to borderline cases.
Token-level BIO annotation with span-level micro-F1 (overlap matching) evaluation: supports adjacent-span merging and best-threshold reporting, which aligns with typical NER/PII pipeline requirements (detection + redaction).
Includes a comparative zero-shot evaluation of general NER systems (e.g., GLiNER variants) showing strong gains when prompting in Russian and highlighting that many general models still miss document-specific types.

Who it's for and trade-offs

Great fit if you are building or evaluating Russian-language PII detection, anonymization, or guardrail systems and need a labelled testbed that includes both realistic user-text patterns and structured-document number variations. It helps benchmark recall on names, addresses, contacts, and Russian document IDs (passport, SNILS, INN, OMS, etc.).

Look elsewhere if you need large-scale training corpora (this is a benchmark-sized test set of ~2.8k sentences) or multi-language coverage — the dataset is Russian-focused and intended primarily for evaluation and comparative benchmarking rather than massive supervised training.

Where it fits

Use this dataset as a held-out evaluation suite for redaction/PII-detection pipelines, for threshold selection and robustness testing, or to compare zero-shot/multilingual NER approaches against a Russia-specific taxonomy. Expect to evaluate both coarse categories (PERSON + LOCATION) for broad comparability and the fine-grained document-ID labels when measuring domain-specific recall and false positives.

Practical tip: because many off-the-shelf NER models omit Russia-specific document categories, run both coarse-scope evaluations (PERSON+LOCATION) for cross-model baselines and targeted checks on RU document IDs with pattern-based detectors or specialized prompts/labels for reliable coverage.

Russian PII NER Benchmark

Introduction

What Sets It Apart

Who it's for and trade-offs

Where it fits

Information

Categories

Tags

More Items

DataPrep-Bench: Benchmarking LLMs as Training Data Preparators

pixelgpt-24x24-20k

Turkish CoT Instruct Dataset