AIAny - Open-PerfectBlend

Why this matters Open-PerfectBlend packs diverse instruction-following data into a single public dataset to support supervised fine-tuning and RLHF-style experiments. By mixing chat, math word problems, code-oriented examples and instruction-response pairs from several existing collections, it gives researchers a broad training mixture without relying on a single source.

What Sets It Apart

Multi-source mixture: combines large public datasets (e.g., MetaMathQA, UltraInteract_sft, ultrachat, orca-math, ultrafeedback, evol-codealpaca, AutoIF, and ShareGPT-derived preference data) to produce ~1.42M train examples after deduplication. This yields wider task coverage (conversational, mathematical reasoning, coding, instruction following).
Licensing and format: distributed under Apache-2.0, stored in parquet and intended for use with Hugging Face Datasets and common processing stacks.
Practical notes: the published split contains precise counts and some known data-quality issues (a discussion noted ~62k rows with no assistant response in the train split and a deduplication step that removed ~88.1k samples). A separate decontaminated variant exists that removes a small number of contaminated documents.

Who It's For and Tradeoffs

Great fit if you need a single, license-clear mixture for instruction tuning or RLHF research and want varied example types (chat, math, code). Look elsewhere if you require guaranteed per-source provenance at the example level, fully cleaned assistant-only rows out of the box, or a dataset that includes the paper's withheld "harmful intent" category—the original reproducer documented differences and remaining prompt-only rows. Consider running your own decontamination and assistant-response filters before assistant-only SFT.

Where It Fits

Use it as a broad pre-fine-tuning mixture or as part of a training corpus ensemble for LLM instruction alignment. For very strict benchmarking or production deployments, augment with focused, high-quality task-specific datasets and explicit provenance/cleaning steps.

Open-PerfectBlend

Introduction

What Sets It Apart

Who It's For and Tradeoffs

Where It Fits

Information

Categories

Tags

More Items

olmOCR-bench

Vāgdhenu — Sanskrit Chant Corpus

AFTER