Overview
DataFlow is a data-centric AI system and open-source toolkit designed to automate and standardize the pipeline of preparing high-quality training and retrieval data for LLMs and downstream tasks. Built around a modular "operator" abstraction, DataFlow lets users combine rule-based methods, deep models, LLMs and external APIs into reusable operators and then compose those operators into pipelines suited for different data-engineering needs.
Core components
-
Operators: DataFlow provides a rich operator taxonomy (reported as 80+ generic operators, 40+ domain-specific operators, and ~20 evaluation operators). Each operator accepts structured inputs (json/jsonl/csv) and outputs processed, higher-quality data. Operators cover tasks such as text cleaning, QA extraction, chain-of-thought synthesis, difficulty/categorization estimation, NL2SQL generation, and evaluation metrics.
-
Pipelines: Operators are composed into ready-to-use pipelines including Text Pipeline (mine QA pairs from plain text), Reasoning Pipeline (extend with chain-of-thought, difficulty, category), Text2SQL Pipeline, Knowledge-Base Cleaning Pipeline (extract/structure knowledge from PDFs/tables/docs), and Agentic RAG Pipeline (identify QA that require external knowledge). These pipelines are configurable and intended for SFT, pretraining data filtering, RL training data generation, and RAG knowledge preparation.
-
DataFlow Agent: An agentic component that can analyze data needs, write or select operators, and dynamically orchestrate them into pipelines. The agent enables on-demand pipeline synthesis by recombining existing operators to meet specific objectives.
Usage and integration
-
Packaging & install: DataFlow is published to PyPI as
open-dataflow. The project supports Python >= 3.10. Quick install ispip install open-dataflow. Adataflow -vcommand and version checks are provided. -
Deployment: The project provides a Dockerfile and a prebuilt image (
molyheci/dataflow:cu124) that bundles CUDA 12.4 and vLLM for GPU-accelerated local inference. Colab demos are available for immediate experimentation. -
Extensibility: Users can add custom operators, assemble bespoke pipelines, and integrate their own LLMs or inference backends (local or API-based). The modular design supports mixing rule-based, model-based, and LLM-based processing.
Empirical results & research
The repository documents experiments showing DataFlow’s benefits on pretraining filtering, SFT synthesis and filtering, reasoning dataset construction, and code instruction curation. Reported benchmarks (in README) include improvements across math, code and knowledge evaluations when training on DataFlow-curated datasets versus baseline or randomly sampled datasets. The project also links to a technical report on arXiv for deeper methodological details and experiments.
Community & provenance
DataFlow originates from the OpenDCAI / PKU-DCAI research team and lists collaborating institutions such as Peking University (PKU), HKUST, Chinese Academy of Sciences (CAS), Shanghai AI Lab, Baichuan, and Ant Group. The GitHub repo (created 2024-10-13) shows community resources including documentation site, video tutorials, Colab notebooks, issue tracker, and contribution channels.
Typical use-cases
- Pretraining data filtering and selection to improve large-scale pretraining quality.
- SFT/RL training dataset synthesis and filtering for instruction-following and conversation models.
- Knowledge base cleaning and structuring from raw documents for RAG systems.
- Automated pipeline generation for recurring data-engineering tasks via the DataFlow Agent.
Getting started (example)
- Install:
pip install open-dataflow. - Run a simple text pipeline or try the provided Colab demo to convert raw text into QA pairs.
- Extend by adding custom operators or composing built-in operators into new pipelines.
- For GPU workloads, use the provided Docker image or install optional extras (e.g.,
open-dataflow[vllm]).
Links & citation
- Repository: https://github.com/OpenDCAI/DataFlow
- Documentation / official site: https://OpenDCAI.github.io/DataFlow-Doc/
- Technical report (arXiv): linked from the repo README
Notes
DataFlow is targeted at practitioners and researchers focused on data-centric approaches to improving LLMs. Its strengths are modularity, a large operator library, ready-to-use pipelines for common tasks (text QA mining, reasoning enhancement, text2sql, KB cleaning, RAG), and the ability to programmatically assemble pipelines via an agent. Users should consult the documentation for API details, operator specifications and the recommended best practices for large-scale data processing.
