LogoAIAny
Icon for item

DataFlow

DataFlow is an open-source, LLM-driven data preparation and workflow automation system from the OpenDCAI / PKU-DCAI team. It composes modular operators and pipelines to parse, generate, process, and evaluate high-quality training data from noisy sources (PDFs, plain text, low-quality QA) to improve domain-specific LLM performance. DataFlow includes many ready-made pipelines (Text, Reasoning, Text2SQL, Knowledge-Base Cleaning, Agentic RAG), a DataFlow Agent that auto-assembles pipelines, and a large operator library for filtering, synthesis, evaluation and more. It is distributed via GitHub and PyPI (open-dataflow) and comes with documentation, Colab demos, and Docker images for easy use.

Introduction

Overview

DataFlow is a data-centric AI system and open-source toolkit designed to automate and standardize the pipeline of preparing high-quality training and retrieval data for LLMs and downstream tasks. Built around a modular "operator" abstraction, DataFlow lets users combine rule-based methods, deep models, LLMs and external APIs into reusable operators and then compose those operators into pipelines suited for different data-engineering needs.

Core components
  • Operators: DataFlow provides a rich operator taxonomy (reported as 80+ generic operators, 40+ domain-specific operators, and ~20 evaluation operators). Each operator accepts structured inputs (json/jsonl/csv) and outputs processed, higher-quality data. Operators cover tasks such as text cleaning, QA extraction, chain-of-thought synthesis, difficulty/categorization estimation, NL2SQL generation, and evaluation metrics.

  • Pipelines: Operators are composed into ready-to-use pipelines including Text Pipeline (mine QA pairs from plain text), Reasoning Pipeline (extend with chain-of-thought, difficulty, category), Text2SQL Pipeline, Knowledge-Base Cleaning Pipeline (extract/structure knowledge from PDFs/tables/docs), and Agentic RAG Pipeline (identify QA that require external knowledge). These pipelines are configurable and intended for SFT, pretraining data filtering, RL training data generation, and RAG knowledge preparation.

  • DataFlow Agent: An agentic component that can analyze data needs, write or select operators, and dynamically orchestrate them into pipelines. The agent enables on-demand pipeline synthesis by recombining existing operators to meet specific objectives.

Usage and integration
  • Packaging & install: DataFlow is published to PyPI as open-dataflow. The project supports Python >= 3.10. Quick install is pip install open-dataflow. A dataflow -v command and version checks are provided.

  • Deployment: The project provides a Dockerfile and a prebuilt image (molyheci/dataflow:cu124) that bundles CUDA 12.4 and vLLM for GPU-accelerated local inference. Colab demos are available for immediate experimentation.

  • Extensibility: Users can add custom operators, assemble bespoke pipelines, and integrate their own LLMs or inference backends (local or API-based). The modular design supports mixing rule-based, model-based, and LLM-based processing.

Empirical results & research

The repository documents experiments showing DataFlow’s benefits on pretraining filtering, SFT synthesis and filtering, reasoning dataset construction, and code instruction curation. Reported benchmarks (in README) include improvements across math, code and knowledge evaluations when training on DataFlow-curated datasets versus baseline or randomly sampled datasets. The project also links to a technical report on arXiv for deeper methodological details and experiments.

Community & provenance

DataFlow originates from the OpenDCAI / PKU-DCAI research team and lists collaborating institutions such as Peking University (PKU), HKUST, Chinese Academy of Sciences (CAS), Shanghai AI Lab, Baichuan, and Ant Group. The GitHub repo (created 2024-10-13) shows community resources including documentation site, video tutorials, Colab notebooks, issue tracker, and contribution channels.

Typical use-cases
  • Pretraining data filtering and selection to improve large-scale pretraining quality.
  • SFT/RL training dataset synthesis and filtering for instruction-following and conversation models.
  • Knowledge base cleaning and structuring from raw documents for RAG systems.
  • Automated pipeline generation for recurring data-engineering tasks via the DataFlow Agent.
Getting started (example)
  1. Install: pip install open-dataflow.
  2. Run a simple text pipeline or try the provided Colab demo to convert raw text into QA pairs.
  3. Extend by adding custom operators or composing built-in operators into new pipelines.
  4. For GPU workloads, use the provided Docker image or install optional extras (e.g., open-dataflow[vllm]).
Notes

DataFlow is targeted at practitioners and researchers focused on data-centric approaches to improving LLMs. Its strengths are modularity, a large operator library, ready-to-use pipelines for common tasks (text QA mining, reasoning enhancement, text2sql, KB cleaning, RAG), and the ability to programmatically assemble pipelines via an agent. Users should consult the documentation for API details, operator specifications and the recommended best practices for large-scale data processing.

Information

  • Websitegithub.com
  • AuthorsOpenDCAI, PKU-DCAI Research Team, Peking University (PKU), HKUST, Chinese Academy of Sciences (CAS), Shanghai AI Lab, Baichuan, Ant Group
  • Published date2024/10/13