DataFlow is an open-source, LLM-driven data preparation and workflow automation system from the OpenDCAI / PKU-DCAI team. It composes modular operators and pipelines to parse, generate, process, and evaluate high-quality training data from noisy sources (PDFs, plain text, low-quality QA) to improve domain-specific LLM performance. DataFlow includes many ready-made pipelines (Text, Reasoning, Text2SQL, Knowledge-Base Cleaning, Agentic RAG), a DataFlow Agent that auto-assembles pipelines, and a large operator library for filtering, synthesis, evaluation and more. It is distributed via GitHub and PyPI (open-dataflow) and comes with documentation, Colab demos, and Docker images for easy use.