Reliable data collection at scale is a prerequisite for modern ML pipelines and retrieval-augmented applications, yet building crawlers that are both robust and stealthy is often engineering-heavy. This project consolidates browser-based and HTTP crawling into a single TypeScript-first toolkit so teams can focus on what to collect, not how to keep the crawler running.
What Sets It Apart
- Unified browser + HTTP interface: lets you switch between raw HTTP, Cheerio/JSDOM parsing, and real-browser rendering (Playwright/Puppeteer) with a common API — so you can handle JS-heavy pages and lightweight endpoints without separate stacks.
- Operational primitives out of the box: persistent request queues, automatic retries, session management, proxy rotation, and built-in storage for datasets and files — meaning less glue code to make crawls repeatable and resumable.
- Stealth and scale features: zero-config generation of browser-like headers and TLS fingerprints, plus headful/headless support and automatic browser management — this reduces the manual fingerprinting work for long-running crawls.
- Designed for pipelines: storage adapters, dataset pushing, and integration-friendly hooks make it suitable for feeding downstream ML tasks (RAG corpus building, dataset curation) rather than one-off scraping scripts.
Who It's For & Trade-offs
Great fit if you need a production-grade Node.js crawler that must operate reliably against modern sites and integrate into data pipelines — teams collecting web data for search indexes, RAG, or analytics will benefit most. Look elsewhere if you need a lightweight Python-first scraper (there are Python alternatives and a separate crawlee-python project) or if you prefer minimal dependencies: real-browser crawling requires heavier runtime (browsers, Playwright/Puppeteer) and careful proxy/anti-bot strategy for large-scale targets. Also note it targets Node.js (requires Node.js 16+), so language preferences are a practical constraint.
Where It Fits
Positioned between low-level HTTP clients and full browser automation frameworks, it reduces operational overhead by combining queueing, storage, and browser orchestration. For teams that need repeatable, scalable crawls for ML data pipelines, it replaces ad-hoc Puppeteer/Python scripts with a structured, production-ready foundation.
