Firecrawl — Web Data API for AI
Firecrawl is an API-first project that converts websites into clean, LLM-ready content and structured data. It’s designed for AI use cases like retrieval-augmented generation (RAG), knowledge base ingestion, automated data extraction, and monitoring site changes. Firecrawl provides both an open-source codebase and a hosted cloud offering, combining developer-friendly SDKs with robust crawling and extraction capabilities.
Key capabilities
- Scrape: fetch a single URL and return content in formats such as Markdown, HTML, screenshots and JSON (including metadata).
- Crawl: submit a job to crawl a URL and its accessible subpages (no sitemap required), with job polling and results retrieval.
- Map: rapidly discover and list URLs on a domain (site map / link discovery).
- Search + Scrape: run web searches and optionally scrape retrieved results in one operation.
- LLM Extraction: extract structured data from one or many pages using either a provided schema or a prompt (supports JSON schema / Pydantic / Zod integration in SDKs).
- Actions: interact with pages before scraping (click, scroll, input, wait, screenshot) — useful for JavaScript-driven or gated content (cloud-only for some actions).
- Batch scraping & async jobs: submit thousands of URLs or batch jobs and poll for results.
- Change tracking: monitor pages and detect content changes over time.
Integrations & SDKs
Firecrawl provides SDKs and integrations to make adoption straightforward:
- Official SDKs: Python (firecrawl-py), Node (@mendable/firecrawl-js), Go, Rust.
- Framework integrations: LangChain (Python & JS), LlamaIndex, and other popular toolchains.
- Low-code and connectors: Zapier, Pipedream, Dify, Flowise, Cargo, etc.
Usage scenarios
- Build a "chat with your website" assistant using RAG: scrape a site into chunked, LLM-ready markdown and index into a vector DB.
- Extract structured entities (product info, articles, company data) at scale using LLM extraction with schemas.
- Automate interactions to access dynamic or gated content (forms, paginated content) then scrape results.
- Monitor competitors' pages or documentation with change tracking.
Open source vs cloud
Firecrawl’s repo is open-source under AGPL-3.0 (with some components under MIT). A hosted Firecrawl Cloud (https://firecrawl.dev) offers additional managed features, scalability, and convenience. The project provides clear self-hosting docs but indicates some cloud-only features and continued active development of hosted capabilities.
Notes & metadata
- Created: 2024-04-15 (GitHub repository creation date)
- License: AGPL-3.0 (with certain components under MIT)
- Typical outputs: markdown, html, json (structured extraction), screenshots, metadata
- Common use-cases: RAG ingestion, knowledge-base building, automated web extraction, data engineering for LLMs
Getting started (high level)
- Sign up for API key on the hosted site or self-host the repo.
- Use SDK (Python/Node) or curl to call /v2/scrape, /v2/crawl, /v2/map, /v2/extract endpoints.
- Optionally provide schema or prompt for structured extraction and integrate output into your LLM pipeline.
