Turn the entire web into data your AI can use
Clean markdown, structured JSON, and real-time extraction — built for AI pipelines, RAG systems, and intelligent agents
Everything your AI pipeline needs
Production-ready web extraction with caching, safety, and async processing built in.
Clean Markdown Extraction
Automatically extracts readable content, strips boilerplate, and converts to clean Markdown with metadata — titles, descriptions, word counts, and reading time.
"title": "..."
}]}
Structured JSON with Schema
Gemini-powered agent extracts structured data using your custom JSON schema definition.
Content Hash Caching
SHA-256 content hashing with multi-layer caching (Redis + DB) skips expensive re-processing.
SSRF Protection
Internal IP blocking, private network validation, and strict URL sanitization to prevent SSRF attacks.
robots.txt Compliance
Optional robots.txt checking ensures respectful crawling with per-domain rate limiting.
Async Job Queue with Real-Time Polling
Long-running operations (site mapping, batch extraction) run as background jobs. Poll for progress with real-time status updates, page counts, and timing.
Cache performance
Extraction times
How it works
Send a URL + prompt
POST to any endpoint with your target URL and optional prompt for structured extraction.
Distill extracts & cleans
Auto-detects JS rendering, fetches content, removes boilerplate, converts to Markdown.
Get structured data back
Receive clean markdown, metadata, links, and optional structured JSON — ready for your pipeline.
Four powerful endpoints
Each endpoint is designed for a specific extraction pattern.
Scrape
Turn any URL into clean Markdown
Map
Crawl entire sites with BFS
Search
Web search with optional scrape
Agent Extract
Gemini-powered structured JSON
Start building in minutes
Get your API key and start extracting data. No credit card required.