AI Data Infrastructure

Turn the entire web into data your AI can use

Clean markdown, structured JSON, and real-time extraction — built for AI pipelines, RAG systems, and intelligent agents

Built for developers
1curl -X POST https://api.distill.dev/api/v1/scrape \
2 -H "X-API-Key: sk_your_key" \
3 -H "Content-Type: application/json" \
4 -d '{
5 "url": "https://openai.com/research",
6 "use_playwright": "auto",
7 "include_links": true
8 }'
9 
10// Response preview:
11{
12 "success": true,
13 "markdown": "# OpenAI Research\n\nOpenAI conducts...",
14 "metadata": {
15 "title": "OpenAI Research",
16 "word_count": 1423,
17 "cached": false
18 },
19 "request_id": "req_abc123"
20}
Output

OpenAI Research

Latest Publications

OpenAI conducts research across multiple domains including language models, reinforcement learning, and AI safety. Their recent publications focus on scaling laws, alignment techniques, and multimodal understanding...

Everything your AI pipeline needs

Production-ready web extraction with caching, safety, and async processing built in.

# Title
## Section
Clean text...

Clean Markdown Extraction

Automatically extracts readable content, strips boilerplate, and converts to clean Markdown with metadata — titles, descriptions, word counts, and reading time.

Extract top stories
{ "stories": [{
  "title": "..."
}]}

Structured JSON with Schema

Gemini-powered agent extracts structured data using your custom JSON schema definition.

🔴MISSreq_0011.2s
🟢HITreq_0018ms

Content Hash Caching

SHA-256 content hashing with multi-layer caching (Redis + DB) skips expensive re-processing.

http://169.254.169.254
Blocked

SSRF Protection

Internal IP blocking, private network validation, and strict URL sanitization to prevent SSRF attacks.

User-agent: *
Disallow: /private
Respected

robots.txt Compliance

Optional robots.txt checking ensures respectful crawling with per-domain rate limiting.

job_xyz
running
job_abc
done ✓
job_def
queued

Async Job Queue with Real-Time Polling

Long-running operations (site mapping, batch extraction) run as background jobs. Poll for progress with real-time status updates, page counts, and timing.

Cache performance

Distill (cached)8ms
Distill (fresh)1.1s
Others3.4s

Extraction times

URLExtractCached
openai.com/research892ms7ms
docs.anthropic.com1.1s9ms
arxiv.org/abs/...743ms6ms
github.com/trending654ms5ms

How it works

01

Send a URL + prompt

POST to any endpoint with your target URL and optional prompt for structured extraction.

02

Distill extracts & cleans

Auto-detects JS rendering, fetches content, removes boilerplate, converts to Markdown.

03

Get structured data back

Receive clean markdown, metadata, links, and optional structured JSON — ready for your pipeline.

Four powerful endpoints

Each endpoint is designed for a specific extraction pattern.

Scrape

Turn any URL into clean Markdown

POST/api/v1/scrape

Map

Crawl entire sites with BFS

POST/api/v1/map

Search

Web search with optional scrape

POST/api/v1/search

Agent Extract

Gemini-powered structured JSON

POST/api/v1/agent/extract

Start building in minutes

Get your API key and start extracting data. No credit card required.