An async BFS website crawler framework built with Playwright.
This project demonstrates how to design a modular crawling system capable of:
- rendering modern websites with Playwright
- extracting page content and links
- exporting Markdown snapshots
- enforcing URL filtering and robots.txt rules
- persisting crawler state for resumable crawling
The crawler is designed as a reusable framework rather than a site-specific scraper, with configuration provided via config.yaml and CLI overrides.
Many simple crawlers rely on requests + BeautifulSoup, which works well for static websites but struggles with modern JavaScript-rendered pages.
This project explores a different design:
- Playwright rendering for dynamic content
- BFS traversal for predictable crawl coverage
- resumable state so long crawls can recover after interruption
- modular architecture to separate fetching, filtering, exporting, and state management
- configuration-driven crawling via YAML and CLI
The goal is not to compete with large frameworks like Scrapy, but to demonstrate how a crawler system can be structured from first principles.
- Async BFS crawl with persistent queue and resume support
- Page content exported to Markdown
- YAML-based configuration with CLI overrides
- Generic URL filtering via regex patterns (no site-specific logic)
- Optional
robots.txtenforcement - Retry with exponential backoff for transient failures
- Docker ready
The crawler is designed as a modular pipeline:
- CLI / config.yaml provides runtime configuration
- main.py initializes the crawler and loads state
- CrawlerCore coordinates crawling logic
- PageFetcher uses Playwright for rendering
- UrlFilter enforces domain, language, and robots rules
- CrawlState persists queue progress for resumable crawling
- MarkdownExporter saves processed content
flowchart TD
CLI[CLI / config.yaml] --> Main[main.py]
Main --> Config[CrawlerConfig<br/>config.py]
Config --> Crawler[Crawler<br/>core.py]
Crawler --> State[CrawlState<br/>state.py]
Crawler --> Filter[UrlFilter<br/>filters.py]
Crawler --> Fetcher[PageFetcher<br/>fetcher.py]
Crawler --> Exporter[MarkdownExporter<br/>exporter.py]
Fetcher --> Browser[Playwright Chromium<br/>async API]
State --> StateFile[crawler_state.json]
Exporter --> Pages[output/pages/*.md]
Filter --> Robots[robots.txt<br/>urllib.robotparser]
sequenceDiagram
participant Main
participant Crawler as Crawler (core.py)
participant Fetcher as PageFetcher
participant Browser as Playwright Browser
participant Filter as UrlFilter
participant State as CrawlState
participant Disk
Main->>State: load state (queue / visited / failed)
Main->>Crawler: asyncio.run(crawler.run())
loop BFS Queue not empty
Crawler->>State: next_url()
State-->>Crawler: url
Crawler->>Fetcher: fetch(url)
Fetcher->>Browser: goto(url)
Browser-->>Fetcher: page loaded
Fetcher->>Browser: inner_text("body")
Fetcher->>Browser: evaluate all a[href] links
Browser-->>Fetcher: text, html, [links]
Fetcher-->>Crawler: FetchResult
Crawler->>Disk: MarkdownExporter.write(url, text)
loop Each extracted link
Crawler->>Filter: check(link)
Filter-->>Crawler: (True/False, reason)
end
Crawler->>State: add_urls(allowed_links)
Crawler->>State: mark_visited(url)
Crawler->>Disk: save crawler_state.json
end
Note over Crawler: Ctrl+C saves state and exits cleanly
site-crawler/
├── crawler/
│ ├── config.py # YAML loader + dataclasses
│ ├── filters.py # URL filter (domain, extension, regex, robots)
│ ├── state.py # Async BFS state with JSON persistence
│ ├── fetcher.py # Playwright async page fetcher
│ ├── exporter.py # Markdown file writer
│ └── core.py # Main async crawl loop
├── tests/
│ ├── test_filters.py
│ ├── test_state.py
│ └── test_exporter.py
├── main.py # CLI entry point
├── config.yaml # Default configuration
├── Dockerfile
├── docker-compose.yml
└── requirements.txt
python -m venv .venv
# Windows
.venv\Scripts\pip install -r requirements.txt
.venv\Scripts\playwright install chromium
# macOS / Linux
.venv/bin/pip install -r requirements.txt
.venv/bin/playwright install chromiumEdit config.yaml before running:
crawler:
start_url: "https://example.com"
allowed_domain: "example.com"
max_pages: null # null = unlimited
delays:
min: 1.0
max: 3.0
filters:
respect_robots: true
exclude_patterns: [] # regex — any match blocks the URL
include_patterns: [] # regex whitelist — empty means allow all# Use config.yaml defaults
python main.py
# Override start URL and domain
python main.py --url https://example.com --domain example.com
# Limit to 100 pages
python main.py --max-pages 100
# Add exclude patterns at runtime
python main.py --exclude "/login/" "/admin/" "/cart/"
# Show browser window
python main.py --headless false
# Ignore robots.txt
python main.py --ignore-robots
# Clear state and restart
python main.py --resetPress Ctrl+C at any time — state is saved and the crawl resumes on the next run.
| Flag | Default | Description |
|---|---|---|
--config |
config.yaml |
YAML config file path |
--url |
from config | Override start URL |
--domain |
from config | Override allowed domain |
--max-pages |
null | Stop after N pages |
--min-delay |
from config | Min delay between requests (s) |
--max-delay |
from config | Max delay between requests (s) |
--headless |
true |
Pass false to show browser |
--exclude |
— | Additional regex exclude patterns |
--ignore-robots |
false | Ignore robots.txt |
--reset |
false | Clear state and restart |
UrlFilter evaluates links in this order:
- Invalid — non-HTTP/HTTPS URLs or empty strings are rejected immediately
- Domain — only URLs matching
allowed_domain(www. normalised) are followed - Extension — configurable blocklist (
.jpg,.pdf,.zip, etc.) - Exclude patterns — any URL matching a regex in
exclude_patternsis blocked - Include patterns — if
include_patternsis non-empty, URL must match at least one - robots.txt — fetched once per domain and cached; checked if
respect_robots: true
| Error type | Behaviour |
|---|---|
| HTTP 429 | Long backoff (4× base delay), then re-queued |
| Playwright timeout | Retry with exponential backoff (up to max_retries) |
| Any other exception | Permanently failed — logged, not retried |
# Build
docker build -t site-crawler .
# Run (edit config.yaml first)
docker run -v $(pwd)/output:/app/output -v $(pwd)/config.yaml:/app/config.yaml site-crawler
# Or with docker-compose
docker compose upoutput/
pages/
about_a1b2c3d4.md
products_e5f6g7h8.md
...
crawler_state.json # resume state
crawler.log # full debug log
Each Markdown file has the format:
# https://example.com/about
<full page body text>Each crawled page logs link filtering stats:
Links: total=134 allowed=22 reasons={'allowed': 22, 'robots': 80, 'domain': 32}
Use this to diagnose why links are being filtered out.
pytest tests/ -vAll tests run without a real browser or network.
- Playwright — async browser automation
- PyYAML — YAML config parsing
- pytest — unit tests
For educational and research use only. Always respect a website's terms of service and applicable laws before crawling.