Skip to content

Shen-po-heng/Site-Crawler

Repository files navigation

Site Crawler

An async BFS website crawler framework built with Playwright.

This project demonstrates how to design a modular crawling system capable of:

  • rendering modern websites with Playwright
  • extracting page content and links
  • exporting Markdown snapshots
  • enforcing URL filtering and robots.txt rules
  • persisting crawler state for resumable crawling

The crawler is designed as a reusable framework rather than a site-specific scraper, with configuration provided via config.yaml and CLI overrides.

Why This Project

Many simple crawlers rely on requests + BeautifulSoup, which works well for static websites but struggles with modern JavaScript-rendered pages.

This project explores a different design:

  • Playwright rendering for dynamic content
  • BFS traversal for predictable crawl coverage
  • resumable state so long crawls can recover after interruption
  • modular architecture to separate fetching, filtering, exporting, and state management
  • configuration-driven crawling via YAML and CLI

The goal is not to compete with large frameworks like Scrapy, but to demonstrate how a crawler system can be structured from first principles.

Features

  • Async BFS crawl with persistent queue and resume support
  • Page content exported to Markdown
  • YAML-based configuration with CLI overrides
  • Generic URL filtering via regex patterns (no site-specific logic)
  • Optional robots.txt enforcement
  • Retry with exponential backoff for transient failures
  • Docker ready

Architecture

The crawler is designed as a modular pipeline:

  • CLI / config.yaml provides runtime configuration
  • main.py initializes the crawler and loads state
  • CrawlerCore coordinates crawling logic
  • PageFetcher uses Playwright for rendering
  • UrlFilter enforces domain, language, and robots rules
  • CrawlState persists queue progress for resumable crawling
  • MarkdownExporter saves processed content
flowchart TD
    CLI[CLI / config.yaml] --> Main[main.py]
    Main --> Config[CrawlerConfig<br/>config.py]
    Config --> Crawler[Crawler<br/>core.py]
    Crawler --> State[CrawlState<br/>state.py]
    Crawler --> Filter[UrlFilter<br/>filters.py]
    Crawler --> Fetcher[PageFetcher<br/>fetcher.py]
    Crawler --> Exporter[MarkdownExporter<br/>exporter.py]
    Fetcher --> Browser[Playwright Chromium<br/>async API]
    State --> StateFile[crawler_state.json]
    Exporter --> Pages[output/pages/*.md]
    Filter --> Robots[robots.txt<br/>urllib.robotparser]
Loading

Data Flow

sequenceDiagram
    participant Main
    participant Crawler as Crawler (core.py)
    participant Fetcher as PageFetcher
    participant Browser as Playwright Browser
    participant Filter as UrlFilter
    participant State as CrawlState
    participant Disk

    Main->>State: load state (queue / visited / failed)
    Main->>Crawler: asyncio.run(crawler.run())

    loop BFS Queue not empty
        Crawler->>State: next_url()
        State-->>Crawler: url
        Crawler->>Fetcher: fetch(url)
        Fetcher->>Browser: goto(url)
        Browser-->>Fetcher: page loaded
        Fetcher->>Browser: inner_text("body")
        Fetcher->>Browser: evaluate all a[href] links
        Browser-->>Fetcher: text, html, [links]
        Fetcher-->>Crawler: FetchResult

        Crawler->>Disk: MarkdownExporter.write(url, text)

        loop Each extracted link
            Crawler->>Filter: check(link)
            Filter-->>Crawler: (True/False, reason)
        end

        Crawler->>State: add_urls(allowed_links)
        Crawler->>State: mark_visited(url)
        Crawler->>Disk: save crawler_state.json
    end

    Note over Crawler: Ctrl+C saves state and exits cleanly
Loading

Project Structure

site-crawler/
├── crawler/
│   ├── config.py      # YAML loader + dataclasses
│   ├── filters.py     # URL filter (domain, extension, regex, robots)
│   ├── state.py       # Async BFS state with JSON persistence
│   ├── fetcher.py     # Playwright async page fetcher
│   ├── exporter.py    # Markdown file writer
│   └── core.py        # Main async crawl loop
├── tests/
│   ├── test_filters.py
│   ├── test_state.py
│   └── test_exporter.py
├── main.py            # CLI entry point
├── config.yaml        # Default configuration
├── Dockerfile
├── docker-compose.yml
└── requirements.txt

Setup

python -m venv .venv

# Windows
.venv\Scripts\pip install -r requirements.txt
.venv\Scripts\playwright install chromium

# macOS / Linux
.venv/bin/pip install -r requirements.txt
.venv/bin/playwright install chromium

Configuration

Edit config.yaml before running:

crawler:
  start_url: "https://example.com"
  allowed_domain: "example.com"
  max_pages: null          # null = unlimited

delays:
  min: 1.0
  max: 3.0

filters:
  respect_robots: true
  exclude_patterns: []     # regex — any match blocks the URL
  include_patterns: []     # regex whitelist — empty means allow all

Usage

# Use config.yaml defaults
python main.py

# Override start URL and domain
python main.py --url https://example.com --domain example.com

# Limit to 100 pages
python main.py --max-pages 100

# Add exclude patterns at runtime
python main.py --exclude "/login/" "/admin/" "/cart/"

# Show browser window
python main.py --headless false

# Ignore robots.txt
python main.py --ignore-robots

# Clear state and restart
python main.py --reset

Press Ctrl+C at any time — state is saved and the crawl resumes on the next run.

CLI Reference

Flag Default Description
--config config.yaml YAML config file path
--url from config Override start URL
--domain from config Override allowed domain
--max-pages null Stop after N pages
--min-delay from config Min delay between requests (s)
--max-delay from config Max delay between requests (s)
--headless true Pass false to show browser
--exclude Additional regex exclude patterns
--ignore-robots false Ignore robots.txt
--reset false Clear state and restart

URL Filtering

UrlFilter evaluates links in this order:

  1. Invalid — non-HTTP/HTTPS URLs or empty strings are rejected immediately
  2. Domain — only URLs matching allowed_domain (www. normalised) are followed
  3. Extension — configurable blocklist (.jpg, .pdf, .zip, etc.)
  4. Exclude patterns — any URL matching a regex in exclude_patterns is blocked
  5. Include patterns — if include_patterns is non-empty, URL must match at least one
  6. robots.txt — fetched once per domain and cached; checked if respect_robots: true

Error Handling

Error type Behaviour
HTTP 429 Long backoff (4× base delay), then re-queued
Playwright timeout Retry with exponential backoff (up to max_retries)
Any other exception Permanently failed — logged, not retried

Docker

# Build
docker build -t site-crawler .

# Run (edit config.yaml first)
docker run -v $(pwd)/output:/app/output -v $(pwd)/config.yaml:/app/config.yaml site-crawler

# Or with docker-compose
docker compose up

Output

output/
  pages/
    about_a1b2c3d4.md
    products_e5f6g7h8.md
    ...
crawler_state.json    # resume state
crawler.log           # full debug log

Each Markdown file has the format:

# https://example.com/about

<full page body text>

Log Diagnostics

Each crawled page logs link filtering stats:

Links: total=134 allowed=22 reasons={'allowed': 22, 'robots': 80, 'domain': 32}

Use this to diagnose why links are being filtered out.

Tests

pytest tests/ -v

All tests run without a real browser or network.

Tech Stack

Disclaimer

For educational and research use only. Always respect a website's terms of service and applicable laws before crawling.

About

Modular async web crawler framework using Playwright, BFS traversal, resumable state, and Markdown export pipeline.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors