A production-oriented bulk scanning utility for AdOps workflows that verifies the availability and integrity signals of ads.txt and app-ads.txt endpoints across large domain lists.
Note
This project is delivered as a Streamlit application focused on operational scanning and reporting rather than a standalone importable Python package.
- Features
- Tech Stack & Architecture
- Getting Started
- Testing
- Deployment
- Usage
- Configuration
- License
- Contacts & Community Support
- Bulk domain ingestion via multi-line textarea input (ideal for Excel/Sheets copy-paste workflows).
- Automated target selection between
ads.txtandapp-ads.txtthrough sidebar radio controls. - Input sanitization routine that normalizes domains by stripping protocol and path fragments.
- HTTPS endpoint probing with explicit timeout control and hardened browser-like User-Agent headers.
- Soft-404 detection heuristic (flags HTML/body responses even when HTTP status is
200). - Response classification into operational categories:
Found,Soft 404,Missing/Error,Timeout,Connection Error,Error. - Line-count extraction for valid text-based ad declaration files.
- Real-time progress visualization for long-running scans with completion counters.
- Results materialization in a pandas-backed tabular report rendered in Streamlit.
- Direct click-through links to scanned target URLs in the UI table.
- CSV export pipeline for downstream QA, BI, or trafficking audits.
- At-a-glance KPI summary metrics: total scanned, files found, missing/errors.
- Built-in de-duplication of domains to reduce redundant network calls.
- Light pacing delay (
time.sleep(0.1)) to reduce aggressive burst behavior on large scans.
Important
The scanner currently enforces HTTPS target URLs (for example, https://example.com/ads.txt). Domains that only serve files over HTTP may be marked as unavailable.
- Language: Python 3
- Web UI Framework: Streamlit
- HTTP Client: Requests
- Data Processing: pandas
- Packaging:
requirements.txtfor dependency pinning/installation
.
βββ app.py # Main Streamlit application: UI, scanning logic, reporting/export
βββ requirements.txt # Runtime dependencies
βββ LICENSE # MIT license
βββ README.md # Project documentation
-
Single-file application architecture (
app.py)- Minimizes onboarding friction.
- Optimized for quick operational deployment.
-
Synchronous scanning model
- Simplifies request lifecycle and error handling.
- Predictable behavior in Streamlit execution model.
-
Soft 404 detection
- Prevents false positives where a branded error page returns HTTP
200. - Uses lightweight HTML tag heuristics.
- Prevents false positives where a branded error page returns HTTP
-
DataFrame-first reporting layer
- Provides a clean bridge between raw checks, visual table rendering, and CSV export.
-
UI-centered workflow
- Prioritizes non-engineering AdOps users who need immediate batch validation without CLI complexity.
flowchart TD
A[User inputs domain list] --> B[Normalize input lines]
B --> C[De-duplicate domains]
C --> D[Select target file ads.txt or app-ads.txt]
D --> E[Iterate domains]
E --> F[HTTP GET with timeout and User-Agent]
F --> G{Response handling}
G -->|200 + text file| H[Status Found + line count]
G -->|200 + HTML markers| I[Status Soft 404]
G -->|Non-200| J[Status Missing/Error]
G -->|Timeout/Connection exception| K[Status Timeout/Connection Error]
H --> L[Append result]
I --> L
J --> L
K --> L
L --> M[Build pandas DataFrame]
M --> N[Render metrics and table]
N --> O[Export CSV]
Tip
For very large lists, split scans into logical batches (for example, by account, GEO, or inventory type) to keep UI sessions responsive and reporting easier to audit.
- Python
3.9+(recommended). pippackage manager.- Outbound internet connectivity to target domains.
- Optional: virtual environment tooling (
venv,virtualenv, orconda).
# 1) Clone repository
git clone https://github.com/<your-org>/ads.txt-app-ads.txt-Availability-Checker.git
cd ads.txt-app-ads.txt-Availability-Checker
# 2) (Recommended) Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate # macOS/Linux
# .venv\Scripts\activate # Windows PowerShell
# 3) Install dependencies
pip install -r requirements.txt
# 4) Launch Streamlit app
streamlit run app.pyBy default, Streamlit serves the app on http://localhost:8501.
There is no dedicated automated test suite in the current repository snapshot; however, you can still execute practical validation checks and static/runtime sanity commands.
# Dependency integrity check
python -m pip check
# Basic syntax validation
python -m py_compile app.py
# Optional linting (if installed in your environment)
flake8 app.py
# Run application manually for functional testing
streamlit run app.pyRecommended manual validation matrix:
- Valid domain hosting
ads.txtshould returnFoundwith positive line count. - Domain returning custom HTML at target path should return
Soft 404. - Non-existent path or domain should return
Missing/ErrororConnection Error. - Large input set should show progressive scan updates and permit CSV export.
Warning
flake8 is not pinned in requirements.txt; install it separately if your CI pipeline enforces linting.
For production-grade usage, deploy as a Streamlit service with process supervision and environment isolation.
pip install -r requirements.txt
streamlit run app.py --server.address 0.0.0.0 --server.port 8501- Place behind a reverse proxy (Nginx/Caddy) for TLS termination and access control.
- Add process management with
systemd, Docker restart policies, or platform-native supervisors.
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8501
CMD ["streamlit", "run", "app.py", "--server.address", "0.0.0.0", "--server.port", "8501"]- Run
python -m py_compile app.pyas a minimum build gate. - Run
python -m pip checkto detect dependency conflicts. - Add optional lint stage (
flake8) and image scanning for container pipelines.
Caution
Large domain scans can generate high outbound request volume. Ensure your infrastructure, egress policies, and target-site compliance constraints are reviewed before scaling scan frequency.
streamlit run app.py- Select target file type (
ads.txtorapp-ads.txt) in the sidebar. - Paste domains (one per line) in the input area.
- Click
Start Scan. - Review summary metrics and detailed table.
- Export results via
Download Report (CSV).
The scanner normalizes inputs by removing protocol/path components.
Input: https://example.com/path/page
Output: example.com
If you want to repurpose internal logic in custom scripts, the current app exposes helper functions in app.py:
from app import format_url, check_single_domain
# Normalize raw input into domain
domain = format_url("https://www.example.com/anything")
# Validate ads.txt availability for normalized domain
result = check_single_domain(domain, "ads.txt")
print(result)
# Expected shape:
# {
# "Domain": "www.example.com",
# "Status": "Found" | "Soft 404" | "Missing/Error" | ...,
# "Code": 200,
# "Lines": 42,
# "URL": "https://www.example.com/ads.txt"
# }Note
Importing from app.py in external scripts will also evaluate Streamlit setup statements. For clean library-style reuse, consider extracting core scanner functions into a dedicated module (for example, scanner.py).
The current implementation is code-configured (no .env parser or CLI flags yet). The table below summarizes adjustable behavior and where to change it.
| Configuration Surface | Current Value | Location | Purpose |
|---|---|---|---|
| Streamlit page title | Bulk Ads.txt Scanner |
st.set_page_config(...) |
Browser tab/app identity |
| Streamlit layout | wide |
st.set_page_config(...) |
Wider results table rendering |
| Supported file targets | ads.txt, app-ads.txt |
Sidebar st.radio |
Select scanning endpoint type |
| Request timeout | 5 seconds |
requests.get(..., timeout=5) |
Prevent hanging connections |
| User-Agent | Chrome-like UA string | headers in check_single_domain |
Improve compatibility with strict servers |
| Scan pacing | 0.1 seconds per domain |
time.sleep(0.1) |
Reduces aggressive burst traffic |
| Output filename | ads_scan_results_<file_type>.csv |
st.download_button(...) |
Standardized report naming |
No environment variables are required by default.
If you extend this project, consider adding:
REQUEST_TIMEOUT_SECONDSSCAN_SLEEP_SECONDSDEFAULT_FILE_TYPEMAX_BATCH_SIZE
Streamlit supports runtime flags for host/port and server behavior:
streamlit run app.py --server.address 0.0.0.0 --server.port 8501This project is licensed under the MIT License. See LICENSE for full terms.
If you find this tool useful, consider leaving a star on GitHub or supporting the author directly.