This is the artifact for paper Can Old Tests do New Tricks for Resolving SWE Issues? accepted at FSE 2026.
We apply for the Functional, Reusable, and Available badges.
- Functional: Reviewers can run the complete TestLoc pipeline end-to-end on a demo of SWE-bench instance (e.g.,
django__django-16145) in approximately 3-5 minutes by following the Quick Start instructions below (this does not include swebench image building time). The pipeline produces all expected outputs: suspicious functions, selected test files, coverage artifacts, and the finalminimized_tests.json. An example output is provided inexample_output/. - Reusable: We support multiple LLMs and works on both SWE-bench Verified and Lite datasets. The minimized regression tests can be integrated into any coding agent framework through the provided tool bundle (
tools/regression_tests/), with SWE-agent as a documented example and instructions for adapting to other agents. The code includes multiple parameters (minimization strategy, top-k test files, parallel workers) to support different experimental setups. - Available: The artifact is publicly available on GitHub at https://github.com/IBM/Issue-Test-Localizer. Pre-computed results for all experiments in the paper are included in the repository (
tests/,results_logs/), with additional detailed logs available via Google Drive.
TestLoc takes a GitHub issue and its codebase as input, and produces a minimized set of regression tests that cover the buggy code. The pipeline has 4 steps:
- Suspicious Function Localization (
testloc.py): An LLM identifies the most relevant source files and functions related to the issue description. - Test File Retrieval (
test_file_selection.py): The LLM selects the top-k (default 10) test files most likely to exercise the suspicious functions. - Coverage Generation (
generate_cov_batch.py): The selected tests are executed inside SWE-bench Docker containers withcoverage runand dynamic contexts, producing per-line coverage data that maps each source line to the test(s) that exercise it. - Test Minimization (
test_minimization.py): Greedy algorithm select the minimal set of tests that covers all lines of the suspicious functions, with LLM-based tie-breaking when multiple tests have equal coverage.
The final output is minimized_tests.json: a small set of regression tests reduced from thousands in the original test suite.
Since running the full pipeline on an entire dataset (500 instances for SWE-bench Verified) can take approximately two days, here we provide instructions to demo the complete pipeline on a single instance (e.g., django__django-16145). Reviewers can follow these exact commands to reproduce a complete run in about 3-5 minutes.
- Python 3.10+
- Docker
- An Anthropic API key (or OpenAI API key)
git clone https://github.com/IBM/Issue-Test-Localizer.git
cd Issue-Test-Localizer
python3 -m venv testloc_venv
source testloc_venv/bin/activate
pip install -r requirements.txt
export ANTHROPIC_API_KEY=your_anthropic_api_keypython3 src/run_pipeline.py \
--instance_id django__django-16145 \
--model claude-sonnet-4-20250514 \
--dataset princeton-nlp/SWE-bench_Verified \
--log_dir outputThis single command runs all 4 steps sequentially. It takes approximately 3-5 minutes (most time is spent on Docker container setup and coverage collection in Step 3).
The pipeline script supports skip flags for incremental runs:
--skip_step1: skip Step 1 ifsuspicious_funcs.jsonalready exists--skip_step2: skip Step 2 iftest_file_selection.jsonalready exists--skip_coverage: skip Step 3 if coverage data already exists
An example output from running on django__django-16145 is provided in example_output/. After the pipeline completes, you will see:
output/
suspicious_funcs.json Step 1 result: suspicious functions per instance
princeton-nlp/SWE-bench_Verified/ Step 1 logs
<model>/
inspector_[instance_id].log detailed localization log
<model>_results/
all_results_..._[instance_id].json full localization output
model_test_files/SWE-bench_Verified/ Step 2 results
<model>/
test_file_selection.json selected test files (top-10)
model_selected.json full LLM prompt + response
confirmed_suspicious_funcs.json confirmed functions
coverage/SWE-bench_Verified_<model>/ Steps 3+4 results
[instance_id]/ per-instance coverage artifacts
[instance_id]_coverage .coverage SQLite DB
.coveragerc coverage config used
eval.sh exact script that ran in Docker
test_output.txt full container stdout
instance.log coverage generation log
minimization_logs/
[instance_id].log minimization decision log
minimized_tests.json FINAL OUTPUT
The final output is minimized_tests.json:
{
"[instance_id]": [
"test_module1.TestClass.test_method1",
"test_module2.TestClass.test_method2",
...
]
}Note that due to LLM nondeterminism, running the pipeline multiple times may produce slightly different results.
You can inspect the .coverage file to confirm it contains per-test dynamic contexts:
import coverage
c = coverage.Coverage(data_file="output/coverage/.../django__django-16145/django__django-16145_coverage")
c.load()
data = c.get_data()
print(f"Measured files: {len(data.measured_files())}")
target = "/testbed/django/core/management/commands/runserver.py"
contexts = data.contexts_by_lineno(target)
for lineno, tests in sorted(contexts.items()):
tests = [t for t in tests if t]
if tests:
print(f" Line {lineno}: {tests}")This shows which test functions exercise which lines of the suspicious file, which is the basis for the greedy minimization in Step 4.
To run the full pipeline on all 500 SWE-bench Verified instances (or 300 Lite instances):
python3 src/run_pipeline.py \
--model claude-sonnet-4-20250514 \
--dataset princeton-nlp/SWE-bench_Verified \
--log_dir output \
--max_workers 4Or use the shell wrappers for more control:
# Step 1: Suspicious function localization (all instances)
bash src/get_suspicious_funcs.sh princeton-nlp/SWE-bench_Verified output claude-sonnet-4-20250514
# Steps 2-4: Test selection, coverage, minimization
bash src/minimization.sh \
output/suspicious_funcs.json \
output/minimized_tests.json \
princeton-nlp/SWE-bench_Verified \
output \
claude-sonnet-4-20250514 \
greedy_additionalAfter TestLoc produces minimized_tests.json, the minimized regression tests can be integrated into coding agents as a tool that agents can call during issue resolution. This enables the agent to run regression tests before and after making changes, ensuring no existing functionality is broken.
Below we use SWE-agent as an example. To reuse in other agents, follow similar instructions: provide the minimized_tests.json to the agent environment, and add run_tests / list_tests tool commands using the tool bundle and system prompt provided here.
The tools/regression_tests/ directory contains a ready-to-use tool bundle that can be plugged into any agent framework:
tools/
regression_tests/ # Tool bundle
config.yaml # Tool definitions (run_tests, list_tests)
bin/run_tests # Runs regression tests for the current instance
bin/list_tests # Lists available regression tests
lib/test_runner.py # Core test runner with per-framework support
swe_agent_config.yaml # SWE-agent config with testing workflow instructions
The test runner automatically detects the correct test framework based on the specific project.
The tool bundle provides two commands for the agent:
run_tests: Run related regression tests for the current instance. Returns pass/fail status and detailed output. Can also accept specific test names:run_tests test1 test2.list_tests: List available related regression tests for the current instance.
To integrate TestLoc's minimized tests into a different agent framework:
- Provide the test data: Make
minimized_tests.jsonavailable inside the agent's environment (e.g., copy to the Docker container at/root/regression_tests.json). - Set environment variables:
SWE_INSTANCE_ID(current instance) andSWE_REPO_DIR(repo path, typically/testbed). - Register the tools: Add
run_testsandlist_testsas callable tools. The scripts intools/regression_tests/bin/can be invoked directly: they read from the environment variables and JSON file set up in steps 1-2. - Update the system prompt: Instruct the agent to run
run_testsbefore and after making changes. Seetools/swe_agent_config.yamlfor the exact prompt wording used in our experiments.
testloc/
|-- src/
| |-- run_pipeline.py # End-to-end pipeline runner (recommended entry point)
| |-- testloc.py # Step 1 entry point
| |-- localization.py # Step 1 core: Inspector class, call graph, LLM prompting
| |-- test_file_selection.py # Step 2: LLM-based test file retrieval (top-k)
| |-- generate_cov_batch.py # Step 3: Docker-based coverage generation
| |-- test_minimization.py # Step 4: Greedy set-cover minimization
| |-- get_suspicious_funcs.sh # Shell wrapper for Step 1
| |-- minimization.sh # Shell wrapper for Steps 2-4
| |-- call_graph_generation.py # Tree-sitter call graph construction + traversal
| |-- model_config.py # LLM API interface (Claude, OpenAI)
| |-- prompts.py # All LLM prompt templates
| |-- query_cov.py # Coverage data query utility
| |-- utils.py # Shared utilities (repo prep, file I/O, AST helpers)
| |-- locate_tests.py # AST-based test method finder
| |-- patch_selection_multi.py # Patch validation via Docker
| |-- bm25.py # BM25 baseline for test ranking
|
|-- tools/
| |-- regression_tests/ # Tool bundle for running minimized tests in agents
| | |-- config.yaml # Tool definitions (run_tests, list_tests)
| | |-- bin/run_tests # Run regression tests for current instance
| | |-- bin/list_tests # List available regression tests
| | |-- lib/test_runner.py # Core test runner (multi-framework support)
| |-- swe_agent_config.yaml # SWE-agent config with testing workflow and prompts
|
|-- tests/ # Pre-computed evaluation results
| |-- regression_tests/ # Coverage result JSONs (per model x dataset)
| |-- reproduction_test/ # Reproduction test results by strategy
|
|-- results_logs/
| |-- addtional_results/ # Evaluation results from other agents
| | |-- sweagent/ # SWE-agent baseline (patches, logs, reports)
| | |-- trae_agent/ # Trae agent baseline (patches, logs, reports)
| |-- patch_ranking_logs/ # Patch ranking and selection logs for agentless
| |-- all_logs/ # Execution logs per instance
| |-- all_patches/ # Generated patch files (.jsonl)
| |-- evaluation_logs/ # Test execution results per model/strategy
| |-- ranks/ # Ranking outputs
|-- requirements.txt # Python dependencies
TestLoc implements two greedy set-cover variants:
Each iteration re-ranks tests by how many remaining uncovered lines they cover:
while uncovered lines remain:
select test(s) covering the most uncovered lines
if tie (>3 candidates): use LLM to break tie
remove covered lines from uncovered set
Tests are ranked once by total coverage across all suspicious function lines:
sort tests by total coverage (descending, fixed)
while tests remain and no_improve < 3:
select test with highest total coverage
if cumulative coverage improves: reset no_improve counter
else: increment no_improve
Select via --strategy greedy_additional or --strategy greedy_total.
| Model | Environment Variable | Example Model Name |
|---|---|---|
| Claude (Anthropic) | ANTHROPIC_API_KEY |
claude-sonnet-4-20250514 |
| GPT-4o (OpenAI) | OPENAI_API_KEY |
gpt-4o |
| Custom endpoint | OPENAI_API_KEY + MODEL_SERVING_URL |
Any model name |
princeton-nlp/SWE-bench_Verifiedprinceton-nlp/SWE-bench_Lite
Pre-computed results from the paper's experiments are included in the repository:
- Regression tests:
tests/regression_tests/: Minimized regression test results for Claude and GPT-4o on both SWE-bench Verified and Lite datasets. - Reproduction tests from Otter:
tests/reproduction_test/: Reproduction test generation results across different strategies (full, none, patchLoc, planner, regression, testLoc), organized by model and dataset. - Agent results:
results_logs/addtional_results/: Evaluation results from SWE-agent and Trae agent baselines, including generated patches, execution logs, and evaluation reports. - Agentless results:
results_logs/patch_ranking_logs/: Patch ranking and selection logs, including generated patches, test execution results, and final rankings per model/strategy. - Detailed logs: Google Drive: Full execution logs and intermediate data for all experiments.