Dead-code detection is easy to make look good on a toy repository.
Write a function, never call it, run a scanner, print the finding. That demo tells you almost nothing about whether the tool is useful in a real codebase.
Real Python projects are different. They have framework entrypoints, pytest fixtures, plugin loading, dynamic imports, decorators, command-line scripts, optional dependencies, generated modules, and objects accessed through strings. A scanner that ignores those patterns will produce impressive-looking output and then waste maintainers' time.
So we tried a stricter test: find dead code in mature open-source repositories, submit small cleanup pull requests, and see what maintainers actually merge.
This is not a universal benchmark and it is not an endorsement claim. It is a practical test of whether selected findings can survive review outside our own repository.
The Result
Eight Skylos-assisted cleanup PRs were merged across seven mature open-source projects.
| Project | PR | Final GitHub diff | What was removed |
|---|---|---|---|
| Black | psf/black#5041 | 3 files, 0 additions, 24 deletions | unused internal parsing and node helpers |
| Black | psf/black#5052 | 3 files, 0 additions, 36 deletions | unused token helpers, parser debug methods, and a stale attribute |
| Flagsmith | Flagsmith/flagsmith#6953 | 10 files, 0 additions, 56 deletions | unused exceptions, serializers, response classes, and helper code |
| pypdf | py-pdf/pypdf#3685 | 1 file, 0 additions, 4 deletions | unused reverse encoding dictionaries |
| mitmproxy | mitmproxy/mitmproxy#8136 | 8 files, 2 additions, 44 deletions | unused console helpers, bit utilities, and stale imports |
| NetworkX | networkx/networkx#8572 | 5 files, 1 addition, 31 deletions | an unused private function and unused imports |
| Optuna | optuna/optuna#6547 | 5 files, 2 additions, 37 deletions | unused helpers, a method, a constant, and unpacked variables |
| beets | beetbox/beets#6473 | 4 files, 0 additions, 38 deletions | unused plugin helpers and a dead database type |
Final merged diff across those PRs:
| Merged PRs | Files changed | Additions | Deletions | Net change |
|---|---|---|---|---|
| 8 | 39 | 5 | 270 | -265 |
Those numbers are intentionally modest. The goal was not to open giant cleanup PRs. The goal was to find small changes maintainers could review quickly.
What This Proves, And What It Does Not
It proves that selected Skylos-assisted findings can turn into real merged cleanup work in mature repositories.
It does not prove that every finding is correct. It does not prove that Skylos is better than every specialized tool on every repository. It does not mean Black, NetworkX, Optuna, mitmproxy, pypdf, beets, or Flagsmith use or endorse Skylos.
That distinction matters. Static-analysis marketing often overclaims. Maintainer review is useful because it creates an external check, but it is still only one kind of evidence.
The stronger claim is narrower:
If the scanner can produce small dead-code candidates that maintainers accept, it is finding real maintenance debt, not only benchmark artifacts.
Why Dead-Code Detection Gets Noisy
Python makes static dead-code detection difficult because code can be reached without a direct textual call.
The standard library supports programmatic imports through importlib.import_module(). Plugin systems often discover code at runtime through package metadata or entrypoints. Frameworks call functions because a decorator, route table, migration runner, CLI config, or test runner knows about them, not because another Python file contains a normal function call.
The Vulture documentation shows the same problem in a small example: a method reached through getattr() can still be reported as unused, and the recommended fix is a whitelist. That is not a knock on Vulture. It is the central problem every Python dead-code tool has to face.
A practical scanner needs to assume that some findings are wrong until proven otherwise.
The False-Positive Traps We Checked
Before opening a PR, we looked for the common traps that make dead-code findings unsafe.
Dynamic dispatch
A method with no direct call can still be reached through getattr(obj, name), a command registry, a serializer map, a plugin loader, or a framework callback.
This is why a pure "zero textual references" rule is not enough.
Public exports
A private helper with no references is one thing. A public symbol exported through __all__, documented API surface, package metadata, or a compatibility layer is different. Removing public API can be a breaking change even if the repository itself does not call it.
Framework entrypoints
Django serializers, FastAPI routes, pytest fixtures, Celery tasks, Alembic migrations, Click commands, and Pydantic validators can look unused if the scanner does not understand the framework.
Tests and generated paths
Some helpers exist only for tests, docs, examples, optional integrations, or generated files. Those are not always dead. Sometimes they are just outside the first scan path.
Near-miss names
A dead-code candidate can sit next to a live symbol with a similar name. In pypdf, for example, reverse encoding dictionaries looked unused, but the related _pdfdoc_encoding_rev needed to stay because it was still used and exported.
What Happened In The Merged PRs
Black: unused parser helpers
The two Black PRs removed unused parser helpers, token helpers, debug methods, and one stale attribute. The first PR removed matches_grammar(), lib2to3_unparse(), is_function_or_class(), and an unused Deprecated warning class. The second removed unused token helpers, parser debug functions, and a stale was_checked attribute.
The useful lesson from Black was not that the codebase was messy. It is not. The lesson is that even tightly maintained projects can retain internal helpers after refactors.
Flagsmith: framework-shaped dead code
Flagsmith was a good test because framework-heavy application code creates more false-positive risk than a small library. The merged PR removed unused exception classes, serializers, response classes, a Pydantic model, and a helper function.
This is the kind of code that often survives because it is harmless: no tests fail, no user sees it, and nobody wants to manually audit every serializer and exception class after a feature changes.
pypdf: the importance of keeping similar live code
pypdf had several reverse encoding dictionaries that were unused. The cleanup was small, but the important part was restraint: a similar dictionary, _pdfdoc_encoding_rev, was kept because it was still used.
That is exactly the kind of near-miss a dead-code workflow has to catch before a PR is opened.
mitmproxy: console helpers and utility code
The mitmproxy PR removed old console helpers, bit utility code, and stale imports. The final diff included 2 additions because not every first-pass removal survived review. That is normal. A useful workflow should expect maintainer feedback and keep the diff conservative.
NetworkX: private helpers and unused imports
NetworkX merged removal of one unused private function and several unused imports. The PR also restored one candidate after maintainer feedback pointed to a broader area that deserved separate review.
That is a good outcome. The purpose of the tool is not to bulldoze code; it is to surface candidates maintainers can reason about.
Optuna: visualizations and internal helpers
Optuna removed unused distribution and visualization helpers, an unused method on a label encoder, an unused constant, and unused unpacked variables. In the PR log, the initial scan had 28 findings and 11 survived after filtering out 17 false positives.
That ratio is important. It shows why raw finding counts are not the product. The product is the filtered set a maintainer can trust.
beets: plugin convenience wrappers
The beets PR removed unused ListenBrainz helper wrappers, an unused static helper, a superseded MusicBrainz collection helper, and a database type that was never instantiated.
Plugin-heavy repositories are exactly where naive dead-code detection can become noisy, so the PR had to stay small and specific.
The Workflow That Made The PRs Reviewable
The rough process was:
- Run the scanner.
- Group findings by risk.
- Check direct references.
- Check exports and package metadata.
- Check framework and plugin entrypoints.
- Remove only candidates that still looked dead.
- Keep PRs small enough for a maintainer to review.
- Treat maintainer feedback as part of the validation loop.
That last step matters. If a maintainer says a symbol is still part of public API, the correct response is to restore it, not argue that the graph says zero references.
Why We Did Not Submit Everything
Some scans produced many more findings than the final PRs. That is expected.
For example, the PR log for Optuna records 28 findings, with 11 confirmed after removing 17 false positives. The pending Celery work started from 300 findings, narrowed to 56 verified true positives, and then selected a much smaller PR-sized subset.
Raw findings are useful for exploration. They are not automatically useful as pull requests.
For open source maintainers, the right unit is a small reviewable diff:
- no broad rewrites,
- no public API removals unless maintainers agree,
- no changes that require trust in a tool,
- no cleanup mixed with style churn,
- clear explanation of why each symbol appears unused.
Lessons For Dead-Code Tools
1. Precision matters more than volume
A tool that prints 500 findings and 300 are questionable will not be trusted. A tool that prints fewer findings but explains why they are likely safe to remove is more useful.
2. Framework awareness is not optional
Python applications do not call everything directly. Static analysis has to understand decorators, config files, entrypoints, tests, migrations, serializers, and plugin conventions.
3. Public API must be treated differently
Internal dead code and public unused code are not the same thing. A library may intentionally keep a symbol for downstream users even if its own tests do not reference it.
4. Maintainer review is a better test than a synthetic demo
A benchmark can tell you whether a detector catches expected cases. A merged cleanup PR tells you whether the output was useful to someone who owns the code.
5. The best workflow is conservative
The goal is not to delete the most code. The goal is to delete code that is actually safe to remove.
How To Try This On Your Repo
Install Skylos and run a local scan:
pip install skylos
skylos .
For a PR gate, start with changed-code review instead of blocking the whole repository at once:
skylos . --diff origin/main
If you are evaluating dead-code findings, do not start by deleting everything. Start with a small branch and ask:
- Is the symbol private or public API?
- Is it exported through
__all__, package metadata, or docs? - Is it registered through a framework, CLI, test runner, migration system, or plugin entrypoint?
- Is it called through
getattr(),importlib, a registry, or string dispatch? - Does removing it keep tests green?
- Would a maintainer understand the diff in under five minutes?
If the answer is unclear, keep the code or mark it for manual review.
The Bottom Line
Dead-code detection should not be judged by the biggest number in a terminal table.
The useful question is simpler:
Can the tool produce cleanup work that survives review in real repositories?
For these eight PRs, the answer was yes. That is not the end of the argument, but it is a stronger signal than another toy benchmark.
If you want to inspect the proof directly, the merged PR list is here: Real-world Skylos results.