feat(python): disk-based package scanning (dist-info metadata)#155
feat(python): disk-based package scanning (dist-info metadata)#155swarit-stepsecurity wants to merge 3 commits into
Conversation
Read installed Python packages from *.dist-info/METADATA and *.egg-info/ PKG-INFO instead of running pip. Default to disk scan; legacy pip path kept behind --legacy-python-scan / use_legacy_python_scan.
There was a problem hiding this comment.
Pull request overview
This PR switches Python package discovery from command-based (pip/conda/uv list) to disk-based parsing of installed package metadata (*.dist-info/METADATA and *.egg-info/PKG-INFO), while preserving a legacy command path behind a config/CLI switch. It wires the new disk scanner into both community scan output and enterprise telemetry, and adds tests for the new detector and disk-mode venv scanning.
Changes:
- Added
PythonDistDetectorto walk site-packages/venvs and parse Name/Version from on-disk metadata. - Wired disk scanning into community scan + enterprise telemetry, with
--legacy-python-scan/use_legacy_python_scanto fall back. - Added tests for metadata parsing, size caps, skip behavior, deduping, and disk-mode project scanning.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/telemetry/telemetry.go | Switch enterprise global + project Python scanning to disk mode by default, with legacy fallback. |
| internal/scan/scanner.go | Switch community Python package listing + project scans to disk mode by default, with legacy fallback. |
| internal/detector/pythonscan.go | Add enterprise disk-based global package scan returning existing PythonScanResult shape. |
| internal/detector/pythonproject.go | Add WithDiskScan to use disk metadata for per-venv package listing. |
| internal/detector/pythondist.go | New disk-based metadata walker/parser and global root discovery. |
| internal/detector/pythondist_test.go | New unit tests for the dist-info/egg-info scanner and disk-mode project listing. |
| internal/config/config.go | Add persisted use_legacy_python_scan config and display it in ShowConfigure(). |
| internal/cli/cli.go | Add --legacy-python-scan and --disk-python-scan flags. |
| cmd/stepsecurity-dev-machine-guard/main.go | Plumb CLI override into global config.UseLegacyPythonScan. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if home, err := os.UserHomeDir(); err == nil && home != "" { | ||
| addGlob(filepath.Join(home, ".local", "lib", "python*", "site-packages")) | ||
| add(filepath.Join(home, ".local", "share", "pipx", "venvs")) | ||
| addGlob(filepath.Join(home, ".pyenv", "versions", "*", "lib", "python*", "site-packages")) | ||
| } | ||
|
|
||
| switch runtime.GOOS { | ||
| case "darwin": | ||
| addGlob("/opt/homebrew/lib/python*/site-packages") | ||
| addGlob("/usr/local/lib/python*/site-packages") | ||
| addGlob("/Library/Frameworks/Python.framework/Versions/*/lib/python*/site-packages") | ||
| if home, err := os.UserHomeDir(); err == nil && home != "" { | ||
| addGlob(filepath.Join(home, "Library", "Python", "*", "lib", "python", "site-packages")) | ||
| } | ||
| case "linux": | ||
| addGlob("/usr/lib/python*/dist-packages") | ||
| addGlob("/usr/lib/python*/site-packages") | ||
| addGlob("/usr/lib/python3/dist-packages") | ||
| addGlob("/usr/local/lib/python*/dist-packages") | ||
| addGlob("/usr/local/lib/python*/site-packages") | ||
| } |
| // readBounded reads path through the executor and rejects files over the size | ||
| // cap. The metadata header we parse is tiny; the cap only guards memory. | ||
| func (d *PythonDistDetector) readBounded(path string) ([]byte, error) { | ||
| data, err := d.exec.ReadFile(path) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
| if d.maxFileSize > 0 && int64(len(data)) > d.maxFileSize { | ||
| d.log.Debug("python dist scan: %s exceeds %d bytes — skipping", path, d.maxFileSize) | ||
| return nil, fmt.Errorf("file %s exceeds max size %d", path, d.maxFileSize) | ||
| } | ||
| return data, nil | ||
| } |
| func (d *PythonDistDetector) ScanVenv(venvPath string) []model.PackageDetail { | ||
| return d.ScanRoots([]string{venvPath}) | ||
| } |
- PythonGlobalRoots anchors per-user paths on the console user via executor.ResolveHome (falling back to os.UserHomeDir), so the root/launchd agent scans the logged-in user's ~/.local, ~/.pyenv, pipx. - readBounded stats file size before reading to avoid large allocations, keeping the post-read length check as a race-safety fallback. - ScanVenv limits its walk to the venv's site-packages dirs instead of the whole tree.
|
Addressed the review comments in d85e9c5:
build/vet/test green. |
Read installed Python packages from
*.dist-info/METADATAand*.egg-info/PKG-INFOon disk instead of runningpip list/conda list/uv pip list.PythonDistDetectorwalks site-packages + venvs and parses package name/version from metadata. No package-manager subprocess.--legacy-python-scan/use_legacy_python_scan.PythonScanResultshape (JSON inraw_stdout_base64) — no backend change..local/.pyenv/etc. now discovered) and removes the Apple-CLT-stub /--without-pip/ timeout failure modes.Tests added for the new detector and disk-mode project listing; build/vet/test green.