-
Notifications
You must be signed in to change notification settings - Fork 30
Description
Environment
SOMEF version: 0.9.13 (installed via pip install somef>=0.9.0)
Python: 3.11.14
OS: Ubuntu (GitHub Actions ubuntu-latest)
Install method: pip install somef
Summary
When calling somef describe with --ignore_classifiers, SOMEF still validates that classifier model paths exist in ~/.somef/config.json and aborts with an error if they are missing. This makes it impossible to use --ignore_classifiers as a lightweight/portable mode that avoids the need to run somef configure first.
Steps to reproduce
1. Install SOMEF fresh (no prior somef configure run)
pip install somef
2. Create a minimal config with only the auth token
mkdir -p ~/.somef
echo '{"Authorization": "token <your_token>"}' > ~/.somef/config.json
3. Create a test repo directory
mkdir /tmp/test_repo
echo "# My Project\n\n## Installation\n\npip install myproject\n\n## Usage\n\nmyproject --help" > /tmp/test_repo/README.md
4. Run with --ignore_classifiers (should skip all classifier-related steps)
somef describe -l /tmp/test_repo -o /tmp/out.json -t 0.8 --ignore_classifiers
Expected behaviour
--ignore_classifiers should make SOMEF skip all classifier-related code paths, including the config.json validation for classifier paths. The run should complete successfully using only header-based extraction, CITATION.cff parsing, and license/language detection — without requiring classifier model files.
Actual behaviour
SOMEF processes the README (header extraction runs, text is split) and then aborts before writing the output file:
INFO-Extracting information using headers
INFO-Labeling headers.
INFO-Header information extracted.
INFO-Splitting text into valid excerpts for classification
INFO-Extraction of bibtex citation from readme completed.
INFO-Text Successfully split.
Error: Category description file path not present in config.json
No output file is created. Exit code is non-zero
===
Key observation
The error fires after Text Successfully split. — meaning SOMEF completes all the non-classifier extraction steps successfully, then hits the classifier config validation right before it would write the output file. The --ignore_classifiers flag appears to skip applying the classifiers but does not skip validating their config entries.
This makes --ignore_classifiers unusable in practice: if the classifier paths are already in config.json, you don't need --ignore_classifiers; but if they're missing (exactly the case where you'd want to use the flag), the run aborts anyway.
Workaround (currently using)
We locate the .pickle files bundled inside the installed SOMEF package and write their paths to config.json programmatically before running:
import somef, json
from pathlib import Path
pkg_classifiers = Path(somef.file).parent / "classifiers"
config = {}
for cat in ["description", "invocation", "installation", "citation"]:
pkl = pkg_classifiers / f"{cat}.pickle"
if pkl.exists():
config[cat] = str(pkl)
(Path.home() / ".somef" / "config.json").write_text(json.dumps(config))
This is fragile and relies on the .pickle files being bundled in the package distribution (which they currently are, but may not always be).
Suggested fix
In the CLI entry point (or wherever config.json is read), skip classifier path validation when --ignore_classifiers is True:
if not ignore_classifiers:
if "description" not in config:
click.echo("Error: Category description file path not present in config.json")
return