Skip to content

Exception from Stage qc-discretize@synthetic-data-gensql.csv #213

@neeshjaa

Description

@neeshjaa

The modeling run ends with an exception from the polars library in a stage I haven't seen before:

Running stage 'qc-discretize@synthetic-data-gensql.csv':
> mkdir -p data/discretized/ && python scripts/discretize.py --real data/ignored.csv --synthetic data/synthetic-data-gensql.csv --schema data/loom-schema.json --real-disc data/discretized/ignored.csv --synthetic-disc data/discretized/synthetic-data-gensql.csv
Traceback (most recent call last):
  File "/home/ubuntu/GenSQL.structure-learning/scripts/discretize.py", line 65, in <module>
    main()
  File "/home/ubuntu/GenSQL.structure-learning/scripts/discretize.py", line 55, in main
    df_real_discretized, df_synthetic_discretized = discretize_quantiles(
  File "/nix/store/bva7dx67y2r6d8jb2a0ziwv6kn9in3wy-python3.10-lpm_discretize-0.0.1/lib/python3.10/site-packages/lpm_discretize/discretize.py", line 441, in discretize_quantiles
    discretization_functions = {
  File "/nix/store/bva7dx67y2r6d8jb2a0ziwv6kn9in3wy-python3.10-lpm_discretize-0.0.1/lib/python3.10/site-packages/lpm_discretize/discretize.py", line 442, in <dictcomp>
    column_name: get_quantile_based_discretization_function(
  File "/nix/store/bva7dx67y2r6d8jb2a0ziwv6kn9in3wy-python3.10-lpm_discretize-0.0.1/lib/python3.10/site-packages/lpm_discretize/discretize.py", line 105, in get_quantile_based_discretization_function
    for quantile in pl.Series(column).qcut(quantiles, include_breaks=True)
TypeError: Series.qcut() got an unexpected keyword argument 'include_breaks'
ERROR: failed to reproduce 'qc-discretize@synthetic-data-gensql.csv': failed to run: mkdir -p data/discretized/ && python scripts/discretize.py --real data/ignored.csv --synthetic data/synthetic-data-gensql.csv --schema data/loom-schema.json --real-disc data/discretized/ignored.csv --synthetic-disc data/discretized/synthetic-data-gensql.csv, exited with 1

The line in question appears to be here:

column = [value for value in column if _is_number(value)]
    assert len(set(column)) > 1
    cutoffs = sorted(
        set(
            [
                quantile["break_point"]
                **for quantile in pl.Series(column).qcut(quantiles, include_breaks=True)**
            ]
        )
    )

but the documentation for the polars qcut() method does specify an optional include_breaks argument, so it's not clear what the problem is.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions