Remove loom from the build. by LeifAndersen · Pull Request #219 · probcomp/GenSQL.structure-learning

LeifAndersen · 2025-08-25T21:32:14Z

Since it doesn't seem to work.

1. Does NOT use loom. Loom seems to be broken at the moment. 2. Requires the model name to be '0.edn`, not `sample.0.edn`. 3. Is only tested in one container.

Yes its non-standard, but our pipeline produces it.

They aren't valid json.

It requires nixos to run, which is an unrealistic requirement.

In that case we can't learn anything, but that shouldn't bring the pipeline to a halt.

(Next commit will test this out.)

Still working on proper tests.

I logged it, and switched the dependency to pivoted.csv

The pivot script handles schemas for unpivoted data. Putting it there throws off the `AST` step.

A println inserted in the script was the problem. Fixed now.

This premodel (or shadow model) is not as accurate as the real one, but the key is that its significantly faster to compute.

While the onehotencoder is better, it doesn't work with the older version of scikit learn used in the image. So for this simple purpose, we can get away with using get_dummies directly.

Before when we did dropna the index changes completely through off the result. Now we're dropping the NA rows entirely.

We were filtering out the pre-pivoted schema, if we filter out the post-pivot one, the guessing process seems to be up to 50% faster.

The implementation for guess_schema is forthcoming.

On the upside, its a lot faster, but is not as robust as the clojure method.

Otherwise we can't call the munge function.

When requested, the python predictor can now shortcircuit the clojure schema predictor, taking times from minute (or hours) down to just a few seconds.

They were wrong. nominal -> numerical categorical -> nominal

Thee ntire data structure was being converted to a string before being emited, this caused an out of memory error on large (bigger than 1 GB) data structures. We can, however, simply write out the structure to the file, rather than first converting it to a string in memory.

While ensemble/ensemble is usually better, in this case we don't need the extra safety checks, additionally larger models will cause the checks to run out of memory.

Leif Andersen added 30 commits August 5, 2025 13:02

Get the vm working in a docker container:

9b0e4b3

1. Does NOT use loom. Loom seems to be broken at the moment. 2. Requires the model name to be '0.edn`, not `sample.0.edn`. 3. Is only tested in one container.

cheshire should allow NaN in json.

2068587

Yes its non-standard, but our pipeline produces it.

Handle Nan as strings.

58f41d2

They aren't valid json.

Remove qc and vegalite from pipeline.

42919e4

It requires nixos to run, which is an unrealistic requirement.

If the subsample is small, we should use all of it for the test.

183664c

Fix some bugs in the predict script

8871030

Script properly also detects features that should be ignored.

e1c4e14

Sometimes there is no nominal data in the train dataset.

17459e0

In that case we can't learn anything, but that shouldn't bring the pipeline to a halt.

If only one type of stats, we should evaluate to nan.

92803e7

Removed the lock file.

e8ba795

UNTESTED: Add pivot script.

34c3ee1

(Next commit will test this out.)

Fixed a few obvious typos in pivot.

cf35d74

Still working on proper tests.

Pivot script runs as part of the pipeline now.

5274597

Need to thread the shrunk dataframe.

4a722a0

Validate script had hidden dependency to data.csv.

4a05c71

I logged it, and switched the dependency to pivoted.csv

Allow for multiple keys and add ignore list.

beaef9e

Track the pivots in the schema

906b204

reverting changes to params that snuck in.

dab333c

add preschema to data's gitignore.

370450e

Shouldn't pivot columns that have been purged.

30de427

Actually use the seed passed in the script.

143383b

Change the function call too.

0a99e24

Should not put the pre-pivoted schema once its been pivoted.

04ead69

The pivot script handles schemas for unpivoted data. Putting it there throws off the `AST` step.

fix: Read vegalite to the pipeline.

2661b83

A println inserted in the script was the problem. Fixed now.

fix: Don't cutoff names in labels.

5725dc2

feat: Add a seperate (optional) step to make a premodel

b4f1b13

This premodel (or shadow model) is not as accurate as the real one, but the key is that its significantly faster to compute.

fix: Use get_dummies from pandas.

d2067f9

While the onehotencoder is better, it doesn't work with the older version of scikit learn used in the image. So for this simple purpose, we can get away with using get_dummies directly.

feat: update premodel to use isolationforest and localoutliers

289fa30

fix: Align premodel output rows.

f3235cd

Before when we did dropna the index changes completely through off the result. Now we're dropping the NA rows entirely.

chore: Use seed in premodel random_state.

d49a3bd

Leif Andersen added 11 commits September 12, 2025 13:09

fix: Speed up guess-schema.

f82be7b

We were filtering out the pre-pivoted schema, if we filter out the post-pivot one, the guessing process seems to be up to 50% faster.

feat: Add documentation for guess_schema option in params.

96f204d

The implementation for guess_schema is forthcoming.

feat: Can compute schema in python stage.

0cf614a

On the upside, its a lot faster, but is not as robust as the clojure method.

fix: munge -> munge_keys

33ade06

Otherwise we can't call the munge function.

chore: Python schemapredictor shorcircuits clojure

781afb5

When requested, the python predictor can now shortcircuit the clojure schema predictor, taking times from minute (or hours) down to just a few seconds.

fix: Change gensql types in pivot script

f98fdd5

They were wrong. nominal -> numerical categorical -> nominal

fix: Pass params file into premodel script

ae24b44

fix: IsolationForest does not use random_state

851fbef

fix: Call constructor directly.

53612a1

While ensemble/ensemble is usually better, in this case we don't need the extra safety checks, additionally larger models will cause the checks to run out of memory.

Store the fully pivoted CSV file for future scripts.

6e2bc04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove loom from the build.#219

Remove loom from the build.#219
LeifAndersen wants to merge 41 commits into
probcomp:mainfrom
LeifAndersen:dstop2

LeifAndersen commented Aug 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LeifAndersen commented Aug 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant