Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
9b0e4b3
Get the vm working in a docker container:
Aug 5, 2025
2068587
cheshire should allow NaN in json.
Aug 5, 2025
58f41d2
Handle Nan as strings.
Aug 5, 2025
42919e4
Remove qc and vegalite from pipeline.
Aug 6, 2025
183664c
If the subsample is small, we should use all of it for the test.
Aug 7, 2025
8871030
Fix some bugs in the predict script
Aug 8, 2025
e1c4e14
Script properly also detects features that should be ignored.
Aug 8, 2025
17459e0
Sometimes there is no nominal data in the train dataset.
Aug 8, 2025
92803e7
If only one type of stats, we should evaluate to nan.
Aug 10, 2025
e8ba795
Removed the lock file.
Aug 19, 2025
34c3ee1
UNTESTED: Add pivot script.
Aug 20, 2025
cf35d74
Fixed a few obvious typos in pivot.
Aug 20, 2025
5274597
Pivot script runs as part of the pipeline now.
Aug 20, 2025
4a722a0
Need to thread the shrunk dataframe.
Aug 23, 2025
4a05c71
Validate script had hidden dependency to data.csv.
Aug 25, 2025
beaef9e
Allow for multiple keys and add ignore list.
Aug 25, 2025
906b204
Track the pivots in the schema
Aug 25, 2025
dab333c
reverting changes to params that snuck in.
Aug 25, 2025
370450e
add preschema to data's gitignore.
Aug 25, 2025
30de427
Shouldn't pivot columns that have been purged.
Aug 25, 2025
143383b
Actually use the seed passed in the script.
Aug 27, 2025
0a99e24
Change the function call too.
Aug 27, 2025
04ead69
Should not put the pre-pivoted schema once its been pivoted.
Aug 27, 2025
2661b83
fix: Read vegalite to the pipeline.
Aug 28, 2025
5725dc2
fix: Don't cutoff names in labels.
Sep 3, 2025
b4f1b13
feat: Add a seperate (optional) step to make a premodel
Sep 6, 2025
d2067f9
fix: Use get_dummies from pandas.
Sep 6, 2025
289fa30
feat: update premodel to use isolationforest and localoutliers
Sep 11, 2025
f3235cd
fix: Align premodel output rows.
Sep 11, 2025
d49a3bd
chore: Use seed in premodel random_state.
Sep 11, 2025
f82be7b
fix: Speed up guess-schema.
Sep 12, 2025
96f204d
feat: Add documentation for guess_schema option in params.
Sep 12, 2025
0cf614a
feat: Can compute schema in python stage.
Sep 12, 2025
33ade06
fix: munge -> munge_keys
Sep 12, 2025
781afb5
chore: Python schemapredictor shorcircuits clojure
Sep 12, 2025
f98fdd5
fix: Change gensql types in pivot script
Sep 16, 2025
ae24b44
fix: Pass params file into premodel script
Sep 17, 2025
851fbef
fix: IsolationForest does not use random_state
Sep 17, 2025
6c9afa3
chore: Increase memory limit of merge algorithms.
Sep 20, 2025
53612a1
fix: Call constructor directly.
Sep 22, 2025
6e2bc04
Store the fully pivoted CSV file for future scripts.
Nov 21, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 0 additions & 6 deletions .dvc/config
Original file line number Diff line number Diff line change
@@ -1,8 +1,2 @@
[core]
analytics = false
remote = default-s3-remote
site_cache_dir = ./.dvc/cache
[cache]
dir = cache
['remote "default-s3-remote"']
url = s3://lpm-research/default
2 changes: 1 addition & 1 deletion bin/validate.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#!/usr/bin/env bash

if INVALID_COLUMNS=$(xsv headers -j data/data.csv | grep -v -E "^[0-9a-zA-Z_\-]+$")
if INVALID_COLUMNS=$(xsv headers -j data/pivoted.csv | grep -v -E "^[0-9a-zA-Z_\-]+$")
then
printf 'Column names may only include alphanumeric characters, dashes, or underscores.\n\n'
printf 'Invalid column names:\n\n%s\n' "$INVALID_COLUMNS"
Expand Down
3 changes: 3 additions & 0 deletions data/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
# schema artifacts
/cgpm-schema.edn
/mapping-table.edn
/preschema.edn
/schema.edn

# loom artifacts
Expand All @@ -43,3 +44,5 @@
/fidelity.csv
/fidelity.json
/synthetic-data-gensql.csv
/pivoted.csv
/premodel.csv
Loading