Skip to content

Add vessel graph normalization step to emit node/edge parquet tables with Dask-distributed processing#168

Merged
akhanf merged 7 commits into
mainfrom
copilot/create-nodes-and-edges-tables
May 18, 2026
Merged

Add vessel graph normalization step to emit node/edge parquet tables with Dask-distributed processing#168
akhanf merged 7 commits into
mainfrom
copilot/create-nodes-and-edges-tables

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 17, 2026

Adds vessel graph normalization that materializes explicit nodes and edges parquet tables from the skeleton edge-list parquet, and updates the conversion path to use Dask-distributed processing to reduce memory pressure on large inputs.

Changes Made

  • Added/maintained node/edge normalization outputs for vessel skeleton graphs:
    • ...+skeleton_nodes.parquet
    • ...+skeleton_edges.parquet
  • Reworked conversion to a Dask-based pipeline:
    • Build nodes first from endpoint columns using distributed dataframe operations.
    • Map edges to node IDs via distributed joins.
    • Write both outputs incrementally to parquet instead of fully materializing large intermediate tables in memory.
  • Preserved downstream schema compatibility:
    • nodes: node_id, channel, vox_x, vox_y, vox_z, x, y, z, radius
    • edges: edge_id, channel, src_node_id, dst_node_id, edge_length, src_vox_*, dst_vox_*
  • Updated targeted tests for the parquet conversion flow and schema validation.

Validation

  • Parallel validation completed (code review + CodeQL).
  • CodeQL reported no security alerts.

@akhanf
Copy link
Copy Markdown
Member

akhanf commented May 17, 2026

@copilot: peak memory consumption for this script is excessive (greater than 128GB) can a more memory efficient alternative be implemented? Maybe do nodes then edges

Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 17, 2026

@copilot: peak memory consumption for this script is excessive (greater than 128GB) can a more memory efficient alternative be implemented? Maybe do nodes then edges

Implemented a low-memory two-pass conversion in c056a38: pass 1 streams the input parquet to build nodes.parquet; pass 2 streams again to map/write edges.parquet incrementally, avoiding full edge-table materialization in RAM. No UI changes in this PR (no screenshot applicable).

Copilot finished work on behalf of akhanf May 17, 2026 20:35
Copilot AI requested a review from akhanf May 17, 2026 20:35
@akhanf
Copy link
Copy Markdown
Member

akhanf commented May 17, 2026

@copilot: Still using up over 128GB of memory. Implement a solution using dask distributed to avoid memory load

Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 17, 2026

@copilot: Still using up over 128GB of memory. Implement a solution using dask distributed to avoid memory load

Implemented in 3f95df6: the conversion now runs through a Dask-distributed pipeline (nodes first, then edges), with distributed joins and incremental parquet writes to avoid loading full node/edge tables into memory. No UI changes in this PR (no screenshot applicable).

Copilot AI changed the title Add vessel graph normalization step to emit node/edge parquet tables Add vessel graph normalization step to emit node/edge parquet tables with Dask-distributed processing May 17, 2026
Copilot finished work on behalf of akhanf May 17, 2026 21:31
@akhanf akhanf marked this pull request as ready for review May 18, 2026 23:10
@akhanf
Copy link
Copy Markdown
Member

akhanf commented May 18, 2026

Works now, running with 128 cores and 256G memory

@akhanf akhanf merged commit 2e43241 into main May 18, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants