Skip to content

[SYSTEMDS-3900] Stream aware write operation for OOC#2301

Closed
j143 wants to merge 4 commits into
apache:mainfrom
j143:SYSTEMDS-3900-write-ooc-operation
Closed

[SYSTEMDS-3900] Stream aware write operation for OOC#2301
j143 wants to merge 4 commits into
apache:mainfrom
j143:SYSTEMDS-3900-write-ooc-operation

Conversation

@j143

@j143 j143 commented Jul 31, 2025

Copy link
Copy Markdown
Member
  • The existing CP WriteInstruction (VariableCPInstruction) was modified to be "stream-aware."

  • It now checks if its input MatrixObject has an OOC stream handle. If a stream exists, it acts as a synchronous terminal consumer, reading blocks from the stream and writing them to separate part-files in an output directory.

  • Multi-Block Write: The OOC write logic was made robust to handle multi-block matrices by writing to separate part-files, which is the standard for distributed systems.

@github-project-automation github-project-automation Bot moved this to In Progress in SystemDS PR Queue Jul 31, 2025
@j143 j143 changed the title Systemds 3900 write ooc operation [SYSTEMDS-3900] Stream aware write operation for OOC Jul 31, 2025
@j143

j143 commented Jul 31, 2025

Copy link
Copy Markdown
Member Author

Hi @mboehm7 , could you please check if this direction is ok! (implementation is not yet complete)

  1. using the existing write itself
  2. using matrix partition files in the directory

@mboehm7

mboehm7 commented Aug 2, 2025

Copy link
Copy Markdown
Contributor

Sorry for the delay, and thanks for getting started on this task @j143.

  1. Integration: Instead of integrating this write into the VariableCPInstruction (where non-binary formats are written), I would recommend to integrate the OOC write into the individual writers (with support for only binary) with a new method which is called if an MatrixObject has indeed an existing OOC stream of blocks.

  2. Core Write Logic: In order to yield the same output files as a normal (single-threaded) write, I recommend to not create part files for every single block, but stream all these blocks into a single file. Once you extend the binary write you see that this approach is even easier and result in files that can be processed much faster (not too many files which can be an issue on distributed file systems).

@mboehm7 mboehm7 closed this in 54a90ad Aug 9, 2025
@github-project-automation github-project-automation Bot moved this from In Progress to Done in SystemDS PR Queue Aug 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants