Skip to content

[SYSTEMDS-3900] Stream aware write operation for OOC (option 2)#2302

Closed
j143 wants to merge 12 commits into
apache:mainfrom
j143:origin/SYSTEMDS-3900-write-ooc-operation-single-file-output
Closed

[SYSTEMDS-3900] Stream aware write operation for OOC (option 2)#2302
j143 wants to merge 12 commits into
apache:mainfrom
j143:origin/SYSTEMDS-3900-write-ooc-operation-single-file-output

Conversation

@j143

@j143 j143 commented Aug 3, 2025

Copy link
Copy Markdown
Member

..

I've chosen two pass method, write data separately and hold metadata in memory and concat at the end.

j143 added 8 commits August 2, 2025 21:11
while loop calls mb.write(dostream) for every block. This results in a corrupted output file that looks like this:

[Header for Block 1] [Data for Block 1]
[Header for Block 2] [Data for Block 2]
[Header for Block 3] [Data for Block 3]
at present, Concat is not supported by ChecksumFileSystem
UnaryTest/
in/
.X.crc
.X.mtd.crc
X
X.mtd
out/
.res._data.crc
.res._header.crc
.res.crc
.res.mtd.crc
res
res._data
res._header
res.mtd
int blen = Integer.parseInt(getInput4().getName());
LocalTaskQueue<IndexedMatrixValue> stream = mo.getStreamHandle();

if (stream != null) {

@j143 j143 Aug 3, 2025

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to recompilation, this is visited the second time. Since the stream is null the second time, it fails.

So, I tried setting the values explicitly. But, recompilation still happening.

mo.updateDataCharacteristics(mc);
HDFSTool.copyFileOnHDFS(fname, mo.getFileName());
mo.setDirty(false);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend to simple probe if a stream exists and not consume the matrix if it does.

@j143 j143 Aug 3, 2025

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I've not understood this part! :)

@j143

j143 commented Aug 3, 2025

Copy link
Copy Markdown
Member Author

Hi @mboehm7 , I would like to talk about the design choices here?

  1. Since part files are not preferred. I choose two file format (one data file, one light metadata and concat in the end)

    a. file directory

UnaryTest/
in/
.X.crc
.X.mtd.crc
X
X.mtd
out/
.res._data.crc
.res._header.crc
.res.crc
.res.mtd.crc
res
res._data
res._header
res.mtd

b. In my machine, x86-64 Windows 11 - HDFS concat doesn't seem to work. I had winutils and other Junit tests work

  1. Facing issue with recompilation step, due to which the stream = null and the if condition fails

  2. If I try to explicitly set matrixCharacterstics to avoid recompilation - it still doesn't work.


could you let me know about the choices so far and which ones or better.

@mboehm7

mboehm7 commented Aug 3, 2025

Copy link
Copy Markdown
Contributor

You don't have to invent any new things here: we take the stream of blocks and directly write them into a sequence file (like our existing binary writer, which actually takes the overall matrix and chunks them off into 1k-1k blocks). The resulting sequence files should be identical, and there is no file-system merge needed.

@j143 j143 marked this pull request as ready for review August 3, 2025 17:55
@j143

j143 commented Aug 3, 2025

Copy link
Copy Markdown
Member Author

Hi @mboehm7 , thanks for review and feedback on the design.

I've tested this on
int rows = 1000, cols = 1000;
5000, 1000
5000, 5000

@j143

j143 commented Aug 9, 2025

Copy link
Copy Markdown
Member Author

Hi @mboehm7 , could you please review this one. I've tried to address most of the comments. thank you.

I have the matrixvector multiplication task in progress, will shortly raise a PR built on the top of the write operation for testing.

@mboehm7

mboehm7 commented Aug 9, 2025

Copy link
Copy Markdown
Contributor

LGTM - thanks for the patch @j143. The revised code looked already pretty good, I just moved the core stream write into the export logic, and fixed a few smaller issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants