[SYSTEMDS-3900] Stream aware write operation for OOC (option 2)#2302
[SYSTEMDS-3900] Stream aware write operation for OOC (option 2)#2302j143 wants to merge 12 commits into
Conversation
while loop calls mb.write(dostream) for every block. This results in a corrupted output file that looks like this: [Header for Block 1] [Data for Block 1] [Header for Block 2] [Data for Block 2] [Header for Block 3] [Data for Block 3]
at present, Concat is not supported by ChecksumFileSystem
UnaryTest/ in/ .X.crc .X.mtd.crc X X.mtd out/ .res._data.crc .res._header.crc .res.crc .res.mtd.crc res res._data res._header res.mtd
| int blen = Integer.parseInt(getInput4().getName()); | ||
| LocalTaskQueue<IndexedMatrixValue> stream = mo.getStreamHandle(); | ||
|
|
||
| if (stream != null) { |
There was a problem hiding this comment.
Due to recompilation, this is visited the second time. Since the stream is null the second time, it fails.
So, I tried setting the values explicitly. But, recompilation still happening.
mo.updateDataCharacteristics(mc);
HDFSTool.copyFileOnHDFS(fname, mo.getFileName());
mo.setDirty(false);There was a problem hiding this comment.
I would recommend to simple probe if a stream exists and not consume the matrix if it does.
There was a problem hiding this comment.
Hi, I've not understood this part! :)
|
Hi @mboehm7 , I would like to talk about the design choices here?
UnaryTest/
in/
.X.crc
.X.mtd.crc
X
X.mtd
out/
.res._data.crc
.res._header.crc
.res.crc
.res.mtd.crc
res
res._data
res._header
res.mtdb. In my machine, x86-64 Windows 11 - HDFS concat doesn't seem to work. I had winutils and other Junit tests work
could you let me know about the choices so far and which ones or better. |
|
You don't have to invent any new things here: we take the stream of blocks and directly write them into a sequence file (like our existing binary writer, which actually takes the overall matrix and chunks them off into 1k-1k blocks). The resulting sequence files should be identical, and there is no file-system merge needed. |
|
Hi @mboehm7 , thanks for review and feedback on the design. I've tested this on |
|
Hi @mboehm7 , could you please review this one. I've tried to address most of the comments. thank you. I have the matrixvector multiplication task in progress, will shortly raise a PR built on the top of the write operation for testing. |
|
LGTM - thanks for the patch @j143. The revised code looked already pretty good, I just moved the core stream write into the export logic, and fixed a few smaller issues. |
..
I've chosen two pass method, write data separately and hold metadata in memory and concat at the end.