reuse container when reading parquet records#1522
Conversation
|
@rdblue , I see the spark parquet reader doesn't reuse the container while vectorized code path reuses the container. Any consideration on it? |
|
I noticed this a few months ago and tested the performance of the two paths. I didn't really see a speed up in my initial test so I didn't change the default. We just need to evaluate the performance to know whether to do this. |
|
Would the values produced by the ParquetReader iterable ever be returned directly to Spark? Say in the distributed planning that folks are considering? Because if so we should check to make sure the code path is one of the ones in Spark where it's ok. |
Yes, we return an iterator with the reused containers. I believe that this is okay because Spark generally converts to unsafe immediately. What code paths can't handle reused containers? |
|
I executed some spark jmh cases with when not reuse the container When reusing the container It shows slight benefit when reusing the container. @rdblue, Does that make sense for spark side change? I will try to write some jmh benchmark for Flink input format and try it again. Plus I found two issues:
We need to delete the created temp file at first. I will fix found issues tmr. |
|
Thanks for running the benchmarks, @chenjunjiedada! I think it looks like we should commit this. @aokolnychyi, what do you think? |
|
I'm restarting the failed JDK 8 tests because the failure was flaky Hive tests. |
Following up on this, it is safe for Spark in the v2 path because Spark ensures that there is a projection that converts to unsafe rows. That's because some Spark exec nodes expect unsafe. |
This is for discussion here. I don't add the unit test here since the existing end to end unit tests should cover.