Update Arrow to 0.15.1 and fix Broadcast and GroupedMapUdf Tests for Spark-3.0.0#653
Conversation
|
Can you resolve conflicts? Thanks! |
enable tests for spark 3.0
Merged and resolved conflicts. Re-enabled some tests. |
fix exception message.
|
@suhsteve is this ready for review? |
yea, looks like tests are passing. |
| backwardCompatibleRelease: '0.9.0' | ||
| TestsToFilterOut: "(FullyQualifiedName!=Microsoft.Spark.E2ETest.IpcTests.DataFrameTests.TestDataFrameGroupedMapUdf)&\ | ||
| (FullyQualifiedName!=Microsoft.Spark.E2ETest.IpcTests.DataFrameTests.TestDataFrameVectorUdf)&\ | ||
| (FullyQualifiedName!=Microsoft.Spark.E2ETest.IpcTests.DataFrameTests.TestGroupedMapUdf)&\ |
There was a problem hiding this comment.
This is a breaking change since these are not the new APIs introduced, no?
There was a problem hiding this comment.
Moved to spark 3.0.0 tests only.
imback82
left a comment
There was a problem hiding this comment.
LGTM except for few nit comments.
Btw, I think this is a breaking change, but it can be addressed as a follow up PR.
| env: | ||
| SPARK_HOME: $(Build.BinariesDirectory)\spark-2.4.6-bin-hadoop2.7 | ||
|
|
||
| # Spark 3.0.0 uses Arrow 0.15.1, which contains a new Arrow spec. This breaks backward |
There was a problem hiding this comment.
Is it easy to track if we have a different backward compatibility for different Spark version?
There was a problem hiding this comment.
At the moment there are no published workers that's backward compatible with 3.0 (since the previous workers only use 0.14.1 and aren't aware of the new spec). But I agree that this is a breaking change.
For backward compatibility, do we want to differentiate between different spark versions and test them against different spark Worker versions? Or one Worker version where we say is backward compatible for all spark versions?
This can be addressed in a separate PR if needed.
There was a problem hiding this comment.
For backward compatibility, do we want to differentiate between different spark versions and test them against different spark Worker versions? Or one Worker version where we say is backward compatible for all spark versions?
I would say the latter.
There was a problem hiding this comment.
Then I think we will have to wait until the next official Worker release before we can remove these extra filters.
There was a problem hiding this comment.
... since we have one worker binary to support all spark versions.
There was a problem hiding this comment.
Well, you can remove the backward compatibility test (breaking change) then add it back when the new one is released.
There was a problem hiding this comment.
Do you want me to remove the extra filters for 3.0 and in the unit tests add the skip attribute ? Or just remove the spark 3.0.0 section in the backward compatibility tests.
There was a problem hiding this comment.
Removed 3.0 backward compatibility testing.
|
@suhsteve Can you update the title/description with more details? This is more like upgrading Arrow version. |
Addressed the comments and updated title/description. |
This PR updates the Arrow library from 0.14.1 to 0.15.1 and also addresses the Broadcast and GroupedMapUdf Tests for Spark-3.0.0. Currently supporting GroupedMapUdf in Spark-3.0.0 is blocked/unsupported so we skip these tests.
https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format for the new message format - 0xFFFFFFFF 0x00000000
This is a part of the effort to bring in CI for Spark 3.0: #348