Skip to content

Update Arrow to 0.15.1 and fix Broadcast and GroupedMapUdf Tests for Spark-3.0.0#653

Merged
imback82 merged 25 commits into
dotnet:masterfrom
suhsteve:spark3_r
Sep 11, 2020
Merged

Update Arrow to 0.15.1 and fix Broadcast and GroupedMapUdf Tests for Spark-3.0.0#653
imback82 merged 25 commits into
dotnet:masterfrom
suhsteve:spark3_r

Conversation

@suhsteve

@suhsteve suhsteve commented Sep 4, 2020

Copy link
Copy Markdown
Member

This PR updates the Arrow library from 0.14.1 to 0.15.1 and also addresses the Broadcast and GroupedMapUdf Tests for Spark-3.0.0. Currently supporting GroupedMapUdf in Spark-3.0.0 is blocked/unsupported so we skip these tests.

This is a part of the effort to bring in CI for Spark 3.0: #348

@suhsteve suhsteve self-assigned this Sep 5, 2020
@imback82 imback82 mentioned this pull request Sep 6, 2020
6 tasks
@imback82

imback82 commented Sep 8, 2020

Copy link
Copy Markdown
Contributor

Can you resolve conflicts? Thanks!

@suhsteve

suhsteve commented Sep 8, 2020

Copy link
Copy Markdown
Member Author

Can you resolve conflicts? Thanks!

Merged and resolved conflicts. Re-enabled some tests.

@imback82

imback82 commented Sep 8, 2020

Copy link
Copy Markdown
Contributor

@suhsteve is this ready for review?

@suhsteve suhsteve requested a review from imback82 September 8, 2020 22:05
@suhsteve

suhsteve commented Sep 8, 2020

Copy link
Copy Markdown
Member Author

@suhsteve is this ready for review?

yea, looks like tests are passing.

Comment thread azure-pipelines.yml Outdated
backwardCompatibleRelease: '0.9.0'
TestsToFilterOut: "(FullyQualifiedName!=Microsoft.Spark.E2ETest.IpcTests.DataFrameTests.TestDataFrameGroupedMapUdf)&\
(FullyQualifiedName!=Microsoft.Spark.E2ETest.IpcTests.DataFrameTests.TestDataFrameVectorUdf)&\
(FullyQualifiedName!=Microsoft.Spark.E2ETest.IpcTests.DataFrameTests.TestGroupedMapUdf)&\

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change since these are not the new APIs introduced, no?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to spark 3.0.0 tests only.

Comment thread src/csharp/Microsoft.Spark.Worker/Command/SqlCommandExecutor.cs
Comment thread src/csharp/Microsoft.Spark.Worker/Command/SqlCommandExecutor.cs Outdated
Comment thread src/csharp/Microsoft.Spark.Worker/Command/SqlCommandExecutor.cs Outdated
Comment thread src/csharp/Microsoft.Spark.Worker.UnitTest/CommandExecutorTests.cs Outdated
@suhsteve suhsteve requested a review from imback82 September 10, 2020 23:13

@imback82 imback82 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for few nit comments.

Btw, I think this is a breaking change, but it can be addressed as a follow up PR.

Comment thread azure-pipelines.yml
env:
SPARK_HOME: $(Build.BinariesDirectory)\spark-2.4.6-bin-hadoop2.7

# Spark 3.0.0 uses Arrow 0.15.1, which contains a new Arrow spec. This breaks backward

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it easy to track if we have a different backward compatibility for different Spark version?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment there are no published workers that's backward compatible with 3.0 (since the previous workers only use 0.14.1 and aren't aware of the new spec). But I agree that this is a breaking change.

For backward compatibility, do we want to differentiate between different spark versions and test them against different spark Worker versions? Or one Worker version where we say is backward compatible for all spark versions?

This can be addressed in a separate PR if needed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For backward compatibility, do we want to differentiate between different spark versions and test them against different spark Worker versions? Or one Worker version where we say is backward compatible for all spark versions?

I would say the latter.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I think we will have to wait until the next official Worker release before we can remove these extra filters.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... since we have one worker binary to support all spark versions.

@imback82 imback82 Sep 11, 2020

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, you can remove the backward compatibility test (breaking change) then add it back when the new one is released.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want me to remove the extra filters for 3.0 and in the unit tests add the skip attribute ? Or just remove the spark 3.0.0 section in the backward compatibility tests.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed 3.0 backward compatibility testing.

Comment thread src/csharp/Microsoft.Spark.Worker.UnitTest/CommandExecutorTests.cs Outdated
Comment thread src/csharp/Microsoft.Spark.Worker/Command/SqlCommandExecutor.cs
Comment thread src/csharp/Microsoft.Spark.Worker/Command/SqlCommandExecutor.cs
Comment thread src/csharp/Microsoft.Spark.Worker/Command/SqlCommandExecutor.cs
Comment thread src/csharp/Microsoft.Spark.Worker/Command/SqlCommandExecutor.cs
@imback82

Copy link
Copy Markdown
Contributor

@suhsteve Can you update the title/description with more details? This is more like upgrading Arrow version.

@suhsteve suhsteve changed the title Broadcast and GroupedMapUdf Tests for Spark-3.0.0 Update Arrow to 0.15.1 and fix Broadcast and GroupedMapUdf Tests for Spark-3.0.0 Sep 11, 2020
@suhsteve

Copy link
Copy Markdown
Member Author

@suhsteve Can you update the title/description with more details? This is more like upgrading Arrow version.

Addressed the comments and updated title/description.

imback82
imback82 previously approved these changes Sep 11, 2020

@imback82 imback82 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @suhsteve!

@imback82 imback82 merged commit a5f707c into dotnet:master Sep 11, 2020
@suhsteve suhsteve added this to the 1.0.0 milestone Sep 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants