Skip to content

[SPARK-27812][K8S] Bump K8S client version to 4.6.1#26093

Closed
igorcalabria wants to merge 6 commits into
apache:masterfrom
igorcalabria:k8s-client-update
Closed

[SPARK-27812][K8S] Bump K8S client version to 4.6.1#26093
igorcalabria wants to merge 6 commits into
apache:masterfrom
igorcalabria:k8s-client-update

Conversation

@igorcalabria

@igorcalabria igorcalabria commented Oct 11, 2019

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Updated kubernetes client.

Why are the changes needed?

https://issues.apache.org/jira/browse/SPARK-27812
https://issues.apache.org/jira/browse/SPARK-27927

We need this fix fabric8io/kubernetes-client#1768 that was released on version 4.6 of the client. The root cause of the problem is better explained in #25785

Does this PR introduce any user-facing change?

Nope, it should be transparent to users

How was this patch tested?

This patch was tested manually using a simple pyspark job

from pyspark.sql import SparkSession

if __name__ == '__main__':
    spark = SparkSession.builder.getOrCreate()

The expected behaviour of this "job" is that both python's and jvm's process exit automatically after the main runs. This is the case for spark versions <= 2.4. On version 2.4.3, the jvm process hangs because there's a non daemon thread running

"OkHttp WebSocket https://10.96.0.1/..." #121 prio=5 os_prio=0 tid=0x00007fb27c005800 nid=0x24b waiting on condition [0x00007fb300847000]
"OkHttp WebSocket https://10.96.0.1/..." #117 prio=5 os_prio=0 tid=0x00007fb28c004000 nid=0x247 waiting on condition [0x00007fb300e4b000]

This is caused by a bug on kubernetes-client library, which is fixed on the version that we are upgrading to.

When the mentioned job is run with this patch applied, the behaviour from spark <= 2.4.3 is restored and both processes terminate successfully

@dongjoon-hyun dongjoon-hyun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making a PR, @igorcalabria .
Currently, our testing Jenkins server is down. I'll review this PR after Jenkins is back.

@ifilonenko

Copy link
Copy Markdown
Contributor

@igorcalabria thank you for writing this patch

I have ran this internally on my cluster and can confirm that non-daemon threads no longer block the JVM from exiting.

Against branch 2.4 and master:

I see the following non-daemon threads exist when running
spark-submit against the following pyspark job:

from pyspark.sql import SparkSession

if __name__ == '__main__':
    spark = SparkSession.builder.getOrCreate()

After the python process exited, the JVM process remained and I see the following using jstack:

"OkHttp WebSocket https://10.96.0.1/..." #121 prio=5 os_prio=0 tid=0x00007fb27c005800 nid=0x24b waiting on condition [0x00007fb300847000]
"OkHttp WebSocket https://10.96.0.1/..." #117 prio=5 os_prio=0 tid=0x00007fb28c004000 nid=0x247 waiting on condition [0x00007fb300e4b000]

Against this patch:

I no longer see the following non-daemon threads and the JVM is able to exit after the completion of the python process.

@mccheah @erikerlandson @felixcheung @holdenk we should backport this fix to branch 2.4 as well and cut a new release with this patch.

@ifilonenko ifilonenko left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes regarding consistency of kubernetes-client version and the bump of okhttp as well

<properties>
<sbt.project.name>kubernetes</sbt.project.name>
<kubernetes.client.version>4.4.2</kubernetes.client.version>
<kubernetes.client.version>4.6.0</kubernetes.client.version>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to change the kubernetes.client.version in resource-managers/kubernetes/integration-tests/pom.xml as well.

and bump:

    <dependency>
      <groupId>com.squareup.okhttp3</groupId>
      <artifactId>okhttp</artifactId>
      - <version>3.8.1</version>
      + <version>3.12.0</version>
    </dependency>

as I believe there is a dependency here https://mvnrepository.com/artifact/io.fabric8/kubernetes-client/4.6.0

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Is there a reason why okhttp is added as an explicit dependency(kubernetes-client already brings it)? If we really need a specific version, shouldn't we at least exclude it from kubernetes-client? I'm saying this because it's very easy to end up with a unexpected version on the classpath. kubernetes-client uses okhttp's version 3.12 since version 4.1.2 but we're only updating it now.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may have been to make sure that another dependency on okhttp didn't win out and override to a lower version. It can just be updated here, or you can try removing it and examining the transitive dependencies afterwards. If it's already at 3.12.0, then it can likely be omitted.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point, @mccheah do we have an opinion on this? I am okay with removing the explicit dependency since kubernetes-client already brings it in. If there was a reason, I can't seem to remember, to explicitly state, let's exclude it?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one looks okay to me.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed we should remove explicit dep to avoid a mismatch failure in the future

@dongjoon-hyun

Copy link
Copy Markdown
Member

ok to test

@dongjoon-hyun

dongjoon-hyun commented Oct 12, 2019

Copy link
Copy Markdown
Member

@igorcalabria . The following claims is not matched with your PR description.

Standard kubernetes tests should validate this patch.

Please describe how to test this for the following for the reviewer. (Technically, this PR need a test case for that. Or, at least, manual testing process should be described in the PR description and should be the commit log.)

https://issues.apache.org/jira/browse/SPARK-27812
https://issues.apache.org/jira/browse/SPARK-27927

@SparkQA

SparkQA commented Oct 13, 2019

Copy link
Copy Markdown

Test build #111979 has finished for PR 26093 at commit 7f15bac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@igorcalabria

Copy link
Copy Markdown
Contributor Author

@igorcalabria . The following claims is not matched with your PR description.

Standard kubernetes tests should validate this patch.

Please describe how to test this for the following for the reviewer. (Technically, this PR need a test case for that. Or, at least, manual testing process should be described in the PR description and should be the commit log.)

https://issues.apache.org/jira/browse/SPARK-27812
https://issues.apache.org/jira/browse/SPARK-27927

@dongjoon-hyun A manual test is described above by @ifilonenko. If you want, I could add an integration test for this specific case. I skipped it because other PRs updating kubernetes-client version didn't add any tests and the README in kubernetes/integration-tests states that the framework is under review and subject to major changes

this matches the version used by kubernetes-client
@SparkQA

SparkQA commented Oct 14, 2019

Copy link
Copy Markdown

Test build #112032 has finished for PR 26093 at commit 5d6bcd7.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen

srowen commented Oct 14, 2019

Copy link
Copy Markdown
Member

If it's easy to write a test for this, OK. If it's complex or would take a long time to run, I can see just verifying the small behavior change with a manual test.

@SparkQA

SparkQA commented Oct 14, 2019

Copy link
Copy Markdown

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/17039/

@SparkQA

SparkQA commented Oct 14, 2019

Copy link
Copy Markdown

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/17039/

@mccheah

mccheah commented Oct 14, 2019

Copy link
Copy Markdown
Contributor

It's pretty difficult to cover all the code paths for changes in a downstream library. @dongjoon-hyun @srowen is there a standard protocol requiring an additional test for downstream dependencies? In this case, we're picking up a fix to an internal implementation detail in the downstream library, and I'd imagine we'd trust the downstream library to have already done its due diligence to test their fix. Regression tests exist to catch new problems introduced by the dependency bumps as we go along.

I'd agree that a manual test demonstrating the desired end behavior is sufficient here.

@SparkQA

SparkQA commented Oct 14, 2019

Copy link
Copy Markdown

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/17042/

@SparkQA

SparkQA commented Oct 14, 2019

Copy link
Copy Markdown

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/17042/

@dongjoon-hyun

Copy link
Copy Markdown
Member

Hi, @igorcalabria and @mccheah .
I'm okay as long as the manual description becomes the PR description and the commit log.

Please describe how to test this for the following for the reviewer. (Technically, this PR need a test case for that. Or, at least, manual testing process should be described in the PR description and should be the commit log.)

@SparkQA

SparkQA commented Oct 14, 2019

Copy link
Copy Markdown

Test build #112035 has finished for PR 26093 at commit b31b375.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ifilonenko

Copy link
Copy Markdown
Contributor

+1 on my end, I have tested using manual tests. An integration test seems like a bit of overkill, although a simple jstack check in the test would be sufficient.

@dongjoon-hyun dongjoon-hyun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, All.

Upgrading versions are okay, but the problem of this PR is to claim too much without a concrete background. This PR doesn't provide the way how to verify the claims in the PR description at all. Since the PR description will be a permanent commit log, we should not merge this PR AS-IS.

I'd like to recommend to remove all the claims from the PR description and narrow down the focus to the bug fixes inside the client library itself. Then, this PR can be considered as a proper proposal.

@erikerlandson

Copy link
Copy Markdown
Contributor

@mccheah @erikerlandson @felixcheung @holdenk we should backport this fix to branch 2.4 as well and cut a new release with this patch.

Will there be a 2.4.5 ?

@ifilonenko

Copy link
Copy Markdown
Contributor

@erikerlandson I think there should be. This bug will cause pyspark resources to hang unless some external process does resource reaping.

@dongjoon-hyun I agree, the PR description should change

@srowen srowen changed the title [SPARK-27812][K8s] Bump client version [SPARK-27812][K8S] Bump K8S client version to 4.6.0 Oct 15, 2019
@srowen

srowen commented Oct 15, 2019

Copy link
Copy Markdown
Member

There will be a 2.4.5 eventually, I'm sure. 2.4.x is a sort of "LTS" branch.

@dongjoon-hyun

Copy link
Copy Markdown
Member

Of course, @erikerlandson . As @srowen said, it's the LTS branch.

Since 2.4.0 was released on November, 2018, branch-2.4 should be maintained at least May, 2020 (18 months). After that, as a LTS branch, I believe branch-2.4 will be maintained for a while.

Since 2.4.4 was released on September, 2019. 2.4.5 will arrive early 2020 and so on. I hope more regular releases. But, the 2.4.x release schedule heavily depends on the patches on branche-2.4 and the volunteer for release manager.

@igorcalabria

Copy link
Copy Markdown
Contributor Author

@dongjoon-hyun I have updated the description to include a manual test. I believe that both jira issues fix claims are now supported by the manual test case provided in the description. Please let me know if you need anything else.

Cheers,

@dongjoon-hyun

Copy link
Copy Markdown
Member

Thank you for updating, @igorcalabria ! I'll follow the verification steps and try to land this.

@dongjoon-hyun dongjoon-hyun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, during reviews, I found that this became outdated already. Could you update this PR with the latest version, @igorcalabria .

@dongjoon-hyun

Copy link
Copy Markdown
Member

Also, cc @jiangxb1987 since he is a release manager for 3.0.0-preview.

@holdenk

holdenk commented Oct 17, 2019

Copy link
Copy Markdown
Contributor

I'm +1 on backporting even though it's a version change I think we need this in 2.4.X

@holdenk holdenk left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending removal of explicit dep & jenkins.
Thanks for doing the manual verification @ifilonenko :)

@holdenk

holdenk commented Oct 17, 2019

Copy link
Copy Markdown
Contributor

Actually if were going to update to 4.6.1 (and it looks like might as well) @ifilonenko would you have the cycles to manually verify with that version as well?

kubernetes-client already brings it and we don't need a specific version
@igorcalabria

Copy link
Copy Markdown
Contributor Author

@dongjoon-hyun bumped version to 4.6.1. Following other reviewers suggestions, I also removed explicit okhttp dep.

@srowen srowen changed the title [SPARK-27812][K8S] Bump K8S client version to 4.6.0 [SPARK-27812][K8S] Bump K8S client version to 4.6.1 Oct 17, 2019

@srowen srowen left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK pending tests

@SparkQA

SparkQA commented Oct 17, 2019

Copy link
Copy Markdown

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/17209/

@SparkQA

SparkQA commented Oct 17, 2019

Copy link
Copy Markdown

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/17209/

@SparkQA

SparkQA commented Oct 17, 2019

Copy link
Copy Markdown

Test build #112221 has finished for PR 26093 at commit a5eb702.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun

Copy link
Copy Markdown
Member

Is this tested with Python 3 on the master branch?

@dongjoon-hyun dongjoon-hyun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Merged to master.

@dongjoon-hyun

Copy link
Copy Markdown
Member

Thank you so much everyone.
Especially, welcome, @igorcalabria . I added you to the Apache Spark contributor group.
Could you make a backporting PR against branch-2.4, @igorcalabria ?

dongjoon-hyun pushed a commit that referenced this pull request Oct 18, 2019
# What changes were proposed in this pull request?

Backport of #26093 to `branch-2.4`

### Why are the changes needed?

https://issues.apache.org/jira/browse/SPARK-27812
https://issues.apache.org/jira/browse/SPARK-27927

We need this fix fabric8io/kubernetes-client#1768 that was released on version 4.6 of the client. The root cause of the problem is better explained in #25785

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

This patch was tested manually using a simple pyspark job

```python
from pyspark.sql import SparkSession

if __name__ == '__main__':
    spark = SparkSession.builder.getOrCreate()
```

The expected behaviour of this "job" is that both python's and jvm's process exit automatically after the main runs. This is the case for spark versions <= 2.4. On version 2.4.3, the jvm process hangs because there's a non daemon thread running

```
"OkHttp WebSocket https://10.96.0.1/..." #121 prio=5 os_prio=0 tid=0x00007fb27c005800 nid=0x24b waiting on condition [0x00007fb300847000]
"OkHttp WebSocket https://10.96.0.1/..." #117 prio=5 os_prio=0 tid=0x00007fb28c004000 nid=0x247 waiting on condition [0x00007fb300e4b000]
```
This is caused by a bug on `kubernetes-client` library, which is fixed on the version that we are upgrading to.

When the mentioned job is run with this patch applied, the behaviour from spark <= 2.4.0 is restored and both processes terminate successfully

Closes #26152 from igorcalabria/k8s-client-update-2.4.

Authored-by: igor.calabria <igor.calabria@ubee.in>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants