Skip to content

[SPARK-7736] [core] Fix a race introduced in PythonRunner.#8258

Closed
vanzin wants to merge 4 commits into
apache:masterfrom
vanzin:SPARK-7736
Closed

[SPARK-7736] [core] Fix a race introduced in PythonRunner.#8258
vanzin wants to merge 4 commits into
apache:masterfrom
vanzin:SPARK-7736

Conversation

@vanzin

@vanzin vanzin commented Aug 17, 2015

Copy link
Copy Markdown
Contributor

The fix for SPARK-7736 introduced a race where a port value of "-1"
could be passed down to the pyspark process, causing it to fail to
connect back to the JVM. This change adds code to fix that race.

The fix for SPARK-7736 introduced a race where a port value of "-1"
could be passed down to the pyspark process, causing it to fail to
connect back to the JVM. This change adds code to fix that race.
@vanzin

vanzin commented Aug 17, 2015

Copy link
Copy Markdown
Contributor Author

Example of the error:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41051/artifact/yarn/target/unit-tests.log

File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 425, in startOverflowError: getsockaddrarg: port must be 0-65535.

@vanzin

vanzin commented Aug 18, 2015

Copy link
Copy Markdown
Contributor Author

retest this please

@andrewor14

Copy link
Copy Markdown
Contributor

@vanzin is this the right JIRA?

@andrewor14

Copy link
Copy Markdown
Contributor

also, I've seen this non-determinism from the user list. It would definitely be good to fix it.

@vanzin

vanzin commented Aug 18, 2015

Copy link
Copy Markdown
Contributor Author

Yes; this bug was introduced by a change that I pushed this morning (to fix the same bug this PR mentions; see #7751).

@SparkQA

SparkQA commented Aug 18, 2015

Copy link
Copy Markdown

Test build #41074 has finished for PR 8258 at commit 30b0ee5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Aug 18, 2015

Copy link
Copy Markdown

Test build #41077 has finished for PR 8258 at commit cfef35d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Aug 18, 2015

Copy link
Copy Markdown

Test build #41089 timed out for PR 8258 at commit d8831a2 after a configured wait of 175m.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I was going to say, isn't there some concurrency utility for this? and you could use a task or future or semaphore, but it might not be any less code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish there was something, but the GatewayServer API is super weird.

@vanzin

vanzin commented Aug 18, 2015

Copy link
Copy Markdown
Contributor Author

I'll try tests again but I'm inclined to merge this soon. retest this please

@SparkQA

SparkQA commented Aug 18, 2015

Copy link
Copy Markdown

Test build #41137 has finished for PR 8258 at commit d8831a2.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin

vanzin commented Aug 18, 2015

Copy link
Copy Markdown
Contributor Author

pyspark fail is the same flaky test that has been failing on and off for a long time. I'm merging this.

@asfgit asfgit closed this in c1840a8 Aug 18, 2015
@vanzin vanzin deleted the SPARK-7736 branch August 18, 2015 18:42
asfgit pushed a commit that referenced this pull request Sep 9, 2015
The fix for SPARK-7736 introduced a race where a port value of "-1"
could be passed down to the pyspark process, causing it to fail to
connect back to the JVM. This change adds code to fix that race.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #8258 from vanzin/SPARK-7736.

(cherry picked from commit c1840a8)
ashangit pushed a commit to ashangit/spark that referenced this pull request Oct 19, 2016
The fix for SPARK-7736 introduced a race where a port value of "-1"
could be passed down to the pyspark process, causing it to fail to
connect back to the JVM. This change adds code to fix that race.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes apache#8258 from vanzin/SPARK-7736.

(cherry picked from commit c1840a8)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants