Skip to content

Conversation

@kevinjqliu
Copy link
Contributor

@kevinjqliu kevinjqliu commented Sep 28, 2025

Rationale for this change

Closes #1527

This PR modifies the dev/Dockerfile file to use spark as the base image. This should be better than downloading spark from source. And likely faster on github runner.

This PR also

  • modifies provision.py uses spark connect
  • add healthcheck to spark docker container

Are these changes tested?

Are there any user-facing changes?

@kevinjqliu kevinjqliu marked this pull request as ready for review September 28, 2025 07:29
@kevinjqliu kevinjqliu requested a review from Fokko September 28, 2025 07:29
Comment on lines -33 to -34
.config("spark.sql.shuffle.partitions", "1")
.config("spark.default.parallelism", "1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we've set these to avoid creating multiple files with just one row. This way, when a shuffle is performed, it will be coalesced into a single file. This affects tests such as positional deletes, because when all the rows in a single file are marked for deletion, the whole file is dropped instead of creating merge-on-read deletes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea i remember these, i can add them to spark-defaults but the tests are passing now without them 🤷

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests are passing, but we're not testing the positional deletes anymore since Spark will throw away the whole file, instead of creating the positional deletes:

The following test illustrates the problem:

diff --git a/tests/integration/test_reads.py b/tests/integration/test_reads.py
index 375eb35b2..ed6e805e3 100644
--- a/tests/integration/test_reads.py
+++ b/tests/integration/test_reads.py
@@ -432,6 +432,11 @@ def test_pyarrow_deletes(catalog: Catalog, format_version: int) -> None:
     #  (11, 'k'),
     #  (12, 'l')
     test_positional_mor_deletes = catalog.load_table(f"default.test_positional_mor_deletes_v{format_version}")
+
+    if format_version == 2:
+        files = test_positional_mor_deletes.scan().plan_files()
+        assert all([len(file.delete_files) > 0 for file in files])
+
     arrow_table = test_positional_mor_deletes.scan().to_arrow()
     assert arrow_table["number"].to_pylist() == [1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12]

This one passes on main but fails on this branch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like spark connect doesnt support these options

        .config("spark.sql.shuffle.partitions", "1")
        .config("spark.default.parallelism", "1")

And INSERT INTO writes 1 data file per row. In order to force a single data file, im using

.coalesce(1).writeTo(identifier).append()

@kevinjqliu kevinjqliu requested a review from Fokko October 1, 2025 17:12
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @kevinjqliu for adding the checks 👍

@Fokko Fokko merged commit 5ee5eea into apache:main Oct 2, 2025
10 checks passed
@kevinjqliu kevinjqliu deleted the kevinjqliu/use-spark-docker branch October 3, 2025 03:18
Fokko pushed a commit that referenced this pull request Oct 3, 2025
<!--
Thanks for opening a pull request!
-->

<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
<!-- Closes #${GITHUB_ISSUE_ID} -->
Follow up to #2540
Related to #1527

# Rationale for this change
Put all the files related to spark inside `dev/spark/`

## Are these changes tested?

## Are there any user-facing changes?

<!-- In the case of user-facing changes, please add the changelog label.
-->
kevinjqliu added a commit to apache/iceberg-rust that referenced this pull request Jan 19, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes #2041

## What changes are included in this PR?
We made some upgrades to the Spark Dockerfile in pyiceberg
(apache/iceberg-python#2540) (which i think
rust's Dockerfile copied over previously). Porting those changes over:
- Use `apache/spark` as base image (should be faster than downloading
spark from apache cdn)
- Upgrade to spark 4.0
- Use Spark connect for provisioning

<!--
Provide a summary of the modifications in this PR. List the main changes
such as new features, bug fixes, refactoring, or any other updates.
-->

## Are these changes tested?
Yes

<!--
Specify what test covers (unit test, integration test, etc.).

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve dev/Dockerfile

2 participants