infra: use spark base image for docker #2540

kevinjqliu · 2025-09-28T06:31:59Z

Rationale for this change

This PR modifies the dev/Dockerfile file to use spark as the base image. This should be better than downloading spark from source. And likely faster on github runner.

This PR also

modifies provision.py uses spark connect
add healthcheck to spark docker container

Are these changes tested?

Are there any user-facing changes?

Fokko · 2025-09-28T17:40:40Z

dev/provision.py

-        .config("spark.sql.shuffle.partitions", "1")
-        .config("spark.default.parallelism", "1")


I think we've set these to avoid creating multiple files with just one row. This way, when a shuffle is performed, it will be coalesced into a single file. This affects tests such as positional deletes, because when all the rows in a single file are marked for deletion, the whole file is dropped instead of creating merge-on-read deletes.

yea i remember these, i can add them to spark-defaults but the tests are passing now without them 🤷

The tests are passing, but we're not testing the positional deletes anymore since Spark will throw away the whole file, instead of creating the positional deletes:

The following test illustrates the problem:

diff --git a/tests/integration/test_reads.py b/tests/integration/test_reads.py index 375eb35b2..ed6e805e3 100644 --- a/tests/integration/test_reads.py +++ b/tests/integration/test_reads.py @@ -432,6 +432,11 @@ def test_pyarrow_deletes(catalog: Catalog, format_version: int) -> None: # (11, 'k'), # (12, 'l') test_positional_mor_deletes = catalog.load_table(f"default.test_positional_mor_deletes_v{format_version}") + + if format_version == 2: + files = test_positional_mor_deletes.scan().plan_files() + assert all([len(file.delete_files) > 0 for file in files]) + arrow_table = test_positional_mor_deletes.scan().to_arrow() assert arrow_table["number"].to_pylist() == [1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12]

This one passes on main but fails on this branch.

Looks like spark connect doesnt support these options

.config("spark.sql.shuffle.partitions", "1") .config("spark.default.parallelism", "1")

And INSERT INTO writes 1 data file per row. In order to force a single data file, im using

.coalesce(1).writeTo(identifier).append()

Fokko

Looks good, thanks @kevinjqliu for adding the checks 👍

Follow up to #2540 Related to #1527 # Rationale for this change Put all the files related to spark inside `dev/spark/` ## Are these changes tested? ## Are there any user-facing changes?

## Which issue does this PR close?  - Closes #2041 ## What changes are included in this PR? We made some upgrades to the Spark Dockerfile in pyiceberg (apache/iceberg-python#2540) (which i think rust's Dockerfile copied over previously). Porting those changes over: - Use `apache/spark` as base image (should be faster than downloading spark from apache cdn) - Upgrade to spark 4.0 - Use Spark connect for provisioning  ## Are these changes tested? Yes

kevinjqliu added 10 commits September 27, 2025 21:47

use spark connect for provision.py

fa9dcc5

remove python packages from docker container

03c1da9

use spark base image

5e5d348

refactor

4cc22b5

add healthcheck

13074c9

include python

9d87684

remove reference

a923880

make lint

d66c40b

healthcheck

311931f

use the right command

fd0d3d8

kevinjqliu marked this pull request as ready for review September 28, 2025 07:29

kevinjqliu requested a review from Fokko September 28, 2025 07:29

reorder

efa91e7

Fokko reviewed Sep 28, 2025

View reviewed changes

fix positional deletes

8b198f5

kevinjqliu requested a review from Fokko October 1, 2025 17:12

Fokko approved these changes Oct 2, 2025

View reviewed changes

Fokko merged commit 5ee5eea into apache:main Oct 2, 2025
10 checks passed

kevinjqliu deleted the kevinjqliu/use-spark-docker branch October 3, 2025 03:18

kevinjqliu mentioned this pull request Oct 3, 2025

infra: group spark files into dev/spark/ #2563

Merged

kevinjqliu mentioned this pull request Nov 3, 2025

[feature request] Improve integration test reliance on docker #637

Closed

kevinjqliu mentioned this pull request Jan 17, 2026

infra: use spark base image for docker apache/iceberg-rust#2043

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

infra: use spark base image for docker #2540

infra: use spark base image for docker #2540

Uh oh!

kevinjqliu commented Sep 28, 2025 •

edited

Loading

Uh oh!

Fokko Sep 28, 2025

Uh oh!

kevinjqliu Sep 28, 2025

Uh oh!

Fokko Sep 30, 2025

Uh oh!

kevinjqliu Oct 1, 2025

Uh oh!

Fokko left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		.config("spark.sql.shuffle.partitions", "1")
		.config("spark.default.parallelism", "1")

infra: use spark base image for docker #2540

infra: use spark base image for docker #2540

Uh oh!

Conversation

kevinjqliu commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Fokko Sep 28, 2025

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Sep 28, 2025

Choose a reason for hiding this comment

Uh oh!

Fokko Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kevinjqliu commented Sep 28, 2025 •

edited

Loading