Skip to content

[BEAM-11929] Rely on py3.6+ dictionary ordering in beam.Row#14156

Merged
TheNeuralBit merged 5 commits intoapache:masterfrom
TheNeuralBit:BEAM-11929
Mar 24, 2021
Merged

[BEAM-11929] Rely on py3.6+ dictionary ordering in beam.Row#14156
TheNeuralBit merged 5 commits intoapache:masterfrom
TheNeuralBit:BEAM-11929

Conversation

@TheNeuralBit
Copy link
Copy Markdown
Member

Previously fields were sorted by name, which lead to a mismatch between the inferred schema and some interactions with the Row object (e.g. iterating), leading to bugs like BEAM-11929. With this change we just rely on the consistent dictionary ordering provided in python 3.6+ to provide a consistent field order.

Post-Commit Tests Status (on master branch)

Lang SDK Dataflow Flink Samza Spark Twister2
Go Build Status --- Build Status --- Build Status ---
Java Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status
Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
--- Build Status ---
XLang Build Status Build Status Build Status --- Build Status ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website Whitespace Typescript
Non-portable Build Status Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status Build Status
Portable --- Build Status --- --- --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@TheNeuralBit
Copy link
Copy Markdown
Member Author

R: @udim
CC: @robertwb

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 5, 2021

Codecov Report

Merging #14156 (4b63f9c) into master (03017b6) will decrease coverage by 0.00%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #14156      +/-   ##
==========================================
- Coverage   83.34%   83.34%   -0.01%     
==========================================
  Files         468      468              
  Lines       58800    58800              
==========================================
- Hits        49009    49006       -3     
- Misses       9791     9794       +3     
Impacted Files Coverage Δ
...apache_beam/examples/complete/juliaset/__init__.py
...s/python/apache_beam/io/aws/clients/s3/__init__.py
.../srcs/sdks/python/apache_beam/io/gcp/bigtableio.py
...srcs/sdks/python/apache_beam/io/localfilesystem.py
...cs/sdks/python/apache_beam/dataframe/transforms.py
...s/python/apache_beam/examples/wordcount_minimal.py
.../srcs/sdks/python/apache_beam/coders/observable.py
...python/apache_beam/examples/complete/distribopt.py
...ython/apache_beam/typehints/decorators_test_py3.py
...he_beam/transforms/combinefn_lifecycle_pipeline.py
... and 926 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 03017b6...4b63f9c. Read the comment docs.

@udim
Copy link
Copy Markdown
Member

udim commented Mar 10, 2021

Copy link
Copy Markdown
Member

@udim udim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change should be mentioned in CHANGES.md if it breaks with existing behavior. For instance, if existing uses rely on fields being sorted by name.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried running this interactively but got the same hash:

>>> hash(type({}.items()))
5913863196444
>>> hash(type({1:2}.items()))
5913863196444

Is it working as intended?

I guess this is technically okay according to the docs, but probably not what you wanted: The only required property is that objects which compare equal have the same hash value.
ref

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't write the original __hash__ implementation that uses type(). You're right that it seems like an odd choice though. It seems likely this was a copy-paste error or a typo.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely a bug.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, went ahead and pushed a commit to fix this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing dicts ignores order. Is that that's intentional?

>>> {1:2, 3:4} == {1:2, 3:4}
True
>>> {1:2, 3:4} == {3:4, 1:2}
True

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm that's consistent with the current implementation, but IMO it makes sense to make this sensitive to field order. That will be consistent with Java, where two Rows with different schemas are considered unequal.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@TheNeuralBit
Copy link
Copy Markdown
Member Author

Run Python PreCommit

@TheNeuralBit
Copy link
Copy Markdown
Member Author

There's a failing test: https://ci-beam.apache.org/job/beam_PreCommit_Python_Commit/17554/testReport/junit/apache_beam.dataframe.io_test/IOTest/test_read_write_parquet/

That's a weird failure. It doesn't seem related to this change and I can't repro it locally. But it's also not a flake that shows up on Python PreCommit Cron. Re-running PreCommit on unmodified code to see if it's reproducible.

@TheNeuralBit
Copy link
Copy Markdown
Member Author

Ok, I think I've addressed all the comments, PTAL @udim

@TheNeuralBit
Copy link
Copy Markdown
Member Author

ping @ehudm do you have time to review this?

Copy link
Copy Markdown
Member

@udim udim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, and sorry for the delay!

@TheNeuralBit TheNeuralBit merged commit 377f4b2 into apache:master Mar 24, 2021
TheNeuralBit added a commit to TheNeuralBit/beam that referenced this pull request Mar 25, 2021
…4156)

* Add (failing) test of beam Row -> DataFrame

* Rely on py3.6+ dict ordering rather than sorting

* Fix __hash__ typo

* Make __eq__ sensitive to field order

* Update docs, add CHANGES entry
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants