[BEAM-11929] Rely on py3.6+ dictionary ordering in beam.Row#14156
[BEAM-11929] Rely on py3.6+ dictionary ordering in beam.Row#14156TheNeuralBit merged 5 commits intoapache:masterfrom
Conversation
udim
left a comment
There was a problem hiding this comment.
This change should be mentioned in CHANGES.md if it breaks with existing behavior. For instance, if existing uses rely on fields being sorted by name.
sdks/python/apache_beam/pvalue.py
Outdated
There was a problem hiding this comment.
I tried running this interactively but got the same hash:
>>> hash(type({}.items()))
5913863196444
>>> hash(type({1:2}.items()))
5913863196444
Is it working as intended?
I guess this is technically okay according to the docs, but probably not what you wanted: The only required property is that objects which compare equal have the same hash value.
ref
There was a problem hiding this comment.
I didn't write the original __hash__ implementation that uses type(). You're right that it seems like an odd choice though. It seems likely this was a copy-paste error or a typo.
There was a problem hiding this comment.
This is definitely a bug.
There was a problem hiding this comment.
Ok, went ahead and pushed a commit to fix this.
sdks/python/apache_beam/pvalue.py
Outdated
There was a problem hiding this comment.
Comparing dicts ignores order. Is that that's intentional?
>>> {1:2, 3:4} == {1:2, 3:4}
True
>>> {1:2, 3:4} == {3:4, 1:2}
True
There was a problem hiding this comment.
Hm that's consistent with the current implementation, but IMO it makes sense to make this sensitive to field order. That will be consistent with Java, where two Rows with different schemas are considered unequal.
|
Run Python PreCommit |
That's a weird failure. It doesn't seem related to this change and I can't repro it locally. But it's also not a flake that shows up on Python PreCommit Cron. Re-running PreCommit on unmodified code to see if it's reproducible. |
1f9fa77 to
742414b
Compare
|
Ok, I think I've addressed all the comments, PTAL @udim |
44ee638 to
4b63f9c
Compare
|
ping @ehudm do you have time to review this? |
udim
left a comment
There was a problem hiding this comment.
Thanks, and sorry for the delay!
…4156) * Add (failing) test of beam Row -> DataFrame * Rely on py3.6+ dict ordering rather than sorting * Fix __hash__ typo * Make __eq__ sensitive to field order * Update docs, add CHANGES entry
Previously fields were sorted by name, which lead to a mismatch between the inferred schema and some interactions with the Row object (e.g. iterating), leading to bugs like BEAM-11929. With this change we just rely on the consistent dictionary ordering provided in python 3.6+ to provide a consistent field order.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.