GH-44501: [C#] Use an index lookup for O(1) field index access#44633
GH-44501: [C#] Use an index lookup for O(1) field index access#44633vthemelis wants to merge 1 commit into
O(1) field index access#44633Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or In the case of PARQUET issues on JIRA the title also supports: See also: |
O(1) field index access
|
|
ce4dda4 to
c7a0c58
Compare
|
@CurtHagenlocher it looks like my tests are failing in CI but I can't replicate this locally. |
c7a0c58 to
21a86cd
Compare
|
Oh, I see the issue. The CI does a merge into The PR above has changed the lookup behaviour when the field is not matched. |
e6d024d to
52b374a
Compare
|
@CurtHagenlocher and @georgevanburgh I've reverted to the original behaviour (throwing an |
52b374a to
d099cdf
Compare
|
I'm a bit tied up with work but will look at this by Friday. |
Thank you very much! |
CurtHagenlocher
left a comment
There was a problem hiding this comment.
Thanks for the change! (and sorry for the delay) I left a few suggestions.
There was a problem hiding this comment.
This is a breaking change. Is there a specific reason for it?
There was a problem hiding this comment.
This is actually fixing a regression that was added in #44576
The function would previously throw InvalidOperationException due to .First(...). The description of the PR is erroneous in asserting that -1 is returned in the case of no match as the exception would first be thrown by the argument of the .IndexOf(...) function.
I noticed this because I wrote my unit tests before that PR was merged. Are you happy with keeping the return -1 regression?
There was a problem hiding this comment.
I think for new code, we should use Ordinal. I don't think CurrentCulture really makes sense here.
There was a problem hiding this comment.
If I change to Ordinal then this would be a breaking change as this is used in GetFieldIndex(string). Are you okay with changing the behaviour?
There was a problem hiding this comment.
Of course, I can then only use the lookup if the user has provided Ordinal as the comparator but this would defeat the purpose of this PR which is to make lookup of Columns in RecordBatch O(1).
There was a problem hiding this comment.
Or I could change this in a separate PR that implements your suggestions here #44650 (comment) ?
There was a problem hiding this comment.
Can we also expose this in a way that lets users get all the indexes for a particular field name?
There was a problem hiding this comment.
Sure, I'm happy to add this, though it feels like it's adding further to the scope of the PR. I'm happy to implement any API suggestions that you may have; what would be your preferred way?
Thank you very much for your review @CurtHagenlocher ! I added some replies to your comments. Let me know your thoughts and will proceed with changes. |
|
Hi @CurtHagenlocher , do you have some time to look into this? |
d099cdf to
a34abec
Compare
a34abec to
92160cb
Compare
|
Hi @CurtHagenlocher, I've undone the style change as you suggested. Let me know your views on the other questions and I'm happy to change this PR accordingly. |
|
If you'd prefer, I can arrange separate PRs for aspects suggested? |
|
Hi @CurtHagenlocher , can you still have a look at this? |
|
@CurtHagenlocher would it be possible to review this? We've hit this recently which inadvertently introduced a large slowdown in our code. |
|
It would be nice to get this feature added to speed up field and column lookups, but from my understanding there are two main issues:
I think Curt's suggestion to deprecate exising APIs and introduce new better APIs (#44650 (comment)) makes sense. This would be fairly disruptive though as it would also require deprecating How about this as a potential way forward?
|
|
I should add that the above changes could all be done as a follow up to this PR so I think we could merge this nearly as-is to get the performance improvement, and then follow up later to fix the comparison inconsistency. Given the breaking change to return -1 instead of throw |
|
|
||
| namespace Apache.Arrow.Tests; | ||
|
|
||
| public class SchemaTests |
There was a problem hiding this comment.
Can we add a test that verifies we can use a different comparer to the default, eg. getting a field named "f0" using "F0" and StringComparer.OrdinalIgnoreCase?
|
Closing as the C# implementation has moved to a new repository. |
Closes #301. Ports the optimization from the closed PR at apache/arrow#44633 into the new .NET-specific repository. The original PR was closed on November 18, 2025 with the note that the C# implementation had moved to a new repository. This version keeps the current `arrow-dotnet` behavior intact: - `GetFieldIndex(..., comparer: null)` and the default path now use a cached `CurrentCulture` index lookup for the common case. - Missing fields still return `-1`. - Duplicate field names still return the first match. - Non-default comparers still fall back to the existing linear scan. I also added dedicated schema tests covering: - `null`, `Ordinal`, `OrdinalIgnoreCase`, and `CurrentCulture` comparers - duplicate-name lookup returning the first match - missing-name behavior for each comparer Local verification: - `dotnet build test/Apache.Arrow.Tests/Apache.Arrow.Tests.csproj` - `DOTNET_ROLL_FORWARD=Major DOTNET_ROLL_FORWARD_TO_PRERELEASE=1 dotnet test test/Apache.Arrow.Tests/Apache.Arrow.Tests.csproj --no-restore --logger 'console;verbosity=minimal'`
Fixes #44501
Rationale for this change
This should speed up the reasonably common operation of looking up a column via name in
SchemaorRecordBatch.I would argue that the fact that this operation was previously O(N) (now O(1)) is unintuitive and could easily lead to accidental performance woes.
Note that I would like to also replace the existing Lookups with signature
string -> Fieldbut unfortunately those useStringComparer.Defaultinstead ofStringComparer.CurrentCulture(which I'm using to make the changedpublicfunction backwards compatible). Not sure if this is intentional.What changes are included in this PR?
This PR simply memoizes the index lookup of the fields. It should not break any existing behaviour.
It also fixes a recent regression about the behaviour of the changed function in the case of using a custom comparator and have no matches in the schema.
Are these changes tested?
All code paths changed are covered by new unit tests.
Are there any user-facing changes?
Yes, the lookup function throws an exception when there are no matches fuels rather than returning -1. This was a regression introduces O(days) ago.
Column(string)method inRecordBatchis linear to the number of columns #44501