Skip to content

[Rust] StructArray: handling duplicate field names #27083

Description

@asfimport

The arrow spec leaves the solution of duplicate field names to implementors.

The C++'s solution: ignore or raise error, the Java's solution: ignore, append, replace or raise error. Both use ignore as the default. Here is the references:

  • https://github.com/apache/arrow/blob/57376d28cf433bed95f19fa44c1e90a780ba54e8/cpp/src/arrow/type.cc

  • https://github.com/apache/arrow/blob/25c736d48dc289f457e74d15d05db65f6d539447/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractStructVector.java

    I'm not expert at database or data science, but as far as I know, in the traditional RDBMS domain, it's unusual to allow duplicate field names. Further more, in the data analysis domain, perhaps it's usual to normalize/clean various kind of bad/dirty data interactively with tools like pandas?

    Back to the problem, I have an example: given duplicate field names A A A B B, the user who knows actual data MAY choose to: replace first A with second A and append third A, and ignore second B. Or the duplication was just mistake?

    Quote from [~nevi_me]: "I also prefer raising an error by default, as that'll make users aware very quickly". Is not acceptable if we silently append/ignore/replace duplicate fields, resulting unexpected results that user does not aware at all.

    If we choose to support replace, ignore or append, at least we must let user control the exact behavior.  For IPC data, perhaps custom metadata (for file, message and field) is the only choice. I suggest just record this problem here, keep raising error until it's really necessary to support other solutions.

Reporter: Qingyou Meng / @mqy

Note: This issue was originally created as ARROW-11178. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions