Skip to content

Add extension type support and implement GuidArray#268

Merged
CurtHagenlocher merged 5 commits into
apache:mainfrom
CurtHagenlocher:GuidArrayTake2
Feb 23, 2026
Merged

Add extension type support and implement GuidArray#268
CurtHagenlocher merged 5 commits into
apache:mainfrom
CurtHagenlocher:GuidArrayTake2

Conversation

@CurtHagenlocher
Copy link
Copy Markdown
Contributor

What's Changed

Adds generic support for extension types including an extension type registry. This has been done in what should be a fairly compatible fashion. No extension types are enabled by default, and different consumers of Arrow inside the same process can explicitly designate which extension types they want to make use of.

Packages several extensibility points into a single ArrowContext: the compression factory, the allocator and the extension registry.

Implements the GuidArray extension type.

Adds tests including interop tests with Python for GuidArray.

Partially implements #182.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds comprehensive support for Apache Arrow extension types to the C# implementation, addressing part of issue #182. Extension types allow users to define custom logical types layered on top of Arrow's built-in storage types, enabling domain-specific type systems while maintaining compatibility with the Arrow format.

Changes:

  • Introduces extension type infrastructure with ExtensionType, ExtensionDefinition, ExtensionArray, and ExtensionTypeRegistry classes
  • Implements GuidArray as the first concrete extension type, representing UUIDs using RFC 4122 big-endian byte layout
  • Adds ArrowContext to bundle configuration (memory allocator, compression factory, and extension registry) for cleaner API design
  • Updates IPC readers/writers and C Data Interface to properly serialize/deserialize extension type metadata
  • Includes comprehensive tests and Python interop tests

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/Apache.Arrow/ExtensionType.cs Core base classes for extension types and arrays
src/Apache.Arrow/ExtensionTypeRegistry.cs Thread-safe registry for extension type definitions with cloning support
src/Apache.Arrow/ArrowContext.cs Context object bundling allocator, compression factory, and extension registry
src/Apache.Arrow/Arrays/GuidArray.cs Implementation of UUID/GUID extension type with RFC 4122 conversion
src/Apache.Arrow/Arrays/FixedSizeBinaryArray.cs Performance optimization for AppendNull using stack allocation
src/Apache.Arrow/Arrays/ArrowArrayFactory.cs Factory support for building extension arrays
src/Apache.Arrow/Arrays/ArrowArrayBuilderFactory.cs Marks extension types as not supporting generic builders
src/Apache.Arrow/Types/IArrowType.cs Adds Extension type ID to enum
src/Apache.Arrow/Types/ArrowTypeExtensions.cs Helper method to unwrap extension types to storage types
src/Apache.Arrow/Ipc/MessageSerializer.cs Schema deserialization with extension type resolution
src/Apache.Arrow/Ipc/ArrowStreamWriter.cs Automatic injection of extension metadata during serialization
src/Apache.Arrow/Ipc/ArrowStreamReader.cs New constructors accepting ArrowContext
src/Apache.Arrow/Ipc/ArrowStreamReaderImplementation.cs Extension registry support in stream reader
src/Apache.Arrow/Ipc/ArrowReaderImplementation.cs Storage type unwrapping in array loading
src/Apache.Arrow/Ipc/ArrowMemoryReaderImplementation.cs Extension registry support in memory reader
src/Apache.Arrow/Ipc/ArrowFileReader.cs New constructor accepting ArrowContext
src/Apache.Arrow/Ipc/ArrowFileReaderImplementation.cs Extension registry support in file reader
src/Apache.Arrow/Ipc/ArrowFooter.cs Extension registry support in footer deserialization
src/Apache.Arrow/C/CArrowSchemaImporter.cs Extension type resolution during C Data Interface import
src/Apache.Arrow/C/CArrowSchemaExporter.cs Extension metadata injection during C Data Interface export
src/Apache.Arrow/C/CArrowArrayImporter.cs Storage type handling during array import
src/Apache.Arrow/C/CArrowArrayExporter.cs Storage type handling during array export
test/Apache.Arrow.Tests/ExtensionTypeTests.cs Comprehensive unit tests for extension type functionality
test/Apache.Arrow.Tests/CDataInterfacePythonTests.cs Python interop tests for GuidArray
Comments suppressed due to low confidence (5)

src/Apache.Arrow/Ipc/ArrowFileReader.cs:69

  • Inconsistent null handling for ArrowContext parameter. The ArrowStreamReader constructor (lines 74-82 in ArrowStreamReader.cs) throws ArgumentNullException when context is null, but this ArrowFileReader constructor allows null context by using the null-conditional operator. While the underlying implementation can handle null values (they default to MemoryAllocator.Default.Value and ExtensionTypeRegistry.Default), this inconsistency in the public API could be confusing to users. Consider either adding a null check here for consistency, or updating ArrowStreamReader to allow null context as well.
            : base(new ArrowFileReaderImplementation(stream, context?.Allocator, context?.CompressionCodecFactory, leaveOpen, context?.ExtensionRegistry))

test/Apache.Arrow.Tests/CDataInterfacePythonTests.cs:1202

  • Missing CArrowSchema.Free call for cSchema. This is a resource leak - the allocated cSchema should be freed after use. All other similar tests in this file properly free both cArray and cSchema.
            CArrowArray.Free(cArray);
        }

src/Apache.Arrow/Arrays/GuidArray.cs:75

  • This TODO comment appears to be outdated - a Builder class is already implemented below on lines 83-123. The TODO should be removed as the functionality is complete.
        // TODO: construct builder

test/Apache.Arrow.Tests/ExtensionTypeTests.cs:51

  • Redundant extension metadata in BuildGuidRecordBatch. The Field is created with explicit ARROW:extension:name and ARROW:extension:metadata entries, but these are automatically injected by ArrowStreamWriter when it encounters a GuidType. This is inconsistent with the approach used in CDataInterfacePythonTests.cs line 1120 where the Field is created without explicit metadata. Consider removing the explicit metadata dictionary from this helper method to avoid redundancy and ensure consistency across tests.
            var field = new Field("guids", GuidType.Default, true,
                new Dictionary<string, string>
                {
                    ["ARROW:extension:name"] = "arrow.uuid",
                    ["ARROW:extension:metadata"] = ""
                });

test/Apache.Arrow.Tests/ExtensionTypeTests.cs:268

  • Test comment is misleading. The comment states "Mutating clone should not affect original" but the test creates a new registry (newRegistry) instead of mutating the clone. To properly test isolation, the code should unregister from the clone and verify the original still has it registered. However, the current test does verify that cloning works and that a new registry doesn't have registrations from another registry.
            // Mutating clone should not affect original
            var newRegistry = new ExtensionTypeRegistry();
            Assert.False(newRegistry.TryGetDefinition("arrow.uuid", out _));

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@adamreeve adamreeve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great thanks @CurtHagenlocher! I just have some pretty minor comments.

The new ArrowContext type is a nice touch, and it's good that registries can be isolated to support large applications that might use Arrow in different places.

Comment thread src/Apache.Arrow/Arrays/GuidArray.cs Outdated
extensionRegistry.TryGetDefinition(extName, out ExtensionDefinition extDef))
{
metadata.TryGetValue("ARROW:extension:metadata", out string extMeta);
if (extDef.TryCreateType(type, extMeta, out ExtensionType extType))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A failure here seems like something that should be logged, but we don't have any logging infrastructure in the library so I guess there isn't really anything that can be done. I think it's probably right that failure shouldn't throw an exception.

}
}

[SkippableFact]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm looks like the Python integration tests aren't running in CI: https://github.com/apache/arrow-dotnet/actions/runs/22278848542/job/64445654891?pr=268#step:6:163

I made #270 to follow up on this. Looks like it might be caused by Python 3.14 being used in CI, which PythonNet doesn't seem to support yet.

Not a blocker for this PR though, I ran the tests on my machine with Python 3.12.

Comment thread test/Apache.Arrow.Tests/ExtensionTypeTests.cs Outdated
Comment thread test/Apache.Arrow.Tests/ExtensionTypeTests.cs Outdated
Comment thread test/Apache.Arrow.Tests/ExtensionTypeTests.cs Outdated
Comment thread test/Apache.Arrow.Tests/ExtensionTypeTests.cs Outdated
@CurtHagenlocher CurtHagenlocher merged commit 59cc1af into apache:main Feb 23, 2026
14 checks passed
@CurtHagenlocher CurtHagenlocher deleted the GuidArrayTake2 branch February 23, 2026 00:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants