Add extension type support and implement GuidArray#268
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request adds comprehensive support for Apache Arrow extension types to the C# implementation, addressing part of issue #182. Extension types allow users to define custom logical types layered on top of Arrow's built-in storage types, enabling domain-specific type systems while maintaining compatibility with the Arrow format.
Changes:
- Introduces extension type infrastructure with
ExtensionType,ExtensionDefinition,ExtensionArray, andExtensionTypeRegistryclasses - Implements
GuidArrayas the first concrete extension type, representing UUIDs using RFC 4122 big-endian byte layout - Adds
ArrowContextto bundle configuration (memory allocator, compression factory, and extension registry) for cleaner API design - Updates IPC readers/writers and C Data Interface to properly serialize/deserialize extension type metadata
- Includes comprehensive tests and Python interop tests
Reviewed changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| src/Apache.Arrow/ExtensionType.cs | Core base classes for extension types and arrays |
| src/Apache.Arrow/ExtensionTypeRegistry.cs | Thread-safe registry for extension type definitions with cloning support |
| src/Apache.Arrow/ArrowContext.cs | Context object bundling allocator, compression factory, and extension registry |
| src/Apache.Arrow/Arrays/GuidArray.cs | Implementation of UUID/GUID extension type with RFC 4122 conversion |
| src/Apache.Arrow/Arrays/FixedSizeBinaryArray.cs | Performance optimization for AppendNull using stack allocation |
| src/Apache.Arrow/Arrays/ArrowArrayFactory.cs | Factory support for building extension arrays |
| src/Apache.Arrow/Arrays/ArrowArrayBuilderFactory.cs | Marks extension types as not supporting generic builders |
| src/Apache.Arrow/Types/IArrowType.cs | Adds Extension type ID to enum |
| src/Apache.Arrow/Types/ArrowTypeExtensions.cs | Helper method to unwrap extension types to storage types |
| src/Apache.Arrow/Ipc/MessageSerializer.cs | Schema deserialization with extension type resolution |
| src/Apache.Arrow/Ipc/ArrowStreamWriter.cs | Automatic injection of extension metadata during serialization |
| src/Apache.Arrow/Ipc/ArrowStreamReader.cs | New constructors accepting ArrowContext |
| src/Apache.Arrow/Ipc/ArrowStreamReaderImplementation.cs | Extension registry support in stream reader |
| src/Apache.Arrow/Ipc/ArrowReaderImplementation.cs | Storage type unwrapping in array loading |
| src/Apache.Arrow/Ipc/ArrowMemoryReaderImplementation.cs | Extension registry support in memory reader |
| src/Apache.Arrow/Ipc/ArrowFileReader.cs | New constructor accepting ArrowContext |
| src/Apache.Arrow/Ipc/ArrowFileReaderImplementation.cs | Extension registry support in file reader |
| src/Apache.Arrow/Ipc/ArrowFooter.cs | Extension registry support in footer deserialization |
| src/Apache.Arrow/C/CArrowSchemaImporter.cs | Extension type resolution during C Data Interface import |
| src/Apache.Arrow/C/CArrowSchemaExporter.cs | Extension metadata injection during C Data Interface export |
| src/Apache.Arrow/C/CArrowArrayImporter.cs | Storage type handling during array import |
| src/Apache.Arrow/C/CArrowArrayExporter.cs | Storage type handling during array export |
| test/Apache.Arrow.Tests/ExtensionTypeTests.cs | Comprehensive unit tests for extension type functionality |
| test/Apache.Arrow.Tests/CDataInterfacePythonTests.cs | Python interop tests for GuidArray |
Comments suppressed due to low confidence (5)
src/Apache.Arrow/Ipc/ArrowFileReader.cs:69
- Inconsistent null handling for ArrowContext parameter. The ArrowStreamReader constructor (lines 74-82 in ArrowStreamReader.cs) throws ArgumentNullException when context is null, but this ArrowFileReader constructor allows null context by using the null-conditional operator. While the underlying implementation can handle null values (they default to MemoryAllocator.Default.Value and ExtensionTypeRegistry.Default), this inconsistency in the public API could be confusing to users. Consider either adding a null check here for consistency, or updating ArrowStreamReader to allow null context as well.
: base(new ArrowFileReaderImplementation(stream, context?.Allocator, context?.CompressionCodecFactory, leaveOpen, context?.ExtensionRegistry))
test/Apache.Arrow.Tests/CDataInterfacePythonTests.cs:1202
- Missing CArrowSchema.Free call for cSchema. This is a resource leak - the allocated cSchema should be freed after use. All other similar tests in this file properly free both cArray and cSchema.
CArrowArray.Free(cArray);
}
src/Apache.Arrow/Arrays/GuidArray.cs:75
- This TODO comment appears to be outdated - a Builder class is already implemented below on lines 83-123. The TODO should be removed as the functionality is complete.
// TODO: construct builder
test/Apache.Arrow.Tests/ExtensionTypeTests.cs:51
- Redundant extension metadata in BuildGuidRecordBatch. The Field is created with explicit ARROW:extension:name and ARROW:extension:metadata entries, but these are automatically injected by ArrowStreamWriter when it encounters a GuidType. This is inconsistent with the approach used in CDataInterfacePythonTests.cs line 1120 where the Field is created without explicit metadata. Consider removing the explicit metadata dictionary from this helper method to avoid redundancy and ensure consistency across tests.
var field = new Field("guids", GuidType.Default, true,
new Dictionary<string, string>
{
["ARROW:extension:name"] = "arrow.uuid",
["ARROW:extension:metadata"] = ""
});
test/Apache.Arrow.Tests/ExtensionTypeTests.cs:268
- Test comment is misleading. The comment states "Mutating clone should not affect original" but the test creates a new registry (newRegistry) instead of mutating the clone. To properly test isolation, the code should unregister from the clone and verify the original still has it registered. However, the current test does verify that cloning works and that a new registry doesn't have registrations from another registry.
// Mutating clone should not affect original
var newRegistry = new ExtensionTypeRegistry();
Assert.False(newRegistry.TryGetDefinition("arrow.uuid", out _));
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
adamreeve
left a comment
There was a problem hiding this comment.
This looks great thanks @CurtHagenlocher! I just have some pretty minor comments.
The new ArrowContext type is a nice touch, and it's good that registries can be isolated to support large applications that might use Arrow in different places.
| extensionRegistry.TryGetDefinition(extName, out ExtensionDefinition extDef)) | ||
| { | ||
| metadata.TryGetValue("ARROW:extension:metadata", out string extMeta); | ||
| if (extDef.TryCreateType(type, extMeta, out ExtensionType extType)) |
There was a problem hiding this comment.
A failure here seems like something that should be logged, but we don't have any logging infrastructure in the library so I guess there isn't really anything that can be done. I think it's probably right that failure shouldn't throw an exception.
| } | ||
| } | ||
|
|
||
| [SkippableFact] |
There was a problem hiding this comment.
Hmm looks like the Python integration tests aren't running in CI: https://github.com/apache/arrow-dotnet/actions/runs/22278848542/job/64445654891?pr=268#step:6:163
I made #270 to follow up on this. Looks like it might be caused by Python 3.14 being used in CI, which PythonNet doesn't seem to support yet.
Not a blocker for this PR though, I ran the tests on my machine with Python 3.12.
What's Changed
Adds generic support for extension types including an extension type registry. This has been done in what should be a fairly compatible fashion. No extension types are enabled by default, and different consumers of Arrow inside the same process can explicitly designate which extension types they want to make use of.
Packages several extensibility points into a single
ArrowContext: the compression factory, the allocator and the extension registry.Implements the
GuidArrayextension type.Adds tests including interop tests with Python for
GuidArray.Partially implements #182.