Skip to content

ARROW-6870: [C#] Add Support for Dictionary Arrays and Dictionary Encoding#10527

Closed
HashidaTKS wants to merge 15 commits into
apache:masterfrom
HashidaTKS:ARROW-6870
Closed

ARROW-6870: [C#] Add Support for Dictionary Arrays and Dictionary Encoding#10527
HashidaTKS wants to merge 15 commits into
apache:masterfrom
HashidaTKS:ARROW-6870

Conversation

@HashidaTKS

@HashidaTKS HashidaTKS commented Jun 14, 2021

Copy link
Copy Markdown
Contributor

This is a implementation of DictionaryBatch (de)serialization for the streaming format.

The following features are missing for now, I plan to implement these features in another future PR.

  • The (de)serialization for the file format and Flight are not implemented yet
  • isDelta is not supported yet

@github-actions

Copy link
Copy Markdown

- Implement Dictionary serialization for ArrowStreamReader/Writer
@HashidaTKS

Copy link
Copy Markdown
Contributor Author

cc: @eerhardt
Would you please review this when you have time?

@eerhardt eerhardt left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good start. Thanks @HashidaTKS!

Here's my first round of comments.


public ArrowReaderImplementation()
{
_dictionaryMemo = new DictionaryMemo();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we lazy load this until it is needed? It currently allocates 3 Dictionary<K, V> objects, which are unnecessary if the payload doesn't contain any dictionaries.

@HashidaTKS HashidaTKS Jun 17, 2021

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created a LazyCreator class and changed DictionaryMemo _dictionaryMemo to LazyCreator<DictionaryMemo> _lazyDictionaryMemo.
However, I am wondering whether this is the best way to lazy load because it need allocate a LazyCreator object.
Is there any other good idea?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I wouldn't use create a new class for this, .NET has https://docs.microsoft.com/dotnet/api/system.lazy-1 already for it.
  2. Even better is to not use Lazy<T> at all, and just check for null before using it. You can make a private property to make this really easy:
private DictionaryMemo _dictionaryMemo;

private DictionaryMemo DictionaryMemo => _dictionaryMemo ??= new DictionaryMemo();

// later in code that needs to use DictonaryMemo:

DictionaryMemo.GetDictionaryType(id);

This way, nothing new is allocated when we don't see a Dictionary batch in the payload.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I reimplemented them with the way 2.

Comment thread csharp/src/Apache.Arrow/Ipc/ArrowReaderImplementation.cs Outdated
Comment thread csharp/src/Apache.Arrow/Arrays/DictionaryArray.cs Outdated
Comment thread csharp/src/Apache.Arrow/RecordBatch.cs Outdated

private readonly IMemoryOwner<byte> _memoryOwner;
private readonly IList<IArrowArray> _arrays;
internal readonly IReadOnlyList<IArrowArray> _arrays;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of exposing this field, can you make an internal property?

Suggested change
internal readonly IReadOnlyList<IArrowArray> _arrays;
private readonly IList<IArrowArray> _arrays;
internal IReadOnlyList<IArrowArray> Arrays => _arrays;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it.

There was already a public property named Arrays so I named the internal property _Arrays.

Comment thread csharp/src/Apache.Arrow/Arrays/DictionaryArray.cs Outdated
Comment thread csharp/src/Apache.Arrow/Arrays/ArrayData.cs

_fieldTypeBuilder = new ArrowTypeFlatbufferBuilder(Builder);
_options = options ?? IpcOptions.Default;
_dictionaryMemo = new DictionaryMemo();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be lazy-initialized until it is needed? That way every ArrowStreamWriter doesn't need to create it, if they don't use Dictionaries.

Comment thread csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs Outdated
Comment thread csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs Outdated
_idToDictionary[id] = dictionary;
}

public void AddDictionaryDelta(long id, IArrowArray dictionary)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be deleted until it is supported? It is dead code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hasn't been deleted (yet)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't commit the change.

- Separate ArrayData constructors
- Remove unused methods / constructors
- Load DictionaryMemo lazily
- Hide _arrays of RecordBatch
- Fix tests
reader.ReadNextRecordBatch();

Assert.Equal(1, memoryPool.Statistics.Allocations);
Assert.Equal(2, memoryPool.Statistics.Allocations);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memo:
_allocator is called when reading dictionary batches and record batches.
The expected value of memoryPool.Statistics.Allocations is 2 at this context.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this test at all? If you want to add a new test for dictionaries - we should add a new one, and leave the existing test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it to use [Theory].

Comment thread csharp/src/Apache.Arrow/Arrays/ArrayData.cs Outdated
@HashidaTKS

Copy link
Copy Markdown
Contributor Author

@eerhardt
Thank you for the review!

I have reflected the feedback.

- Change a LazyCreator constructor position
@HashidaTKS HashidaTKS requested a review from eerhardt June 17, 2021 18:37
- Slight refactoring for LazyCreator.Instance
@HashidaTKS

HashidaTKS commented Jun 19, 2021

Copy link
Copy Markdown
Contributor Author

Added some minor changes.

- refactor ArrayData constructors
@HashidaTKS

Copy link
Copy Markdown
Contributor Author

Modified ArrayData constructors.

I'm sorry to commit many times after requesting a review.

@eerhardt

Copy link
Copy Markdown
Contributor

Sorry for the delay @HashidaTKS. I didn't forget about this. I will get back to it this week, hopefully sometime today.

@HashidaTKS

Copy link
Copy Markdown
Contributor Author

@eerhardt
No problem, I'm in no hurry. Please take your time.

reader.ReadNextRecordBatch();

Assert.Equal(1, memoryPool.Statistics.Allocations);
Assert.Equal(2, memoryPool.Statistics.Allocations);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this test at all? If you want to add a new test for dictionaries - we should add a new one, and leave the existing test.

Comment thread csharp/src/Apache.Arrow/RecordBatch.cs Outdated
Comment on lines 31 to 34
internal IReadOnlyList<IArrowArray> _Arrays => (IReadOnlyList<IArrowArray>)_arrays;

private readonly IMemoryOwner<byte> _memoryOwner;
private readonly IList<IArrowArray> _arrays;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
internal IReadOnlyList<IArrowArray> _Arrays => (IReadOnlyList<IArrowArray>)_arrays;
private readonly IMemoryOwner<byte> _memoryOwner;
private readonly IList<IArrowArray> _arrays;
internal IReadOnlyList<IArrowArray> ArrayList => _arrays;
private readonly IMemoryOwner<byte> _memoryOwner;
private readonly List<IArrowArray> _arrays;
  1. Using the coding style from https://github.com/dotnet/runtime/blob/main/docs/coding-guidelines/coding-style.md, only fields should have _ prefix.
  2. We shouldn't need to cast here. Just have the private field as a List<IArrowArray>.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it.

public Stream BaseStream { get; }
private readonly bool _leaveOpen;
private readonly MemoryAllocator _allocator;
private protected bool HasReadInitialDictionary { get; set; }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private protected bool HasReadInitialDictionary { get; set; }
private bool HasReadInitialDictionary { get; set; }

Nothing outside this class uses this property.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it.

Comment on lines +32 to +33

public ArrowStreamReaderImplementation(Stream stream, MemoryAllocator allocator, bool leaveOpen) : base()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public ArrowStreamReaderImplementation(Stream stream, MemoryAllocator allocator, bool leaveOpen) : base()
public ArrowStreamReaderImplementation(Stream stream, MemoryAllocator allocator, bool leaveOpen)
  1. Needless extra line.
  2. No need for the explicit call to base.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it.

}
}

protected void ReadInitialDictionaries()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
protected void ReadInitialDictionaries()
private void ReadInitialDictionaries()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it.

Comment thread csharp/src/Apache.Arrow/Arrays/DictionaryArray.cs
childFields[i] = FieldFromFlatbuffer(childFlatbufField.Value, lazyDictionaryMemo);
}

Flatbuf.DictionaryEncoding? de = flatbufField.Dictionary;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) Can you spell out de? It doesn't make a great reading experience to figure out what de means.

Suggested change
Flatbuf.DictionaryEncoding? de = flatbufField.Dictionary;
Flatbuf.DictionaryEncoding? dictionaryEncoding = flatbufField.Dictionary;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it.

Flatbuf.Int? indexTypeAsInt = de.Value.IndexType;
if (!indexTypeAsInt.HasValue)
{
throw new InvalidDataException("Dictionary type not defined");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be a more correct message?

Suggested change
throw new InvalidDataException("Dictionary type not defined");
throw new InvalidDataException("Dictionary IndexType not defined");

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it.

}

internal static Schema GetSchema(Flatbuf.Schema schema)
internal static Schema GetSchema(Flatbuf.Schema schema, LazyCreator<DictionaryMemo> lazyDictionaryMemo)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Along with the comment about removing LazyCreator, this could be changed to:

Suggested change
internal static Schema GetSchema(Flatbuf.Schema schema, LazyCreator<DictionaryMemo> lazyDictionaryMemo)
internal static Schema GetSchema(Flatbuf.Schema schema, ref DictionaryMemo dictionaryMemo)

And then if this code needs to use the dictionaryMemo and it is null, it gets created when it is needed. And the ref gets set.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it.

{
public static RecordBatch CreateSampleRecordBatch(int length)
//TODO: Remove the createDictionaryArray argument after all writer/reader supports DictionaryType serialization
public static RecordBatch CreateSampleRecordBatch(int length, bool createDictionaryArray = false)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general comment about createDictionaryArray - I think the existing tests should be left alone. That way we are still testing as much as we can without any dictionaries. Then we create new tests for dictionary support.

An easy way to do this is to change a bunch of tests from [Fact] to [Theory] and pass in true and false into the test for bool createDictoinaryArray. That way the test runs twice, once with a dictionary, and once without. That should give us the best test coverage.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand.
I fixed tests to use [Theory]

- Fix access modifiers
- Fix tests
- Change the way of the lazy creation for dictionaryMemo

@HashidaTKS HashidaTKS left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eerhardt
Thank you for the review!
I have responded to what you have pointed out.

  • Fix access modifiers and the code format
    • Sorry for my carelessness...
  • Fix tests
  • Change the way of the lazy creation for dictionaryMemo

Comment thread csharp/src/Apache.Arrow/Arrays/DictionaryArray.cs
public static class TestData
{
public static RecordBatch CreateSampleRecordBatch(int length)
//TODO: Remove the createDictionaryArray argument after all writer/reader supports DictionaryType serialization

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning to always create a dictionary array after ArrowFileWriter/Reader support DictionaryType serialization.
But considering your feedback comments, now I think it is better to leave createDictionaryArray even after that.
I will remove this TODO.

public Stream BaseStream { get; }
private readonly bool _leaveOpen;
private readonly MemoryAllocator _allocator;
private protected bool HasReadInitialDictionary { get; set; }

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it.

reader.ReadNextRecordBatch();

Assert.Equal(1, memoryPool.Statistics.Allocations);
Assert.Equal(2, memoryPool.Statistics.Allocations);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it to use [Theory].

Comment thread csharp/src/Apache.Arrow/RecordBatch.cs Outdated
Comment on lines 31 to 34
internal IReadOnlyList<IArrowArray> _Arrays => (IReadOnlyList<IArrowArray>)_arrays;

private readonly IMemoryOwner<byte> _memoryOwner;
private readonly IList<IArrowArray> _arrays;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it.

}
}

protected void ReadInitialDictionaries()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it.

Comment on lines +32 to +33

public ArrowStreamReaderImplementation(Stream stream, MemoryAllocator allocator, bool leaveOpen) : base()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it.

{
public static RecordBatch CreateSampleRecordBatch(int length)
//TODO: Remove the createDictionaryArray argument after all writer/reader supports DictionaryType serialization
public static RecordBatch CreateSampleRecordBatch(int length, bool createDictionaryArray = false)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand.
I fixed tests to use [Theory]

public static class TestData
{
public static RecordBatch CreateSampleRecordBatch(int length)
//TODO: Remove the createDictionaryArray argument after all writer/reader supports DictionaryType serialization

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the TODO comment.

public void Ctor_LeaveOpenDefault_StreamClosedOnDispose()
{
RecordBatch originalBatch = TestData.CreateSampleRecordBatch(length: 100);
RecordBatch originalBatch = TestData.CreateSampleRecordBatch(length: 100, createDictionaryArray: true);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it.

- change the place of the HasCreatedDictionaryMemo property
- Remove needless extra lines
@HashidaTKS HashidaTKS requested a review from eerhardt July 8, 2021 14:48
Comment thread csharp/src/Apache.Arrow/Ipc/ArrowReaderImplementation.cs Outdated
HasReadInitialDictionary = true;
}

private async ValueTask ReadInitialDictionariesAsync(CancellationToken cancellationToken)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) the other methods in this file have the Async method first, and then the synchronous method after it. Can this go above ReadInitialDictionaries()?

Comment thread csharp/src/Apache.Arrow/Ipc/ArrowStreamReaderImplementation.cs Outdated
Comment thread csharp/src/Apache.Arrow/Ipc/ArrowStreamReaderImplementation.cs Outdated
Comment thread csharp/src/Apache.Arrow/Ipc/MessageSerializer.cs Outdated

ReadInitialDictionaries();

return ReadArrowObject();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec says

DictionaryBatch and RecordBatch messages may be interleaved

This appears to only support DictionaryBatches above RecordBatches. If we read all the "initial" dictionaries, and then read a RecordBatch, the next time ReadRecordBatch() is called, if there is a DictionaryBatch, null will be returned. Is that intentional?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, it slipped my mind. That is not intentional.
I modified to be able to read interleaved dictionaries.

During the modification, I realized that ReadInitialDictionariesAsync and ReadInitialDictionaries are needless, so I removed them.

Related to this, ArrowStreamWriter does not support writing interleaved dictionaries for now.
I would like to implement it in another PR because this PR will become too large.
How is that?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested reading interleaved dictionaries by creating a test file with python and reading the test file with C#, and it seemed to work fine.
The code for creating the test file is below, this code is based on the test code of the python implementation.

https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_ipc.py#L417

import pyarrow as pa

ty = pa.dictionary(pa.int8(), pa.utf8())
data = [["foo", "foo", None],
            ["foo", "bar", "foo"],
            ["foo", "bar"],
            ["foo", None, "bar", "quux"], 
            ["bar", "quux"],
            ]
batches = [
        pa.RecordBatch.from_arrays([pa.array(v, type=ty)], names=['dicts'])
        for v in data]
schema = batches[0].schema

def write_batches():
    with pa.RecordBatchStreamWriter("./dictionary_batch_test.batch", schema = schema) as writer:
        for batch in batches:
            writer.write_batch(batch)

st = write_batches()

We should add C# tests for this after ArrowStreamWriter supports writing interleaved dictionaries.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to implement it in another PR because this PR will become too large.
How is that?

That is perfectly reasonable to me. I think the current functionality is sufficient for the first PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested reading interleaved dictionaries by creating a test file with python

In my spare time, I've been trying to set up the integration tests, so we can test the C# implementation against the other languages. Hopefully I can get the initial PR for that up soon.

Comment thread csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs Outdated
Comment thread csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs Outdated
Comment thread csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs Outdated
Comment thread csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs Outdated
Comment thread csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs
- Avoid a needless allocation
- Remove a needless ctor
- Support reading the replacement dictionaries
- Add tests for writing and reading dictionaries used in NestedType arrays
- Fix a bug when reading dictionaries used in NestedType arrays
- Regard indexType as signed int32 if it is null
- DictionaryArray with children
  - Add a test for this
  - Fix a serialization bug about this

@HashidaTKS HashidaTKS left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eerhardt
Thank you!

I have responded to what you have pointed out.

  • Avoid needless allocation
  • Remove a needless ctor
  • Support reading the replacement dictionaries
  • Add tests for writing and reading dictionaries used in NestedType arrays
  • Add tests for writing and reading ListType dictionaries
  • Fix a bug when reading dictionaries used in NestedType arrays
  • Fix a bug when writing ListType dictionaries
  • Regard indexType as signed int32 if it is null


ReadInitialDictionaries();

return ReadArrowObject();

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, it slipped my mind. That is not intentional.
I modified to be able to read interleaved dictionaries.

During the modification, I realized that ReadInitialDictionariesAsync and ReadInitialDictionaries are needless, so I removed them.

Related to this, ArrowStreamWriter does not support writing interleaved dictionaries for now.
I would like to implement it in another PR because this PR will become too large.
How is that?

Comment thread csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs Outdated
Comment thread csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs Outdated
Comment thread csharp/src/Apache.Arrow/Ipc/MessageSerializer.cs Outdated
Comment thread csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs Outdated
Comment thread csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs Outdated
Comment thread csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs

ReadInitialDictionaries();

return ReadArrowObject();

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested reading interleaved dictionaries by creating a test file with python and reading the test file with C#, and it seemed to work fine.
The code for creating the test file is below, this code is based on the test code of the python implementation.

https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_ipc.py#L417

import pyarrow as pa

ty = pa.dictionary(pa.int8(), pa.utf8())
data = [["foo", "foo", None],
            ["foo", "bar", "foo"],
            ["foo", "bar"],
            ["foo", None, "bar", "quux"], 
            ["bar", "quux"],
            ]
batches = [
        pa.RecordBatch.from_arrays([pa.array(v, type=ty)], names=['dicts'])
        for v in data]
schema = batches[0].schema

def write_batches():
    with pa.RecordBatchStreamWriter("./dictionary_batch_test.batch", schema = schema) as writer:
        for batch in batches:
            writer.write_batch(batch)

st = write_batches()

We should add C# tests for this after ArrowStreamWriter supports writing interleaved dictionaries.

@HashidaTKS HashidaTKS requested a review from eerhardt July 13, 2021 00:42
@HashidaTKS

Copy link
Copy Markdown
Contributor Author

Cc: @eerhardt

Would you please re-review this when you have time?

Comment thread csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs Outdated
Comment thread csharp/src/Apache.Arrow/Ipc/ArrowStreamReaderImplementation.cs Outdated

@eerhardt eerhardt left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just had a couple very minor comments. Once those are addressed, I think we can merge this.

Thank you for the great work here, @HashidaTKS! And thank you for your patience (I had been busy).

@HashidaTKS

Copy link
Copy Markdown
Contributor Author

@eerhardt
I have addressed commented issues.

I appreciate your review and support!

- do-while -> while
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants