[SPARK-21046][SQL] simplify the array offset and length in ColumnVector#18260
[SPARK-21046][SQL] simplify the array offset and length in ColumnVector#18260cloud-fan wants to merge 7 commits into
Conversation
| if (this.arrayLengths != null) { | ||
| System.arraycopy(this.arrayLengths, 0, newLengths, 0, capacity); | ||
| System.arraycopy(this.arrayOffsets, 0, newOffsets, 0, capacity); | ||
| // need 2 ints as offset and length for each array. |
There was a problem hiding this comment.
It would be good to add a comment regarding that intData[] is used for other purpose compared to the original intention (i.e. storing int column data).
| * Returns the offset of the array at rowid. | ||
| */ | ||
| public abstract int getArrayOffset(int rowId); | ||
| public void arrayWriteEnd(int rowId, int offset, int length) { |
There was a problem hiding this comment.
I assumed that there is arrayWriteStart since we newly created arrayWriteEnd.
After reading a comment, I understood your intention. Is there any other method name (e.g. arraySetColumn)?
There was a problem hiding this comment.
no there is no arrayWriteStart.
Maybe I should pick another name, how about putArrayOffsetAndLength or just keep the original name?
|
Test build #77874 has finished for PR 18260 at commit
|
| // The data stored in these two allocations need to maintain binary compatible. We can | ||
| // directly pass this buffer to external components. | ||
| private long nulls; | ||
| // The actually data of this column vector will be store here. If it's an array column vector, |
There was a problem hiding this comment.
nit: will be store -> will be stored ?
| // Array for each type. Only 1 is populated for any type. | ||
| private byte[] byteData; | ||
| private short[] shortData; | ||
| // This is not used used to store data for int column vector, but also can store offsets and |
There was a problem hiding this comment.
nit: This is not used used to store -> This is not used to store ?
| Platform.reallocateMemory(offsetData, oldCapacity * 4, newCapacity * 4); | ||
| // need 2 ints as offset and length for each array. | ||
| this.data = Platform.reallocateMemory(data, oldCapacity * 8, newCapacity * 8); | ||
| putInt(0, 0); |
There was a problem hiding this comment.
actually we don't, as the new memory region should be filled with 0. This is just a safe guard to be more robust about the underlying memory allocation details.
There was a problem hiding this comment.
I had second thought that putArrayOffsetAndLength(0,0,0) is better for the purpose of the guard?
| } | ||
| arrayLengths = newLengths; | ||
| arrayOffsets = newOffsets; | ||
| putInt(0, 0); |
| // need 2 ints as offset and length for each array. | ||
| if (intData == null || intData.length < newCapacity * 2) { | ||
| int[] newData = new int[newCapacity * 2]; | ||
| if (intData != null) System.arraycopy(intData, 0, newData, 0, capacity * 2); |
There was a problem hiding this comment.
I guess we should pass intData.length instead of capacity * 2 ?
nvm: here intData.length should be capacity * 2.
| column.putArray(3, 3, 3) | ||
| column.putArrayOffsetAndLength(0, 0, 1) | ||
| column.putArrayOffsetAndLength(1, 1, 2) | ||
| column.putArrayOffsetAndLength(2, 2, 0) |
There was a problem hiding this comment.
This should be column.putArrayOffsetAndLength(2, 3, 0) ?
|
LGTM except for one comment, pending Jenkins. |
| Platform.reallocateMemory(lengthData, oldCapacity * 4, newCapacity * 4); | ||
| this.offsetData = | ||
| Platform.reallocateMemory(offsetData, oldCapacity * 4, newCapacity * 4); | ||
| // need 2 ints as offset and length for each array. |
There was a problem hiding this comment.
nit: for each array -> for each array element.
There was a problem hiding this comment.
Oh. This's quite a bit ambiguous. Nvm. for each array is good.
|
Test build #77902 has finished for PR 18260 at commit
|
|
LGTM except for one question regarding array capacity after this change. |
| private short[] shortData; | ||
| // This is not only used to store data for int column vector, but also can store offsets and | ||
| // lengths for array column vector. | ||
| private int[] intData; |
There was a problem hiding this comment.
One question I just have is, the capacity of ColumnVector is bound by MAX_CAPACITY. Previously we store offset and length individually, so we can have MAX_CAPACITY arrays at most. Now we store offset and length together in data/intData which is bound to MAX_CAPACITY, doesn't it say we can just have MAX_CAPACITY / 2 arrays at most?
There was a problem hiding this comment.
Good catch. Is it possible to use longData, which has a pair of 32-bit offset and length, to keep MAX_CAPACITY array length?
There was a problem hiding this comment.
Oh. I see. We only check the limit of MAX_CAPACITY before actually going into reserveInternal.
Even we pass this check, we still face problem when allocating intData, see the comment below.
There was a problem hiding this comment.
@kiszk Do you meant we store a pair of offset/length together as an element in longData?
|
Test build #77903 has finished for PR 18260 at commit
|
| System.arraycopy(this.arrayOffsets, 0, newOffsets, 0, capacity); | ||
| // need 2 ints as offset and length for each array. | ||
| if (intData == null || intData.length < newCapacity * 2) { | ||
| int[] newData = new int[newCapacity * 2]; |
There was a problem hiding this comment.
newCapacity here can be MAX_CAPACITY at most. When newCapacity is more than MAX_CAPACITY / 2, seems this allocation would cause problem?
|
Test build #77900 has finished for PR 18260 at commit
|
|
Test build #77905 has finished for PR 18260 at commit
|
| private byte[] byteData; | ||
| private short[] shortData; | ||
| private int[] intData; | ||
| // This is not only used to store data for int column vector, but also can store offsets and |
There was a problem hiding this comment.
int column vector -> long column vector.
| */ | ||
| public abstract int getArrayOffset(int rowId); | ||
| public void putArrayOffsetAndSize(int rowId, int offset, int size) { | ||
| long offsetAndSize = (offset << 32) | size; |
There was a problem hiding this comment.
offset should be converted to long before shifting?
|
LGTM except for two comments. |
|
Test build #77912 has finished for PR 18260 at commit
|
|
Test build #77933 has finished for PR 18260 at commit
|
|
|
||
| reference.zipWithIndex.foreach { v => | ||
| assert(v._1.length == column.getArrayLength(v._2), "MemoryMode=" + memMode) | ||
| assert(v._1.length == column.getInt(v._2 * 2 + 1), "MemoryMode=" + memMode) |
|
Test build #77936 has finished for PR 18260 at commit
|
|
The spark R test failure is not related, I'm merging it to master, thanks for your review! |
|
Why are we doing this? Isn't it better potentially for compression to store them separately? We can also easily remove the offset for fixed length arrays. |
|
Sorry I missed this part. Currently we don't externalize |
## What changes were proposed in this pull request? Currently when a `ColumnVector` stores array type elements, we will use 2 arrays for lengths and offsets and implement them individually in on-heap and off-heap column vector. In this PR, we use one array to represent both offsets and lengths, so that we can treat it as `ColumnVector` and all the logic can go to the base class `ColumnVector` ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#18260 from cloud-fan/put.
What changes were proposed in this pull request?
Currently when a
ColumnVectorstores array type elements, we will use 2 arrays for lengths and offsets and implement them individually in on-heap and off-heap column vector.In this PR, we use one array to represent both offsets and lengths, so that we can treat it as
ColumnVectorand all the logic can go to the base classColumnVectorHow was this patch tested?
existing tests.