-
Notifications
You must be signed in to change notification settings - Fork 4.2k
ARROW-9010: [Java] Framework and interface changes for RecordBatch IPC buffer compression #7326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
c57edb0
ARROW-9010: [Java] Framework and interface changes for RecordBatch IP…
liyafan82 48d5365
ARROW-9010: [Java] Make compression body optional
liyafan82 75bfb52
ARROW-9010: [Java] Revise the implementation of ArrowCompressionBody
liyafan82 c2d14e6
ARROW-9010: [Java] Simplify the compression codec interface
liyafan82 a842ce6
ARROW-9010: [Java] Provide default compression codec
liyafan82 bb9fab5
ARROW-9010: [Java] Resolve comments
liyafan82 5573352
ARROW-9010: [Java] Stop using no-compression decoder during IPC
liyafan82 a1a0800
ARROW-9010: [Java] Continue to use no-compression decoder outside IPC
liyafan82 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
51 changes: 51 additions & 0 deletions
51
java/vector/src/main/java/org/apache/arrow/vector/compression/CompressionCodec.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,51 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.arrow.vector.compression; | ||
|
|
||
| import org.apache.arrow.memory.ArrowBuf; | ||
| import org.apache.arrow.memory.BufferAllocator; | ||
|
|
||
| /** | ||
| * The codec for compression/decompression. | ||
| */ | ||
| public interface CompressionCodec { | ||
|
|
||
| /** | ||
| * Compress a buffer. | ||
| * @param allocator the allocator for allocating memory for compressed buffer. | ||
| * @param unCompressedBuffer the buffer to compress. | ||
| * Implementation of this method should take care of releasing this buffer. | ||
| * @return the compressed buffer. | ||
| */ | ||
| ArrowBuf compress(BufferAllocator allocator, ArrowBuf unCompressedBuffer); | ||
|
|
||
| /** | ||
| * Decompress a buffer. | ||
| * @param allocator the allocator for allocating memory for decompressed buffer. | ||
| * @param compressedBuffer the buffer to be decompressed. | ||
| * Implementation of this method should take care of releasing this buffer. | ||
| * @return the decompressed buffer. | ||
| */ | ||
| ArrowBuf decompress(BufferAllocator allocator, ArrowBuf compressedBuffer); | ||
|
|
||
| /** | ||
| * Gets the name of the codec. | ||
| * @return the name of the codec. | ||
| */ | ||
| String getCodecName(); | ||
| } | ||
60 changes: 60 additions & 0 deletions
60
java/vector/src/main/java/org/apache/arrow/vector/compression/CompressionUtil.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,60 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.arrow.vector.compression; | ||
|
|
||
| import org.apache.arrow.flatbuf.BodyCompressionMethod; | ||
| import org.apache.arrow.flatbuf.CompressionType; | ||
| import org.apache.arrow.vector.ipc.message.ArrowBodyCompression; | ||
|
|
||
| /** | ||
| * Utilities for data compression/decompression. | ||
| */ | ||
| public class CompressionUtil { | ||
|
|
||
| private CompressionUtil() { | ||
| } | ||
|
|
||
| /** | ||
| * Creates the {@link ArrowBodyCompression} object, given the {@link CompressionCodec}. | ||
| * The implementation of this method should depend on the values of {@link CompressionType#names}. | ||
| */ | ||
| public static ArrowBodyCompression createBodyCompression(CompressionCodec codec) { | ||
| switch (codec.getCodecName()) { | ||
| case "default": | ||
| return NoCompressionCodec.DEFAULT_BODY_COMPRESSION; | ||
| case "LZ4_FRAME": | ||
| return new ArrowBodyCompression(CompressionType.LZ4_FRAME, BodyCompressionMethod.BUFFER); | ||
| case "ZSTD": | ||
| return new ArrowBodyCompression(CompressionType.ZSTD, BodyCompressionMethod.BUFFER); | ||
| default: | ||
| throw new IllegalArgumentException("Unknown codec: " + codec.getCodecName()); | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Creates the {@link CompressionCodec} given the compression type. | ||
| */ | ||
| public static CompressionCodec createCodec(byte compressionType) { | ||
| switch (compressionType) { | ||
| case NoCompressionCodec.COMPRESSION_TYPE: | ||
| return NoCompressionCodec.INSTANCE; | ||
| default: | ||
| throw new IllegalArgumentException("Compression type not supported: " + compressionType); | ||
| } | ||
| } | ||
| } |
54 changes: 54 additions & 0 deletions
54
java/vector/src/main/java/org/apache/arrow/vector/compression/NoCompressionCodec.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,54 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.arrow.vector.compression; | ||
|
|
||
| import org.apache.arrow.flatbuf.BodyCompressionMethod; | ||
| import org.apache.arrow.memory.ArrowBuf; | ||
| import org.apache.arrow.memory.BufferAllocator; | ||
| import org.apache.arrow.vector.ipc.message.ArrowBodyCompression; | ||
|
|
||
| /** | ||
| * The default compression codec that does no compression. | ||
| */ | ||
| public class NoCompressionCodec implements CompressionCodec { | ||
|
|
||
| public static final NoCompressionCodec INSTANCE = new NoCompressionCodec(); | ||
|
|
||
| public static final byte COMPRESSION_TYPE = -1; | ||
|
|
||
| public static final ArrowBodyCompression DEFAULT_BODY_COMPRESSION = | ||
| new ArrowBodyCompression(COMPRESSION_TYPE, BodyCompressionMethod.BUFFER); | ||
|
|
||
| private NoCompressionCodec() { | ||
| } | ||
|
|
||
| @Override | ||
| public ArrowBuf compress(BufferAllocator allocator, ArrowBuf unCompressedBuffer) { | ||
| return unCompressedBuffer; | ||
| } | ||
|
|
||
| @Override | ||
| public ArrowBuf decompress(BufferAllocator allocator, ArrowBuf compressedBuffer) { | ||
| return compressedBuffer; | ||
| } | ||
|
|
||
| @Override | ||
| public String getCodecName() { | ||
| return "default"; | ||
| } | ||
| } |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In thinking through this interface, something occurred to me. Given the way the current Message serialization code works is it forces a copy from an underlying channel. Given this, it seem like we would now have two copy like operations:
I was wondering what you thought about deferring compression until writing out the batch and doing decompression eagerly when reading in the batch to avoid one of these copies. I think one thing that would potentially help us decide on the course to take would be look at the possible libraries we would use for compression/decompression and whether they support off-heap, InputStream or some other interface for accomplishing the compression
@liyafan82 have you done more research in this area?
@rymurr do you have anything thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emkornfield Thank you for starting this discussion and sharing your good ideas.
Your reasoning makes sense to me.
I guess I was looking at the problem from a different perspective.
IMO, the bottleneck of a compressing codec is the CPU resource, and the main purpose of compressing is to reduce memory/network bandwidth consumption.
Given the above assumptions, we should try to do the compression as early as possible. The earliest possible place should be in the
getFieldBuffersmethod. In this PR, we do it inVectorUnLoader, which is not the best, but close enough to the best. Similarly, we should try to do the decompression as late as possible. In this PR, we do it inVectorLoader, which is close to the optimal.Admittedly, we have additional copies after introducing the compression framework. However, both additional copies are based on the compressed data, with reduced data size, so the overhead should be small.
The above reasoning is based on the assumption that the compression codec could effectively reduce the data size, which is not always true in practice. So I think we can make the decision based on the specific compression codec, and real benchmark data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liyafan82 good points. I think the other use-case I didn't consider is Flight which I think will already be in off-heap memory.
In terms of copies, the should be smaller in general, but it would be good to to understand if there are any limitations. It looks like at least airlift supports off-heap decompression, so I think this API is probably OK from that perspective.
@rymurr did you have any other comments otherwise I can take another look and then I think we can merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emkornfield Do you think we can merge this now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I think so. i'll merge