HADOOP-17124. Support LZO Codec using aircompressor#2159
Conversation
| * limitations under the License. | ||
| */ | ||
|
|
||
| package com.hadoop.compression.lzo; |
There was a problem hiding this comment.
hmm, why we need this bridging class in Hadoop repo while the class is from hadoop-lzo library?
There was a problem hiding this comment.
Historically, Hadoop included LZO until it was removed in HADOOP-4874 due to GPL licensing concern. Then the GPL LZO codec was maintained as a separate project in https://github.com/twitter/hadoop-lzo with new codec class com.hadoop.compression.lzo.LzoCodec. In Hadoop's sequence file, the first couple bytes of file includes the class of the compression codec used when writing, and hadoop uses this information to pick up the right codec to read the data. As a result, we need to bridge it in order to enable Hadoop to read the old LZO compressed data.
There was a problem hiding this comment.
@steveloughran this might answer why Not sure why the com.hadoop classes are there at all..
There was a problem hiding this comment.
OK. In that case the classes should have the reason explained, but tag the new classes as Deprecated
| @Override | ||
| public Class<? extends Compressor> getCompressorType() | ||
| { | ||
| return LzopCodec.HadoopLzopCompressor.class; |
There was a problem hiding this comment.
I see that createCompressor returns HadoopLzoCompressor(). Should we keep these two in sync?
There was a problem hiding this comment.
Good catch. Ideally yes, but HadoopLzoCompressor is private in airlift, so I can not easily do it. I added a new override
@Override
public Compressor createCompressor()
{
return new HadoopLzopCompressor();
}
to keep them in sync, and I will try to work with aircompressor to move this to their codebase.
steveloughran
left a comment
There was a problem hiding this comment.
This is a bit beyond my area of expertise. All I know about codecs tends to be related to stack traces
We cannot add any new jars to the classpath of hadoop-common unless they are absolutely essential. See #1948 as an example of the problems we've hit there.
This means that
- the codec needs to use reflection to load classes; fail with some useful message (or at least the docs mention it in troubleshooting)
- or compile time import of aircompressor, tagging as
provided, and then requiring people who want to use this codec to add the JAR and any (documented) dependencies.
Other than this
- to use a codec you have to especially ask for it -correct? Which means that: people need to know of it. Wherever codecs are documented this will have to be added.
- Not sure why the com.hadoop classes are there at all.
| <artifactId>jna</artifactId> | ||
| <version>${jna.version}</version> | ||
| </dependency> | ||
| <dependency> |
There was a problem hiding this comment.
does this add any transient dependencies?
There was a problem hiding this comment.
Looks okay to me https://mvnrepository.com/artifact/io.airlift/aircompressor/0.16
No compile time dependency. A shaded Hadoop 2 jar in provided dependency.
What's tricky is it has test scope dependency on jmh-core, which is GPL 2.0.
There was a problem hiding this comment.
Since we don't depend on the test scope, I think we should be fine.
There was a problem hiding this comment.
@steveloughran we would like to actually bundle this jar into common since this is a very clean jar and used in many projects such as presto and orc to provide their compression codecs.
In fact, we would like to have snappy to fallback to aircompressor if no native lib is provided. For many of our devs, it's non-trivial to setup native libs in their development env since it requires to compile many native libs from sources and install them into LD path. If we can fallback to pure java snappy implementation in aircompressor, it will make the developers' life a way easier.
There was a problem hiding this comment.
My point was that aircompressor has non-optional dependency on GPL artifact and thus making it GPL as well.
That said, I just verified mjh-core is not in the Hadoop dependency after the patch.
We should talk to aircompressor developers to clarify the license issue.
There was a problem hiding this comment.
I would like to mention that other apache projects such as Presto and Spark all include this as dep, so would be great to get the clarification since it's heavily used in Apache community..
| import org.apache.hadoop.io.compress.CompressionOutputStream; | ||
| import org.apache.hadoop.io.compress.Compressor; | ||
|
|
||
| import org.apache.commons.logging.Log; |
| public class LzoCodec extends org.apache.hadoop.io.compress.LzoCodec { | ||
| private static final Log LOG = LogFactory.getLog(LzoCodec.class); | ||
|
|
||
| static final String gplLzoCodec = LzoCodec.class.getName(); |
| @Override | ||
| public CompressionOutputStream createOutputStream(OutputStream out, | ||
| Compressor compressor) throws IOException { | ||
| if (!warned) { |
There was a problem hiding this comment.
there's a risk of >1 warning in a multithread env, but not something I'm worried about. AtomicBoolean would be the purist way to do it.
| */ | ||
| @DoNotPool | ||
| static class HadoopLzopCompressor | ||
| implements Compressor |
There was a problem hiding this comment.
K&R / java { placement please.
There was a problem hiding this comment.
Done. I'm using intelji with default java formatter. Does hadoop provide a codestyle formatter that Intelji can use? Thanks.
| * limitations under the License. | ||
| */ | ||
|
|
||
| package com.hadoop.compression.lzo; |
There was a problem hiding this comment.
I think this is to offer a bridge for those who are using hadoop-lzo library.
| } | ||
|
|
||
| /** | ||
| * No Hadoop code seems to actually use the compressor, so just return a dummy one so the createOutputStream method |
There was a problem hiding this comment.
does anyone know about downstream uses?
There was a problem hiding this comment.
I only know Presto uses Lzop codec. Maybe we can skip this for this PR and focs on LzoCodec @dbtsai?
There was a problem hiding this comment.
The test code is calling createCompressor without using the actual implementation. Without this, the test will not pass.
|
💔 -1 overall
This message was automatically generated. |
|
I I can see why adding this to hadoop-common appeals in terms of ease of redistribution, but I want it in its own package, such as hadoop-tools/hadoop-compression
There is a cost to this: you don't get it without changing your pom/sbt/gradle dependencies. We could think about including a stub hadoop-compression module even in those releases which don't get a backport. There is a variant option: add it to hadoop-extras, which has been around for a long time and currently it doesn't do much. Add it there and while downstream apps have to extend their pom, everything will build against older versions if they do that? The more I think about that the more I like it. |
|
cc @viirya @steveloughran We are thinking the same. For example, if we add |
|
hadoop-compression would be cleanest. I was looking @ hadoop-extras as it is already bundled everywhere, so if a project declares a dependency on it, it will still build against older releases -just lack the new codes |
|
OK, so what to do?
|
|
@steveloughran +1 on tools/hadoop-compression, and we will work on it. Since #2297 is almost ready to merge, is it okay we merge it first, and work on creating cc @viirya |
|
@dbtsai yes, lets do snappy first |
|
We're closing this stale PR because it has been open for 100 days with no activity. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
See https://issues.apache.org/jira/browse/HADOOP-17124 for details.