[BEAM-10292] DefaultFilenamePolicy.ParamsCoder is not able to decode directory on the local file system#12050
Conversation
|
R: @dmvk |
dmvk
left a comment
There was a problem hiding this comment.
This makes total sense, nice catch 👍
Unfortunatelly backward incompatible changes of coders are blockers for upgrades of streaming pipelines. Beam currently doesn't have any mechanism for coder versioning, so this may be tricky to implement correctly.
In this case, it may be possible to simply append a boolean coder and use it only when more bytes are available for consuption, thus making it backward compatible, eg.:
ResourceId prefix =
FileBasedSink.convertToFileResourceIfPossible(stringCoder.decode(inStream));
String shardTemplate = stringCoder.decode(inStream);
String suffix = stringCoder.decode(inStream);
boolean isDirectory = false; // or whatever the default is
if (inStream.available() > 0) {
isDirectory = booleanCoder.decode(inStream);
}Thanks for the contribution!
@lukecwik any thoughts about this one?
|
Thanks for the review @dmvk ! You're right, I forgot to ensure backward compatibility. I tried to implement the requested changes, will you take a look at it please? |
|
Run Java PreCommit |
There was a problem hiding this comment.
Would it be possible to add test case for bw compatible code path?
There was a problem hiding this comment.
ok, I added the requested test for bw compatibility
btw you can see the buggy behavior there (lost information whether baseFilename is file/directory)
There was a problem hiding this comment.
convertToFileResourceIfPossible attempts to match a file and if that fails attempts to match a directory.
Is the issue that the underlying filesystem erroneously says something is a file when really it is a directory?
There was a problem hiding this comment.
Exactly, that's the issue, for example LocalFileSystem matches path /tmp/dirA/ as file..
How about something like this?
public static class ParamsCoder extends AtomicCoder<Params> {
private static final ParamsCoder INSTANCE = new ParamsCoder();
private static final Coder<String> STRING_CODER = StringUtf8Coder.of();
public static ParamsCoder of() {
return INSTANCE;
}
@Override
public void encode(Params value, OutputStream outStream) throws IOException {
if (value == null) {
throw new CoderException("cannot encode a null value");
}
ResourceId baseFilename = value.getBaseFilename().get();
String baseFilenameString =
(baseFilename.isDirectory() && !baseFilename.toString().endsWith("/"))
? baseFilename.toString() + "/"
: baseFilename.toString();
STRING_CODER.encode(baseFilenameString, outStream);
STRING_CODER.encode(value.getShardTemplate(), outStream);
STRING_CODER.encode(value.getSuffix(), outStream);
}
@Override
public Params decode(InputStream inStream) throws IOException {
String baseFilenameString = STRING_CODER.decode(inStream);
ResourceId prefix =
FileSystems.matchNewResource(baseFilenameString, baseFilenameString.endsWith("/"));
String shardTemplate = STRING_CODER.decode(inStream);
String suffix = STRING_CODER.decode(inStream);
return new Params()
.withBaseFilename(prefix)
.withShardTemplate(shardTemplate)
.withSuffix(suffix);
}
}But I'm not sure how to handle the suffix (/) to ensure compatibility among (file) systems
There was a problem hiding this comment.
I think that there are a couple of options here:
- Take the breaking change to the coder because it is a fix, make sure it is documented in the release notes
- Try to fix the underlying filesystem to do a better job of file/dir matching
- Deprecate this filename policy, create a new one (DefaultFilenamePolicy2) and tell people to use it in new code.
I'm for 1 but would ask for consensus on the mailing list.
Also, any / hacking will make things worse since different file systems use different path separator characters (e.g linux vs windows)
There was a problem hiding this comment.
OK, I agree with the first solution - taking the breaking change. I will ask on the mailing list.
BTW I did workaround in my project where I use only files (not directories) as Params base filenames and then everything works fine. That means I no longer needs this fix but it still seems to me like a bug which is worth fixing.
(P.S.: The suffix wouldn't have to be the path separator (/) but e.g. sometotallyrandomstring123 - but it's very ugly...)
There was a problem hiding this comment.
There was a problem hiding this comment.
Hi @lukecwik, finally after the discussion in the mailing list I chose the second proposed option and fixed the underlying filesystem. Turns out that only LocalFileSystem is broken, on the other hand e.g. HDFS and S3 look fine and already have this kind of check.
I still think that the proper solution should be to use the ResourceIdCoder instead of StringUtf8Coder but I understand the consequences.
lukecwik
left a comment
There was a problem hiding this comment.
This makes total sense, nice catch 👍
Unfortunatelly backward incompatible changes of coders are blockers for upgrades of streaming pipelines. Beam currently doesn't have any mechanism for coder versioning, so this may be tricky to implement correctly.
In this case, it may be possible to simply append a boolean coder and use it only when more bytes are available for consuption, thus making it backward compatible, eg.:
ResourceId prefix = FileBasedSink.convertToFileResourceIfPossible(stringCoder.decode(inStream)); String shardTemplate = stringCoder.decode(inStream); String suffix = stringCoder.decode(inStream); boolean isDirectory = false; // or whatever the default is if (inStream.available() > 0) { isDirectory = booleanCoder.decode(inStream); }Thanks for the contribution!
@lukecwik any thoughts about this one?
This won't work since you can have multiple encoded Params concatentated together on the input stream so you'll read the first byte of the next Params with this change.
There was a problem hiding this comment.
convertToFileResourceIfPossible attempts to match a file and if that fails attempts to match a directory.
Is the issue that the underlying filesystem erroneously says something is a file when really it is a directory?
3f4aa11 to
7b440f2
Compare
…gumentException when the first 'singleResourceSpec' parameter ends with filesystem separator and the second provided parameter indicates that it isn't directory. Fixed due to DefaultFilenamePolicy.ParamsCoder was not able to decode directory on the local file system because it calls FileBasedSink.convertToFileResourceIfPossible method which internally calls FileSystem.matchNewResource method. Other file systems (HDFS, S3) already have this kind of check.
lukecwik
left a comment
There was a problem hiding this comment.
Please perform the one fix up and then this is mergeable.
| return constructed.toString(); | ||
| } | ||
|
|
||
| private static DefaultFilenamePolicy.Params encodeDecodeParams( |
There was a problem hiding this comment.
thanks for the advice 👍
|
|
||
| @Override | ||
| protected LocalResourceId matchNewResource(String singleResourceSpec, boolean isDirectory) { | ||
| if (singleResourceSpec.endsWith(File.separator) && !isDirectory) { |
There was a problem hiding this comment.
Do you want to add the file separator if it doesn't exist if isDirectory == true. This will help make it less error prone for users. Similar to GcsFileSystem.
There was a problem hiding this comment.
sounds good, I added it
lukecwik
left a comment
There was a problem hiding this comment.
We should update LocalResourceId and drop the isDirectory field
|
|
||
| @Override | ||
| protected LocalResourceId matchNewResource(String singleResourceSpec, boolean isDirectory) { | ||
| if (singleResourceSpec.endsWith(File.separator) && !isDirectory) { |
There was a problem hiding this comment.
What if the character is escaped and being used in the name component of the path string?
There was a problem hiding this comment.
Is it even possible? For example according to https://stackoverflow.com/questions/9847288/is-it-possible-to-use-in-a-filename it's not. Or do you mean something different?
There was a problem hiding this comment.
Your right, it seems as though Linux doesn't allow /, Mac OS X doesn't allow :, Windows doesn't allow \.
- adding file separator if it doesn't exist if isDirectory == true in LocalFileSystem.matchNewResource - dropped isDirectory field in LocalResourceId - using CoderUtils.clone method in DefaultFilenamePolicyTest.testParamsCoder test
I removed the |
Makes sense since LocalResourceId constructor adds |
|
@dmvk The backwards compat issues have been addressed. Will merge, feel free to add additional comments if there was something that was not addressed. |
DefaultFilenamePolicy.ParamsCoderusedStringUtf8Coderfor encoding/decodingbaseFilenameand therefore information whetherbaseFilename resourceIDis file or directory was lost. UsingResourceIdCoderinstead fixes this issue.Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username).[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.