Skip to content

[BEAM-10292] DefaultFilenamePolicy.ParamsCoder is not able to decode directory on the local file system#12050

Merged
lukecwik merged 2 commits intoapache:masterfrom
davidak09:BEAM-10292
Aug 20, 2020
Merged

[BEAM-10292] DefaultFilenamePolicy.ParamsCoder is not able to decode directory on the local file system#12050
lukecwik merged 2 commits intoapache:masterfrom
davidak09:BEAM-10292

Conversation

@davidak09
Copy link
Copy Markdown
Contributor

@davidak09 davidak09 commented Jun 22, 2020

DefaultFilenamePolicy.ParamsCoder used StringUtf8Coder for encoding/decoding baseFilename and therefore information whether baseFilename resourceID is file or directory was lost. Using ResourceIdCoder instead fixes this issue.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang SDK Dataflow Flink Samza Spark
Go Build Status --- Build Status --- Build Status
Java Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
--- Build Status
XLang --- --- Build Status --- ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status
Build Status
Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@davidak09
Copy link
Copy Markdown
Contributor Author

R: @dmvk

Copy link
Copy Markdown
Member

@dmvk dmvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes total sense, nice catch 👍

Unfortunatelly backward incompatible changes of coders are blockers for upgrades of streaming pipelines. Beam currently doesn't have any mechanism for coder versioning, so this may be tricky to implement correctly.

In this case, it may be possible to simply append a boolean coder and use it only when more bytes are available for consuption, thus making it backward compatible, eg.:

ResourceId prefix =
  FileBasedSink.convertToFileResourceIfPossible(stringCoder.decode(inStream));
String shardTemplate = stringCoder.decode(inStream);
String suffix = stringCoder.decode(inStream);
boolean isDirectory = false; // or whatever the default is
if (inStream.available() > 0) {
  isDirectory = booleanCoder.decode(inStream);
}

Thanks for the contribution!

@lukecwik any thoughts about this one?

@davidak09
Copy link
Copy Markdown
Contributor Author

Thanks for the review @dmvk !

You're right, I forgot to ensure backward compatibility. I tried to implement the requested changes, will you take a look at it please?

@dmvk
Copy link
Copy Markdown
Member

dmvk commented Jun 24, 2020

Run Java PreCommit

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to add test case for bw compatible code path?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I added the requested test for bw compatibility

btw you can see the buggy behavior there (lost information whether baseFilename is file/directory)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convertToFileResourceIfPossible attempts to match a file and if that fails attempts to match a directory.

Is the issue that the underlying filesystem erroneously says something is a file when really it is a directory?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, that's the issue, for example LocalFileSystem matches path /tmp/dirA/ as file..

How about something like this?

public static class ParamsCoder extends AtomicCoder<Params> {
    private static final ParamsCoder INSTANCE = new ParamsCoder();
    private static final Coder<String> STRING_CODER = StringUtf8Coder.of();

    public static ParamsCoder of() {
      return INSTANCE;
    }

    @Override
    public void encode(Params value, OutputStream outStream) throws IOException {
      if (value == null) {
        throw new CoderException("cannot encode a null value");
      }
      ResourceId baseFilename = value.getBaseFilename().get();
      String baseFilenameString =
          (baseFilename.isDirectory() && !baseFilename.toString().endsWith("/"))
              ? baseFilename.toString() + "/"
              : baseFilename.toString();
      STRING_CODER.encode(baseFilenameString, outStream);
      STRING_CODER.encode(value.getShardTemplate(), outStream);
      STRING_CODER.encode(value.getSuffix(), outStream);
    }

    @Override
    public Params decode(InputStream inStream) throws IOException {
      String baseFilenameString = STRING_CODER.decode(inStream);
      ResourceId prefix =
          FileSystems.matchNewResource(baseFilenameString, baseFilenameString.endsWith("/"));
      String shardTemplate = STRING_CODER.decode(inStream);
      String suffix = STRING_CODER.decode(inStream);
      return new Params()
          .withBaseFilename(prefix)
          .withShardTemplate(shardTemplate)
          .withSuffix(suffix);
    }
  }

But I'm not sure how to handle the suffix (/) to ensure compatibility among (file) systems

Copy link
Copy Markdown
Member

@lukecwik lukecwik Jun 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that there are a couple of options here:

  1. Take the breaking change to the coder because it is a fix, make sure it is documented in the release notes
  2. Try to fix the underlying filesystem to do a better job of file/dir matching
  3. Deprecate this filename policy, create a new one (DefaultFilenamePolicy2) and tell people to use it in new code.

I'm for 1 but would ask for consensus on the mailing list.

Also, any / hacking will make things worse since different file systems use different path separator characters (e.g linux vs windows)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I agree with the first solution - taking the breaking change. I will ask on the mailing list.

BTW I did workaround in my project where I use only files (not directories) as Params base filenames and then everything works fine. That means I no longer needs this fix but it still seems to me like a bug which is worth fixing.

(P.S.: The suffix wouldn't have to be the path separator (/) but e.g. sometotallyrandomstring123 - but it's very ugly...)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lukecwik, finally after the discussion in the mailing list I chose the second proposed option and fixed the underlying filesystem. Turns out that only LocalFileSystem is broken, on the other hand e.g. HDFS and S3 look fine and already have this kind of check.

I still think that the proper solution should be to use the ResourceIdCoder instead of StringUtf8Coder but I understand the consequences.

Copy link
Copy Markdown
Member

@lukecwik lukecwik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes total sense, nice catch 👍

Unfortunatelly backward incompatible changes of coders are blockers for upgrades of streaming pipelines. Beam currently doesn't have any mechanism for coder versioning, so this may be tricky to implement correctly.

In this case, it may be possible to simply append a boolean coder and use it only when more bytes are available for consuption, thus making it backward compatible, eg.:

ResourceId prefix =
  FileBasedSink.convertToFileResourceIfPossible(stringCoder.decode(inStream));
String shardTemplate = stringCoder.decode(inStream);
String suffix = stringCoder.decode(inStream);
boolean isDirectory = false; // or whatever the default is
if (inStream.available() > 0) {
  isDirectory = booleanCoder.decode(inStream);
}

Thanks for the contribution!

@lukecwik any thoughts about this one?

This won't work since you can have multiple encoded Params concatentated together on the input stream so you'll read the first byte of the next Params with this change.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convertToFileResourceIfPossible attempts to match a file and if that fails attempts to match a directory.

Is the issue that the underlying filesystem erroneously says something is a file when really it is a directory?

@davidak09 davidak09 force-pushed the BEAM-10292 branch 2 times, most recently from 3f4aa11 to 7b440f2 Compare August 3, 2020 11:27
…gumentException when the first 'singleResourceSpec' parameter ends with filesystem separator and the second provided parameter indicates that it isn't directory.

Fixed due to DefaultFilenamePolicy.ParamsCoder was not able to decode directory on the local file system because it calls FileBasedSink.convertToFileResourceIfPossible method which internally calls FileSystem.matchNewResource method.
Other file systems (HDFS, S3) already have this kind of check.
@davidak09 davidak09 changed the title [BEAM-10292] DefaultFilenamePolicy.ParamsCoder uses ResourceIdCoder [BEAM-10292] DefaultFilenamePolicy.ParamsCoder is not able to decode directory on the local file system Aug 11, 2020
@davidak09 davidak09 requested review from dmvk and lukecwik August 11, 2020 13:17
Copy link
Copy Markdown
Member

@lukecwik lukecwik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please perform the one fix up and then this is mergeable.

return constructed.toString();
}

private static DefaultFilenamePolicy.Params encodeDecodeParams(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the advice 👍


@Override
protected LocalResourceId matchNewResource(String singleResourceSpec, boolean isDirectory) {
if (singleResourceSpec.endsWith(File.separator) && !isDirectory) {
Copy link
Copy Markdown
Member

@lukecwik lukecwik Aug 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to add the file separator if it doesn't exist if isDirectory == true. This will help make it less error prone for users. Similar to GcsFileSystem.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, I added it

Copy link
Copy Markdown
Member

@lukecwik lukecwik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should update LocalResourceId and drop the isDirectory field


@Override
protected LocalResourceId matchNewResource(String singleResourceSpec, boolean isDirectory) {
if (singleResourceSpec.endsWith(File.separator) && !isDirectory) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the character is escaped and being used in the name component of the path string?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it even possible? For example according to https://stackoverflow.com/questions/9847288/is-it-possible-to-use-in-a-filename it's not. Or do you mean something different?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your right, it seems as though Linux doesn't allow /, Mac OS X doesn't allow :, Windows doesn't allow \.

- adding file separator if it doesn't exist if isDirectory == true in LocalFileSystem.matchNewResource
- dropped isDirectory field in LocalResourceId
- using CoderUtils.clone method in DefaultFilenamePolicyTest.testParamsCoder test
@davidak09
Copy link
Copy Markdown
Contributor Author

We should update LocalResourceId and drop the isDirectory field

I removed the isDirectory field, nevertheless I left isDirectory parameter in the constructor and the factory method untouched. WDYT?

@lukecwik
Copy link
Copy Markdown
Member

lukecwik commented Aug 20, 2020

We should update LocalResourceId and drop the isDirectory field

I removed the isDirectory field, nevertheless I left isDirectory parameter in the constructor and the factory method untouched. WDYT?

Makes sense since LocalResourceId constructor adds File.separator onto the path.

@lukecwik
Copy link
Copy Markdown
Member

@dmvk The backwards compat issues have been addressed. Will merge, feel free to add additional comments if there was something that was not addressed.

@lukecwik lukecwik merged commit d86eccd into apache:master Aug 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants