Skip to content

[SPARK-17805][PYSPARK] Fix in sqlContext.read.text when pass in list of paths#15379

Closed
BryanCutler wants to merge 1 commit into
apache:masterfrom
BryanCutler:sql-readtext-paths-SPARK-17805
Closed

[SPARK-17805][PYSPARK] Fix in sqlContext.read.text when pass in list of paths#15379
BryanCutler wants to merge 1 commit into
apache:masterfrom
BryanCutler:sql-readtext-paths-SPARK-17805

Conversation

@BryanCutler

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

If given a list of paths, pyspark.sql.readwriter.text will attempt to use an undefined variable paths. This change checks if the param paths is a basestring and then converts it to a list, so that the same variable paths can be used for both cases

How was this patch tested?

Added unit test for reading list of files

@SparkQA

SparkQA commented Oct 6, 2016

Copy link
Copy Markdown

Test build #66455 has finished for PR 15379 at commit ade9823.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler

Copy link
Copy Markdown
Member Author

ping @davies @yanboliang

path = [paths]
return self._df(self._jreader.text(self._spark._sc._jvm.PythonUtils.toSeq(path)))
paths = [paths]
return self._df(self._jreader.text(self._spark._sc._jvm.PythonUtils.toSeq(paths)))

@HyukjinKwon HyukjinKwon Oct 7, 2016

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a super minor but I think it'd be nicer to match up the variable name to path if this makes sense. For parquet, it takes non-keyword arguments so it seems paths but for others, it seems take a single argument, path.

@holdenk holdenk Oct 7, 2016

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I agree keeping path here kind of makes sense.

Its unfortunate we didn't catch the difference in the named parameter difference between these reader functions back during 2.0. At this point changing the named parameter from paths to path we need to be a bit careful with incase people are using named params (if we did that we would need to add a version changed note and be careful). We could also have it (transitionally) take a kwargs work with either for a version (while updating the pydoc of course).

@HyukjinKwon

HyukjinKwon commented Oct 7, 2016

Copy link
Copy Markdown
Member

+1 for this PR and please allow me to cc @holdenk here.

@holdenk holdenk left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @BryanCutler and thanks for pointing it to me @HyukjinKwon. Definitetly a good thing to fix and it does helpfully point out some of our API inconsistency (although I'm not 100% sure if fright now is the best time to fix the named parameter difference - but if it isn't we should make a follow up task to clean it up the next time we are more ok with making breaking changes).

path = [paths]
return self._df(self._jreader.text(self._spark._sc._jvm.PythonUtils.toSeq(path)))
paths = [paths]
return self._df(self._jreader.text(self._spark._sc._jvm.PythonUtils.toSeq(paths)))

@holdenk holdenk Oct 7, 2016

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I agree keeping path here kind of makes sense.

Its unfortunate we didn't catch the difference in the named parameter difference between these reader functions back during 2.0. At this point changing the named parameter from paths to path we need to be a bit careful with incase people are using named params (if we did that we would need to add a version changed note and be careful). We could also have it (transitionally) take a kwargs work with either for a version (while updating the pydoc of course).

@rxin

rxin commented Oct 7, 2016

Copy link
Copy Markdown
Contributor

Thanks - the argument renaming issue is largely orthogonal and I don't think we can break it now. I'm going to merge this in master/2.0.

@asfgit asfgit closed this in bcaa799 Oct 7, 2016
asfgit pushed a commit that referenced this pull request Oct 7, 2016
…of paths

## What changes were proposed in this pull request?
If given a list of paths, `pyspark.sql.readwriter.text` will attempt to use an undefined variable `paths`.  This change checks if the param `paths` is a basestring and then converts it to a list, so that the same variable `paths` can be used for both cases

## How was this patch tested?
Added unit test for reading list of files

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #15379 from BryanCutler/sql-readtext-paths-SPARK-17805.

(cherry picked from commit bcaa799)
Signed-off-by: Reynold Xin <rxin@databricks.com>
@BryanCutler

Copy link
Copy Markdown
Member Author

Thanks @rxin, @HyukjinKwon and @holdenk for reviewing!

@BryanCutler BryanCutler deleted the sql-readtext-paths-SPARK-17805 branch December 2, 2016 01:01
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…of paths

## What changes were proposed in this pull request?
If given a list of paths, `pyspark.sql.readwriter.text` will attempt to use an undefined variable `paths`.  This change checks if the param `paths` is a basestring and then converts it to a list, so that the same variable `paths` can be used for both cases

## How was this patch tested?
Added unit test for reading list of files

Author: Bryan Cutler <cutlerb@gmail.com>

Closes apache#15379 from BryanCutler/sql-readtext-paths-SPARK-17805.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants