[SPARK-23291][SQL][R] R's substr should not reduce starting position by 1 when calling Scala API by viirya · Pull Request #20464 · apache/spark

viirya · 2018-02-01T03:41:19Z

What changes were proposed in this pull request?

Seems R's substr API treats Scala substr API as zero based and so subtracts the given starting position by 1.

Because Scala's substr API also accepts zero-based starting position (treated as the first element), so the current R's substr test results are correct as they all use 1 as starting positions.

How was this patch tested?

Modified tests.

…a API.

viirya · 2018-02-01T03:45:01Z

One more thing to notice is that the two parameters (starting and ending positions) of R's substr API is also unaligned with Scala's substr which takes starting position and substring length.

viirya · 2018-02-01T03:45:23Z

cc @felixcheung @HyukjinKwon

srowen · 2018-02-01T03:47:18Z

Also @shivaram

shivaram · 2018-02-01T03:52:12Z

One thing to keep in mind is what the user's perception of the API is. If R users are going to use 1-based indexing then this might not be the right fix ? http://stat.ethz.ch/R-manual/R-devel/library/base/html/substr.html is the base R function FWIW

viirya · 2018-02-01T04:00:21Z

@shivaram This fix is to make it correctly 1-based. Previously SparkR substr API substracts starting position by 1, so it becomes zero-based.

This fix matches R's substr in above link as I test:

> substr("Michael", 4, 6)                                                                                                                  
[1] "hae"

Before this fix, SparkR's substr returns "cha".

shivaram · 2018-02-01T04:03:55Z

Thanks for clarifying @viirya. Is the PR description accurate ? I read it as ..SQL's substr also accepts zero-based starting position while R uses a 1-based starting position.

SparkQA · 2018-02-01T04:25:43Z

Test build #86908 has finished for PR 20464 at commit a2ffdc1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-02-01T04:38:26Z

@shivaram Thanks for pointing out it. I made change to the description. Hopefully it is clearer now. Basically I just want to clarify why R's substr tests are correct previously.

HyukjinKwon · 2018-02-01T04:57:31Z

I was just manually double checking both substr in R and this. It seems correct; however, I think we should add a note in the doc and release note ...

One followup question is though, would it be difficult to match the behaviour with substr in R when the index is 0 or minus? If i understood #20464 (comment) correctly, it sounds better to match it to substr's behaviour in R. Took a quick look/test and seems we can just set start to 1 for both cases.

If this followup question is something we are not sure yet, I think we might be okay as is.

HyukjinKwon · 2018-02-01T04:59:47Z

Just in case, I am testing with:

df <- createDataFrame(list(list(a="abcdef")))
collect(select(df, substr(df$a, 4, 5)))
substr("abcdef", 4, 5)

just in case it helps to check and reproduce.

viirya · 2018-02-01T05:22:15Z

One followup question is though, would it be difficult to match the behaviour with substr in R when the index is 0 or minus? If i understood #20464 (comment) correctly, it sounds better to match it to substr's behaviour in R. Took a quick look/test and seems we can just set start to 1 for both cases.

If we both consider the indices at starting and ending, setting them to 1 seems not enough. E.g.,

> substr("abcdef", -2, -3)
[1] ""
> substr("abcdef", 1, 1)
[1] "a"

For the cases when only ending is zero/negative, no matter what starting is, the result is empty string.

For the cases when only starting is zero/negative, we can set it to 1.

For the cases they are both zero/negative, the result is empty string.

We can address this in another PR.

felixcheung · 2018-02-01T05:39:10Z

 setMethod("substr", signature(x = "Column"),
          function(x, start, stop) {
-            jc <- callJMethod(x@jc, "substr", as.integer(start - 1), as.integer(stop - start + 1))
+            jc <- callJMethod(x@jc, "substr", as.integer(start), as.integer(stop - start + 1))


I'm a bit concern with changing this. As you can see it's been like this from the very beginning...

This API behavior should be considered as wrong and performs inconsistently. Because for starting position 1, we get substring from 1st element, but for position 2, we still get the substring from 1. So we will get the following inconsistent results:

> collect(select(df, substr(df$a, 1, 5))) substring(a, 0, 5) 1 abcde > collect(select(df, substr(df$a, 2, 5))) substring(a, 1, 4) 1 abcd

For such change, we might need to add a note in the doc as @HyukjinKwon suggested.

question:

is there a way to make the behavior the same before this change for any caller calling substr with common index like 0

why consider other changes as a follow up and not here?
[SPARK-23291][SQL][R] R's substr should not reduce starting position by 1 when calling Scala API #20464 (comment)

why consider other changes as a follow up and not here? #20464 (comment)

Just because I think it is another issue regarding 0/negative indices. I can deal it here if you strongly feel it is better.

is there a way to make the behavior the same before this change for any caller calling substr with common index like 0

Should we keep the behavior when calling substr with 0 as start index?

> df <- createDataFrame(list(list(a="abcdef"))) > collect(select(df, substr(df$a, 0, 5))) substring(a, -1, 6) 1 f > substr("abcdef", 0, 5) [1] "abcde"

I think the previous behavior is pretty unreasonable..

SparkQA · 2018-02-01T09:47:30Z

Test build #86924 has finished for PR 20464 at commit 95c8a4e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2018-02-02T07:01:09Z

+
+## Upgrading to Spark 2.4.0
+
+ - The first parameter of `substr` method was wrongly subtracted by one, previously. This can lead to inconsistent substring results and also does not match with the behaviour with `substr` in R. It has been corrected.


instead of The first parameter of -> The ``start`` parameter of...

felixcheung · 2018-02-02T07:09:52Z

 setMethod("substr", signature(x = "Column"),
          function(x, start, stop) {
-            jc <- callJMethod(x@jc, "substr", as.integer(start - 1), as.integer(stop - start + 1))
+            jc <- callJMethod(x@jc, "substr", as.integer(start), as.integer(stop - start + 1))


question:

is there a way to make the behavior the same before this change for any caller calling substr with common index like 0

why consider other changes as a follow up and not here?
[SPARK-23291][SQL][R] R's substr should not reduce starting position by 1 when calling Scala API #20464 (comment)

SparkQA · 2018-02-02T09:25:13Z

Test build #86985 has finished for PR 20464 at commit d994d76.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

I admit it's correct .. so LGTM but let me leave it to @felixcheung and @shivaram ..

shivaram · 2018-02-07T08:03:15Z

I think @felixcheung has the most context here, so I'd suggest we wait for his comments.

viirya · 2018-02-15T14:37:44Z

ping @felixcheung

felixcheung · 2018-02-16T04:33:28Z

Sorry, I'm a bit occupied with testing 2.3 RC, will get back to this after.

viirya · 2018-02-16T08:46:40Z

@felixcheung Thanks!

viirya · 2018-03-02T04:50:48Z

Because 2.3 is released, ping @felixcheung again

felixcheung · 2018-03-04T00:22:43Z

 setMethod("substr", signature(x = "Column"),
          function(x, start, stop) {
-            jc <- callJMethod(x@jc, "substr", as.integer(start - 1), as.integer(stop - start + 1))
+            jc <- callJMethod(x@jc, "substr", as.integer(start), as.integer(stop - start + 1))


I think we should do two things:

add to the func doc that the start param should be 0-base and to add to the example with the result
collect(select(df, substr(df$a, 0, 5))) # this should give you...

I think you mean 1-base.

Added to the func doc.

felixcheung · 2018-03-04T00:23:39Z

+
+## Upgrading to Spark 2.4.0
+
+ - The `start` parameter of `substr` method was wrongly subtracted by one, previously. This can lead to inconsistent substring results and also does not match with the behaviour with `substr` in R. It has been corrected.


in the migration guide we should give a concrete example with non-0 start index, eg.
substr(df$a, 1, 6) should be changed to substr(df$a, 0, 5)

SparkQA · 2018-03-06T05:37:34Z

Test build #87993 has finished for PR 20464 at commit 0ebdf74.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2018-03-06T17:17:46Z

appveyor tests failed, could you close and reopen this PR to trigger it.

strange, I haven't seen anything like this on appveyor a long time.

1. Error: create DataFrame with complex types (@test_sparkSQL.R#535) -----------
8712org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 114.0 failed 1 times, most recent failure: Lost task 0.0 in stage 114.0 (TID 116, localhost, executor driver): java.net.SocketTimeoutException: Accept timed out

felixcheung

LG, pending tests.
one small comment for clarity. thanks!

felixcheung · 2018-03-07T06:24:49Z

+
+## Upgrading to Spark 2.4.0
+
+ - The `start` parameter of `substr` method was wrongly subtracted by one, previously. In other words, the index specified by `start` parameter was considered as 0-base. This can lead to inconsistent substring results and also does not match with the behaviour with `substr` in R. It has been fixed so the `start` parameter of `substr` method is now 1-base, e.g., `substr(df$a, 2, 5)` should be changed to `substr(df$a, 1, 4)`.


could you add
method is now 1-base, e.g., therefore to get the same result as substr(df$a, 2, 5), it should be changed to substr(df$a, 1, 4)

Yes. Added.

SparkQA · 2018-03-07T07:12:18Z

Test build #88039 has finished for PR 20464 at commit 8c1a8ec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2018-03-07T17:39:36Z

merged to master, thanks!

…by 1 when calling Scala API ## What changes were proposed in this pull request? Seems R's substr API treats Scala substr API as zero based and so subtracts the given starting position by 1. Because Scala's substr API also accepts zero-based starting position (treated as the first element), so the current R's substr test results are correct as they all use 1 as starting positions. ## How was this patch tested? Modified tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#20464 from viirya/SPARK-23291.

R's substr should not reduce starting position by 1 when calling Scal…

a2ffdc1

…a API.

felixcheung reviewed Feb 1, 2018

View reviewed changes

Add a note to migration guide of R doc.

95c8a4e

felixcheung reviewed Feb 2, 2018

View reviewed changes

Fix doc.

d994d76

HyukjinKwon approved these changes Feb 5, 2018

View reviewed changes

felixcheung reviewed Mar 4, 2018

View reviewed changes

Improve doc.

0ebdf74

viirya closed this Mar 6, 2018

viirya reopened this Mar 6, 2018

felixcheung approved these changes Mar 7, 2018

View reviewed changes

Improve doc clarity.

8c1a8ec

asfgit closed this in 53561d2 Mar 7, 2018

viirya deleted the SPARK-23291 branch December 27, 2023 18:21


		## Upgrading to Spark 2.4.0

		- The first parameter of `substr` method was wrongly subtracted by one, previously. This can lead to inconsistent substring results and also does not match with the behaviour with `substr` in R. It has been corrected.


		## Upgrading to Spark 2.4.0

		- The `start` parameter of `substr` method was wrongly subtracted by one, previously. This can lead to inconsistent substring results and also does not match with the behaviour with `substr` in R. It has been corrected.

Uh oh!

Conversation

viirya commented Feb 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

viirya commented Feb 1, 2018

Uh oh!

viirya commented Feb 1, 2018

Uh oh!

srowen commented Feb 1, 2018

Uh oh!

shivaram commented Feb 1, 2018

Uh oh!

viirya commented Feb 1, 2018

Uh oh!

shivaram commented Feb 1, 2018

Uh oh!

SparkQA commented Feb 1, 2018

Uh oh!

viirya commented Feb 1, 2018

Uh oh!

HyukjinKwon commented Feb 1, 2018

Uh oh!

HyukjinKwon commented Feb 1, 2018

Uh oh!

viirya commented Feb 1, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 1, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 2, 2018

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

shivaram commented Feb 7, 2018

Uh oh!

viirya commented Feb 15, 2018

Uh oh!

felixcheung commented Feb 16, 2018

Uh oh!

viirya commented Feb 16, 2018

Uh oh!

viirya commented Mar 2, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Mar 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 6, 2018

Uh oh!

felixcheung commented Mar 6, 2018

Uh oh!

felixcheung left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

viirya commented Feb 1, 2018 •

edited

Loading

viirya Mar 6, 2018 •

edited

Loading