SPARK-4963 [SQL] Add copy to SQL's Sample operator#3827
Conversation
|
Can one of the admins verify this patch? |
|
add to whitelist |
|
ok to test |
|
Test build #24866 has started for PR 3827 at commit
|
|
Test build #24866 has finished for PR 3827 at commit
|
|
Test PASSed. |
|
Hey @yanbohappy, as I've commented in the JIRA, would you mind to do a micro benchmark using code in #758 to see whether this fix introduces noticeable performance regression? |
|
@yanbohappy Actually, we can move the |
6eaee5e to
cea7e2e
Compare
|
Test build #24888 has started for PR 3827 at commit
|
|
@liancheng I agree to move the copy call to execution.Sample.execute and added new commits. |
|
Test build #24888 has finished for PR 3827 at commit
|
|
Test PASSed. |
There was a problem hiding this comment.
Use checkAnswer instead here as it give better output when there is an exception or the answer is wrong.
(1 to 10).foreach { i =>
checkAnswer(
sql("SELECT * FROM sampled WHERE key % 2 = 1"),
Seq.empty)
}|
Test build #24942 has started for PR 3827 at commit
|
|
Change for better test output and move it to another test file which is more reasonable. |
|
Test build #24942 has finished for PR 3827 at commit
|
|
Test FAILed. |
|
Test build #24947 has started for PR 3827 at commit
|
|
Test build #24947 has finished for PR 3827 at commit
|
|
Test PASSed. |
|
Can anyone verify and merge this patch? It's a bug appeared frequently and fix it asap will be better. @marmbrus |
|
Sorry for the late response. This LGTM except a minor styling issue. Thanks! |
|
Test build #25293 has started for PR 3827 at commit
|
|
Test build #25293 has finished for PR 3827 at commit
|
|
Test PASSed. |
There was a problem hiding this comment.
You don't need to create the src table. Our test harness does that automatically whenever the test tables are referenced. I can remove this when merging.
|
Thanks! I've merged this to master. |
https://issues.apache.org/jira/browse/SPARK-4963
SchemaRDD.sample() return wrong results due to GapSamplingIterator operating on mutable row.
HiveTableScan make RDD with SpecificMutableRow and SchemaRDD.sample() will return GapSamplingIterator for iterating.
override def next(): T = {
val r = data.next()
advance
r
}
GapSamplingIterator.next() return the current underlying element and assigned it to r.
However if the underlying iterator is mutable row just like what HiveTableScan returned, underlying iterator and r will point to the same object.
After advance operation, we drop some underlying elments and it also changed r which is not expected. Then we return the wrong value different from initial r.
To fix this issue, the most direct way is to make HiveTableScan return mutable row with copy just like the initial commit that I have made. This solution will make HiveTableScan can not get the full advantage of reusable MutableRow, but it can make sample operation return correct result.
Further more, we need to investigate GapSamplingIterator.next() and make it can implement copy operation inside it. To achieve this, we should define every elements that RDD can store implement the function like cloneable and it will make huge change.