Skip to content

[SPARK-10895][SPARK-11164][SQL] Push down InSet and string filters to Parquet#8956

Closed
viirya wants to merge 5 commits into
apache:masterfrom
viirya:parquet-stringfilter-pushdown
Closed

[SPARK-10895][SPARK-11164][SQL] Push down InSet and string filters to Parquet#8956
viirya wants to merge 5 commits into
apache:masterfrom
viirya:parquet-stringfilter-pushdown

Conversation

@viirya

@viirya viirya commented Oct 1, 2015

Copy link
Copy Markdown
Member

@SparkQA

SparkQA commented Oct 1, 2015

Copy link
Copy Markdown

Test build #43149 has finished for PR 8956 at commit f27288e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class StringFilter(

@rxin

rxin commented Oct 3, 2015

Copy link
Copy Markdown
Contributor

cc @liancheng

@SparkQA

SparkQA commented Oct 3, 2015

Copy link
Copy Markdown

Test build #43213 has finished for PR 8956 at commit 4d00ed0.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class StringFilter(

@viirya

viirya commented Oct 3, 2015

Copy link
Copy Markdown
Member Author

retest this please.

@SparkQA

SparkQA commented Oct 3, 2015

Copy link
Copy Markdown

Test build #43215 has finished for PR 8956 at commit 4d00ed0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class StringFilter(

@SparkQA

SparkQA commented Oct 5, 2015

Copy link
Copy Markdown

Test build #43238 has finished for PR 8956 at commit eb134b9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class StringFilter(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove ()

@liancheng

Copy link
Copy Markdown
Contributor

Off topic but related, SetInFilter.canDrop can also leverage statistics information:

    private val min = valueSet.min
    private val max = valueSet.max

    override def canDrop(statistics: Statistics[T]): Boolean = {
      statistics.getMax.compareTo(min) < 0 || max.compareTo(statistics.getMin) < 0
    }

(And we probably should rename SetInFilter to InSetFilter.)

@viirya viirya changed the title [SPARK-10895][SQL] Push down string filters to Parquet [SPARK-10895][SPARK-11164][SQL] Push down string filters to Parquet Oct 17, 2015
@viirya viirya changed the title [SPARK-10895][SPARK-11164][SQL] Push down string filters to Parquet [SPARK-10895][SPARK-11164][SQL] Push down InSet and string filters to Parquet Oct 17, 2015
@viirya

viirya commented Oct 17, 2015

Copy link
Copy Markdown
Member Author

@liancheng Thank you for your detailed comments. I've updated this patch. When the tests are passed, please review it again to see if there is any problem.

@SparkQA

SparkQA commented Oct 17, 2015

Copy link
Copy Markdown

Test build #43878 has finished for PR 8956 at commit 02bbab8.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class InSetFilter[T <: Comparable[T]](valueSet: Set[T])
    • abstract class StringFilter extends UserDefinedPredicate[Binary]
    • case class StringStartsWithFilter(prefix: String) extends StringFilter
    • case class StringEndsWithFilter(suffix: String) extends StringFilter
    • case class StringContainsFilter(str: String) extends StringFilter

@viirya

viirya commented Oct 17, 2015

Copy link
Copy Markdown
Member Author

retest this please.

@SparkQA

SparkQA commented Oct 17, 2015

Copy link
Copy Markdown

Test build #43880 has finished for PR 8956 at commit 02bbab8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class InSetFilter[T <: Comparable[T]](valueSet: Set[T])
    • abstract class StringFilter extends UserDefinedPredicate[Binary]
    • case class StringStartsWithFilter(prefix: String) extends StringFilter
    • case class StringEndsWithFilter(suffix: String) extends StringFilter
    • case class StringContainsFilter(str: String) extends StringFilter

@rxin

rxin commented Oct 17, 2015

Copy link
Copy Markdown
Contributor

Are there any performance improvements by pushing this down?

@viirya

viirya commented Oct 17, 2015

Copy link
Copy Markdown
Member Author

I can run some performance tests later.

@rxin

rxin commented Oct 17, 2015

Copy link
Copy Markdown
Contributor

Thanks - that'd be great.

@viirya

viirya commented Oct 19, 2015

Copy link
Copy Markdown
Member Author

Sorry I am in travel. I will submit the test few days after.

@rxin

rxin commented Oct 19, 2015

Copy link
Copy Markdown
Contributor

@viirya do you mind closing this and reopening it when it's ready?

@viirya

viirya commented Oct 20, 2015

Copy link
Copy Markdown
Member Author

Sure.

@viirya viirya closed this Oct 20, 2015
@viirya

viirya commented Oct 25, 2015

Copy link
Copy Markdown
Member Author

@rxin I am curious that although I don't observe significant performance improvement from a simple projection + filter operation by now with simple experiment, by making this filters pushed down to Parquet side, do we retrieve less data and reduce the memory footprint? If so, even under the same performance level, is this patch still worth merging?

@rxin

rxin commented Oct 25, 2015

Copy link
Copy Markdown
Contributor

If we don't observe performance improvements, it's definitely not worth it. Can you post your how you measured it, and performance results? Thanks.

@viirya

viirya commented Oct 25, 2015

Copy link
Copy Markdown
Member Author

ok. Thanks. Because we found that with pushdown filters, we can avoid the OOM problem when processing large data in our daily usage. I am wondering if it is helpful to others too.

I will post the the performance test later.

@rxin

rxin commented Oct 25, 2015

Copy link
Copy Markdown
Contributor

How does pushdown avoid OOM?

@viirya

viirya commented Oct 25, 2015

Copy link
Copy Markdown
Member Author

Because we can pre-filtering the data? Without pushdown, the whole data will be loaded into memory and then has been filtered later.

@rxin

rxin commented Oct 25, 2015

Copy link
Copy Markdown
Contributor

Is that the case? I thought we load them one by one (or small batch at a time) and then apply the filter directly on them?

@viirya

viirya commented Oct 25, 2015

Copy link
Copy Markdown
Member Author

Hmm, I am not sure about that. Because I supposed that Parquet relation will read all data first if no pushdown filters are applied. Then Spark SQL's Filter operation will be applied later. Maybe @liancheng can answer this?

@liancheng

Copy link
Copy Markdown
Contributor

Well, it depends. The situation is a little bit tricky to explain. In general there are two cases:

  1. String filters with high selectivity, namely most records can be dropped
  • Performance

    Usually, I'd expect there's no noticeable performance gain, because each record is checked against the filter pushed down, and string operations themselves are CPU bound. So the performance should be similar to the case no filter is pushed down at all.

    However, a properly implemented StringStartsWithFilter.canDrop (as I mentioned in this comment) can bring big performance win since it can drop entire row groups whenever possible. But this requires us to bump parquet-mr to 1.8+ first, which is done in [SPARK-9876] [SQL] Bumps parquet-mr to 1.8.1 #9225.

  • Memory footprint

    What @viirya observed is reasonable. One benefit of Parquet filters is that, they are performed before record assembling, namely we can drop a record before converting the underlying Parquet column values into an InternalRow. I think that's the reason why @viirya observed that OOM was gone.

    (BTW, ParquetRelation processes all the data using iterators, so we don't read all the data first and then apply the filters. My theory is that it's the InternalRow materialization costs more memory.)

  1. String filters with low selectivity, namely most records can NOT be dropped
  • Performance

    In this case, I'd expect performance regression. This is because currently Spark SQL tends to be pessimistic, and always applies all the filters again even if some of them are pushed down. In this case, almost all records are filtered twice. Since string operations are CPU bound, this can be time consuming.

  • Memory footprint

    Since the string filters in this PR "steal" the underlying byte arrays without copying them, I'd expect the memory footprint is similar to the normal case.

@viirya

viirya commented Oct 26, 2015

Copy link
Copy Markdown
Member Author

Thank you @liancheng for clear explanation!

So looks like the only benefit of this patch is the reduced memory footprint under certain cases. If you all think it is not worth merging this, we should keep it closed.

@liancheng

Copy link
Copy Markdown
Contributor

@viirya I think we can add StringStartsWithFilter later after #9225 is merged. Also we are considering removing the defensive filtering. But yeah, for now let's keep this one closed.

@viirya

viirya commented Oct 27, 2015

Copy link
Copy Markdown
Member Author

@liancheng ok. Thank you.

@viirya viirya deleted the parquet-stringfilter-pushdown branch December 27, 2023 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants