PG: join new vertex data by vertex ids#2796
PG: join new vertex data by vertex ids#2796rapids-bot[bot] merged 9 commits intorapidsai:branch-22.12from
Conversation
Fixes rapidsai#2793. We are using the assumption that there should be a single row for each vertex id.
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## branch-22.12 #2796 +/- ##
===============================================
Coverage ? 62.60%
===============================================
Files ? 118
Lines ? 6570
Branches ? 0
===============================================
Hits ? 4113
Misses ? 2457
Partials ? 0 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
This should fix the quadratic scaling we're seeing when adding new data. CC @VibhuJawa. I'm still trying to improve the merges for MG to be like #2796, but I'm encountering issues. Authors: - Erik Welch (https://github.com/eriknw) Approvers: - Vibhu Jawa (https://github.com/VibhuJawa) - Rick Ratzel (https://github.com/rlratzel) - Alex Barghi (https://github.com/alexbarghi-nv) URL: #2805
|
This PR needs rapidsai/cudf#11998 to pass tests. I think we ought to try to benchmark this before merging. Here is the mini-benchmark and here is the benchmark for the current branch We shouldn't yet read too much into these numbers, but this PR actually isn't as fast as I was expecting. I'll investigate further and experiment with MAG240 dataset. @rlratzel, this reworks the merge in |
VibhuJawa
left a comment
There was a problem hiding this comment.
Minor clarification. Looks good other wise.
| if df.npartitions > 2 * self.__num_workers: | ||
| # TODO: better understand behavior of npartitions argument in join | ||
| df = df.repartition(npartitions=self.__num_workers).persist() |
There was a problem hiding this comment.
With dask_cudf merges/operations, I think we should not actually decrease number of partitions to less than or equal to _num_workers unless we actually need to because it decrease parallelization and worker starving can become a problem.
Though the shuffle cost of the merge is nlogn but if your input partitions are less than available workers , worker starving can become a problem.
Maybe lets pick 2*_num_workers or something ?
There was a problem hiding this comment.
That sounds reasonable. Done. I think we may want to explore these values as we scale out.
I know it's bad if we don't repartition though and the number of partitions grows too large.
There was a problem hiding this comment.
Agreed . Yup , it becomes really slow as you increase partitions . In case you are curios on how the shuffle works under the hood, below is my favorite explanation by rjzamora which i think still holds true.
|
@gpucibot merge |
Fixes #2793
Fixes #2794
We are using the assumption that there should be a single row for each vertex id.
This does not yet handle MG. Let's figure out how we want SG to behave first.