Skip to content

PG: join new vertex data by vertex ids#2796

Merged
rapids-bot[bot] merged 9 commits intorapidsai:branch-22.12from
eriknw:join_vertex_ids
Nov 9, 2022
Merged

PG: join new vertex data by vertex ids#2796
rapids-bot[bot] merged 9 commits intorapidsai:branch-22.12from
eriknw:join_vertex_ids

Conversation

@eriknw
Copy link
Contributor

@eriknw eriknw commented Oct 12, 2022

Fixes #2793
Fixes #2794

We are using the assumption that there should be a single row for each vertex id.

This does not yet handle MG. Let's figure out how we want SG to behave first.

Fixes rapidsai#2793.

We are using the assumption that there should be a single row for each vertex id.
@codecov-commenter
Copy link

codecov-commenter commented Oct 12, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.12@e4ed302). Click here to learn what that means.
Patch has no changes to coverable lines.

Additional details and impacted files
@@               Coverage Diff               @@
##             branch-22.12    #2796   +/-   ##
===============================================
  Coverage                ?   62.60%           
===============================================
  Files                   ?      118           
  Lines                   ?     6570           
  Branches                ?        0           
===============================================
  Hits                    ?     4113           
  Misses                  ?     2457           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@eriknw eriknw marked this pull request as draft October 13, 2022 00:27
@alexbarghi-nv alexbarghi-nv added non-breaking Non-breaking change Fix labels Oct 13, 2022
@alexbarghi-nv alexbarghi-nv added this to the 22.12 milestone Oct 13, 2022
@alexbarghi-nv alexbarghi-nv added the improvement Improvement / enhancement to an existing function label Oct 13, 2022
rapids-bot bot pushed a commit that referenced this pull request Oct 24, 2022
This should fix the quadratic scaling we're seeing when adding new data.

CC @VibhuJawa. I'm still trying to improve the merges for MG to be like #2796, but I'm encountering issues.

Authors:
  - Erik Welch (https://github.com/eriknw)

Approvers:
  - Vibhu Jawa (https://github.com/VibhuJawa)
  - Rick Ratzel (https://github.com/rlratzel)
  - Alex Barghi (https://github.com/alexbarghi-nv)

URL: #2805
@eriknw
Copy link
Contributor Author

eriknw commented Oct 26, 2022

This PR needs rapidsai/cudf#11998 to pass tests.

I think we ought to try to benchmark this before merging.

Here is the mini-benchmark bench_add_edges_cyber for "branch-22.12" branch (I added [50] for kicks):

------------------------------------------------------------- benchmark: 5 tests ------------------------------------------------------------
Name (time in s, mem in bytes)        Mean                GPU mem            GPU Leaked mem            Rounds            GPU Rounds
---------------------------------------------------------------------------------------------------------------------------------------------
bench_add_edges_cyber[1]            1.3833 (1.0)      273,342,072 (1.0)                 144 (1.0)           1           1
bench_add_edges_cyber[3]            2.8139 (2.03)     273,342,968 (1.00)              1,032 (7.17)          1           1
bench_add_edges_cyber[10]           6.7457 (4.88)     273,343,656 (1.00)              1,712 (11.89)         1           1
bench_add_edges_cyber[30]          18.8487 (13.63)    273,344,440 (1.00)              2,528 (17.56)         1           1
bench_add_edges_cyber[50]          30.2863 (21.89)    273,346,992 (1.00)              5,048 (35.06)         1           1
---------------------------------------------------------------------------------------------------------------------------------------------

and here is the benchmark for the current branch

--------------------------------------------------------------- benchmark: 5 tests ---------------------------------------------------------------
Name (time in ms, mem in bytes)            Mean                GPU mem            GPU Leaked mem            Rounds            GPU Rounds
--------------------------------------------------------------------------------------------------------------------------------------------------
bench_add_edges_cyber[1]               801.3284 (1.0)      273,342,224 (1.0)                 280 (1.0)           1           1
bench_add_edges_cyber[3]             1,723.6213 (2.15)     273,342,672 (1.00)                496 (1.77)          1           1
bench_add_edges_cyber[10]            4,731.0946 (5.90)     273,343,672 (1.00)              1,360 (4.86)          1           1
bench_add_edges_cyber[30]           13,608.0636 (16.98)    273,343,448 (1.00)              1,256 (4.49)          1           1
bench_add_edges_cyber[50]           22,819.4191 (28.48)    273,348,912 (1.00)              6,488 (23.17)         1           1
--------------------------------------------------------------------------------------------------------------------------------------------------

We shouldn't yet read too much into these numbers, but this PR actually isn't as fast as I was expecting. I'll investigate further and experiment with MAG240 dataset.

@rlratzel, this reworks the merge in add_vertex_data and add_edge_data, which seemed to be a particularly sensitive part of the code base. I think this PR can begin to be reviewed even though we're blocked on cudf PR for the tests.

@eriknw eriknw marked this pull request as ready for review November 3, 2022 04:53
Copy link
Member

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor clarification. Looks good other wise.

Comment on lines +456 to +458
if df.npartitions > 2 * self.__num_workers:
# TODO: better understand behavior of npartitions argument in join
df = df.repartition(npartitions=self.__num_workers).persist()
Copy link
Member

@VibhuJawa VibhuJawa Nov 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With dask_cudf merges/operations, I think we should not actually decrease number of partitions to less than or equal to _num_workers unless we actually need to because it decrease parallelization and worker starving can become a problem.

Though the shuffle cost of the merge is nlogn but if your input partitions are less than available workers , worker starving can become a problem.

Maybe lets pick 2*_num_workers or something ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds reasonable. Done. I think we may want to explore these values as we scale out.

I know it's bad if we don't repartition though and the number of partitions grows too large.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed . Yup , it becomes really slow as you increase partitions . In case you are curios on how the shuffle works under the hood, below is my favorite explanation by rjzamora which i think still holds true.

rapidsai/cudf#4308 (comment)

Copy link
Member

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@BradReesWork
Copy link
Member

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 7387fbc into rapidsai:branch-22.12 Nov 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improvement / enhancement to an existing function non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Can not add data to the same edges twice [BUG]Adding data to same nodes twice in a PG results in incorrect __vertex_prop_dataframe

5 participants