Skip to content

[REVIEW] Optimize dask.uniform_neighbor_sample#2887

Merged
rapids-bot[bot] merged 11 commits intorapidsai:branch-22.12from
VibhuJawa:optimize_dask_uniform_sampling
Nov 18, 2022
Merged

[REVIEW] Optimize dask.uniform_neighbor_sample#2887
rapids-bot[bot] merged 11 commits intorapidsai:branch-22.12from
VibhuJawa:optimize_dask_uniform_sampling

Conversation

@VibhuJawa
Copy link
Member

@VibhuJawa VibhuJawa commented Nov 3, 2022

This PR closes #2872 and part of https://github.com/rapidsai/graph_dl/issues/74 .

On a preliminary benchmark i am seeing a 6x , 3.5x improvement and i expect more improvement on bigger clusters.

Before PR :

%%timeit
df = cugraph.dask.uniform_neighbor_sample(dask_g, start_list[:10_000], [20])
187 ms ± 40.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

After PR

%%timeit
df = cugraph.dask.uniform_neighbor_sample(dask_g, start_list[:10_000], [20])
50.1 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Updated benchmark

Before PR:

---------------------------------------------------------------------- benchmark: 3 tests ----------------------------------------------------------------------
Name (time in ms, mem in bytes)                            Mean              GPU mem            GPU Leaked mem            Rounds            GPU Rounds          
----------------------------------------------------------------------------------------------------------------------------------------------------------------
bench_uniform_neigbour_sample_email_eu_core[1000]      170.2833 (1.0)        300,224 (1.0)                   0 (1.0)           1           1
bench_uniform_neigbour_sample_email_eu_core[5000]      191.7691 (1.13)     1,200,944 (4.00)                  0 (1.0)           1           1
bench_uniform_neigbour_sample_email_eu_core[10000]     194.8658 (1.14)     1,200,944 (4.00)                  0 (1.0)           1           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------

After PR:

--------------------------------------------------------------------- benchmark: 3 tests --------------------------------------------------------------------
Name (time in ms, mem in bytes)                           Mean            GPU mem            GPU Leaked mem            Rounds            GPU Rounds          
-------------------------------------------------------------------------------------------------------------------------------------------------------------
bench_uniform_neigbour_sample_email_eu_core[1000]      54.0802 (1.0)            0 (1.0)                   0 (1.0)           1           1
bench_uniform_neigbour_sample_email_eu_core[5000]      63.0271 (1.17)           0 (1.0)                   0 (1.0)           1           1
bench_uniform_neigbour_sample_email_eu_core[10000]     70.1445 (1.30)           0 (1.0)                   0 (1.0)           1           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------

We should probably follow similar things for other algos too.
CC: @jnke2016 , @rlratzel , @alexbarghi-nv .

@VibhuJawa VibhuJawa requested a review from a team as a code owner November 3, 2022 22:24
@VibhuJawa VibhuJawa added this to the 22.12 milestone Nov 3, 2022
@VibhuJawa VibhuJawa self-assigned this Nov 3, 2022
@VibhuJawa VibhuJawa added 2 - In Progress improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 3, 2022
@VibhuJawa VibhuJawa changed the title [WIP] Optimize dask.uniform_neighbor_sample [REVIEW] Optimize dask.uniform_neighbor_sample Nov 4, 2022
@codecov-commenter
Copy link

codecov-commenter commented Nov 4, 2022

Codecov Report

Base: 60.43% // Head: 61.71% // Increases project coverage by +1.27% 🎉

Coverage data is based on head (632d584) compared to base (d86c933).
Patch has no changes to coverable lines.

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-22.12    #2887      +/-   ##
================================================
+ Coverage         60.43%   61.71%   +1.27%     
================================================
  Files               114      123       +9     
  Lines              6547     7543     +996     
================================================
+ Hits               3957     4655     +698     
- Misses             2590     2888     +298     
Impacted Files Coverage Δ
python/cugraph/cugraph/components/connectivity.py 94.11% <ø> (-0.89%) ⬇️
python/cugraph/cugraph/dask/__init__.py 100.00% <ø> (ø)
...on/cugraph/cugraph/dask/components/connectivity.py 31.03% <ø> (+5.10%) ⬆️
...thon/cugraph/cugraph/dask/sampling/random_walks.py 22.91% <ø> (ø)
...h/cugraph/dask/sampling/uniform_neighbor_sample.py 25.42% <ø> (+4.26%) ⬆️
...ugraph/cugraph/dask/structure/mg_property_graph.py 14.12% <ø> (+0.46%) ⬆️
python/cugraph/cugraph/dask/traversal/bfs.py 25.39% <ø> (-0.84%) ⬇️
...ph/cugraph/experimental/pyg_extensions/__init__.py 100.00% <ø> (ø)
python/cugraph/cugraph/gnn/__init__.py 100.00% <ø> (ø)
...h/cugraph/gnn/dgl_extensions/base_cugraph_store.py 84.21% <ø> (ø)
... and 47 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

wait(cudf_result)

ddf = dask_cudf.from_delayed(cudf_result).persist()
# Send tasks all at once
Copy link
Member

@alexbarghi-nv alexbarghi-nv Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment; it would be nice to have a helper function for this so we can apply it to more algorithms. But I suggest we leave that to a future PR given how much work we have left before burndown.

To clarify, this was referencing lines 174-180

Copy link
Member

@alexbarghi-nv alexbarghi-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@jnke2016 jnke2016 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation experiences a hang on 2, 3 and 4 GPUs similar to this issue that was closed few months ago by this PR. To reproduce, just run this implementation of uniform_nwighbor_sample several times(less than 10) on 2 GPUs.

@VibhuJawa
Copy link
Member Author

This implementation experiences a hang on 2, 3 and 4 GPUs similar to this issue that was closed few months ago by this PR. To reproduce, just run this implementation of uniform_nwighbor_sample several times(less than 10) on 2 GPUs.

Thanks for testing this thoroughly @jnke2016 , I really appreciate it. Can you post a link to the code that you ran to get to a hand as i could not get it to hand in my testing on 8 GPUs .

We can probably add those statements back and give back 10 ms of perf on 8GPUs but it is not a real solution as it will be much worse on larger cluster (say 128 GPUs) .

I am gonna try looking into a real fix of the problem today

@jnke2016
Copy link
Contributor

jnke2016 commented Nov 9, 2022

You can run this on 2 GPUs.

import cugraph
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
import dask
from cugraph.dask.comms import comms as Comms
import cudf
import dask_cudf
from cugraph.dask.common.mg_utils import get_visible_devices
import cugraph.dask as dcg
from cugraph.generators import rmat

import time

def setup():
    cluster = LocalCUDACluster()
    client = Client(cluster)
    client.wait_for_workers(len(get_visible_devices()))

    Comms.initialize(p2p=True)
    return (client, cluster)

def teardown(client, cluster=None):
    Comms.destroy()
    client.close()
    if cluster:
        cluster.close()

def generate_edgelist(scale,
                      edgefactor,
                      seed=None,
                      unweighted=False,
                     ):

    df = rmat(
        scale,
        (2**scale)*edgefactor,
        0.57,  # from Graph500
        0.19,  # from Graph500
        0.19,  # from Graph500
        seed or 42,
        clip_and_flip=False,
        scramble_vertex_ids=True,
        create_using=None,  # return edgelist instead of Graph instance
        mg=True
    )

    return df

if __name__ == "__main__":

    setup_objs = setup()
    client = setup_objs[0]
    num_workers = len(client.scheduler_info()['workers'])

    df = generate_edgelist(scale=10, edgefactor=16, seed=5)

    m_G = cugraph.Graph()
    m_G.from_dask_cudf_edgelist(
        df, source='src', destination='dst', renumber=True, legacy_renum_only=True)

    vertices = m_G.nodes().persist()
    print("number of nodes are ", len(vertices))
    print("number of edges is ", len(m_G.input_df))
    vertices = vertices.compute().reset_index(drop=True).iloc[:10000]


    time_list = []
    for i in range(10):
        print("*****iteration ", i, "*****")
        t0 = time.time()
        results = cugraph.dask.uniform_neighbor_sample(m_G, vertices, [2])
        t1 = time.time()
        total = t1-t0
        time_list.append(total)


    import statistics
    print("the mean is ", statistics.mean(time_list))
    print("the standard deviation is ", statistics.stdev(time_list))
    print("number of GPUs is ", num_workers)
    
    teardown(*setup_objs)

@VibhuJawa
Copy link
Member Author

@jnke2016 , Thanks for this script. Let me investigate more :-)

@VibhuJawa VibhuJawa changed the title [REVIEW] Optimize dask.uniform_neighbor_sample [WIP] Optimize dask.uniform_neighbor_sample Nov 10, 2022
@VibhuJawa
Copy link
Member Author

VibhuJawa commented Nov 17, 2022

@jnke2016 ,

Swapping client.submit to client.map is runnning into issues with work-stealing (See internal thread). Swapping it back gets rid of the hang .

We still see speedups (See updated benchmarks in the PR description on a 8 GPU cluster) but we will not be as fast as we should be on large clusters . (We currently loose 5ms per worker)

Anyways, I have verified that your script works on 1,2,3,4,5,6,7,8 GPUs with 50 iterations each without hanging.

I will make the cluster agnostic changes in 23.02 as it is more involved.

Please feel free to review again

@VibhuJawa VibhuJawa changed the title [WIP] Optimize dask.uniform_neighbor_sample [REVIEW] Optimize dask.uniform_neighbor_sample Nov 17, 2022
@VibhuJawa VibhuJawa requested review from a team as code owners November 17, 2022 03:59
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@VibhuJawa VibhuJawa changed the base branch from branch-22.12 to branch-22.10 November 17, 2022 04:00
@VibhuJawa VibhuJawa changed the base branch from branch-22.10 to branch-22.12 November 17, 2022 04:01
@VibhuJawa
Copy link
Member Author

@rerun tests

Copy link
Contributor

@wangxiaoyunNV wangxiaoyunNV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good.

@wangxiaoyunNV wangxiaoyunNV reopened this Nov 17, 2022
Copy link
Contributor

@jnke2016 jnke2016 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

@BradReesWork
Copy link
Member

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 34e3dd7 into rapidsai:branch-22.12 Nov 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improvement / enhancement to an existing function non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA]: Reduce the dask overhead in uniform_neighbor_sample

6 participants