Speeding up GPU clustering using smarter download strategy and memory allocations by FabianSchuetze · Pull Request #4677 · PointCloudLibrary/pcl

FabianSchuetze · 2021-03-26T15:58:54Z

I stumbled upon the same issue mentioned in #2703 while familiarizing myself with the GPU-related codebase of PCL. A CPU flamegraph showed about 1/3 of the runtime is spent on resizing a vector:

The resize function is located in the device_array.hpp codebase. The arrays are constructed in the loop (line 100) of gpu_extract_custer.hpp. Pre-allocating the array to the maximum possible size can mostly eliminate the memory allocations as documented by the flamegraph with the revised code:
I tried to optimize the Cuda memcopies that also incur significant times by using pinned host memory, but that did not lead to a noticeable improvement. The GPU code is still significantly slower than the CPU version, though. I believe this is due to the sequential nature of the program, and data copies between the host and device memory.

I would be happy to work more on this issue if further PRs are welcome in this field. Maybe somebody also has an idea for improving GPU-based segmentation or other ideas to work on the GPU-related codebase of PCL!

P.S. The function pcl::gpu::extractEuclideanClusters is verbose by default, should we maybe also change this?

mvieth · 2021-03-27T15:56:49Z

Nice idea, and the flame graphs look promising

Could you do some simple benchmarking to find out how much faster than the current implementation your proposal is?
What do you think about moving the creation of the vectors out of the for-loop (moving it to line ~70)? Could that further reduce the number of allocations?
I am not sure the value you use for reserve is the correct one. I think data is queries_host.size() * max_answers large, so up to host_cloud_->size() * max_answers. But that can be very large (several gigabytes), and I wonder how much difference the reserve makes anyway? Some simple benchmarks here would be great as well
Please also move the comment "Host buffer for results"
Yes, if you like, you can also change the logging behaviour, e.g. with the PCL_DEBUG macro
If you like to further work on the GPU clustering, you could check if the value "10" (line 103) is the best option or if other options make the clustering faster. You could also make the value settable as the @todo suggests

FabianSchuetze · 2021-03-27T17:47:17Z

Thank you for the kind and detailed feedback, @mvieth - that was very helpful to me! I will address the comments and update the PR.

FabianSchuetze · 2021-03-29T11:36:34Z

Thank you again for your detailed comments! I tried to address these comments as follows:

With a pointcloud from my living room containing 198,835 points, the computation time drops from 9.40 to 4.23s. Similarly, for the pointcloud "five_people.pcd" from PCD's test folder (comprising 307,200 points), the duration falls from 10.63s to 7.09 seconds.
Moving the vectors one scope up and not calling reserve was an excellent good suggestion - thank you! With the persistent vector, calling resizing leads to few memory allocations.
I did move the comment "Host buffer for results" and moved the logging into the PCL_DEBUG macro.
I also experimented with changing the threshold of offloading the computation to the GPU and compared values of (10, 50, 100, 1000) This is a losing battle for the GPU at the moment: Increasing the threshold reduces total computation time. As documented by the flamegraph, the latency for copying data between host and device biases comparison. I would increase the threshold to 100, but I am not sure if the comparison is entirely fair.

I hope this address all the comments you made and that I did not forget anything. Is there anything else you suggest looking at? I also thought more about the problem and have two other ideas for improving it:

According to the flamegraph, 9 percent of the time is spent in malloc inside the create function of the DeviceArray. The create function destroys (if needed) memory and fully allocates new memory again. We have to call the create function because the device array is always local to the scope and because the device array does not have a resize function. Such a resize function, for example, is also part of thrust. Do you think it would be interesting for us to expand the API of the DeviceArray to permit resizing the array?

Apart from resizing the array, copying data from the device to the host requires the most time (about 2/3 in the current form). Looking at lines 149ff, I think we could avoid copying the data as the only relevant computation is obtaining the indices from data[qp_r + qp * max_answers]. I guess that we could do this all on the GPU threads.

Do you think any of these ideas are worthwhile to pursue? Or would you alternatively suggest working on other areas of the GPU section?

FabianSchuetze · 2021-03-29T14:57:35Z

I'm sorry to see the build partially failing. Could somebody kindly help me understanding why that happened? I looked through the logs and found only one statement that might report an error:

2021-03-29T12:42:13.4574189Z   LINK : the 32-bit linker (C:\Program Files (x86)\Microsoft Visual Studio\2017\Enterprise\VC\Tools\MSVC\14.16.27023\bin\HostX86\x86\link.exe) ran out of heap space and is going to restart linking with a 64-bit linker
2021-03-29T12:42:13.5194291Z   LINK : restarting link with 64-bit linker `C:\Program Files\Git\usr\bin\link.exe'
2021-03-29T12:42:18.9432564Z   /c/Program Files (x86)/Microsoft Visual Studio/2017/Enterprise/VC/Tools/MSVC/14.16.27023/bin/HostX86/x86/link: cannot create link 'Â â– /' to '/ERRORREPORT:QUEUE': No such file or directory

However, the build continues for a while afterwards and I suspect I misinterpret the error. I am grateful for any hints or suggestions to understand the build results!

mvieth · 2021-03-29T15:09:17Z

So a speed-up of 2.2 and 1.5 respectively - that is really nice.
Regarding the offloading threshold: can you give an estimate of how much 10 vs 100 changes? I made some quick tests and it seems to barely change the time.
Regarding adding a resize function to DeviceArray: you can try that out if you like, but I don't know how much impact it would make if it only accounts for 9% of the time spent.
Regarding copying the data from device to host: I am not sure how you would completely avoid copying the data - by calling ptr() and accessing the data from there? Not sure if that works/improves the performance, but might be worth a try. Another idea to reduce copying: there are many elements in data that are copied, but never accessed, since there are blocks of size max_answers that might not be filled completely. If we would first download sizes, and then somehow only download those chunks of data that we actually need, that might improve performance.
@larshg If you like, have a look at our discussion. Since you worked on the GPU clustering before, you might have a good suggestion
P.S.: Regarding the failed test: the Windows CI sometimes runs out of memory - I restarted it and the check should probably pass now.

FabianSchuetze · 2021-03-29T17:33:18Z

Thank you very much, @mvieth, for your detailed and thoughtful comments - they again helped me understand the problem much better!

I'm surprised you did not experience a significant change in run-time. With my configuration, the run-time changed almost monotonically. I will report detailed benchmarks again.

Your comment that we download more data than necessary was striking to me, I did not realize this first. For a sampled iteration of the nested for loop the selected indices qp_r + qp * max_answers are: 0, 1, ...,20, 25000, 25001, ..., 25024, 50000, .... Although only a small fraction of the indices of data are accessed, we donwload the entire array. We could indeed try to filter the array on the device before downloading it. That is a fascinating challenge!

P.S. Thanks for running the build again - its reassuring to see that not only my computer runs out of memory when building PCL when using all possible cores.

larshg · 2021-03-29T20:43:39Z

I'm just testing it out as well.
However after the offloading threshold was set to 100, I didn't get any clusters. Seems like the CPU version is faulty here:

pcl/gpu/segmentation/include/pcl/gpu/segmentation/impl/gpu_extract_clusters.hpp

Lines 105 to 110 in 97cb436

    
           std::cout << " CPU: "; 
        
           for(std::size_t p = 0; p < queries_host.size (); p++) 
        
           { 
        
             // Execute the radiusSearch on the host 
        
             tree->radiusSearchHost(queries_host[p], tolerance, data, max_answers); 
        
           }

For each query points, the radius search is performed, but the results are not accumulated and hence, it never finds the required sizes of clusters.

Also, it seems, with my preliminary tests of setting it to 200 (No GPU activity with the pointcloud / parameters I use), the "CPU implementation of the GPU algorithm" is faster than the euclidian clustering in pcl_segmentation (using a K-D tree).

Some more investigation is required.

How was #4506 progressing @mvieth :) ?

FabianSchuetze · 2021-03-30T06:00:41Z

Thanks for the detailed and very helpful comments, @larshg! Your comments about the difference between the CPU and GPU version are indeed intriguing and I am happy to investigate it more thoroughly!

FabianSchuetze · 2021-03-30T13:17:22Z

@larshg Thank you again for pointing out that the obtained clusters are a function of the GPU offloading threshold! I did verify this, which left me puzzled for a while. I adjusted the scope of the loop you pointed out, but that didn't help. I will try to figure out what is going on and then write again.

@mvieth I refrained from benchmarking the run-time of the algorithm as a function of the offloading threshold. I will first try to understand why changing the threshold causes the results to differ and then measure run-time performances.

@kunaltyagi Thanks for the comments! I did adjust the code in response to your observations, and I think it improves the code.

At all three of you: Thank you so much for taking a look at this PR and for guiding me through the process - this is indeed a pleasure!

FabianSchuetze · 2021-04-03T10:05:09Z

First: Thank you all again for the excellent feedback so far! It is really a pleasure to work on this.

I continued working on the issue and made progress on the program's efficiency. As @mvieth suggested, we currently download potentially too much data from the device to host. A more judicious download gives us speed gains of a factor of 3. I updated the PR to document the ideas in code. However, I do not think this code is pretty, and I am not sure if it's a good idea to merge it. To keep the discussion focused, I opened another issue to discuss expanding the API of the DeviceArray to allow users to download data more effectively.

Here are some performance benchmarks (taken for the pcd test data)
Dataset: (turning the cpu version of the GPU part of; i.e. if (queries_host.size() <= 0) go to CPU; else GPU)

Dataset	rops_cloud (804k)	bunny (11K)	cturtle (771K)
CPU	1.31	0.01	3.18
GPU	0.49	0.08	1.63
Master branch	1.61	0.07	8.63

The flamegraphs reflect the timing updates too. The graph still highlights further possibilities for improvement of the GPU version. However, the CPU code dominates the timing for the first time :-).

I realize that the problem of different results depending on the threshold parameter persists! Thanks again for highlighting this, @larshg! Although I am thinking about it, I am still tapping a bit in the dark but will keep trying. What do you think about the changes and what are open questions for you?

P.S. I cannot use a few of the test datasets because FLANN fails on the CPU. These are: table_scene_mug_stereo_textured, office1d, or five_people. The (abbreviated) error is "Invalid (NaN, Inf) point coordinates given to radiusSearch!"' failed. I have looked for an issue but didn't see one. Does somebody know an issue?

FabianSchuetze · 2021-04-04T14:32:35Z

Ha! I think things are falling in place!

Thank you again, @larshg , for identifying that the number of clusters changes as the offloading parameter changes. As you speculated, not all found indices found inside the method radiusSearchHost got propagated through the program. The search method clears the vector of previously found indices, and thus these indices got lost. I now obtain the same results on the dataset five_people for a broad range of offloading parameters. @larshg , can you kindly check the new program on the dataset on which you identified the problem at first? That would be wonderful!

larshg · 2021-04-04T15:45:16Z

Yes, thats pretty much the same thing I thought would be required. Not sure if you can use a move_inserter to avoid making a copy?

And maybe use pcl::indices(its currently defined as the same, but makes it easier to change type of indices) instead of std::vector if searchHost returns indices?

FabianSchuetze · 2021-04-05T14:29:24Z

Thank you, @kunaltyagi, for reviewing the PR - that was very helpful indeed! I updated the PR accordingly.

Thank you also very much, @larshg! I replaced all instances of std::vector<int> with pcl::indices. I think this is a better solution - thank you! Moving avector<int> induces a copy because the values from one contiguous memory location have to be placed in the other contiguous memory location. If we had a vector of strings (or any other type that manages a pointer to a memory location), move inserters would be perfect.

What are your opinions about placing the cuda interaction into the DeviceArray class? I think the following lines of code could be wrapped inside a download function of DeviceArray:

#include <cuda_runtime_api.h>
#include <cuda.h>
...
const std::size_t bytes = (sizes[qp]) * sizeof(int);
cudaMemcpy(&tmp[0], pdata, bytes, cudaMemcpyDeviceToHost);
cudaDeviceSynchronize();

I think all these lines could be placed inside a function overloading the download member function of the DeviceArray. I tried to describe it in a ticket #4689 .

FabianSchuetze · 2021-04-06T17:50:14Z

Thank you, @kunaltyagi , for your comments - they were very helpful for me!

kunaltyagi

Just some thoughts about 1 final memory allocation

kunaltyagi · 2021-04-11T11:36:29Z

+      for(int idx : data)
        {
-          for(int qp_r = 0; qp_r < sizes[qp]; qp_r++)
-          {
-            if(processed[data[qp_r + qp * max_answers]])
-              continue;
-            processed[data[qp_r + qp * max_answers]] = true;
-            queries_host.push_back ((*host_cloud_)[data[qp_r + qp * max_answers]]);
-            found_points++;
-            r.indices.push_back(data[qp_r + qp * max_answers]);
-          }
-        }
+        if(processed[idx])
+          continue;
+        processed[idx] = true;
+        queries_host.push_back ((*host_cloud_)[idx]);
+        found_points++;
+        r.indices.push_back(idx);
      }


The braces need formatting

Oh, sorry, I tried to fix it! Is it correct now? I have a question though: When I clang-format the file gpu_extract_clusters.hpp many lines of the file change. Am I doing something wrong or does the same maybe also happen to you too?

gpu module isn't formatted yet. So your experience is correct. The plugin I use allows me to format only a selection of lines instead of all the lines in the file.

You can either:

use similar settings for some plugin in your fav editor

create an additional formatting commit (at the start or end) so we can skip that/select that easily during code-review

Ha - thanks for pointing out this feature in vim clang-format! I always only formatted the entire buffer, never a few lines. Shall I maybe make a batch PR clang-formatting the GPU codebase?

kunaltyagi

LGTM otherwise :)

Thanks @FabianSchuetze for bearing with us

FabianSchuetze · 2021-04-12T16:33:24Z

LGTM otherwise :)

Thanks @FabianSchuetze for bearing with us

It was a pleasure to work on it - thank you so much for your support @kunaltyagi !

I have another question as this issue comes to an end: I would like to continue working on the GPU code. Issues #4443 or #2218 seem interesting to me. Alternatively, the GPU code seems to lack a correspondence estimation, and I would be happy to work on this. I would be thrilled to see a GPU version of the ICP algorithm and think this should be the next step. Do you have an idea of which feature/issue to prioritize?

FabianSchuetze · 2021-06-22T10:50:23Z

Puha! I think we are getting closer to an end. Thanks to the expansion of the device array API, we can avoid the Cuda calls in the segmentation module spotted by Lars.

kunaltyagi

LGTM

larshg · 2021-06-22T14:01:50Z

Apart from 3 really minor things, LGTM.

FabianSchuetze · 2021-06-23T17:18:48Z

Thank you, Kunal and Lars, for the review! Lars, you are eagle-eyed! I did apply most of the changes you suggested - thank you.

larshg · 2021-06-24T13:44:03Z

Just tested it. I went from something like 38 seconds, load of spam, faulty clusters, to 8 seconds, minimal spam, correct clusters, with my test pointcloud with about 120k points, thats segmented into 28 clusters with size of 3000-10.000 points.
However, the CPU version using Kd-tree, still does it in about 300 ms.

I noticed that line is still verbose:

pcl/gpu/segmentation/include/pcl/gpu/segmentation/impl/gpu_extract_clusters.hpp

Line 210 in 2675643

std::cout << "INFO: end of extractEuclideanClusters " << std::endl;

Could you fix this one as well 😄 ?

FabianSchuetze · 2021-06-25T17:11:39Z

Thank you, Lars, for testing the program and for your feedback. Your message is bittersweet! I'm glad the results of the program itself were OK. Nevertheless, it would be wonderful to have a GPU version that performs faster than the CPU version. I will take a look at some GPU KDTree implementations for inspiration. Anyway - the noisy info got banished to the PCL_DEBUG macro, and it shan't be seen anymore during normal operation. Thanks for the feedback!

kunaltyagi · 2021-06-26T07:52:57Z

@mvieth Do you want to take a look? Or we can go ahead and squash-merge this

FabianSchuetze · 2021-06-27T09:42:50Z

Thanks for approving, Lars and Kunal!

mvieth · 2021-06-27T15:14:10Z

I had a quick look over the code, everything seems fine. And it definitely promises a great speedup, even if it might still not be able to compete with the pure CPU version. Thank you for working on this!

… allocations (PointCloudLibrary#4677) * remove costly memory allocation * addresses comments * stylistic changes * economical download of data from device to host * tries to resolve bug of different cluster sizes * removed comments and address PR review * try to address review comments * exploiting symmetry * formatting and auto * placed function in source file * placed function again in namespace pcl::detail * moved declaration to hpp file * compatible with new device array api * remove duplicate function - compiles but segfault * runs without segfault * cosmetic changes * removed noisy info * Add newline for the debug macro Co-authored-by: Kunal Tyagi <tyagi.kunal@live.com>

mvieth added changelog: enhancement Meta-information for changelog generation module: gpu labels Mar 27, 2021

larshg reviewed Mar 29, 2021

View reviewed changes

Comment thread gpu/segmentation/include/pcl/gpu/segmentation/impl/gpu_extract_clusters.hpp Outdated

kunaltyagi reviewed Mar 30, 2021

View reviewed changes

Comment thread gpu/segmentation/include/pcl/gpu/segmentation/impl/gpu_extract_clusters.hpp Outdated

Comment thread gpu/segmentation/include/pcl/gpu/segmentation/impl/gpu_extract_clusters.hpp

FabianSchuetze mentioned this pull request Apr 3, 2021

[GPU::DeviceArray] Expand the API of the DeviceArray to ease data transfer and allocations #4689

Open

kunaltyagi reviewed Apr 5, 2021

View reviewed changes

kunaltyagi linked an issue Apr 5, 2021 that may be closed by this pull request

gpu seg is toooo..o slower than cpu #2703

Closed

kunaltyagi changed the title ~~Remove costly memory allocation for GPU related clustering [Tries to address Issue #2703]~~ Remove costly memory allocation to speed up GPU clustering Apr 5, 2021

kunaltyagi requested changes Apr 6, 2021

View reviewed changes

kunaltyagi reviewed Apr 7, 2021

View reviewed changes

Comment thread gpu/segmentation/include/pcl/gpu/segmentation/impl/gpu_extract_clusters.hpp Outdated

Comment thread gpu/segmentation/include/pcl/gpu/segmentation/impl/gpu_extract_clusters.hpp Outdated

kunaltyagi reviewed Apr 10, 2021

View reviewed changes

Comment thread gpu/segmentation/include/pcl/gpu/segmentation/impl/gpu_extract_clusters.hpp Outdated

Comment thread gpu/segmentation/include/pcl/gpu/segmentation/impl/gpu_extract_clusters.hpp Outdated

kunaltyagi reviewed Apr 11, 2021

View reviewed changes

kunaltyagi reviewed Apr 12, 2021

View reviewed changes

Comment thread gpu/segmentation/include/pcl/gpu/segmentation/impl/gpu_extract_clusters.hpp Outdated

FabianSchuetze mentioned this pull request Apr 12, 2021

GPU code is slower than CPU #2871

Closed

kunaltyagi linked an issue Apr 13, 2021 that may be closed by this pull request

GPU code is slower than CPU #2871

Closed

runs without segfault

e778950

kunaltyagi previously approved these changes Jun 22, 2021

View reviewed changes

larshg reviewed Jun 22, 2021

View reviewed changes

Comment thread gpu/segmentation/include/pcl/gpu/segmentation/gpu_extract_clusters.h Outdated

larshg reviewed Jun 22, 2021

View reviewed changes

Comment thread gpu/segmentation/include/pcl/gpu/segmentation/impl/gpu_extract_clusters.hpp Outdated

larshg reviewed Jun 22, 2021

View reviewed changes

Comment thread gpu/segmentation/src/extract_clusters.cpp

cosmetic changes

2675643

FabianSchuetze dismissed kunaltyagi’s stale review via 2675643 June 23, 2021 17:13

kunaltyagi previously approved these changes Jun 24, 2021

View reviewed changes

removed noisy info

4c84905

FabianSchuetze dismissed kunaltyagi’s stale review via 4c84905 June 25, 2021 17:03

larshg previously approved these changes Jun 25, 2021

View reviewed changes

kunaltyagi reviewed Jun 26, 2021

View reviewed changes

Comment thread gpu/segmentation/include/pcl/gpu/segmentation/impl/gpu_extract_clusters.hpp Outdated

Add newline for the debug macro

2e952bd

kunaltyagi dismissed larshg’s stale review via 2e952bd June 26, 2021 03:22

kunaltyagi approved these changes Jun 26, 2021

View reviewed changes

kunaltyagi requested a review from larshg June 26, 2021 03:22

larshg approved these changes Jun 26, 2021

View reviewed changes

FabianSchuetze mentioned this pull request Jun 27, 2021

[custom] Possibilitiy of adding fast KDtree (or other clustering algorithm) to the GPU module? #4817

Open

mvieth merged commit f9927b9 into PointCloudLibrary:master Jun 27, 2021

FabianSchuetze mentioned this pull request Jun 28, 2021

Clang format gpu/segmentation #4819

Merged

Uh oh!

Conversation

FabianSchuetze commented Mar 26, 2021

Uh oh!

mvieth commented Mar 27, 2021

Uh oh!

FabianSchuetze commented Mar 27, 2021

Uh oh!

FabianSchuetze commented Mar 29, 2021

Uh oh!

FabianSchuetze commented Mar 29, 2021

Uh oh!

mvieth commented Mar 29, 2021

Uh oh!

FabianSchuetze commented Mar 29, 2021

Uh oh!

Uh oh!

larshg commented Mar 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FabianSchuetze commented Mar 30, 2021

Uh oh!

Uh oh!

Uh oh!

FabianSchuetze commented Mar 30, 2021

Uh oh!

FabianSchuetze commented Apr 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FabianSchuetze commented Apr 4, 2021

Uh oh!

larshg commented Apr 4, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FabianSchuetze commented Apr 5, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FabianSchuetze commented Apr 6, 2021

Uh oh!

kunaltyagi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kunaltyagi Apr 11, 2021

Choose a reason for hiding this comment

Uh oh!

FabianSchuetze Apr 11, 2021

Choose a reason for hiding this comment

Uh oh!

kunaltyagi Apr 12, 2021

Choose a reason for hiding this comment

Uh oh!

FabianSchuetze Apr 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kunaltyagi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

FabianSchuetze commented Apr 12, 2021

Uh oh!

FabianSchuetze commented Jun 22, 2021

Uh oh!

kunaltyagi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

larshg commented Jun 22, 2021

Uh oh!

larshg commented Mar 29, 2021 •

edited

Loading

FabianSchuetze commented Apr 3, 2021 •

edited

Loading

FabianSchuetze Apr 12, 2021 •

edited

Loading