Skip to content

Speeding up GPU clustering using smarter download strategy and memory allocations#4677

Merged
mvieth merged 18 commits into
PointCloudLibrary:masterfrom
FabianSchuetze:remove_memory_allocation
Jun 27, 2021
Merged

Speeding up GPU clustering using smarter download strategy and memory allocations#4677
mvieth merged 18 commits into
PointCloudLibrary:masterfrom
FabianSchuetze:remove_memory_allocation

Conversation

@FabianSchuetze
Copy link
Copy Markdown
Contributor

I stumbled upon the same issue mentioned in #2703 while familiarizing myself with the GPU-related codebase of PCL. A CPU flamegraph showed about 1/3 of the runtime is spent on resizing a vector:
perf_orig
The resize function is located in the device_array.hpp codebase. The arrays are constructed in the loop (line 100) of gpu_extract_custer.hpp. Pre-allocating the array to the maximum possible size can mostly eliminate the memory allocations as documented by the flamegraph with the revised code:
perf I tried to optimize the Cuda memcopies that also incur significant times by using pinned host memory, but that did not lead to a noticeable improvement. The GPU code is still significantly slower than the CPU version, though. I believe this is due to the sequential nature of the program, and data copies between the host and device memory.

I would be happy to work more on this issue if further PRs are welcome in this field. Maybe somebody also has an idea for improving GPU-based segmentation or other ideas to work on the GPU-related codebase of PCL!

P.S. The function pcl::gpu::extractEuclideanClusters is verbose by default, should we maybe also change this?

@mvieth mvieth added changelog: enhancement Meta-information for changelog generation module: gpu labels Mar 27, 2021
@mvieth
Copy link
Copy Markdown
Member

mvieth commented Mar 27, 2021

Nice idea, and the flame graphs look promising

  • Could you do some simple benchmarking to find out how much faster than the current implementation your proposal is?
  • What do you think about moving the creation of the vectors out of the for-loop (moving it to line ~70)? Could that further reduce the number of allocations?
  • I am not sure the value you use for reserve is the correct one. I think data is queries_host.size() * max_answers large, so up to host_cloud_->size() * max_answers. But that can be very large (several gigabytes), and I wonder how much difference the reserve makes anyway? Some simple benchmarks here would be great as well
  • Please also move the comment "Host buffer for results"
  • Yes, if you like, you can also change the logging behaviour, e.g. with the PCL_DEBUG macro
  • If you like to further work on the GPU clustering, you could check if the value "10" (line 103) is the best option or if other options make the clustering faster. You could also make the value settable as the @todo suggests

@FabianSchuetze
Copy link
Copy Markdown
Contributor Author

Thank you for the kind and detailed feedback, @mvieth - that was very helpful to me! I will address the comments and update the PR.

@FabianSchuetze
Copy link
Copy Markdown
Contributor Author

Thank you again for your detailed comments! I tried to address these comments as follows:

  • With a pointcloud from my living room containing 198,835 points, the computation time drops from 9.40 to 4.23s. Similarly, for the pointcloud "five_people.pcd" from PCD's test folder (comprising 307,200 points), the duration falls from 10.63s to 7.09 seconds.
  • Moving the vectors one scope up and not calling reserve was an excellent good suggestion - thank you! With the persistent vector, calling resizing leads to few memory allocations.
  • I did move the comment "Host buffer for results" and moved the logging into the PCL_DEBUG macro.
  • I also experimented with changing the threshold of offloading the computation to the GPU and compared values of (10, 50, 100, 1000) This is a losing battle for the GPU at the moment: Increasing the threshold reduces total computation time. As documented by the flamegraph, the latency for copying data between host and device biases comparison. I would increase the threshold to 100, but I am not sure if the comparison is entirely fair.

I hope this address all the comments you made and that I did not forget anything. Is there anything else you suggest looking at? I also thought more about the problem and have two other ideas for improving it:

According to the flamegraph, 9 percent of the time is spent in malloc inside the create function of the DeviceArray. The create function destroys (if needed) memory and fully allocates new memory again. We have to call the create function because the device array is always local to the scope and because the device array does not have a resize function. Such a resize function, for example, is also part of thrust. Do you think it would be interesting for us to expand the API of the DeviceArray to permit resizing the array?

Apart from resizing the array, copying data from the device to the host requires the most time (about 2/3 in the current form). Looking at lines 149ff, I think we could avoid copying the data as the only relevant computation is obtaining the indices from data[qp_r + qp * max_answers]. I guess that we could do this all on the GPU threads.

Do you think any of these ideas are worthwhile to pursue? Or would you alternatively suggest working on other areas of the GPU section?

@FabianSchuetze
Copy link
Copy Markdown
Contributor Author

I'm sorry to see the build partially failing. Could somebody kindly help me understanding why that happened? I looked through the logs and found only one statement that might report an error:

2021-03-29T12:42:13.4574189Z   LINK : the 32-bit linker (C:\Program Files (x86)\Microsoft Visual Studio\2017\Enterprise\VC\Tools\MSVC\14.16.27023\bin\HostX86\x86\link.exe) ran out of heap space and is going to restart linking with a 64-bit linker
2021-03-29T12:42:13.5194291Z   LINK : restarting link with 64-bit linker `C:\Program Files\Git\usr\bin\link.exe'
2021-03-29T12:42:18.9432564Z   /c/Program Files (x86)/Microsoft Visual Studio/2017/Enterprise/VC/Tools/MSVC/14.16.27023/bin/HostX86/x86/link: cannot create link 'Â â– /' to '/ERRORREPORT:QUEUE': No such file or directory

However, the build continues for a while afterwards and I suspect I misinterpret the error. I am grateful for any hints or suggestions to understand the build results!

@mvieth
Copy link
Copy Markdown
Member

mvieth commented Mar 29, 2021

So a speed-up of 2.2 and 1.5 respectively - that is really nice.
Regarding the offloading threshold: can you give an estimate of how much 10 vs 100 changes? I made some quick tests and it seems to barely change the time.
Regarding adding a resize function to DeviceArray: you can try that out if you like, but I don't know how much impact it would make if it only accounts for 9% of the time spent.
Regarding copying the data from device to host: I am not sure how you would completely avoid copying the data - by calling ptr() and accessing the data from there? Not sure if that works/improves the performance, but might be worth a try. Another idea to reduce copying: there are many elements in data that are copied, but never accessed, since there are blocks of size max_answers that might not be filled completely. If we would first download sizes, and then somehow only download those chunks of data that we actually need, that might improve performance.
@larshg If you like, have a look at our discussion. Since you worked on the GPU clustering before, you might have a good suggestion
P.S.: Regarding the failed test: the Windows CI sometimes runs out of memory - I restarted it and the check should probably pass now.

@FabianSchuetze
Copy link
Copy Markdown
Contributor Author

Thank you very much, @mvieth, for your detailed and thoughtful comments - they again helped me understand the problem much better!

I'm surprised you did not experience a significant change in run-time. With my configuration, the run-time changed almost monotonically. I will report detailed benchmarks again.

Your comment that we download more data than necessary was striking to me, I did not realize this first. For a sampled iteration of the nested for loop the selected indices qp_r + qp * max_answers are: 0, 1, ...,20, 25000, 25001, ..., 25024, 50000, .... Although only a small fraction of the indices of data are accessed, we donwload the entire array. We could indeed try to filter the array on the device before downloading it. That is a fascinating challenge!

P.S. Thanks for running the build again - its reassuring to see that not only my computer runs out of memory when building PCL when using all possible cores.

Comment thread gpu/segmentation/include/pcl/gpu/segmentation/impl/gpu_extract_clusters.hpp Outdated
@larshg
Copy link
Copy Markdown
Contributor

larshg commented Mar 29, 2021

I'm just testing it out as well.
However after the offloading threshold was set to 100, I didn't get any clusters. Seems like the CPU version is faulty here:

std::cout << " CPU: ";
for(std::size_t p = 0; p < queries_host.size (); p++)
{
// Execute the radiusSearch on the host
tree->radiusSearchHost(queries_host[p], tolerance, data, max_answers);
}

For each query points, the radius search is performed, but the results are not accumulated and hence, it never finds the required sizes of clusters.

Also, it seems, with my preliminary tests of setting it to 200 (No GPU activity with the pointcloud / parameters I use), the "CPU implementation of the GPU algorithm" is faster than the euclidian clustering in pcl_segmentation (using a K-D tree).

Some more investigation is required.

How was #4506 progressing @mvieth :) ?

@FabianSchuetze
Copy link
Copy Markdown
Contributor Author

Thanks for the detailed and very helpful comments, @larshg! Your comments about the difference between the CPU and GPU version are indeed intriguing and I am happy to investigate it more thoroughly!

Comment thread gpu/segmentation/include/pcl/gpu/segmentation/impl/gpu_extract_clusters.hpp Outdated
@FabianSchuetze
Copy link
Copy Markdown
Contributor Author

@larshg Thank you again for pointing out that the obtained clusters are a function of the GPU offloading threshold! I did verify this, which left me puzzled for a while. I adjusted the scope of the loop you pointed out, but that didn't help. I will try to figure out what is going on and then write again.

@mvieth I refrained from benchmarking the run-time of the algorithm as a function of the offloading threshold. I will first try to understand why changing the threshold causes the results to differ and then measure run-time performances.

@kunaltyagi Thanks for the comments! I did adjust the code in response to your observations, and I think it improves the code.

At all three of you: Thank you so much for taking a look at this PR and for guiding me through the process - this is indeed a pleasure!

@FabianSchuetze
Copy link
Copy Markdown
Contributor Author

FabianSchuetze commented Apr 3, 2021

First: Thank you all again for the excellent feedback so far! It is really a pleasure to work on this.

I continued working on the issue and made progress on the program's efficiency. As @mvieth suggested, we currently download potentially too much data from the device to host. A more judicious download gives us speed gains of a factor of 3. I updated the PR to document the ideas in code. However, I do not think this code is pretty, and I am not sure if it's a good idea to merge it. To keep the discussion focused, I opened another issue to discuss expanding the API of the DeviceArray to allow users to download data more effectively.

Here are some performance benchmarks (taken for the pcd test data)
Dataset: (turning the cpu version of the GPU part of; i.e. if (queries_host.size() <= 0) go to CPU; else GPU)

Dataset rops_cloud (804k) bunny (11K) cturtle (771K)
CPU 1.31 0.01 3.18
GPU 0.49 0.08 1.63
Master branch 1.61 0.07 8.63

The flamegraphs reflect the timing updates too. The graph still highlights further possibilities for improvement of the GPU version. However, the CPU code dominates the timing for the first time :-).
perf

I realize that the problem of different results depending on the threshold parameter persists! Thanks again for highlighting this, @larshg! Although I am thinking about it, I am still tapping a bit in the dark but will keep trying. What do you think about the changes and what are open questions for you?

P.S. I cannot use a few of the test datasets because FLANN fails on the CPU. These are: table_scene_mug_stereo_textured, office1d, or five_people. The (abbreviated) error is "Invalid (NaN, Inf) point coordinates given to radiusSearch!"' failed. I have looked for an issue but didn't see one. Does somebody know an issue?

@FabianSchuetze
Copy link
Copy Markdown
Contributor Author

Ha! I think things are falling in place!

Thank you again, @larshg , for identifying that the number of clusters changes as the offloading parameter changes. As you speculated, not all found indices found inside the method radiusSearchHost got propagated through the program. The search method clears the vector of previously found indices, and thus these indices got lost. I now obtain the same results on the dataset five_people for a broad range of offloading parameters. @larshg , can you kindly check the new program on the dataset on which you identified the problem at first? That would be wonderful!

@larshg
Copy link
Copy Markdown
Contributor

larshg commented Apr 4, 2021

Yes, thats pretty much the same thing I thought would be required. Not sure if you can use a move_inserter to avoid making a copy?

And maybe use pcl::indices(its currently defined as the same, but makes it easier to change type of indices) instead of std::vector if searchHost returns indices?

Comment thread gpu/segmentation/include/pcl/gpu/segmentation/impl/gpu_extract_clusters.hpp Outdated
Comment thread gpu/segmentation/include/pcl/gpu/segmentation/impl/gpu_extract_clusters.hpp Outdated
Comment thread gpu/segmentation/include/pcl/gpu/segmentation/impl/gpu_extract_clusters.hpp Outdated
@kunaltyagi kunaltyagi linked an issue Apr 5, 2021 that may be closed by this pull request
@kunaltyagi kunaltyagi changed the title Remove costly memory allocation for GPU related clustering [Tries to address Issue #2703] Remove costly memory allocation to speed up GPU clustering Apr 5, 2021
@FabianSchuetze
Copy link
Copy Markdown
Contributor Author

Thank you, @kunaltyagi, for reviewing the PR - that was very helpful indeed! I updated the PR accordingly.

Thank you also very much, @larshg! I replaced all instances of std::vector<int> with pcl::indices. I think this is a better solution - thank you! Moving avector<int> induces a copy because the values from one contiguous memory location have to be placed in the other contiguous memory location. If we had a vector of strings (or any other type that manages a pointer to a memory location), move inserters would be perfect.

What are your opinions about placing the cuda interaction into the DeviceArray class? I think the following lines of code could be wrapped inside a download function of DeviceArray:

#include <cuda_runtime_api.h>
#include <cuda.h>
...
const std::size_t bytes = (sizes[qp]) * sizeof(int);
cudaMemcpy(&tmp[0], pdata, bytes, cudaMemcpyDeviceToHost);
cudaDeviceSynchronize();

I think all these lines could be placed inside a function overloading the download member function of the DeviceArray. I tried to describe it in a ticket #4689 .

Comment thread gpu/segmentation/include/pcl/gpu/segmentation/impl/gpu_extract_clusters.hpp Outdated
Comment thread gpu/segmentation/include/pcl/gpu/segmentation/impl/gpu_extract_clusters.hpp Outdated
Comment thread gpu/segmentation/include/pcl/gpu/segmentation/impl/gpu_extract_clusters.hpp Outdated
@FabianSchuetze
Copy link
Copy Markdown
Contributor Author

Thank you, @kunaltyagi , for your comments - they were very helpful for me!

Copy link
Copy Markdown
Member

@kunaltyagi kunaltyagi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some thoughts about 1 final memory allocation

Comment thread gpu/segmentation/include/pcl/gpu/segmentation/impl/gpu_extract_clusters.hpp Outdated
Comment thread gpu/segmentation/include/pcl/gpu/segmentation/impl/gpu_extract_clusters.hpp Outdated
Comment thread gpu/segmentation/include/pcl/gpu/segmentation/impl/gpu_extract_clusters.hpp Outdated
Comment thread gpu/segmentation/include/pcl/gpu/segmentation/impl/gpu_extract_clusters.hpp Outdated
Comment thread gpu/segmentation/include/pcl/gpu/segmentation/impl/gpu_extract_clusters.hpp Outdated
Comment on lines 165 to 177
for(int idx : data)
{
for(int qp_r = 0; qp_r < sizes[qp]; qp_r++)
{
if(processed[data[qp_r + qp * max_answers]])
continue;
processed[data[qp_r + qp * max_answers]] = true;
queries_host.push_back ((*host_cloud_)[data[qp_r + qp * max_answers]]);
found_points++;
r.indices.push_back(data[qp_r + qp * max_answers]);
}
}
if(processed[idx])
continue;
processed[idx] = true;
queries_host.push_back ((*host_cloud_)[idx]);
found_points++;
r.indices.push_back(idx);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The braces need formatting

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, I tried to fix it! Is it correct now? I have a question though: When I clang-format the file gpu_extract_clusters.hpp many lines of the file change. Am I doing something wrong or does the same maybe also happen to you too?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gpu module isn't formatted yet. So your experience is correct. The plugin I use allows me to format only a selection of lines instead of all the lines in the file.

You can either:

  • use similar settings for some plugin in your fav editor
  • create an additional formatting commit (at the start or end) so we can skip that/select that easily during code-review

Copy link
Copy Markdown
Contributor Author

@FabianSchuetze FabianSchuetze Apr 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha - thanks for pointing out this feature in vim clang-format! I always only formatted the entire buffer, never a few lines. Shall I maybe make a batch PR clang-formatting the GPU codebase?

Copy link
Copy Markdown
Member

@kunaltyagi kunaltyagi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise :)

Thanks @FabianSchuetze for bearing with us

Comment thread gpu/segmentation/include/pcl/gpu/segmentation/impl/gpu_extract_clusters.hpp Outdated
@FabianSchuetze
Copy link
Copy Markdown
Contributor Author

LGTM otherwise :)

Thanks @FabianSchuetze for bearing with us

It was a pleasure to work on it - thank you so much for your support @kunaltyagi !

I have another question as this issue comes to an end: I would like to continue working on the GPU code. Issues #4443 or #2218 seem interesting to me. Alternatively, the GPU code seems to lack a correspondence estimation, and I would be happy to work on this. I would be thrilled to see a GPU version of the ICP algorithm and think this should be the next step. Do you have an idea of which feature/issue to prioritize?

@kunaltyagi kunaltyagi linked an issue Apr 13, 2021 that may be closed by this pull request
@FabianSchuetze
Copy link
Copy Markdown
Contributor Author

Puha! I think we are getting closer to an end. Thanks to the expansion of the device array API, we can avoid the Cuda calls in the segmentation module spotted by Lars.

kunaltyagi
kunaltyagi previously approved these changes Jun 22, 2021
Copy link
Copy Markdown
Member

@kunaltyagi kunaltyagi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread gpu/segmentation/include/pcl/gpu/segmentation/gpu_extract_clusters.h Outdated
Comment thread gpu/segmentation/include/pcl/gpu/segmentation/impl/gpu_extract_clusters.hpp Outdated
Comment thread gpu/segmentation/src/extract_clusters.cpp
@larshg
Copy link
Copy Markdown
Contributor

larshg commented Jun 22, 2021

Apart from 3 really minor things, LGTM.

@FabianSchuetze
Copy link
Copy Markdown
Contributor Author

Thank you, Kunal and Lars, for the review! Lars, you are eagle-eyed! I did apply most of the changes you suggested - thank you.

kunaltyagi
kunaltyagi previously approved these changes Jun 24, 2021
@larshg
Copy link
Copy Markdown
Contributor

larshg commented Jun 24, 2021

Just tested it. I went from something like 38 seconds, load of spam, faulty clusters, to 8 seconds, minimal spam, correct clusters, with my test pointcloud with about 120k points, thats segmented into 28 clusters with size of 3000-10.000 points.
However, the CPU version using Kd-tree, still does it in about 300 ms.

I noticed that line is still verbose:

std::cout << "INFO: end of extractEuclideanClusters " << std::endl;

Could you fix this one as well 😄 ?

@FabianSchuetze
Copy link
Copy Markdown
Contributor Author

Thank you, Lars, for testing the program and for your feedback. Your message is bittersweet! I'm glad the results of the program itself were OK. Nevertheless, it would be wonderful to have a GPU version that performs faster than the CPU version. I will take a look at some GPU KDTree implementations for inspiration. Anyway - the noisy info got banished to the PCL_DEBUG macro, and it shan't be seen anymore during normal operation. Thanks for the feedback!

larshg
larshg previously approved these changes Jun 25, 2021
Comment thread gpu/segmentation/include/pcl/gpu/segmentation/impl/gpu_extract_clusters.hpp Outdated
@kunaltyagi kunaltyagi requested a review from larshg June 26, 2021 03:22
@kunaltyagi
Copy link
Copy Markdown
Member

@mvieth Do you want to take a look? Or we can go ahead and squash-merge this

@FabianSchuetze
Copy link
Copy Markdown
Contributor Author

Thanks for approving, Lars and Kunal!

@mvieth
Copy link
Copy Markdown
Member

mvieth commented Jun 27, 2021

I had a quick look over the code, everything seems fine. And it definitely promises a great speedup, even if it might still not be able to compete with the pure CPU version. Thank you for working on this!

@mvieth mvieth merged commit f9927b9 into PointCloudLibrary:master Jun 27, 2021
tin1254 pushed a commit to tin1254/pcl that referenced this pull request Aug 10, 2021
… allocations (PointCloudLibrary#4677)

* remove costly memory allocation

* addresses comments

* stylistic changes

* economical download of data from device to host

* tries to resolve bug of different cluster sizes

* removed comments and address PR review

* try to address review comments

* exploiting symmetry

* formatting and auto

* placed function in source file

* placed function again in namespace pcl::detail

* moved declaration to hpp file

* compatible with new device array api

* remove duplicate function - compiles but segfault

* runs without segfault

* cosmetic changes

* removed noisy info

* Add newline for the debug macro

Co-authored-by: Kunal Tyagi <tyagi.kunal@live.com>
mvieth pushed a commit to mvieth/pcl that referenced this pull request Dec 27, 2021
… allocations (PointCloudLibrary#4677)

* remove costly memory allocation

* addresses comments

* stylistic changes

* economical download of data from device to host

* tries to resolve bug of different cluster sizes

* removed comments and address PR review

* try to address review comments

* exploiting symmetry

* formatting and auto

* placed function in source file

* placed function again in namespace pcl::detail

* moved declaration to hpp file

* compatible with new device array api

* remove duplicate function - compiles but segfault

* runs without segfault

* cosmetic changes

* removed noisy info

* Add newline for the debug macro

Co-authored-by: Kunal Tyagi <tyagi.kunal@live.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog: enhancement Meta-information for changelog generation module: gpu

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GPU code is slower than CPU gpu seg is toooo..o slower than cpu

4 participants