Convert DG-RePlAce algorithm to Kokkos#5352
Conversation
There was a problem hiding this comment.
warning: 'gpl2/MakeDgReplace.h' file not found [clang-diagnostic-error]
#include "gpl2/MakeDgReplace.h"
^There was a problem hiding this comment.
warning: 'Kokkos_Core.hpp' file not found [clang-diagnostic-error]
#include <Kokkos_Core.hpp>
^There was a problem hiding this comment.
warning: 'Kokkos_Core.hpp' file not found [clang-diagnostic-error]
#include <Kokkos_Core.hpp>
^There was a problem hiding this comment.
warning: parameter 'M' is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions [readability-avoid-const-params-in-decls]
| void dct_2d_fft(const int M, | |
| void dct_2d_fft(int M, |
There was a problem hiding this comment.
warning: parameter 'N' is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions [readability-avoid-const-params-in-decls]
| const int N, | |
| int N, |
There was a problem hiding this comment.
warning: call to 'ceil' promotes float to double [performance-type-promotion-in-math-fn]
src/gpl2/src/placerBase.cpp:40:
- #include <cstdio>
+ #include <cmath>
+ #include <cstdio>| binSizeX_ = ceil(static_cast<float>((ux_ - lx_)) / binCntX_); | |
| binSizeX_ = std::ceil(static_cast<float>((ux_ - lx_)) / binCntX_); |
There was a problem hiding this comment.
warning: call to 'ceil' promotes float to double [performance-type-promotion-in-math-fn]
| binSizeY_ = ceil(static_cast<float>((uy_ - ly_)) / binCntY_); | |
| binSizeY_ = std::ceil(static_cast<float>((uy_ - ly_)) / binCntY_); |
There was a problem hiding this comment.
warning: 'db_sta/dbNetwork.hh' file not found [clang-diagnostic-error]
#include "db_sta/dbNetwork.hh"
^There was a problem hiding this comment.
warning: call to 'round' promotes float to double [performance-type-promotion-in-math-fn]
src/gpl2/src/placerBase.h:38:
- #include <memory>
+ #include <cmath>
+ #include <memory>| + static_cast<int64_t>(round(macroInstsArea_ * targetDensity_)); | |
| + static_cast<int64_t>(std::round(macroInstsArea_ * targetDensity_)); |
There was a problem hiding this comment.
warning: member initializer for 'inst_' is redundant [modernize-use-default-member-init]
| : inst_(nullptr), | |
| : , |
|
Earlier it was reported the runtime difference to be minimal but 0:57.70 vs 1:33.49 is more substantial. Is this expected? |
Earlier measurements were done when some parts was still using native CUDA and using different design ( I'd expect, it should be possible to achieve similar runtime using Kokkos, This results might suggest, that there are some unnecessary memory copies between host/device, but this needs to be investigated further. |
|
Please try to get a more precise measure of the runtime difference as this is important in deciding whether Kokkos is a good alternative to direct CUDA coding. Do all the various versions produce the same result? That is also important. |
|
What was the thinking behind making kokkos a dependency but kokkos-fft a submodule? It seems like they could both be build dependencies (and added to the DependencyInstaller with an option). |
I think I would say direct CUDA coding isn't really a viable option. I would be personally opposed to its inclusion. I think Kokkos or something like it is the only viable path forward. The runtime differences don't look significant if you compare it to the overall speedup achieved. We're going for a pragmatic path forward, and to me this meets my bar for the goals we set out.
Agree that this is important to check. We may need to order the floats to get identical/sufficiently similar results. |
You personally pushed for the inclusion of gpuSolver.cu and said its was valuable as a template for future development. Shall we delete it? I was never in favor. A 50% overhead is worth exploring to at least understand if not eliminate. |
I think that seems like the right move at this point. With more time and context I don't think it's viable for us to maintain two codebases.
+1 I just want to point out if this is the fastest we could go that seems fast enough for me. |
No they don't and it was quite surprising, as I expected that original code and Kokkos with CUDA backend will produce the same result. NVCC should do pre-processing and compilation for device code and produce CUDA binary and it should leave host code for host compiler. We checked that when I suspect that this issue isn't only related to Eigen: when I disabled initial placement, runtime of Kokkos and original code were almost the same, but results were still different (I haven't investigated reason for this).
kokkos-fft is header only interface library that translates FFT calls into proper backend by detecting enabled backends in Kokkos, but I agree, if preferred, both kokkos and kokkos-fft could be dependencies.
I think this overhead is due to different initial placement, when initial placement is disabled runtime is very similar:
I also did precise measurements using RTX 3080, 8 vCPU i9-12900 @ 2.42 GHz and 32GB of RAM with 10 runs using
|
|
Thanks for the analysis. It would be good to get to the bottom of the difference as it will make regression testing hard otherwise. Is |
Arguments that are passed to |
|
another possibility is that it is invoking a different g++ binary from another path |
|
Converted to a draft due to no progress. |
04d428f to
925dd93
Compare
|
I've rebased this branch onto latest
I've found that to not be the case. Early, I've recreated the same condition (where Eigen was running slowly) using
To prioritize merging of GPU-accelerated placement, the focus was to get the branch issue-free before optimizing. In my testing, Kokkos-based algorithm on Future / subsequent work:
|
|
I added a configuration option to |
|
I would prefer to see kokkos as part of the dependency installer rather than as a submodule. There should be no need to compile it for each workspace on a machine. |
|
With the current setup, it would be possible to support both compilation schemes, with the priority set towards the |
|
If someone wants to put a local copy in-tree that's fine but I'd like to avoid having a submodule. |
|
I'll add support for |
072e3b1 to
2dcac77
Compare
2dcac77 to
960ec72
Compare
|
I've added nested parallelism to the most time consuming kernel -
Additionally, a concern was raised wrt. non-deterministic results that are returned from Kokkos, depending on the compute device used for processing. To validate the flow, each variant was subjected to a run from syntheis to the Test subjects were:
Metrics collected were taken from the final report and log, and were:
Results:
|
|
Very nice! How is the cpu vs gpu runtime with your latest changes? Is this ready for review? |
There was a problem hiding this comment.
warning: parameter 'M' is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions [readability-avoid-const-params-in-decls]
| void dct_2d_fft(const int M, | |
| void dct_2d_fft(int M, |
There was a problem hiding this comment.
warning: parameter 'N' is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions [readability-avoid-const-params-in-decls]
| const int N, | |
| int N, |
There was a problem hiding this comment.
warning: parameter 'M' is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions [readability-avoid-const-params-in-decls]
| void idct_2d_fft(const int M, | |
| void idct_2d_fft(int M, |
There was a problem hiding this comment.
warning: parameter 'N' is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions [readability-avoid-const-params-in-decls]
| const int N, | |
| int N, |
There was a problem hiding this comment.
warning: parameter 'M' is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions [readability-avoid-const-params-in-decls]
| void idxst_idct(const int M, | |
| void idxst_idct(int M, |
There was a problem hiding this comment.
warning: member initializer for 'isFixed_' is redundant [modernize-use-default-member-init]
| isFixed_(false) | |
There was a problem hiding this comment.
warning: result of integer division used in a floating point context; possible loss of precision [bugprone-integer-division]
int ux = lx + floor(bbox->getDX() / 2) * 2;
^There was a problem hiding this comment.
warning: result of integer division used in a floating point context; possible loss of precision [bugprone-integer-division]
int uy = ly + floor(bbox->getDY() / 2) * 2;
^There was a problem hiding this comment.
warning: the parameter 'ps' is copied for each invocation but only used as a const reference; consider making it a const reference [performance-unnecessary-value-param]
| void Instance::dbSetPlacementStatus(odb::dbPlacementStatus ps) | |
| void Instance::dbSetPlacementStatus(const odb::dbPlacementStatus& ps) |
src/gpl2/src/placerObjects.h:105:
- void dbSetPlacementStatus(odb::dbPlacementStatus ps);
+ void dbSetPlacementStatus(const odb::dbPlacementStatus& ps);There was a problem hiding this comment.
warning: member initializer for 'pin_' is redundant [modernize-use-default-member-init]
| : pin_(nullptr), | |
| : , |
|
Yes, it's ready for review. I've applied the suggested clang-tidy fixes and added the missing RockyLinux9 package. The performance difference between
The test setup is an Intel i7-8700 and a NVIDIA GTX 1080Ti |
a1b101b to
1d136de
Compare
86ae55d to
32d2b9d
Compare
|
Looks like we soon also need bring kokkos and fftw into the BCR to get it nicely compiled in the new bazel build. |
There was a problem hiding this comment.
warning: 'Kokkos_Core.hpp' file not found [clang-diagnostic-error]
#include "Kokkos_Core.hpp"
^There was a problem hiding this comment.
warning: 'Kokkos_Core.hpp' file not found [clang-diagnostic-error]
#include "Kokkos_Core.hpp"
^…guides Signed-off-by: Jan Bylicki <jbylicki@antmicro.com>
Signed-off-by: Jan Bylicki <jbylicki@antmicro.com>
Signed-off-by: Jan Bylicki <jbylicki@antmicro.com>
Signed-off-by: Jan Bylicki <jbylicki@antmicro.com>
32d2b9d to
be2d8bd
Compare
|
@jbylicki To confirm, is this PR ready for review? |
|
@mikesinouye Yes, it is |
|
I tried to build this but I get: Have you run into this? |
| if [[ ${gpuDeps} == "nvidia" ]]; then | ||
| RELEASE_CODENAME=$(lsb_release -c | awk '{print $2}') | ||
|
|
||
| NEW_LINES="deb http://deb.debian.org/debian/ $RELEASE_CODENAME main contrib non-free | ||
| deb-src http://deb.debian.org/debian/ $RELEASE_CODENAME main contrib non-free" | ||
|
|
||
| if ! grep -q "$NEW_LINES" /etc/apt/sources.list; then | ||
| echo "$NEW_LINES" | tee -a /etc/apt/sources.list > /dev/null | ||
| fi | ||
| apt-get update | ||
| apt-get -y install --no-install-recommends libcu++-dev nvidia-cuda-toolkit | ||
| fi | ||
|
|
||
|
|
There was a problem hiding this comment.
Ok, there are some issues with this setup for installing nvidia-cuda-toolkit.
- The command
lsb_release -con Ubuntu returns an Ubuntu codename likejammy,focal. The script then tries this codename to access a URL on the debian serverhttp://deb.debian.org/debian/jammywhich does not exist. - Even if the url existed, installing packages from a Debian repository on an Ubuntu system is highly discouraged. Ubuntu is based on Debian but has its own version of libraries and packages. Mixing them can cause critical dependency conflicts that can break the package manager and other essential system services.
I'm currently running Debian 12 bookworm and indeed it needs to have non-free in order to find nvidia-cuda-toolkit. For ubuntu though, they call it multiverse instead of non-free.
Also, libcu++-dev should already be included with nvidia-cuda-toolkit package.
Neither kokkos, nor gpl2 itself requires it. I guess that it was used for some debugging in past, and can be safely dropped. Besides, in other OpenROAD modules we already use spdlog's format (which is proxy to either std::format or bundled copy of fmtlib)
Build it with same flags as in cmake
I suspect that you have an older version of KokkosFFT (before introduction of The current setup is very complex:
The version mismatch is not the only issue with using
Building Kokkos solely in-tree would solve these problems and decrease maintenance burden related to having two independent methods of installing Kokkos. |
|
I think we should finish this off |
|
If this continues, I can work on making kokkos build with bazel (and once it works also put on BCR) |
|
That would be great. I've been holding off until we can retire cmake to avoid having to deal with it. |
|
Late-comer thought — I've spent time reviewing gpl and have an RTX 5090 on hand. After reading both gpl and gpl2, much of gpl2 (Nesterov body, WA wirelength, BiCGSTAB) looks like the same algorithms as gpl. So instead of a separate gpl2 library, you could keep gpl and add Kokkos kernels behind an optional build flag, one function per PR. The CMake side can reuse -DGPU=ON in etc/Build.sh now; the bazel side will work once @hzeller finishes putting Kokkos on BCR. First PR could be getHpwl to set up the option and Kokkos dependency. WA gradient and a Poisson solver using Kokkos-FFT would follow. @sgizler's determinism fixes in this branch apply to either approach. I'm raising this because the dual-codebase concern keeps coming up, and this approach might address it. Happy to try the first PR if it's useful, or to drop it and let #5352 continue as planned. |
|
The goal would be to have a single gpl. This is a separate code base as that is how it was developed academically but not the preferred end state. I am glad to take PRs for incrementally adding these ideas directly to gpl/. Dealing with Kokkos itself will be a significant first step. |
|
Thanks @maliberty — that's clarifying. I'll put together a first PR with the CMake option and a Kokkos-backed getHpwl, leaving the CPU path as default. If the build-system side needs to align with @hzeller's BCR work, happy to coordinate before posting. |
This MR converts DG-RePlAce algorithm that was originally written for CUDA to Kokkos.
Kokkos provides abstraction for writing parallel code that can be translated into several backends including CUDA, OpenMP and C++ threads.
Tested on single run with RTX 3090 and i7-8700 CPU @ 3.20GHz using
ariane133design.