Skip to content

Convert DG-RePlAce algorithm to Kokkos#5352

Open
kamilrakoczy wants to merge 30 commits into
The-OpenROAD-Project:masterfrom
antmicro:convert-gpl2-kokkos
Open

Convert DG-RePlAce algorithm to Kokkos#5352
kamilrakoczy wants to merge 30 commits into
The-OpenROAD-Project:masterfrom
antmicro:convert-gpl2-kokkos

Conversation

@kamilrakoczy
Copy link
Copy Markdown
Contributor

@kamilrakoczy kamilrakoczy commented Jul 8, 2024

This MR converts DG-RePlAce algorithm that was originally written for CUDA to Kokkos.

Kokkos provides abstraction for writing parallel code that can be translated into several backends including CUDA, OpenMP and C++ threads.

Tested on single run with RTX 3090 and i7-8700 CPU @ 3.20GHz using ariane133 design.

original placer CUDA implementation Kokkos (CUDA backend) Kokkos (OpenMP backend) Kokkos (Threads backend)
ariane133 global place time 11:27.39 0:57.70 1:33.49 3:24.12 6:08.94

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

There were too many comments to post at once. Showing the first 25 out of 52. Check the log or trigger a new build to see more.

Comment thread src/gpl2/src/MakeDgReplace.cpp Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'gpl2/MakeDgReplace.h' file not found [clang-diagnostic-error]

#include "gpl2/MakeDgReplace.h"
         ^

Comment thread src/gpl2/src/dct.cpp Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'Kokkos_Core.hpp' file not found [clang-diagnostic-error]

#include <Kokkos_Core.hpp>
         ^

Comment thread src/gpl2/src/dct.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'Kokkos_Core.hpp' file not found [clang-diagnostic-error]

#include <Kokkos_Core.hpp>
         ^

Comment thread src/gpl2/src/dct.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: parameter 'M' is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions [readability-avoid-const-params-in-decls]

Suggested change
void dct_2d_fft(const int M,
void dct_2d_fft(int M,

Comment thread src/gpl2/src/dct.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: parameter 'N' is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions [readability-avoid-const-params-in-decls]

Suggested change
const int N,
int N,

Comment thread src/gpl2/src/placerBase.cpp Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: call to 'ceil' promotes float to double [performance-type-promotion-in-math-fn]

src/gpl2/src/placerBase.cpp:40:

- #include <cstdio>
+ #include <cmath>
+ #include <cstdio>
Suggested change
binSizeX_ = ceil(static_cast<float>((ux_ - lx_)) / binCntX_);
binSizeX_ = std::ceil(static_cast<float>((ux_ - lx_)) / binCntX_);

Comment thread src/gpl2/src/placerBase.cpp Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: call to 'ceil' promotes float to double [performance-type-promotion-in-math-fn]

Suggested change
binSizeY_ = ceil(static_cast<float>((uy_ - ly_)) / binCntY_);
binSizeY_ = std::ceil(static_cast<float>((uy_ - ly_)) / binCntY_);

Comment thread src/gpl2/src/placerBase.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'db_sta/dbNetwork.hh' file not found [clang-diagnostic-error]

#include "db_sta/dbNetwork.hh"
         ^

Comment thread src/gpl2/src/placerBase.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: call to 'round' promotes float to double [performance-type-promotion-in-math-fn]

src/gpl2/src/placerBase.h:38:

- #include <memory>
+ #include <cmath>
+ #include <memory>
Suggested change
+ static_cast<int64_t>(round(macroInstsArea_ * targetDensity_));
+ static_cast<int64_t>(std::round(macroInstsArea_ * targetDensity_));

Comment thread src/gpl2/src/placerObjects.cpp Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: member initializer for 'inst_' is redundant [modernize-use-default-member-init]

Suggested change
: inst_(nullptr),
: ,

@maliberty
Copy link
Copy Markdown
Member

Earlier it was reported the runtime difference to be minimal but 0:57.70 vs 1:33.49 is more substantial. Is this expected?

@kamilrakoczy
Copy link
Copy Markdown
Contributor Author

Earlier it was reported the runtime difference to be minimal but 0:57.70 vs 1:33.49 is more substantial. Is this expected?

Earlier measurements were done when some parts was still using native CUDA and using different design (black-parrot).
This measurements are single run on local machine while using it for other things too, so they are not very accurate.

I'd expect, it should be possible to achieve similar runtime using Kokkos, This results might suggest, that there are some unnecessary memory copies between host/device, but this needs to be investigated further.

@maliberty
Copy link
Copy Markdown
Member

Please try to get a more precise measure of the runtime difference as this is important in deciding whether Kokkos is a good alternative to direct CUDA coding.

Do all the various versions produce the same result? That is also important.

@maliberty
Copy link
Copy Markdown
Member

What was the thinking behind making kokkos a dependency but kokkos-fft a submodule? It seems like they could both be build dependencies (and added to the DependencyInstaller with an option).

@QuantamHD
Copy link
Copy Markdown
Collaborator

QuantamHD commented Jul 9, 2024

Please try to get a more precise measure of the runtime difference as this is important in deciding whether Kokkos is a good alternative to direct CUDA coding.

I think I would say direct CUDA coding isn't really a viable option. I would be personally opposed to its inclusion. I think Kokkos or something like it is the only viable path forward. The runtime differences don't look significant if you compare it to the overall speedup achieved.

We're going for a pragmatic path forward, and to me this meets my bar for the goals we set out.

Do all the various versions produce the same result? That is also important.

Agree that this is important to check. We may need to order the floats to get identical/sufficiently similar results.

@maliberty
Copy link
Copy Markdown
Member

I think I would say direct CUDA coding isn't really a viable option. I would be personally opposed to its inclusion.

You personally pushed for the inclusion of gpuSolver.cu and said its was valuable as a template for future development. Shall we delete it? I was never in favor.

A 50% overhead is worth exploring to at least understand if not eliminate.

@QuantamHD
Copy link
Copy Markdown
Collaborator

QuantamHD commented Jul 9, 2024

You personally pushed for the inclusion of gpuSolver.cu and said its was valuable as a template for future development. Shall we delete it? I was never in favor.

I think that seems like the right move at this point. With more time and context I don't think it's viable for us to maintain two codebases.

A 50% overhead is worth exploring to at least understand if not eliminate.

+1 I just want to point out if this is the fastest we could go that seems fast enough for me.

@kamilrakoczy
Copy link
Copy Markdown
Contributor Author

Do all the various versions produce the same result? That is also important.

No they don't and it was quite surprising, as I expected that original code and Kokkos with CUDA backend will produce the same result.
We investigated this and it turned out that it is because Kokkos passes all files that depends on it through nvcc_wrapper. This wrapper converts host compiler options (g++) to nvcc options and uses nvcc to compile all Kokkos-dependent sources. This is done to allow device code in single .cpp file instead of separate .cu file for it.

NVCC should do pre-processing and compilation for device code and produce CUDA binary and it should leave host code for host compiler.

We checked that when nvcc is used to compile InitialPlace, Eigen solveWithGuess returns different results with exactly the same inputs comparing to using g++ directly.

I suspect that this issue isn't only related to Eigen: when I disabled initial placement, runtime of Kokkos and original code were almost the same, but results were still different (I haven't investigated reason for this).

What was the thinking behind making kokkos a dependency but kokkos-fft a submodule? It seems like they could both be build dependencies (and added to the DependencyInstaller with an option).

kokkos-fft is header only interface library that translates FFT calls into proper backend by detecting enabled backends in Kokkos, but I agree, if preferred, both kokkos and kokkos-fft could be dependencies.

A 50% overhead is worth exploring to at least understand if not eliminate.

I think this overhead is due to different initial placement, when initial placement is disabled runtime is very similar:

CUDA implementation Kokkos (CUDA backend)
ariane133 global place time without initial placement 0:55.52 0:58.25

I also did precise measurements using RTX 3080, 8 vCPU i9-12900 @ 2.42 GHz and 32GB of RAM with 10 runs using ariane133 design:

min time [min] avg time [min] med time [min] max time [min]
CUDA implementation 0:45 0:48 0:47 0:53
Kokkos (CUDA backend) 1:53 1:57 1:57 2:00
Kokkos (OpenMP backend) 1:50 2:04 1:54 2:37
Kokkos (threads backend) 3:42 3:43 3:43 3:45

@maliberty
Copy link
Copy Markdown
Member

Thanks for the analysis. It would be good to get to the bottom of the difference as it will make regression testing hard otherwise. Is nvcc calling g++ with different flags?

@kamilrakoczy
Copy link
Copy Markdown
Contributor Author

Is nvcc calling g++ with different flags?

Arguments that are passed to nvcc and that nvcc should pass to g++ are the same.
I haven't investigated yet how (with what flags) g++ is invoked from nvcc.

@maliberty
Copy link
Copy Markdown
Member

another possibility is that it is invoking a different g++ binary from another path

@maliberty maliberty marked this pull request as draft October 14, 2024 04:41
@maliberty
Copy link
Copy Markdown
Member

Converted to a draft due to no progress.

@jbylicki jbylicki force-pushed the convert-gpl2-kokkos branch 2 times, most recently from 04d428f to 925dd93 Compare January 7, 2025 13:14
@jbylicki
Copy link
Copy Markdown
Contributor

jbylicki commented Jan 7, 2025

I've rebased this branch onto latest master and started resolving the mentioned issues:

  • Eigen’s solveWithGuess() behaves differently on the Kokkos branch (with a suggestion that this is caused by nvcc_wrapper, a part of Kokkos responsible for redirecting compilations, not pertaining to CUDA, to the host compiler):

I've found that to not be the case. Early, I've recreated the same condition (where Eigen was running slowly) using clang++ as the Kokkos compiler and I've confirmed that nvcc_wrapper was not used then. The problem was Eigen, when detecting CUDA availability, was trying to use it. Nevertheless, I saw no peak in GPU usage when initial_place was running, so I've disabled it and saw the numbers return to baseline (the same as in the CUDA-native implementation).

  • What is the performance difference between Kokkos and CUDA-native implementations?

To prioritize merging of GPU-accelerated placement, the focus was to get the branch issue-free before optimizing. In my testing, Kokkos-based algorithm on black-parrot spends about 10 seconds in libcuda.so, whereas the CUDA-native implementation spends around 5. All other timings are comparable, making the entire run about 5 seconds longer.

Future / subsequent work:

  • Make Kokkos a submodule: Due to varying conditions on host machines, most Kokkos libraries available as a package ship without either CUDA or OMP support. Having a dependency that has to be manually compiled and set correctly to have a functioning and fast implementation might intruduce complexity for the end user. Therefore, I suggest not migrating kokkos-fft to be a dependency and using kokkos, that is already cloned as a submodule to kokkos-fft, as an in-tree library. The issue I'm currently facing is that internal deprecations of CMake symbols are being triggered when Kokkos' compilation is triggered as a child project and not the parent.
  • Optimize memory accesses and the Kokkos implementation itself: I've confirmed that memory copying is one of the causes of the algorithm being slower, and fixes are in development, waiting for the more pressing issues to be resolved.

@jbylicki
Copy link
Copy Markdown
Contributor

jbylicki commented Jan 9, 2025

I added a configuration option to etc/Build.sh, -use_gpl2 that will include the gpl2 subdirectory and launch the compilation of kokkos via kokkos-fft in CMake. I additionally assigned the -gpu flag from the build script to enable the CUDA backend in Kokkos.

@maliberty
Copy link
Copy Markdown
Member

I would prefer to see kokkos as part of the dependency installer rather than as a submodule. There should be no need to compile it for each workspace on a machine.

@jbylicki
Copy link
Copy Markdown
Contributor

jbylicki commented Jan 9, 2025

With the current setup, it would be possible to support both compilation schemes, with the priority set towards the DependencyInstaller - if a system-wide Kokkos installation would be detected, it will be used during compilation. I would suggest leaving the possibility to use in-tree Kokkos and kokkos-fft (if kokkos-fft was also moved to be downloaded via DependencyInstaller), as the script is tailored only towards Ubuntu users. If a system-wide package is not detected, both dependencies can be installed via FetchContent and built in-tree.

@maliberty
Copy link
Copy Markdown
Member

If someone wants to put a local copy in-tree that's fine but I'd like to avoid having a submodule.

@jbylicki
Copy link
Copy Markdown
Contributor

jbylicki commented Jan 9, 2025

I'll add support for kokkos and kokkos-fft via the DependencyInstaller then. The submodule could be deleted while keeping in-tree support - CMake would in case of a system-wide package being absent handle the download by the FetchContent directive, and the build would have conditionals in place to link correctly.

@jbylicki
Copy link
Copy Markdown
Contributor

I've added nested parallelism to the most time consuming kernel - computeBCPosNegKernel. After rebasing both branches to the same base commit, the performance results are as follows for the black-parrot design with the CUDA backend:

  • CUDA-native: 24.606 seconds (total time: 114.50 s, skipped intial place: 94.49 s)
  • Kokkos: 23.614 seconds (total time: 114.42 s, skipped intial place: 95.07 s)

Additionally, a concern was raised wrt. non-deterministic results that are returned from Kokkos, depending on the compute device used for processing. To validate the flow, each variant was subjected to a run from syntheis to the final step. While it's true that those results are varying, they have minimal impact on the actual parameters of the finished flow. Additionally, the results are deterministic on a per-device basis, even when the compute device is calculating under heavy external loads (especially applicable for GPUs).

Test subjects were:

  • master branch commit 7e0fce872123, as baseline and base for other branches
  • cuda-native, the original CUDA-native implementation, rebased onto the same base as other branches
  • kokkos-cpu, the Kokkos-based flow, ran on the OpenMP backend
  • kokkos-gpu, the Kokkos-based flow, ran on the CUDA backend

Metrics collected were taken from the final report and log, and were:

  • Total Negative Slack (tns)
  • Worst Negative Slack (wns)
  • Total power
  • Design area and utilization

Results:

Branch TNS WNS Design area, utilization Total Power
master -2.42 -2.42 760397 u^2 45% utilization 2.57e-01 W
cuda-native -2.40 -2.40 753511 u^2 44% utilization 2.49e-01 W
kokkos-cpu -2.49 -2.49 753608 u^2 44% utilization 2.50e-01 W
kokkos-gpu -2.44 -2.44 753674 u^2 44% utilization 2.50e-01 W

@maliberty
Copy link
Copy Markdown
Member

Very nice! How is the cpu vs gpu runtime with your latest changes? Is this ready for review?

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

There were too many comments to post at once. Showing the first 25 out of 45. Check the log or trigger a new build to see more.

Comment thread src/gpl2/src/dct.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: parameter 'M' is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions [readability-avoid-const-params-in-decls]

Suggested change
void dct_2d_fft(const int M,
void dct_2d_fft(int M,

Comment thread src/gpl2/src/dct.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: parameter 'N' is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions [readability-avoid-const-params-in-decls]

Suggested change
const int N,
int N,

Comment thread src/gpl2/src/dct.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: parameter 'M' is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions [readability-avoid-const-params-in-decls]

Suggested change
void idct_2d_fft(const int M,
void idct_2d_fft(int M,

Comment thread src/gpl2/src/dct.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: parameter 'N' is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions [readability-avoid-const-params-in-decls]

Suggested change
const int N,
int N,

Comment thread src/gpl2/src/dct.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: parameter 'M' is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions [readability-avoid-const-params-in-decls]

Suggested change
void idxst_idct(const int M,
void idxst_idct(int M,

Comment thread src/gpl2/src/placerObjects.cpp Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: member initializer for 'isFixed_' is redundant [modernize-use-default-member-init]

Suggested change
isFixed_(false)

Comment thread src/gpl2/src/placerObjects.cpp Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: result of integer division used in a floating point context; possible loss of precision [bugprone-integer-division]

  int ux = lx + floor(bbox->getDX() / 2) * 2;
                      ^

Comment thread src/gpl2/src/placerObjects.cpp Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: result of integer division used in a floating point context; possible loss of precision [bugprone-integer-division]

  int uy = ly + floor(bbox->getDY() / 2) * 2;
                      ^

Comment thread src/gpl2/src/placerObjects.cpp Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: the parameter 'ps' is copied for each invocation but only used as a const reference; consider making it a const reference [performance-unnecessary-value-param]

Suggested change
void Instance::dbSetPlacementStatus(odb::dbPlacementStatus ps)
void Instance::dbSetPlacementStatus(const odb::dbPlacementStatus& ps)

src/gpl2/src/placerObjects.h:105:

-   void dbSetPlacementStatus(odb::dbPlacementStatus ps);
+   void dbSetPlacementStatus(const odb::dbPlacementStatus& ps);

Comment thread src/gpl2/src/placerObjects.cpp Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: member initializer for 'pin_' is redundant [modernize-use-default-member-init]

Suggested change
: pin_(nullptr),
: ,

@jbylicki
Copy link
Copy Markdown
Contributor

Yes, it's ready for review. I've applied the suggested clang-tidy fixes and added the missing RockyLinux9 package.

The performance difference between CUDA and OpenMP backends on black_parrot is:

  • CUDA: 85.38 s (dg_global_place call time: 20.46 s)
  • OpenMP: 96.58 s (dg_global_place call time: 29.83 s)

The test setup is an Intel i7-8700 and a NVIDIA GTX 1080Ti

@jbylicki jbylicki force-pushed the convert-gpl2-kokkos branch from a1b101b to 1d136de Compare February 13, 2025 19:44
@maliberty maliberty marked this pull request as ready for review February 14, 2025 05:34
@jbylicki jbylicki force-pushed the convert-gpl2-kokkos branch from 86ae55d to 32d2b9d Compare May 14, 2025 17:09
@hzeller
Copy link
Copy Markdown
Collaborator

hzeller commented May 17, 2025

Looks like we soon also need bring kokkos and fftw into the BCR to get it nicely compiled in the new bazel build.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

Comment thread src/gpl2/src/densityOp.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'Kokkos_Core.hpp' file not found [clang-diagnostic-error]

#include "Kokkos_Core.hpp"
         ^

Comment thread src/gpl2/src/placerObjects.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'Kokkos_Core.hpp' file not found [clang-diagnostic-error]

#include "Kokkos_Core.hpp"
         ^

jbylicki added 4 commits May 21, 2025 11:57
…guides

Signed-off-by: Jan Bylicki <jbylicki@antmicro.com>
Signed-off-by: Jan Bylicki <jbylicki@antmicro.com>
Signed-off-by: Jan Bylicki <jbylicki@antmicro.com>
Signed-off-by: Jan Bylicki <jbylicki@antmicro.com>
@jbylicki jbylicki force-pushed the convert-gpl2-kokkos branch from 32d2b9d to be2d8bd Compare May 21, 2025 10:08
@mikesinouye
Copy link
Copy Markdown
Contributor

@jbylicki To confirm, is this PR ready for review?

@jbylicki
Copy link
Copy Markdown
Contributor

@mikesinouye Yes, it is

@maliberty
Copy link
Copy Markdown
Member

I tried to build this but I get:

/home/matt/OpenROAD/src/gpl2/src/dct.cpp: In function ‘void dct_2d_fft(int, int, const Kokkos::View<const Kokkos::complex<float>*>&, const Kokkos::View<const Kokkos::complex<float>*>&, const Kokkos::View<const float*>&, const Kokkos::View<float*>&, const Kokkos::View<Kokkos::complex<float>*>&, const Kokkos::View<float*>&)’:
/home/matt/OpenROAD/src/gpl2/src/dct.cpp:112:21: error: ‘Plan’ in namespace ‘KokkosFFT’ does not name a type
  112 |   static KokkosFFT::Plan fftplan(hostSpace,

Have you run into this?

Comment on lines +661 to +674
if [[ ${gpuDeps} == "nvidia" ]]; then
RELEASE_CODENAME=$(lsb_release -c | awk '{print $2}')

NEW_LINES="deb http://deb.debian.org/debian/ $RELEASE_CODENAME main contrib non-free
deb-src http://deb.debian.org/debian/ $RELEASE_CODENAME main contrib non-free"

if ! grep -q "$NEW_LINES" /etc/apt/sources.list; then
echo "$NEW_LINES" | tee -a /etc/apt/sources.list > /dev/null
fi
apt-get update
apt-get -y install --no-install-recommends libcu++-dev nvidia-cuda-toolkit
fi


Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, there are some issues with this setup for installing nvidia-cuda-toolkit.

  1. The command lsb_release -c on Ubuntu returns an Ubuntu codename like jammy, focal. The script then tries this codename to access a URL on the debian server http://deb.debian.org/debian/jammy which does not exist.
  2. Even if the url existed, installing packages from a Debian repository on an Ubuntu system is highly discouraged. Ubuntu is based on Debian but has its own version of libraries and packages. Mixing them can cause critical dependency conflicts that can break the package manager and other essential system services.

I'm currently running Debian 12 bookworm and indeed it needs to have non-free in order to find nvidia-cuda-toolkit. For ubuntu though, they call it multiverse instead of non-free.
Also, libcu++-dev should already be included with nvidia-cuda-toolkit package.

sgizler added 4 commits July 29, 2025 17:17
Neither kokkos, nor gpl2 itself requires it. I guess that it was used
for some debugging in past, and can be safely dropped.

Besides, in other OpenROAD modules we already use spdlog's format
(which is proxy to either std::format or bundled copy of fmtlib)
Build it with same flags as in cmake
@sgizler
Copy link
Copy Markdown

sgizler commented Jul 29, 2025

I tried to build this but I get:

/home/matt/OpenROAD/src/gpl2/src/dct.cpp: In function ‘void dct_2d_fft(int, int, const Kokkos::View<const Kokkos::complex<float>*>&, const Kokkos::View<const Kokkos::complex<float>*>&, const Kokkos::View<const float*>&, const Kokkos::View<float*>&, const Kokkos::View<Kokkos::complex<float>*>&, const Kokkos::View<float*>&)’:
/home/matt/OpenROAD/src/gpl2/src/dct.cpp:112:21: error: ‘Plan’ in namespace ‘KokkosFFT’ does not name a type
  112 |   static KokkosFFT::Plan fftplan(hostSpace,

Have you run into this?

I suspect that you have an older version of KokkosFFT (before introduction of KokkosFFT::Plan) installed in your system, and that it is being picked up. The quick workaround would probably be to remove /usr/local/include/kokkos and rerun DependencyInstaller.sh. We could add some kind of version check to the script, but maybe dropping Kokkos from DependencyInstaller and building it solely in-tree would be better solution?

The current setup is very complex:

  • DependencyInstaller.sh either:
    • detects that KokkosFFT is present on host in which case it does nothing (without any version check)
    • if not present, builds and installs KokkosFFT system-wide
  • CMakeLists.txt either:
    • detects that KokkosFFT is present on host in which case it picks it up (not necessarily the one from DependencyInstaller.sh)
    • if not present, builds and installs KokkosFFT in-tree

The version mismatch is not the only issue with using DependencyInstaller.sh for KokkosFFT installation:

  • If you want to switch between using CPU and GPU you have to remove the old installation of Kokkos, rerun DependencyInstaller with proper args and rebuild OpenROAD.
  • Kokkos seems to make an assumption that you use the same compiler suite for building the library and for the code using it. This makes it complicated to build OpenROAD with clang.
  • Kokkos has to be built with non-default compile-time options in order to give deterministic results. We would have to add some extra (potentially complicated) checks for that too.
  • Even if we add version and flags checks in DependencyInstaller.sh, there is no guarantee that CMake will pick the correct instance of Kokkos if user already had it installed globally

Building Kokkos solely in-tree would solve these problems and decrease maintenance burden related to having two independent methods of installing Kokkos.

@maliberty maliberty added the Stale A stale PR or issue subject to automated closure. label Mar 24, 2026
@github-actions github-actions Bot removed the Stale A stale PR or issue subject to automated closure. label Mar 25, 2026
@maliberty maliberty added the Stale A stale PR or issue subject to automated closure. label Mar 25, 2026
@github-actions github-actions Bot closed this Apr 15, 2026
@maliberty
Copy link
Copy Markdown
Member

I think we should finish this off

@maliberty maliberty reopened this Apr 15, 2026
@github-actions github-actions Bot removed the Stale A stale PR or issue subject to automated closure. label Apr 15, 2026
@hzeller
Copy link
Copy Markdown
Collaborator

hzeller commented Apr 15, 2026

If this continues, I can work on making kokkos build with bazel (and once it works also put on BCR)

@maliberty
Copy link
Copy Markdown
Member

That would be great. I've been holding off until we can retire cmake to avoid having to deal with it.

@ApeachM
Copy link
Copy Markdown
Contributor

ApeachM commented May 6, 2026

Late-comer thought — I've spent time reviewing gpl and have an RTX 5090 on hand. After reading both gpl and gpl2, much of gpl2 (Nesterov body, WA wirelength, BiCGSTAB) looks like the same algorithms as gpl. So instead of a separate gpl2 library, you could keep gpl and add Kokkos kernels behind an optional build flag, one function per PR. The CMake side can reuse -DGPU=ON in etc/Build.sh now; the bazel side will work once @hzeller finishes putting Kokkos on BCR. First PR could be getHpwl to set up the option and Kokkos dependency. WA gradient and a Poisson solver using Kokkos-FFT would follow. @sgizler's determinism fixes in this branch apply to either approach.

I'm raising this because the dual-codebase concern keeps coming up, and this approach might address it. Happy to try the first PR if it's useful, or to drop it and let #5352 continue as planned.

@maliberty
Copy link
Copy Markdown
Member

The goal would be to have a single gpl. This is a separate code base as that is how it was developed academically but not the preferred end state. I am glad to take PRs for incrementally adding these ideas directly to gpl/. Dealing with Kokkos itself will be a significant first step.

@ApeachM
Copy link
Copy Markdown
Contributor

ApeachM commented May 8, 2026

Thanks @maliberty — that's clarifying. I'll put together a first PR with the CMake option and a Kokkos-backed getHpwl, leaving the CPU path as default. If the build-system side needs to align with @hzeller's BCR work, happy to coordinate before posting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.