-
Notifications
You must be signed in to change notification settings - Fork 212
Implement Principal Component Analysis (PCA) module #1086
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1086 +/- ##
==========================================
- Coverage 68.69% 68.66% -0.04%
==========================================
Files 393 393
Lines 12720 12720
Branches 1376 1376
==========================================
- Hits 8738 8734 -4
- Misses 3982 3986 +4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
src/stats/stdlib_stats.fypp
Outdated
| end interface moment | ||
|
|
||
|
|
||
| #! Note: PCA uses SVD and EIGH which rely on LAPACK. LAPACK backends do not support extended (xdp) or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not accurate. stdlib's modernized BLAS/LAPACK do support all kinds. The only caveat is that when linking against an optimized library such as OpenBLAS or MKL, they only provide support for simple and double precision, so for extended and quadruple, the internal stdlib version will be used.
src/stats/stdlib_stats_pca.fypp
Outdated
| x_centered = x | ||
| end if | ||
|
|
||
| res = matmul(x_centered, transpose(components)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would advise to switch to using gemm instead to avoid temporal array allocations and ensuring use of optimized backends when linking against.
src/stats/stdlib_stats_pca.fypp
Outdated
|
|
||
|
|
||
| #:for k1, t1 in REAL_KINDS_TYPES | ||
| module function pca_transform_${k1}$(x, components, x_mean) result(res) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally speaking, please preferer subroutines when returning arrays. Subroutines give more control and predictability on memory allocation/recycling. Functions (depending on the compiler options) might or might not create temp arrays or reallocate the left-hand-side of the assignment. You can always create a function on top of the subroutine to have a functional style, but for performance and safety subroutines are more commendable.
src/stats/stdlib_stats_pca.fypp
Outdated
|
|
||
| integer(ilp) :: i, n | ||
| n = size(x_reduced, 1, kind=ilp) | ||
| res = matmul(x_reduced, components) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also here, please prefer gemm
src/stats/stdlib_stats_pca.fypp
Outdated
| mu = mean(x, dim=1) | ||
| if (present(x_mean)) x_mean = mu | ||
|
|
||
| err0 = linalg_state_type("pca", LINALG_SUCCESS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not necessary to initialize the state handler - leads to unnecessary overhead
| err0 = linalg_state_type("pca", LINALG_SUCCESS) |
src/stats/stdlib_stats_pca.fypp
Outdated
| do j = 1, p | ||
| idx(j) = j | ||
| end do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use array operations where straightforward
src/stats/stdlib_stats_pca.fypp
Outdated
| ! Simple bubble sort | ||
| do i = 1, p-1 | ||
| do j = i+1, p | ||
| if (lambda(idx(i)) < lambda(idx(j))) then | ||
| m = idx(i) | ||
| idx(i) = idx(j) | ||
| idx(j) = m | ||
| end if | ||
| end do | ||
| end do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use stdlib functions from the sorting module where already provided
src/stats/stdlib_stats_pca.fypp
Outdated
| err0 = linalg_state_type("pca", LINALG_ERROR, "Unknown method: "//method_) | ||
| end if | ||
|
|
||
| if (present(err)) err = err0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please fix state handling: should error stop on missing argument + error
| if (present(err)) err = err0 | |
| call err0%handle(err) |
|
@JAi-SATHVIK the build fails happen because in the CMakeLists.txt for the stats module you need to add something like target_link_libraries(stats PUBLIC blas lapack) Also, in my previous comment regarding the kinds support, what I meant was: please enable all kinds, stdlib provids the backends. Linking against optimized libraries is optional and would increase performance for simple and double precision. But all kinds are supported. |
|
Thank you @jalvesz , for the clarification! I've made the following updates: CMakeLists.txt: The Documentation: Updated the inline comments in both |
|
New fails are happening because of the dependency on the sorting module. This exposes an issue with the modularization of the library. The sorting module being at the root of src is not visible by the stats module within its own independent folder... we might need to reconsider next steps. One idea would be to make this library (pca) not a submodule of stats but a module in itself at the root of src such that it can use easily any other module such as stats, blas/lapack and sorting. Or there might be another approach to think about |
In this case, add
However, I agree with that. When I was working on #1081, I started to get "faked" circular dependencies.
If the CMake file is correctly written, a submodule "pca" should not be a problem. |
|
Thank you @jalvesz @jvdp1 for the insights. Regarding the immediate build failure, I will try adding ../stdlib_sorting.fypp to the src/stats/CMakeLists.txt as suggested to resolve the visibility issue with the sorting module. Regarding the structural change: I'm open to moving PCA to a standalone module in src/ if that aligns better with the library's goals for modularity and reducing compilation overhead. Should I proceed with the CMake fix first to verify the current logic, or would you prefer I start refactoring it into its own module now? |
|
I'll suggest to go step by step: first try to fix "as is", then let's continue the discussion on what would be the best strategy. |
#1080