-
Notifications
You must be signed in to change notification settings - Fork 8
Code optimization (first round) #528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ata.table; in fact, using data.table will make the code slower)
jdblischak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have some example code for benchmarking that is known to be slow in the app? When I just use gs_design_ahr() |> summary() |> as_gt(), I don't observe much of a difference.
main
git2r::commits(n = 1)[[1]]
## [6b8d5ac] 2025-04-14: Merge pull request #529 from Merck/gt-latex-table-position
library("gsDesign2")
library("microbenchmark")
library("profvis")
system.time(
gs_design_ahr() |> summary() |> as_gt()
)
## user system elapsed
## 0.05 0.00 0.08
profvis(gs_design_ahr() |> summary() |> as_gt())
profvis(for (i in 1:10) gs_design_ahr() |> summary() |> as_gt())
microbenchmark(gs_design_ahr() |> summary() |> as_gt())
## Unit: milliseconds
## expr min lq mean median uq max neval
## as_gt(summary(gs_design_ahr())) 56.9447 59.9736 64.96923 62.94655 67.02665 111.5459 100PR
git2r::commits(n = 1)[[1]]
## [1d3675e] 2025-04-14: Merge branch 'main' into optimize-pw_info
library("gsDesign2")
library("microbenchmark")
library("profvis")
system.time(
gs_design_ahr() |> summary() |> as_gt()
)
## user system elapsed
## 0.08 0.00 0.06
profvis(gs_design_ahr() |> summary() |> as_gt())
profvis(for (i in 1:10) gs_design_ahr() |> summary() |> as_gt())
microbenchmark(gs_design_ahr() |> summary() |> as_gt())
## Unit: milliseconds
## expr min lq mean median uq max neval
## as_gt(summary(gs_design_ahr())) 52.4006 58.1329 62.64138 60.54605 63.06605 120.6613 100|
Yes, To make For now, |
… unnecessary since the cumsum() result is guaranteed to be sorted and unique as long as `duration` is positive should we use `<` instead of `<=` here?
…of filtering `start_time_fr` with `td` again the rewrite here doesn't guarantee equivalence: I actually changed `<=` to `<`, which I feel makes more sense since `start_time_fr` is a simple c() expression, let's just use the expression directly to save a variable name (since naming is always hard)
…mulate the data frames in a list and rbind them all at the end instead
…ll actually be a lot faster
a simple benchmark:
y = head(penguins)
system.time(for (i in 1:1e6) {
x = vector('list', 10)
for (j in seq_along(x)) x[[j]] = y
})
y = head(penguins)
system.time(for (i in 1:1e6) {
x = list()
for (j in seq_along(x)) x[[j]] = y
})
…table only for a simple column operation (i.e., adding a `treatment` column)
…l group the computation by `t` next (which will implicitly sort by `t`), and 2) the data is already ordered by `treatment` (via `rbind(control, experimental)` on the previous line, and `control` happens to be "smaller" than `experimental`)
My apologies. I had seen your comment earlier but then forgot about it when I was performing the benchmarking yesterday.
Confirmed! This PR improves the performance, especially for the worst-case scenario ( # main
git2r::commits(n = 1)[[1]]
## [6b8d5ac] 2025-04-14: Merge pull request #529 from Merck/gt-latex-table-position
library("gsDesign2")
library("microbenchmark")
benchmark <- function() gs_design_ahr() |> to_integer() |> summary() |> as_gt()
microbenchmark(benchmark())
## Unit: milliseconds
## expr min lq mean median uq max neval
## benchmark() 421.8117 447.9604 598.8056 463.4538 482.0702 2983.017 100
# PR
git2r::commits(n = 1)[[1]]
## [1d3675e] 2025-04-14: Merge branch 'main' into optimize-pw_info
library("gsDesign2")
library("microbenchmark")
benchmark <- function() gs_design_ahr() |> to_integer() |> summary() |> as_gt()
microbenchmark(benchmark())
## Unit: milliseconds
## expr min lq mean median uq max neval
## benchmark() 347.3618 374.5193 397.7408 386.4939 413.2976 581.9334 100 |
|
Nice! @jdblischak for benchmarking, maybe use a more demanding example like the app's default design. Code can be copied from the report tab, need to run with the latest development version of gsDesign2 on GitHub because of the recent changes on spending function syntax. |
… the data by `time` and `stratum`, which we have already done; the columns seem to be already this order, so no need to reorder them
….frame instead (although it doesn't really make much difference)
…et by stratum first, then loop on total_duration, otherwise we may be repeatedly subsetting the data for each total_duration
LittleBeannie
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @yihui, thank you for your code optimization! Rather than referring to it as a review, I would describe it as a learning opportunity for me to pick up all the tricks! I sat beside the desk for a couple of hours to digest everything and learn. Below, please find my minor comments for your consideration:
| if (theta[n_analysis] < 0) { | ||
| stop("gs_design_npe() or gs_power_npe(): final effect size must be > 0!") | ||
| } | ||
| check_theta <- function(theta, K) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please lower case the K, as all gsDesign2 source code is in snake case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can definitely make that change, but I feel we should occasionally break the snake case rule. In this particular case, I think it makes more sense to match the mathematical notation. In our math notation, we use K to denote the number of analyses, and lower case k as the subscript. What do you think?
BTW, this is an internal function not for end users, so the coding style consistency matters less.
| ), | ||
| total_duration = 25, | ||
| simple = TRUE) { | ||
| # Check input values ---- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the re-coding of expected_event, how much running time can we expect to save? I am okay with the changes in the input checking, but I am cautious about the rest. The computation of the expected event is very complex, and I am concerned that a small change may impact areas we have not previously exposed.
If we decide to proceed with these edits, we should also update this vignette (https://merck.github.io/gsDesign2/articles/story-compute-expected-events.html#organizing-calculations-under-the-piecewise-model) to keep everything in sync. I would prefer not to postpone the merge of this PR. Would it be acceptable to leave the edits for expected_event for now and potentially create another PR if needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code has been restored.
| H = cumsum(h * duration), # cumulative hazard | ||
| survival = exp(-H) # survival | ||
| ) | ||
| H <- cumulative_rate(x, duration, rate, last_(rate)) # cumulative hazard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid the upper case variable name H.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what we were using previously, although H was a column name instead of a variable name. I think mathematically it makes sense to denote hazard by h and cumulative hazard by H.
| if (survival[1] > 1) stop("`survival` must not be greater than 1") | ||
| if (last_(survival) >= 1) stop("`survival` must have at least one value < 1") | ||
|
|
||
| ans <- tibble(Times = times, Survival = survival) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the old way in s2pwe, which is more statistically clear. Since s2pwe is commonly a stand-alone function, it should be okay to keep the old programming?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that you may prefer the old tibble() %>% mutate() %>% select() structure, but let me explain the redundancy in the code.
First, the original code:
tibble(Times = times, Survival = survival) %>%
mutate(
duration = Times - fastlag(Times, first = 0),
H = -log(Survival),
rate = (H - fastlag(H, first = 0)) / duration
) %>%
select(duration, rate)Capitalizing names to Times and Survival seems to be unnecessary, so we simplify it to:
tibble(
duration = times - fastlag(times, first = 0),
H = -log(survival),
rate = (H - fastlag(H, first = 0)) / duration
) %>%
select(duration, rate)I have created a simple helper function diff_one() (a shorthand of diff(c(0, x))) for the pattern x - fastlag(x, first = 0) (which appeared in this package several times), so we call diff_one() instead:
tibble(
duration = diff_one(times),
H = -log(survival),
rate = diff_one(H) / duration
) %>%
select(duration, rate)At this point, you realize the only reason we need select() is the intermediate column H, so why not just compute H and create the tibble() directly? There is no need to create a column in tibble() and remove it next.
H <- -log(survival)
tibble(duration = diff_one(times), rate = diff_one(H) / duration)As soon as you know what diff_one() means (as I said, this first-order diff operation is quite common throughout the package), the 2 lines of new code should clearer than the 7 lines of old code.
[ci skip]
35bd79b to
7fffe00
Compare
yihui
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've restored expected_event() as requested, and will create a separate PR (unless we are going to call gsDesign::eEvents() in it).
To me, the only minor issue left is the variable name K. I have renamed it to k but I feel we should break the snake case rule here and use capital K to match the math notation: #528 (comment)
@LittleBeannie If don't have objections, I'll revert the commit 7c18ad1 and this PR will be ready to merge. If you have a strong preference on the lowercase k, the PR can be merged right now.
| ), | ||
| total_duration = 25, | ||
| simple = TRUE) { | ||
| # Check input values ---- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code has been restored.
LittleBeannie
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for explaining your thoughts to me, @yihui! I now have a better understanding of the simplified code. I appreciate the significant reduction in code complexity, and I’m grateful for your efforts in completing this extensive task!
|
Since some updates to # main
git2r::commits(n = 1)[[1]]
## [a22f40d] 2025-05-08: Merge pull request #543 from Merck/533-the-design-of-the-weight-argument-in-gs_xxx_wlr
system.time(benchmark())
## user system elapsed
## 1.91 0.06 1.96
microbenchmark(benchmark())
## Unit: seconds
## expr min lq mean median uq max neval
## benchmark() 1.819074 1.949909 3.403863 2.061791 2.6421 12.36416 100
# PR
git2r::commits(n = 1)[[1]]
## [380f6c4] 2025-05-07: revert 7c18ad1: k -> K
# cold
clrhash(gsDesign2:::fun_hash)
system.time(benchmark())
## user system elapsed
## 1.02 0.03 1.17
# cached
microbenchmark(benchmark())
## Unit: milliseconds
## expr min lq mean median uq max neval
## benchmark() 167.3944 173.5312 180.8217 176.8842 181.3281 276.1453 100 |
jdblischak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work @yihui!
| #' @importFrom stats pnorm qnorm setNames uniroot | ||
| #' @importFrom tibble tibble | ||
| #' @importFrom utils tail | ||
| #' @import utils |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a good discussion topic for a future group meeting. I prefer to explicitly list each imported function like we currently do for {data.table} and {dplyr}.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For base packages, usually I fully import them. For 3rd party packages, usually I import functions explicitly. I'm okay with selective imports for base packages, too. In fact, the stats package has to be imported selectively to avoid conflicts between stats::filter and dplyr::filter.
Close #527