Bdt performance#100
Closed
behrenhoff wants to merge 15 commits into
Closed
Conversation
Improve performance: many lookups in the residuals map, but order is irrelevant.
eb26dbe to
20989f1
Compare
Performance analysis has shown that most CPU time for BDT calculation is spent in the exp(x) function. In particular, for a vector<Double_t> of size nClasses for every i,j the following is calculated: exp(fResiduals[*e].at(j)-fResiduals[*e].at(i)), i.e. in total there are O(nClasses**2) exp calculations. This can be replaced by precalculting O(nClasses) exp values and further using division. While this gives identical results in symbolic calculations, the results in numeric calculations using Double_t's might differ.
Some typecasts in the code were useless (i.e. casting T* to T*). For the dynamic_cast -> static_cast: The DecisionTree only contains DecisionTreeNode* pointers as nodes. Therefore one can safely use static_cast and avoid the runtime cost of dynamic_cast (this is a relevant cost factor here!).
…educe map lookups.
The function was too long to understand and was using a lot of C style code. Main features of the rewrite: * Got rid of all new and delete calls * The variables relevant for each input variable are encapsulated in the new class DecisionTreeVariableDetail, the variables relevant for each bin in the new class BinDetail. * Factored out the code for fisherCoeff calculation * Use foreach style C++11 loops where possible (unfortunately still many index-based loops left)
20989f1 to
aa5c81c
Compare
Avoids 2 checks when variables are known to be nullptr beforehand.
GradBoostRegression: - replace two maps with one unordered_map - use C++11 loops - remove unused variable i - calculate event weight once instead of twice GradBoost: - modernize code - map->unordered_map, reduce map lookups by a factor of 2
The TMVA::Timer::DrawProgressBar( Int_t icounts, const TString& comment) function used to be very expensive, it would redraw the bar every time it was called. Now the previous state of the progress bar is cached so the the progress bar is only redrawn when the output will look different. Now that the timer is fast, add it to the TestRegression method.
aa5c81c to
8f8f477
Compare
By reordering the loops we are much more cache friendly. Instead of looping over all trees nClasses times and taking only every nClasses tree, we now run over all trees once and index into the small local temp vector. Also initialize the vector in the beginning, thus avoid possible reallocation. In addition, do the "replace exp(a-b) by exp(a)/exp(b)" trick (does not help much here). For cases with 5000 trees and 18 classes, we get approx 30% runtime performance gain on an Intel i7-4790K.
Warning: this is experimental, works for me but should be better tested!
8f8f477 to
84b0896
Compare
A map of Node* -> vector<double> stores a sum and sort of a squared sum in the vector, i.e. it has either zero elements (on creation) or 2 elemnts when used. There is no need to use a vector here. We can simply use two named double variables instead. In addition to better runtime performance, we get better naming and less code for free (remove empty() branch).
It seems the default std::hash functor for these types is not optimal. This patch suggests to right shift the pointer value before passing it on to std::hash<size_t>. This improves speed on my system because the rightmost bits of the pointers are always zero and somehow the default hash implementation doesn't like that. There might be better ideas to improve the maps of these pointers... Warning: This is compiler/architecture dependent! Tested on 64 bit linux using gcc 4.8.4
Following a suggestion by Axel Naumann, I have moded the two variables into the TMVA::Event. This removes the need for map lookups when boosting. See: https://sft.its.cern.ch/jira/browse/ROOT-8006?focusedCommentId=73554&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-73554 Currently using public members, but should be good enough for testing the correctness and the performance gain.
5bb1a0c to
c6fb614
Compare
Member
|
Hi, I know it's quite late but would you agree to close this PR given the recent merging which happened in the past week? |
|
Can one of the admins verify this patch? |
Contributor
Author
|
closed, sure. Even though I still want to port the threading code. It needs to be rebased / reworked anyway. |
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Increase the speed of BDT training.
For regression analysis with Grad boosting, the speed gain is almost 2x.
For multiclass the gain depends on the number of multiclasses.
For classification: haven't done the test.
Non BDT algorithms will also be faster (assuming the progress bar is enabled).