Skip to content

Bdt performance#100

Closed
behrenhoff wants to merge 15 commits into
root-project:masterfrom
behrenhoff:bdt-performance
Closed

Bdt performance#100
behrenhoff wants to merge 15 commits into
root-project:masterfrom
behrenhoff:bdt-performance

Conversation

@behrenhoff
Copy link
Copy Markdown
Contributor

Increase the speed of BDT training.

For regression analysis with Grad boosting, the speed gain is almost 2x.
For multiclass the gain depends on the number of multiclasses.
For classification: haven't done the test.

Non BDT algorithms will also be faster (assuming the progress bar is enabled).

Improve performance: many lookups in the residuals map,
but order is irrelevant.
Performance analysis has shown that most CPU time for BDT calculation
is spent in the exp(x) function.

In particular, for a vector<Double_t> of size nClasses for every i,j the
following is calculated: exp(fResiduals[*e].at(j)-fResiduals[*e].at(i)),
i.e. in total there are O(nClasses**2) exp calculations. This can be
replaced by precalculting O(nClasses) exp values and further using division.

While this gives identical results in symbolic calculations, the results
in numeric calculations using Double_t's might differ.
Some typecasts in the code were useless (i.e. casting T* to T*).

For the dynamic_cast -> static_cast:
The DecisionTree only contains DecisionTreeNode* pointers as nodes.
Therefore one can safely use static_cast and avoid the runtime cost of
dynamic_cast (this is a relevant cost factor here!).
The function was too long to understand and was using a lot
of C style code.

Main features of the rewrite:
* Got rid of all new and delete calls
* The variables relevant for each input variable are
  encapsulated in the new class DecisionTreeVariableDetail,
  the variables relevant for each bin in the new class
  BinDetail.
* Factored out the code for fisherCoeff calculation
* Use foreach style C++11 loops where possible (unfortunately
  still many index-based loops left)
Avoids 2 checks when variables are known to be nullptr beforehand.
GradBoostRegression:
- replace two maps with one unordered_map
- use C++11 loops
- remove unused variable i
- calculate event weight once instead of twice

GradBoost:
- modernize code
- map->unordered_map, reduce map lookups by a factor of 2
The TMVA::Timer::DrawProgressBar( Int_t icounts, const TString& comment) function
used to be very expensive, it would redraw the bar every time it was called.

Now the previous state of the progress bar is cached so the the progress bar is
only redrawn when the output will look different.

Now that the timer is fast, add it to the TestRegression method.
By reordering the loops we are much more cache friendly.
Instead of looping over all trees nClasses times and taking
only every nClasses tree, we now run over all trees once and
index into the small local temp vector.

Also initialize the vector in the beginning, thus avoid possible
reallocation.

In addition, do the "replace exp(a-b) by exp(a)/exp(b)" trick
(does not help much here).

For cases with 5000 trees and 18 classes, we get approx
30% runtime performance gain on an Intel i7-4790K.
Warning: this is experimental, works for me but should be better tested!
A map of Node* -> vector<double> stores a sum and sort of a squared
sum in the vector, i.e. it has either zero elements (on creation)
or 2 elemnts when used.

There is no need to use a vector here. We can simply use two named
double variables instead. In addition to better runtime performance,
we get better naming and less code for free (remove empty() branch).
It seems the default std::hash functor for these types is not optimal.
This patch suggests to right shift the pointer value before passing
it on to std::hash<size_t>. This improves speed on my system because
the rightmost bits of the pointers are always zero and somehow the
default hash implementation doesn't like that.

There might be better ideas to improve the maps of these pointers...

Warning: This is compiler/architecture dependent!
Tested on 64 bit linux using gcc 4.8.4
Following a suggestion by Axel Naumann, I have moded the two variables into
the TMVA::Event. This removes the need for map lookups when boosting.

See:
https://sft.its.cern.ch/jira/browse/ROOT-8006?focusedCommentId=73554&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-73554

Currently using public members, but should be good enough for testing the
correctness and the performance gain.
@dpiparo
Copy link
Copy Markdown
Member

dpiparo commented Mar 6, 2017

Hi, I know it's quite late but would you agree to close this PR given the recent merging which happened in the past week?

@phsft-bot
Copy link
Copy Markdown

Can one of the admins verify this patch?

@behrenhoff behrenhoff closed this Mar 6, 2017
@behrenhoff
Copy link
Copy Markdown
Contributor Author

closed, sure. Even though I still want to port the threading code. It needs to be rebased / reworked anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants