Bdt performance by behrenhoff · Pull Request #100 · root-project/root

behrenhoff · 2015-10-14T15:29:22Z

Increase the speed of BDT training.

For regression analysis with Grad boosting, the speed gain is almost 2x.
For multiclass the gain depends on the number of multiclasses.
For classification: haven't done the test.

Non BDT algorithms will also be faster (assuming the progress bar is enabled).

Improve performance: many lookups in the residuals map, but order is irrelevant.

Performance analysis has shown that most CPU time for BDT calculation is spent in the exp(x) function. In particular, for a vector<Double_t> of size nClasses for every i,j the following is calculated: exp(fResiduals[*e].at(j)-fResiduals[*e].at(i)), i.e. in total there are O(nClasses**2) exp calculations. This can be replaced by precalculting O(nClasses) exp values and further using division. While this gives identical results in symbolic calculations, the results in numeric calculations using Double_t's might differ.

Some typecasts in the code were useless (i.e. casting T* to T*). For the dynamic_cast -> static_cast: The DecisionTree only contains DecisionTreeNode* pointers as nodes. Therefore one can safely use static_cast and avoid the runtime cost of dynamic_cast (this is a relevant cost factor here!).

…educe map lookups.

The function was too long to understand and was using a lot of C style code. Main features of the rewrite: * Got rid of all new and delete calls * The variables relevant for each input variable are encapsulated in the new class DecisionTreeVariableDetail, the variables relevant for each bin in the new class BinDetail. * Factored out the code for fisherCoeff calculation * Use foreach style C++11 loops where possible (unfortunately still many index-based loops left)

Avoids 2 checks when variables are known to be nullptr beforehand.

GradBoostRegression: - replace two maps with one unordered_map - use C++11 loops - remove unused variable i - calculate event weight once instead of twice GradBoost: - modernize code - map->unordered_map, reduce map lookups by a factor of 2

The TMVA::Timer::DrawProgressBar( Int_t icounts, const TString& comment) function used to be very expensive, it would redraw the bar every time it was called. Now the previous state of the progress bar is cached so the the progress bar is only redrawn when the output will look different. Now that the timer is fast, add it to the TestRegression method.

By reordering the loops we are much more cache friendly. Instead of looping over all trees nClasses times and taking only every nClasses tree, we now run over all trees once and index into the small local temp vector. Also initialize the vector in the beginning, thus avoid possible reallocation. In addition, do the "replace exp(a-b) by exp(a)/exp(b)" trick (does not help much here). For cases with 5000 trees and 18 classes, we get approx 30% runtime performance gain on an Intel i7-4790K.

Warning: this is experimental, works for me but should be better tested!

A map of Node* -> vector<double> stores a sum and sort of a squared sum in the vector, i.e. it has either zero elements (on creation) or 2 elemnts when used. There is no need to use a vector here. We can simply use two named double variables instead. In addition to better runtime performance, we get better naming and less code for free (remove empty() branch).

It seems the default std::hash functor for these types is not optimal. This patch suggests to right shift the pointer value before passing it on to std::hash<size_t>. This improves speed on my system because the rightmost bits of the pointers are always zero and somehow the default hash implementation doesn't like that. There might be better ideas to improve the maps of these pointers... Warning: This is compiler/architecture dependent! Tested on 64 bit linux using gcc 4.8.4

Following a suggestion by Axel Naumann, I have moded the two variables into the TMVA::Event. This removes the need for map lookups when boosting. See: https://sft.its.cern.ch/jira/browse/ROOT-8006?focusedCommentId=73554&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-73554 Currently using public members, but should be good enough for testing the correctness and the performance gain.

dpiparo · 2017-03-06T15:36:03Z

Hi, I know it's quite late but would you agree to close this PR given the recent merging which happened in the past week?

phsft-bot · 2017-03-06T15:36:06Z

Can one of the admins verify this patch?

behrenhoff · 2017-03-06T16:01:12Z

closed, sure. Even though I still want to port the threading code. It needs to be rebased / reworked anyway.

map -> unordered_map for f[Weighted]Residuals

fecffc0

Improve performance: many lookups in the residuals map, but order is irrelevant.

Axel-Naumann assigned lmoneta Oct 15, 2015

behrenhoff force-pushed the bdt-performance branch from eb26dbe to 20989f1 Compare October 28, 2015 18:49

behrenhoff added 6 commits November 3, 2015 16:54

Remove trailing whitespace

4e9819a

pass double's as double's, not as references

ffdc57a

Remove some vector.at() calls and replace with [], modernize loops, r…

b83eb25

…educe map lookups.

behrenhoff force-pushed the bdt-performance branch from 20989f1 to aa5c81c Compare November 4, 2015 16:08

behrenhoff added 3 commits November 4, 2015 18:10

Add inline Float_t GetValueFast(UInt_t).

ee8aeca

Avoids 2 checks when variables are known to be nullptr beforehand.

Modernize code in GradBoost[Regression]

f11eed6

GradBoostRegression: - replace two maps with one unordered_map - use C++11 loops - remove unused variable i - calculate event weight once instead of twice GradBoost: - modernize code - map->unordered_map, reduce map lookups by a factor of 2

behrenhoff force-pushed the bdt-performance branch from aa5c81c to 8f8f477 Compare November 4, 2015 17:10

behrenhoff added 2 commits January 21, 2016 17:28

Experimental: Run in parallel

84b0896

Warning: this is experimental, works for me but should be better tested!

behrenhoff force-pushed the bdt-performance branch from 8f8f477 to 84b0896 Compare January 21, 2016 16:59

behrenhoff added 3 commits February 18, 2016 09:39

behrenhoff force-pushed the bdt-performance branch from 5bb1a0c to c6fb614 Compare February 24, 2016 17:21

peremato unassigned lmoneta Mar 1, 2017

behrenhoff closed this Mar 6, 2017

ethereal-space-cadet16 mentioned this pull request May 31, 2022

Accessing pyROOT #10676

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bdt performance#100

Bdt performance#100
behrenhoff wants to merge 15 commits into
root-project:masterfrom
behrenhoff:bdt-performance

behrenhoff commented Oct 14, 2015

Uh oh!

dpiparo commented Mar 6, 2017

Uh oh!

phsft-bot commented Mar 6, 2017

Uh oh!

behrenhoff commented Mar 6, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

behrenhoff commented Oct 14, 2015

Uh oh!

dpiparo commented Mar 6, 2017

Uh oh!

phsft-bot commented Mar 6, 2017

Uh oh!

behrenhoff commented Mar 6, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants