[ML] Gain upper bound estimation for classification and regression by valeriy42 · Pull Request #1537 · elastic/ml-cpp

valeriy42 · 2020-10-15T09:37:26Z

In this PR we start computing an upper bound on the potential gain from splitting a node. If the upper bound of the gain is lower than the currently smallest gain among all candidates, we ignore the node and this way prevent computations that are especially expensive on the large datasets.

Since we avoid computation of the splits that we wouldn't be added to the tree anyway, this PR does not change the qualitative results.

At the moment, we can only compute the upper bound for regression and binary classification. For multiclass classification we proceed as before.

Note: this PR contains additional instrumentation to assess the performance improvement. I will remove this instrumentation in a follow-up PR after tests.

…h-and-bounds

tveasey

I know this is still work in progress and needs a little tidying up. I made a couple of comments on the implementation to make the update of the bound related information as efficient as possible.

lib/maths/CBoostedTreeLeafNodeStatistics.cc

… into branch-and-bounds

…h-and-bounds

valeriy42 · 2020-11-10T12:46:35Z

@tveasey I finished up this PR. It would be great if you could take another look at it.

valeriy42 · 2020-11-10T12:50:16Z

@tveasey Also, do you think we need to update the memory estimation?

tveasey · 2020-11-10T12:52:01Z

Also, do you think we need to update the memory estimation?

I don't think so: the extra memory is accounted for in the sizeof of the derivatives object so we should get it accounted for already.

tveasey

Over all looks great. There are a couple of potential issues I think. One is just in the temporary stats you compute, but the other looks like it is a real issue caused by unsigned integer underflow. I also made some minor suggestions for readability and consistency.

tveasey · 2020-11-10T11:44:25Z

include/maths/CBoostedTreeLeafNodeStatistics.h

    class MATHS_EXPORT CSplitsDerivatives {
    public:
        using TDerivativesVec = std::vector<CDerivatives>;
+        using TDerivativesMappedVec = boosted_tree_detail::TAlignedMemoryMappedFloatVector;


We reserve Vec for std::vector in all other code so I find this name misleading. However, we actually already have this type declared so I would just use TMemoryMappedFloatVector wherever you currently use TDerivativesMappedVec and delete this typedef.

I removed the typedef as you suggested.

tveasey · 2020-11-10T12:54:55Z

include/maths/CBoostedTreeLeafNodeStatistics.h

    private:
        using TDerivativesVecVec = std::vector<TDerivativesVec>;
        using TAlignedDoubleVec = std::vector<double, core::CAlignedAllocator<double>>;
+        using TDerivatives2D = Eigen::Matrix<double, 2, 1>;


This type name breaks our standard naming conventions which say that the type name should reflect the type not its use. In other cases in the code where we have fixed stack based vectors we would use something like TDoubleVector2x1.

tveasey · 2020-11-10T12:55:33Z

include/maths/CBoostedTreeLeafNodeStatistics.h

-                         bool assignMissingToLeft)
+                         bool assignMissingToLeft,
+                         std::size_t leftChildRowCount, // TODO remove after stats measurement
+                         std::size_t rightChildRowCount, // TODO remove after stats measurement


Do you plan to remove these in a follow up PR or can they be removed now?

I remove them in a follow-up PR.

tveasey · 2020-11-10T12:58:07Z

lib/maths/CBoostedTreeImpl.cc

        static_cast<std::int64_t>(this->memoryUsage()) - lastMemoryUsage);
+
+    if (m_Instrumentation != nullptr) {
+        LOG_INFO(<< "Statistics computed: " << m_Instrumentation->statisticsComputed()


I'm not sure this should make it into the logs. It is interesting for us to know but I can't see a user wanting to know this. If you want it for committed for QA I would leave a TODO to downgrade this logging to trace later on.

I added a TODO comment and will remove this output completely together wit the instrumentation code.

lib/maths/CBoostedTreeImpl.cc

lib/maths/CBoostedTreeLeafNodeStatistics.cc

tveasey · 2020-11-10T13:23:41Z

lib/maths/CBoostedTreeLeafNodeStatistics.cc

+            if (cl[ASSIGN_MISSING_TO_LEFT] == 0 || cl[ASSIGN_MISSING_TO_LEFT] == c) {
+                gain[ASSIGN_MISSING_TO_LEFT] = -INF;
+            } else {
+                minLossLeft[ASSIGN_MISSING_TO_LEFT] = minimumLoss(


Does it matter that we only set minLossLeft in this branch? I don't think so because they are only used if gain > -INF, but this is not immediately obvious from the code. I think it warrants a comment.

You are right, I added a comment.

tveasey · 2020-11-10T13:35:23Z

lib/maths/CBoostedTreeLeafNodeStatistics.cc

+
+double CBoostedTreeLeafNodeStatistics::childMaxGain(double gChild,
+                                                    double minLossChild,
+                                                    double lambda) const {


I think it might be worthwhile writing a brief outline of what this is doing, i.e. something along the lines of: "This computes the maximum possible gain we can expect splitting a child node given we know the sum of the positive (g^+) and negative gradients (g^-) at its parent, the minimum curvature on the positive and negative gradient set (hmin^+ and hmin^-) and largest and smallest gradient (gmax and gmin, respectively). The highest possible gain consistent with these constraints can be shown to be: (g^+)^2 / (hmin^+ g^+ / gmax + lambda) + (g^-)^2 / (hmin^- g^- / gmin + lambda)".

I added the explanation comment as you suggested.

lib/maths/CBoostedTreeLeafNodeStatistics.cc

tveasey

LGTM

…lastic#1537) In this PR we start computing an upper bound on the potential gain from splitting a node. If the upper bound of the gain is lower than the currently smallest gain among all candidates, we ignore the node and this way prevent computations that are especially expensive on the large datasets. Since we avoid computation of the splits that we wouldn't be added to the tree anyway, this PR does not change the qualitative results. At the moment, we can only compute the upper bound for regression and binary classification. For multiclass classification we proceed as before. Note: this PR contains additional instrumentation to assess the performance improvement. I will remove this instrumentation in a follow-up PR after tests.

…on (#1568) In this PR we start computing an upper bound on the potential gain from splitting a node. If the upper bound of the gain is lower than the currently smallest gain among all candidates, we ignore the node and this way prevent computations that are especially expensive on the large datasets. Since we avoid computation of the splits that we wouldn't be added to the tree anyway, this PR does not change the qualitative results. At the moment, we can only compute the upper bound for regression and binary classification. For multiclass classification we proceed as before. Note: this PR contains additional instrumentation to assess the performance improvement. I will remove this instrumentation in a follow-up PR after tests. Backport of #1537 Note: this PR contains additional instrumentation to assess the performance improvement. I will remove this instrumentation in a follow-up PR after tests. * windows compilation error fixed * fix compiler issues * fixing compiling issues

Debugging an intermittent SIGSEGV triggered by another change I'm working on showed up this potential out-of- bounds read. It would happen very infrequently. It was introduced by elastic#1537.

upper bound wip

ef43e53

valeriy42 added >non-issue :ml labels Oct 15, 2020

valeriy42 added 16 commits October 16, 2020 11:20

Upper bound computation compiles

7bb8fbf

smallest gain threshold introduces

7b6327a

Fixed gradient substraction

3bad36e

Merge branch 'master' of https://github.com/elastic/ml-cpp into branc…

852e2e6

…h-and-bounds

unit test wip

47f96be

Unit test implemented

ad33423

update upper bound

04b1caf

a069fd4

instrumentation

80b9c6b

change instrumentation to info

03421b0

formatting

8716b79

remove outputs

fd81646

take calculation out of the loop

1b3827c

refactoring

343f934

Merge branch 'master' of https://github.com/elastic/ml-cpp into branc…

4196f20

…h-and-bounds

refactoring

d554328

tveasey reviewed Oct 28, 2020

View reviewed changes

lib/maths/CBoostedTreeLeafNodeStatistics.cc Outdated Show resolved Hide resolved

lib/maths/CBoostedTreeLeafNodeStatistics.cc Show resolved Hide resolved

lib/maths/CBoostedTreeLeafNodeStatistics.cc Outdated Show resolved Hide resolved

valeriy42 added 10 commits October 29, 2020 10:01

refactorings

74ff899

vectorization

d0f7fc9

cleaning up

97d13a0

cleaning up

5ff70e7

cleaning up

5961123

Merge branch 'branch-and-bounds' of https://github.com/valeriy42/ml-cpp…

1e6e159

… into branch-and-bounds

Fix bug with bounds

48810dc

fixed nan bug

144821c

remove depth parameter

b70f292

duplication addDerivative function

7f82c1a

valeriy42 added 4 commits November 9, 2020 17:13

fixed the bug with global maxgain

cc8b596

cleaning up

a923d88

instrument num skipped rows

b76e7d4

checkBounds removed

294d94c

valeriy42 changed the title ~~[WIP][ML] Branches and Bounds~~ [ML] Gain upper bound estimation for classification and regression Nov 10, 2020

valeriy42 marked this pull request as ready for review November 10, 2020 10:24

handle multiclass

d0914c5

valeriy42 added >enhancement v7.11.0 v8.0.0 and removed >non-issue labels Nov 10, 2020

valeriy42 added 3 commits November 10, 2020 11:39

Merge branch 'master' of https://github.com/elastic/ml-cpp into branc…

ec659f7

…h-and-bounds

docs

0593670

cleaning up

47d9db5

tveasey reviewed Nov 10, 2020

View reviewed changes

review comments

cf6c81f

tveasey approved these changes Nov 10, 2020

View reviewed changes

valeriy42 merged commit f4c986a into elastic:master Nov 11, 2020

valeriy42 deleted the branch-and-bounds branch November 11, 2020 10:17

valeriy42 mentioned this pull request Nov 11, 2020

[7.x][ML] Gain upper bound estimation for classification and regression #1568

Merged

tveasey mentioned this pull request Dec 7, 2020

[ML] Avoid computing best splits for nodes which we know can't split #1313

Closed

tveasey mentioned this pull request Aug 10, 2022

[ML] Fix potential out-of-bounds read #2385

Merged

Conversation

valeriy42 commented Oct 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tveasey left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

valeriy42 commented Nov 10, 2020

Uh oh!

valeriy42 commented Nov 10, 2020

Uh oh!

tveasey commented Nov 10, 2020

Uh oh!

tveasey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tveasey Nov 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tveasey left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

valeriy42 commented Oct 15, 2020 •

edited

Loading

tveasey Nov 10, 2020 •

edited

Loading