Skip to content

[ML] Gain upper bound estimation for classification and regression#1537

Merged
valeriy42 merged 39 commits intoelastic:masterfrom
valeriy42:branch-and-bounds
Nov 11, 2020
Merged

[ML] Gain upper bound estimation for classification and regression#1537
valeriy42 merged 39 commits intoelastic:masterfrom
valeriy42:branch-and-bounds

Conversation

@valeriy42
Copy link
Copy Markdown
Contributor

@valeriy42 valeriy42 commented Oct 15, 2020

In this PR we start computing an upper bound on the potential gain from splitting a node. If the upper bound of the gain is lower than the currently smallest gain among all candidates, we ignore the node and this way prevent computations that are especially expensive on the large datasets.

Since we avoid computation of the splits that we wouldn't be added to the tree anyway, this PR does not change the qualitative results.

At the moment, we can only compute the upper bound for regression and binary classification. For multiclass classification we proceed as before.

Note: this PR contains additional instrumentation to assess the performance improvement. I will remove this instrumentation in a follow-up PR after tests.

Copy link
Copy Markdown
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is still work in progress and needs a little tidying up. I made a couple of comments on the implementation to make the update of the bound related information as efficient as possible.

@valeriy42 valeriy42 changed the title [WIP][ML] Branches and Bounds [ML] Gain upper bound estimation for classification and regression Nov 10, 2020
@valeriy42 valeriy42 marked this pull request as ready for review November 10, 2020 10:24
@valeriy42
Copy link
Copy Markdown
Contributor Author

@tveasey I finished up this PR. It would be great if you could take another look at it.

@valeriy42
Copy link
Copy Markdown
Contributor Author

@tveasey Also, do you think we need to update the memory estimation?

@tveasey
Copy link
Copy Markdown
Contributor

tveasey commented Nov 10, 2020

Also, do you think we need to update the memory estimation?

I don't think so: the extra memory is accounted for in the sizeof of the derivatives object so we should get it accounted for already.

Copy link
Copy Markdown
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Over all looks great. There are a couple of potential issues I think. One is just in the temporary stats you compute, but the other looks like it is a real issue caused by unsigned integer underflow. I also made some minor suggestions for readability and consistency.

class MATHS_EXPORT CSplitsDerivatives {
public:
using TDerivativesVec = std::vector<CDerivatives>;
using TDerivativesMappedVec = boosted_tree_detail::TAlignedMemoryMappedFloatVector;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We reserve Vec for std::vector in all other code so I find this name misleading. However, we actually already have this type declared so I would just use TMemoryMappedFloatVector wherever you currently use TDerivativesMappedVec and delete this typedef.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the typedef as you suggested.

private:
using TDerivativesVecVec = std::vector<TDerivativesVec>;
using TAlignedDoubleVec = std::vector<double, core::CAlignedAllocator<double>>;
using TDerivatives2D = Eigen::Matrix<double, 2, 1>;
Copy link
Copy Markdown
Contributor

@tveasey tveasey Nov 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type name breaks our standard naming conventions which say that the type name should reflect the type not its use. In other cases in the code where we have fixed stack based vectors we would use something like TDoubleVector2x1.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed

bool assignMissingToLeft)
bool assignMissingToLeft,
std::size_t leftChildRowCount, // TODO remove after stats measurement
std::size_t rightChildRowCount, // TODO remove after stats measurement
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan to remove these in a follow up PR or can they be removed now?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remove them in a follow-up PR.

static_cast<std::int64_t>(this->memoryUsage()) - lastMemoryUsage);

if (m_Instrumentation != nullptr) {
LOG_INFO(<< "Statistics computed: " << m_Instrumentation->statisticsComputed()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this should make it into the logs. It is interesting for us to know but I can't see a user wanting to know this. If you want it for committed for QA I would leave a TODO to downgrade this logging to trace later on.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a TODO comment and will remove this output completely together wit the instrumentation code.

if (cl[ASSIGN_MISSING_TO_LEFT] == 0 || cl[ASSIGN_MISSING_TO_LEFT] == c) {
gain[ASSIGN_MISSING_TO_LEFT] = -INF;
} else {
minLossLeft[ASSIGN_MISSING_TO_LEFT] = minimumLoss(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it matter that we only set minLossLeft in this branch? I don't think so because they are only used if gain > -INF, but this is not immediately obvious from the code. I think it warrants a comment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I added a comment.


double CBoostedTreeLeafNodeStatistics::childMaxGain(double gChild,
double minLossChild,
double lambda) const {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be worthwhile writing a brief outline of what this is doing, i.e. something along the lines of: "This computes the maximum possible gain we can expect splitting a child node given we know the sum of the positive (g^+) and negative gradients (g^-) at its parent, the minimum curvature on the positive and negative gradient set (hmin^+ and hmin^-) and largest and smallest gradient (gmax and gmin, respectively). The highest possible gain consistent with these constraints can be shown to be: (g^+)^2 / (hmin^+ g^+ / gmax + lambda) + (g^-)^2 / (hmin^- g^- / gmin + lambda)".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the explanation comment as you suggested.

Copy link
Copy Markdown
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@valeriy42 valeriy42 merged commit f4c986a into elastic:master Nov 11, 2020
@valeriy42 valeriy42 deleted the branch-and-bounds branch November 11, 2020 10:17
valeriy42 added a commit to valeriy42/ml-cpp that referenced this pull request Nov 11, 2020
…lastic#1537)

In this PR we start computing an upper bound on the potential gain from splitting a node. If the upper bound of the gain is lower than the currently smallest gain among all candidates, we ignore the node and this way prevent computations that are especially expensive on the large datasets.

Since we avoid computation of the splits that we wouldn't be added to the tree anyway, this PR does not change the qualitative results.

At the moment, we can only compute the upper bound for regression and binary classification. For multiclass classification we proceed as before.

Note: this PR contains additional instrumentation to assess the performance improvement. I will remove this instrumentation in a follow-up PR after tests.
valeriy42 added a commit that referenced this pull request Nov 12, 2020
…on (#1568)

In this PR we start computing an upper bound on the potential gain from splitting a node. If the upper bound of the gain is lower than the currently smallest gain among all candidates, we ignore the node and this way prevent computations that are especially expensive on the large datasets.

Since we avoid computation of the splits that we wouldn't be added to the tree anyway, this PR does not change the qualitative results.

At the moment, we can only compute the upper bound for regression and binary classification. For multiclass classification we proceed as before.

Note: this PR contains additional instrumentation to assess the performance improvement. I will remove this instrumentation in a follow-up PR after tests.

Backport of #1537

Note: this PR contains additional instrumentation to assess the performance improvement. I will remove this instrumentation in a follow-up PR after tests.

* windows compilation error fixed

* fix compiler issues

* fixing compiling issues
pull bot pushed a commit to abitmore/ml-cpp that referenced this pull request Aug 10, 2022
Debugging an intermittent SIGSEGV triggered by another change I'm working on showed up this potential out-of-
bounds read. It would happen very infrequently. It was introduced by elastic#1537.
droberts195 pushed a commit to droberts195/ml-cpp that referenced this pull request Sep 5, 2022
Debugging an intermittent SIGSEGV triggered by another change I'm working on showed up this potential out-of-
bounds read. It would happen very infrequently. It was introduced by elastic#1537.
droberts195 pushed a commit to droberts195/ml-cpp that referenced this pull request Sep 5, 2022
Debugging an intermittent SIGSEGV triggered by another change I'm working on showed up this potential out-of-
bounds read. It would happen very infrequently. It was introduced by elastic#1537.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants