[ML] Gain upper bound estimation for classification and regression#1537
[ML] Gain upper bound estimation for classification and regression#1537valeriy42 merged 39 commits intoelastic:masterfrom
Conversation
tveasey
left a comment
There was a problem hiding this comment.
I know this is still work in progress and needs a little tidying up. I made a couple of comments on the implementation to make the update of the bound related information as efficient as possible.
… into branch-and-bounds
|
@tveasey I finished up this PR. It would be great if you could take another look at it. |
|
@tveasey Also, do you think we need to update the memory estimation? |
I don't think so: the extra memory is accounted for in the sizeof of the derivatives object so we should get it accounted for already. |
tveasey
left a comment
There was a problem hiding this comment.
Over all looks great. There are a couple of potential issues I think. One is just in the temporary stats you compute, but the other looks like it is a real issue caused by unsigned integer underflow. I also made some minor suggestions for readability and consistency.
| class MATHS_EXPORT CSplitsDerivatives { | ||
| public: | ||
| using TDerivativesVec = std::vector<CDerivatives>; | ||
| using TDerivativesMappedVec = boosted_tree_detail::TAlignedMemoryMappedFloatVector; |
There was a problem hiding this comment.
We reserve Vec for std::vector in all other code so I find this name misleading. However, we actually already have this type declared so I would just use TMemoryMappedFloatVector wherever you currently use TDerivativesMappedVec and delete this typedef.
There was a problem hiding this comment.
I removed the typedef as you suggested.
| private: | ||
| using TDerivativesVecVec = std::vector<TDerivativesVec>; | ||
| using TAlignedDoubleVec = std::vector<double, core::CAlignedAllocator<double>>; | ||
| using TDerivatives2D = Eigen::Matrix<double, 2, 1>; |
There was a problem hiding this comment.
This type name breaks our standard naming conventions which say that the type name should reflect the type not its use. In other cases in the code where we have fixed stack based vectors we would use something like TDoubleVector2x1.
| bool assignMissingToLeft) | ||
| bool assignMissingToLeft, | ||
| std::size_t leftChildRowCount, // TODO remove after stats measurement | ||
| std::size_t rightChildRowCount, // TODO remove after stats measurement |
There was a problem hiding this comment.
Do you plan to remove these in a follow up PR or can they be removed now?
There was a problem hiding this comment.
I remove them in a follow-up PR.
| static_cast<std::int64_t>(this->memoryUsage()) - lastMemoryUsage); | ||
|
|
||
| if (m_Instrumentation != nullptr) { | ||
| LOG_INFO(<< "Statistics computed: " << m_Instrumentation->statisticsComputed() |
There was a problem hiding this comment.
I'm not sure this should make it into the logs. It is interesting for us to know but I can't see a user wanting to know this. If you want it for committed for QA I would leave a TODO to downgrade this logging to trace later on.
There was a problem hiding this comment.
I added a TODO comment and will remove this output completely together wit the instrumentation code.
| if (cl[ASSIGN_MISSING_TO_LEFT] == 0 || cl[ASSIGN_MISSING_TO_LEFT] == c) { | ||
| gain[ASSIGN_MISSING_TO_LEFT] = -INF; | ||
| } else { | ||
| minLossLeft[ASSIGN_MISSING_TO_LEFT] = minimumLoss( |
There was a problem hiding this comment.
Does it matter that we only set minLossLeft in this branch? I don't think so because they are only used if gain > -INF, but this is not immediately obvious from the code. I think it warrants a comment.
There was a problem hiding this comment.
You are right, I added a comment.
|
|
||
| double CBoostedTreeLeafNodeStatistics::childMaxGain(double gChild, | ||
| double minLossChild, | ||
| double lambda) const { |
There was a problem hiding this comment.
I think it might be worthwhile writing a brief outline of what this is doing, i.e. something along the lines of: "This computes the maximum possible gain we can expect splitting a child node given we know the sum of the positive (g^+) and negative gradients (g^-) at its parent, the minimum curvature on the positive and negative gradient set (hmin^+ and hmin^-) and largest and smallest gradient (gmax and gmin, respectively). The highest possible gain consistent with these constraints can be shown to be: (g^+)^2 / (hmin^+ g^+ / gmax + lambda) + (g^-)^2 / (hmin^- g^- / gmin + lambda)".
There was a problem hiding this comment.
I added the explanation comment as you suggested.
…lastic#1537) In this PR we start computing an upper bound on the potential gain from splitting a node. If the upper bound of the gain is lower than the currently smallest gain among all candidates, we ignore the node and this way prevent computations that are especially expensive on the large datasets. Since we avoid computation of the splits that we wouldn't be added to the tree anyway, this PR does not change the qualitative results. At the moment, we can only compute the upper bound for regression and binary classification. For multiclass classification we proceed as before. Note: this PR contains additional instrumentation to assess the performance improvement. I will remove this instrumentation in a follow-up PR after tests.
…on (#1568) In this PR we start computing an upper bound on the potential gain from splitting a node. If the upper bound of the gain is lower than the currently smallest gain among all candidates, we ignore the node and this way prevent computations that are especially expensive on the large datasets. Since we avoid computation of the splits that we wouldn't be added to the tree anyway, this PR does not change the qualitative results. At the moment, we can only compute the upper bound for regression and binary classification. For multiclass classification we proceed as before. Note: this PR contains additional instrumentation to assess the performance improvement. I will remove this instrumentation in a follow-up PR after tests. Backport of #1537 Note: this PR contains additional instrumentation to assess the performance improvement. I will remove this instrumentation in a follow-up PR after tests. * windows compilation error fixed * fix compiler issues * fixing compiling issues
Debugging an intermittent SIGSEGV triggered by another change I'm working on showed up this potential out-of- bounds read. It would happen very infrequently. It was introduced by elastic#1537.
Debugging an intermittent SIGSEGV triggered by another change I'm working on showed up this potential out-of- bounds read. It would happen very infrequently. It was introduced by elastic#1537.
Debugging an intermittent SIGSEGV triggered by another change I'm working on showed up this potential out-of- bounds read. It would happen very infrequently. It was introduced by elastic#1537.
In this PR we start computing an upper bound on the potential gain from splitting a node. If the upper bound of the gain is lower than the currently smallest gain among all candidates, we ignore the node and this way prevent computations that are especially expensive on the large datasets.
Since we avoid computation of the splits that we wouldn't be added to the tree anyway, this PR does not change the qualitative results.
At the moment, we can only compute the upper bound for regression and binary classification. For multiclass classification we proceed as before.
Note: this PR contains additional instrumentation to assess the performance improvement. I will remove this instrumentation in a follow-up PR after tests.