In bagging, we train each model independently on a random subset of the data. In boosting, we train each model sequentially on the same data, with each subsequent model focusing on correcting the errors of combined previous model by increasing weights of misclassified datapoints
Unlike bagging, each model depends on the previous ones and its contribution to the final prediction is weighted differently.
Let's define the ensemble model
where:
-
$h_t(x)$ is the t-th weak learner -
$\alpha_t \in [0,1]$ is the weight of the t-th weak learner -
$T$ is the total number of weak learners
The risk (or error) of the ensemble model
where:
-
$L$ is the loss function -
$(x_i, y_i)$ are the input-output pairs in the dataset -
$n$ is the number of samples
Our goal is to minimize
Using Taylor expansion, this can be approximated as:
where
Because
Expanding this further:
We call
Thus, the problem reduces to:
This formulation shows that each new weak learner