You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Kernel function $k(x, y) = \phi(x)^T \phi(y)$ where $\phi(x)$ is a feature mapping that maps the data points to a higher dimensional space without actually computing the feature mapping, i.e., $\phi(x)$.
In other words, we can compute the kernel function $k(x, y)$ without actually computing the feature mapping $\phi(x)$.
If the dataset $D = {(x_i, y_i)}_{i=1}^n$ where $x_i \in \mathbb{R}^d$ and $y_i \in {-1, 1}$ is not linearly separable in the original $d$-dimensional space, we can use a feature mapping $\phi(x)$ to map the data points to a higher dimensional space where they become linearly separable and then use the SVM optimization problem to find the optimal hyperplane.
Hence the new dataset $D' = {(x_i', y_i)}_{i=1}^n$ where $x_i' = \phi(x_i)$ and $y_i \in {-1, 1}$.
Optimization problem
The optimization problem for SVM with kernel can be formulated as:
$$\min_{w,b,\xi} \frac{1}{2} |w|^2 + C \sum_{i=1}^n \xi_i$$
subject to:
$$y_i(w^T \phi(x_i) + b) \geq 1 - \xi_i, \quad i = 1,2,3,...,n$$$$\xi_i \geq 0$$
Solution to the optimization problem
The Lagrangian for the SVM optimization problem with kernel can be formulated as follows:
where $\alpha_i \geq 0$ and $\beta_i \geq 0$ are the Lagrange multipliers.
To find the optimal solution, we take the partial derivatives of the Lagrangian with respect to $w$, $b$, $\xi_i$, $\alpha_i$, and $\beta_i$, and set them to zero:
This is often referred to as the "kernel trick", which allows us to work in high-dimensional feature spaces without explicitly computing the feature vectors.
Finding $b$
To find $b$, we can use any support vector (a point where $0 < \alpha_i < C$) and the fact that for these points, $y_i(w^T \phi(x_i) + b) = 1$.
Decision Function
The decision function for classifying new points becomes:
The main difference from the non-kernel SVM is the use of the kernel function $k(x_i, x)$ instead of the dot product $x_i^T x$. This allows the SVM to find non-linear decision boundaries in the original input space by implicitly working in a higher-dimensional feature space.