When you start having intricate overlaps or irregularly shaped boundaries between the classes SVM generalize well, but need a little bit of tweaking and tuning to achieve good results. In other words: you need to know what you are doing.

Let's walk through some simple examples from the sklearn documentation first to get a feeling of how SVMs behave. We will go light on the theory as I will explore this another time in a future post.

First we need to import some basics:

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn import svm
from sklearn.datasets import make_blobs

import warnings
warnings.filterwarnings('ignore') # matplotlib is a very talkative library

Trivial Examples in 2 and 3 Dimensions

2 Dimensional Data

The first example is loosely based on the sklearn documentaiton for support vector machines. We start by creating a few 2-dimensional data points $X \in \mathbb{R}^{40x2}$ and assigning them each one out of two classes $y \in \{0,1\}^{40}$.

X, y = make_blobs(n_samples=40, centers=2, random_state=6)
plt.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=plt.cm.Paired)

# saving the x and y axis max/min values
# we will need those later for more plots
xlim = plt.gca().get_xlim()
ylim = plt.gca().get_ylim()

It is very easy to separate these two classes by intuition with simple straight line, right through the middle of the gap between the two classes. You would then - intuitively - classify any new data south of that line belonging to the red class and any data point north of that line to the blue class.

In order to see if the SVM arrives at a result similiar to our intuition, let us first train the model.

# create and fit the model
clf = svm.SVC(kernel='linear', C=1)
clf.fit(X, y)

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

What follows next is basically only for visualization purposes. We draw an equisdistant grid across the $\mathbb{R}^2$ domain where our features $X$ live and let the SVM decide for each pair of $ x = (x^{(1)},x^{(2)})$ coordinates, which score it assigns to that point.

# create grid of x_1 and x_2 coordinates
x_1 = np.linspace(xlim[0], xlim[1], 30)
x_2 = np.linspace(ylim[0], ylim[1], 30)
X_1, X_2 = np.meshgrid(x_1, x_2)
grid = np.vstack([X_1.ravel(), X_2.ravel()]).T

With this grid we can calculate the classification for "all" possible points in the feature space or at least a close enough approximation.

# Calculate the value for every possible x_1-x_2-combination
# that is almost the same as using the predict function, but just
# shying away from using a threshold to decide which class a
# point would belong to

Z = clf.decision_function(grid).reshape(X_1.shape)

Now, we can simply plot the data and then overlay the datapoints with the values of the decision function. We can do that as a colored overlay:

# plot the data points
plt.scatter(X[:, 0], X[:, 1], c='k', s=30)

# plot decision function and values
cntr1 = plt.contourf(X_1, X_2, Z, levels=20, alpha=0.5, cmap='viridis')
plt.colorbar(cntr1).set_label('decision function')
plt.xlabel('x_1')
plt.ylabel('x_2')

plt.show()

Or we can simply pick out the values which are most interesting to us, such as the decision boundary, where the decision function becomes $0$ and the support vectors, which are the data points where the decision function becomes $1$ or $-1$ respectively.

# plot the data points
plt.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=plt.cm.Paired)

# plot decision boundary and margins
plt.contour(X_1,
            X_2,
            Z,
            colors='k',
            levels=[-1, 0, 1],
            alpha=0.5,
            linestyles=['--', '-', '--'])

# plot support vectors
plt.scatter(clf.support_vectors_[:, 0],
            clf.support_vectors_[:, 1],
            s=100,
            linewidth=1,
            facecolors='none',
            edgecolors='k')
plt.show()

As we can see, the SVM has generated a solution that is in line with our intuitive guess about how the featuer space should be divided up according to classes. The dashed lines represent the levels of $1$ and $-1$ where the solid line defines the decision boundary at the zero-level of the decision function.

To do right by the SVM algorithm though - and to mentally prepare ourselves for the generalization to non-trivial examples that lies ahead - we have to visualize the data and the decision function in 3D.

from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(12, 12))
ax = fig.add_subplot(111, projection='3d')

ax.plot_surface(X_1, X_2, Z, cmap='viridis', alpha=0.5)
ax.contour(X_1,
           X_2,
           Z,
           colors='k',
           levels=[-1, 0, 1],
           alpha=1,
           linestyles=['--', '-', '--'])

xy_data = np.vstack([X[:, 0].ravel(), X[:, 1].ravel()]).T

ax.scatter(X[:, 0], X[:, 1], 0, c=y, s=30, cmap=plt.cm.Paired)

ax.set_xlabel('x_1')
ax.set_ylabel('x_2')
ax.set_zlabel('decision function')

ax.view_init(15, -15)

As we can see here, the decision function defines a hyperplane separating our datapoints into areas which map to $Z<0$ and $Z>0$ respectively. Note that the hyper"plane" is actually the solid black line that you see where $Z=0$.

3-Dimensional Data

Of course there is no need to limit oneself to two features, but as we always need to add at least one dimension to create a hyperplane, it is hard to create sensible visualizations in higher dimensions. But let us at least try for 3 feature dimensions.

# we create 40 separable points
# note that X has three dimensions here and y is the class they belong to
X, y = make_blobs(n_samples=40, n_features=3, centers=2, random_state=6)

fig = plt.figure(figsize=(12, 12))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, s=30, cmap=plt.cm.Paired)

# saving the x and y axis max/min values
# we will need those later for more plots
xlim = plt.gca().get_xlim()
ylim = plt.gca().get_ylim()
zlim = plt.gca().get_zlim()

We then fit the model again, but now simply with a 3D input.

clf_3d = svm.SVC(kernel='linear', C=1)
clf_3d.fit(X, y)

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

And then repeat our grid creation procedure again, just add one more dimension.

# create grid of x_1-x_2-x_3 coordinates 

x_1 = np.linspace(xlim[0], xlim[1], 30)
x_2 = np.linspace(ylim[0], ylim[1], 30)
x_3 = np.linspace(zlim[0], zlim[1], 30)
X_1, X_2, X_3 = np.meshgrid( x_1, x_2, x_3)
grid_3d = np.vstack([X_1.ravel(), X_2.ravel(), X_3.ravel()]).T

# additionally create a grid only consisting of the data_points
# this is essentially just reformatting them
# so we can work with the grid and the data
# without changing our functions

grid_3d_data = np.vstack([X[:,0].ravel(), X[:,1].ravel(), X[:,2].ravel()]).T

# Calculate the value for for every possible x_1-x_2-x_3-combination

fourth_dimension_grid = clf_3d.decision_function(grid_3d)
fourth_dimension_data = clf_3d.decision_function(grid_3d_data)

Now, we have the little challenge that we have been working with point clouds so far. When drawing in 2D there is an interpolation going on under the hood, so we could easily draw a decision function using "contour" and matplotlib would fill in the gaps in our grid. I know of no such function in 3D, so we need a little hack.

As an approximation of the zero-level of the decision function, I will use the points in space where it is very close to $0$.

decision_boundary_3d = grid_3d[np.abs(fourth_dimension_grid)<0.01]

We then use a triangulated surface plot to fill the gaps in the point cloud. It looks a bit wonky, but it does the job.

fig = plt.figure(figsize=(12, 12))
ax = fig.add_subplot(111, projection='3d')

# create triangulated surface with the
# points of the decision boundary

ax.plot_trisurf(decision_boundary_3d[:, 0],
                decision_boundary_3d[:, 1],
                decision_boundary_3d[:, 2],
                color='teal',
                alpha=0.4)

# add in the 3D datapoints and color them
# according to their decision function values
scatter = ax.scatter(X[:, 0],
           X[:, 1],
           X[:, 2],
           c=fourth_dimension_data,
           s=30,
           cmap='viridis')

plt.colorbar(scatter, shrink = .7).set_label('decision function')

ax.set_xlabel('x_1')
ax.set_ylabel('x_2')
ax.set_zlabel('x_3')

ax.view_init(15, 135)

Of course, this is another very trivial example, the dataset is clearly linearly separable just in three dimensions instead of two. But we have been able to buld intuition about how the separation between classes in more than three dimensions works, i.e. by defining a function whose zero-level separates the classes.

Summary

A few trivial examples have shown the basic principles of SVM:

embed the data into a higher dimensional space
use a function mapping the data into the higher dimensional space
choose the function so that the difference between classes is high, but lower for different datapoints within the same class
split the distance in the values and classify new data according to which side of the split they appear on

Generalizing to Non-Trivial Examples

The more interesting case is of course the one where we cannot simply arrive at an almost perfect solution by eyeballing and jamming a straight line or plane between the clusters. So what happens when we don't have a linearly separable dataset?

2-Dimensional Data

Let us create a nested, 2-dimensional dataset using the make_gaussian_quantiles function.

from sklearn.datasets import make_gaussian_quantiles

X, y = make_gaussian_quantiles(mean=None,
                               cov=1.0,
                               n_samples=100,
                               n_features=2,
                               n_classes=2,
                               shuffle=True,
                               random_state=42)

# draw the result
plt.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=plt.cm.Paired)

# saving the x and y axis max/min values
# we will need those later for more plots
xlim = plt.gca().get_xlim()
ylim = plt.gca().get_ylim()

Then we try to fit our classifier again, without accommodating to the new dataset structure.

clf = svm.SVC(kernel='linear')
clf.fit(X, y)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

And visualize the results just like before.

x_1 = np.linspace(xlim[0], xlim[1], 30)
x_2 = np.linspace(ylim[0], ylim[1], 30)
X_1, X_2 = np.meshgrid(x_1, x_2)
grid = np.vstack([X_1.ravel(), X_2.ravel()]).T

Z = clf.decision_function(grid).reshape(X_1.shape)

plt.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=plt.cm.Paired)

plt.contour(X_1,
            X_2,
            Z,
            colors='k',
            levels=[-1, 0, 1],
            alpha=0.5,
            linestyles=['--', '-', '--'])

plt.show()

The SVM is clearly out of its depth with the concentric dataset. Our decision boundary has almost nothing to do with the data and is completely useless. This is where we have to introduce the concept of 'kernels'.

In the trivial case above we introduced a linear function that had values larger than $1$ around the one class of dots and smaller than $-1$ around the other class. In the trench between the two sets, our decision function went from $1$ through $0$ to $-1$.

Now we simple need to do the same thing but instead of an linear function, we use another function that is larger than $0$ for data from one class and smaller for the other class and pass thorugh $0$ on its way from one class to another.

And, ideally, it is differentiable to do fancy math stuff with it later.

It turns out that that we can construct such a function by using the sum of a lof of smaller functions and one good option is the gaussian kernel function. You can simply think 'normal distribution' here: we create a little normal distribution around each data point and multiply with $1$ or $-1$ accordingly.

def gaussian_kernel(X, Y, sigma=1):

    similarity = 1 / (sigma * np.sqrt(2 * np.pi)) * np.exp(
        -np.linalg.norm(np.subtract(X, Y), ord=2) / (2 * (sigma**2)))

    return np.array(similarity).reshape(1, -1)

This function simply takes to vectors $X$ and $Y$, calculates the euclidean distance between them and then scales the distance on a gaussian curve, i.e. close to the peak if they are very close to each other but then rapidly declining, given the right $\sigma$ . If we choose $(x^{(1)},x^{(2)}) = [0,0]$ as the center, that is what the function looks like:

fig = plt.figure(figsize=(12, 12))
ax = fig.add_subplot(111, projection='3d')

Z = np.array([gaussian_kernel(point, [0, 0])
              for point in grid]).reshape(X_1.shape)

ax.plot_surface(X_1, X_2, Z)

ax.set_xlabel('X_0')
ax.set_ylabel('X_1')
ax.set_zlabel('kernel function')

ax.view_init(0, 105)

This function fulfills all the criteria above, but in order to use it in the the SVM we still have to comply with a technicality.

I was actually a bit stumped by this, as the sklearn documentation does not go into detail at all about how to properly implement a kernel function. Luckily, there was a stackoverflow comment that came to my rescue and which you will find linked in my sources. It said:

For efficiency reasons, SVM assumes that the kernel is a function accepting two matrices of samples.

So, instead of single vectors, we need a function that takes and returns matrices such as:

from functools import partial


def proxy_kernel(X, Y, K):
    gram_matrix = np.zeros((X.shape[0], Y.shape[0]))
    for i, x in enumerate(X):
        for j, y in enumerate(Y):
            gram_matrix[i, j] = K(x, y)
    return gram_matrix


correct_gaussian_kernel = partial(proxy_kernel, K=gaussian_kernel)

We can then try to fit the SVM again, using the new kernel:

clf = svm.SVC(kernel=correct_gaussian_kernel, C=1)
clf.fit(X, y)

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel=functools.partial(<function proxy_kernel at 0x0000015D60DA8510>, K=<function gaussian_kernel at 0x0000015D60DF4D90>),
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

From here on out, it's smooth sailing: we resuse the grid from before, calculate the decision function on each dot in the grid an visualize the decision boundary

Z = clf.decision_function(grid).reshape(X_1.shape)

plt.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=plt.cm.Paired)

plt.contour(X_1,
            X_2,
            Z,
            colors='k',
            levels=[-1, 0, 1],
            alpha=0.5,
            linestyles=['--', '-', '--'])

plt.show()

As we see here the decision boundary (solid line) neatly encircles the cluster of blue dots except for a three escapees in the lower right and top left quadrant of the boundary.

When plotting the decision function in 3D and overlaying it with the data, it becomes a sort of well where every data point that crosses the solid boundary falls into the well and gets classified as a blue dot - or in other words: the solid line indicates where the zero-level of the decision function lies and that zero-level defines the hyperplane that separates the data.

fig = plt.figure(figsize=(12, 12))
ax = fig.add_subplot(111, projection='3d')

ax.plot_surface(X_1, X_2, Z, cmap='viridis', alpha=0.2)
ax.contour(X_1,
           X_2,
           Z,
           colors='k',
           levels=[-1, 0, 1],
           alpha=1,
           linestyles=['--', '-', '--'])

xy_data = np.vstack([X[:, 0].ravel(), X[:, 1].ravel()]).T

ax.scatter(X[:, 0],
           X[:, 1],
           clf.decision_function(xy_data),
           c=y,
           s=30,
           cmap=plt.cm.Paired)

ax.set_xlabel('x_1')
ax.set_ylabel('x_2')
ax.set_zlabel('decision function')

ax.view_init(15, 15)

3-Dimensional Data

When trying to elevate this experiment to three dimensional input data, we obviously lose the ability to visualize the decision function in its entirety, but as before, we can still visualize the hyperplane of the zero-level set. Let us start with concentric 3D data. i.e. it is a ball of "blue" data in the middle of a "red" data cloud.

# note that X has three dimensions here and y is the class they belong to
X, y = make_gaussian_quantiles(mean=None,
                               cov=1.0,
                               n_samples=200,
                               n_features=3,
                               n_classes=2,
                               shuffle=True,
                               random_state=42)

fig = plt.figure(figsize=(12, 12))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, s=30, cmap=plt.cm.Paired)

# saving the x and y axis max/min values
# we will need those later for more plots
xlim = plt.gca().get_xlim()
ylim = plt.gca().get_ylim()
zlim = plt.gca().get_zlim()

ax.set_xlabel('x_1')
ax.set_ylabel('x_2')
ax.set_zlabel('x_3');


ax.view_init(15, 215)

And we also classify that with our custom gaussian kernel. So, what's the gaussian kernel in 3D? It is a sphere where the space is more densely packed in the middle and the density then (depending on $\sigma$) quickly decreased as you move outwards.

clf_3d = svm.SVC(kernel=correct_gaussian_kernel)
clf_3d.fit(X, y)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel=functools.partial(<function proxy_kernel at 0x0000015D60DA8510>, K=<function gaussian_kernel at 0x0000015D60DF4D90>),
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Again, we create a grid covering the whole data domain and calculating the decision function for all the grid points.

# create grid of x_1-x_2-x_3-coordinates
x_1 = np.linspace(xlim[0], xlim[1], 30)
x_2 = np.linspace(ylim[0], ylim[1], 30)
x_3 = np.linspace(zlim[0], zlim[1], 30)
X_1, X_2, X_3 = np.meshgrid(x_1, x_2, x_3)

grid_3d = np.vstack([X_1.ravel(), X_2.ravel(), X_3.ravel()]).T

# Calculate decision function value
fourth_dimension_grid = clf_3d.decision_function(grid_3d)

Properly displaying the 3-dimensional zero-level of a 4-dimensional function was a bit challenging, as there is no function in matplotlib that gracefully handles converting a point cloud to a surface.

So, I sliced the $x_3$-axis up and then, for each fixed $x_3$ value, I calculated the decision function across the whole $x_1$-$x_2$-domain.Finally, I filtered for where the result is zero, thereby identifying a ring that is part of the whole 3D surface.

As matplotlib can apparently nicely interpolate the zero-level in 2 dimensions, we get the contour of a slightly deformed ball that nicely encompasses all the blue data points.

fig = plt.figure(figsize=(12,12))
ax = fig.add_subplot(111, projection='3d')

for x_3 in np.unique(grid_3d[:,2]):
    # create grid in x_1-x_2-domain
    X1,X2 = np.meshgrid(x_1,x_2)
    
    # Calculate the value of the decision function for each x_3 slice
    X3 = clf_3d.decision_function(grid_3d[grid_3d[:,2] == x_3])
    
    # offset the slice to the right position
    # as we are looking for the zero-level of Z (Z=0) we actually need to
    # display the level x_3 as Z+x_3 = 0+x_3 = x_3
    cset = ax.contour(X1,X2,X3.reshape(X1.shape)+x_3, levels = [x_3], zdir='z')

# plot the data
ax.scatter(X[:, 0], X[:, 1], X[:,2],c = y, s=30, cmap=plt.cm.Paired)

ax.set_xlabel('x_1')
ax.set_ylabel('x_2')
ax.set_zlabel('x_3');

ax.view_init(15, 215)

So, we have seen in this chapter how to generalize the method into higher dimensions: Using a similarity kernel, we create a function that tends towards positive values for one class and negative values for the other.

But, this whole ordeal displays one of the biggest drawbacks of the method: you need to get a good idea about the topology that your dataset creates before you can choose an appropriate kernel. Alternatively, you can brute force your way through all the available kernel and parameter options.

Support Vector Machines: Building Intuition

Trivial Examples in 2 and 3 Dimensions

2 Dimensional Data

3-Dimensional Data

Summary

Generalizing to Non-Trivial Examples

2-Dimensional Data

3-Dimensional Data

Sources