binary SVM

Support Vector Machines

Before deep learning became popular on a large scale, support vector machines were the most popular machine learning method.

The support vector machine has a small amount of calculation, so it is still favored by devices that cannot run large memory and high calculation amount.

To illustrate SVM, we use the iris dataset as an example. (For the convenience of explaining the principle, I only extract two categories from it)

For visualization, I compressed the iris data set to two-dimensional data using principal component analysis (PCA).

####################
# read data
####################

f = open('iris.csv', 'r')
lines = f.readlines()

data = []
label = []

for line in lines[ 1: ]:
    line = line.replace('\n', '').split(',')

    # Convert string data into float form
    data.append( list(map(float, line[ :-1 ])) )
    label.append( line[-1] )


# Converted into a numpy array, you can process the data like MATLAB.
import numpy as np

data = np.array(data)
label = np.array(label)



####################
# setosa & versicolor
####################

# Extract only two categories from it
data = data[ label != 'virginica' ]
label = label[ label != 'virginica' ]

# Replace the label with a number
for i in range(len(label)):
    if label[i] == 'setosa':
        label[i] = 0
    else:
        label[i] = 1

# convert string into int
label = label.astype(int) 



####################
# pca
####################

from sklearn.decomposition import PCA

pca = PCA( n_components = 2 ) # 2D data

# fit the model
pca.fit(data) 
print(pca.explained_variance_ratio_)

# transformed data
data = pca.fit_transform(data)



####################
# show data
####################

print('data:')
# View the dimensional information of the data
print(data.shape)
print(data)
print()
print('label:')
print(label.shape)
print(label)

import matplotlib.pyplot as plt

# visualizaion
plt.scatter(data[:, 0], data[:, 1], c = label)

plt.show()
2D-data visualization

[0.90505321 0.07466183] is the parameter of pca. The first dimension occupies 0.90505321 of the information of all dimensions, and the second dimension occupies 0.07466183 of the information of all dimensions, so the iris data set actually only needs one dimension to retain 0.90505321 Information which is nearly all the information.

So we start to explain what svm is.

Reference program

The first thing: use sklearn to build an svm on the data set:

Linear svm

SVM is a classifier and one of the best classifiers for supervised learning.

The basic idea of SVM is to draw a line to separate two different data. For example, in the picture: the brown line separates the blue and green dots.

The SVM algorithm ensures that the points on both sides are separated by this line to the greatest extent: that is, the points on both sides are the farthest away from this line.

So SVM adjusts the slope and position of this line so that the point on both sides which closest to this line(Use dashed lines to indicate) are the farthest from this line. Just as shown in the picture.

We call the distance between the dotted line and the SVM dividing line margen. The purpose of the SVM trainer is to maximize the margen of the training set data.

The algorithm of SVM is complicated, but fortunately sklearn includes libsvm, we can use linear svm and kernel svm very conveniently.

XOR problem

Problems like the following that cannot be divided by a straight line are called XOR problems.

XOR problems

The accuracy of dividing this problem with a straight line can never be higher than 50%.

SVM is a classifier based on a hyperplane, so SVM can draw a dividing line that is much more complicated than a straight line.

Resolution = 10
Resolution = 100
Resolution = 1000
Resolution = 2000

We can see that linear SVM is racking it brains to solve this classification problem.

But this is basically futile, and you will find that SVM fantasizes out a pattern that does not exist in the first place. When we adjust C (penalty coefficient) too high, SVM has a strong overfitting problem.

In order to solve the XOR problem, if we make the dividing line of the SVM into a curve, then the problem is solved?

In the following example, I replaced the linear core with an rbf core.

rbf core SVM

The problem has been solved very well!

The area where the purple dot is located is divided into purple, and the area where the yellow dot is located is divided into yellow.

And you will find that the space between the yellow and purple points is very large, which means that the SVM is very adaptable. If the data has a certain range, it can still be in the correct position.

The RBF core is one of the most commonly used cores because of its simple principle and good effect.

If we draw the RBF core it looks like this:

RBF function

The RBF function turns a linear function into a nonlinear result, which is very useful in machine learning. Because the world we live in is also non-linear.

For example, in SVM, we cannot use straight lines to divide regions, because the points are not on different sides of a picture, they may overlap(XOR problem), so these nonlinear functions are very useful.

Other kernel

We use other kernels to show you the effect of nonlinear SVM:

sigmoid

What is drawn here is a two-dimensional sigmoid function. The sigmoid function is also a very commonly used function to turn linear results into nonlinearities.

sigmoid core SVM

You will find that sigmoid does not work well when dealing with XOR problems.

If you know what a series, Taylor expansion or Fourier transform is, then polynomial expansion will be much easier.

Taylor expansion of some formulas

If you don't know the above concepts, then please check out the following example.

The above function can draw the function image of the Taylor expansion of the cosine. Adjusting N can change the order of the Taylor expansion drawn.

For example, when N=2, the first-order Taylor expansion is drawn:

N = 2

For example, when N=3, the second-order Taylor expansion is drawn:

N = 3

Then comes the higher-order Taylor expansion:

N = 4
N = 5
N = 6
N = 7
N = 8
N = 9

When N=10, numpy reports an error, and Taylor expansion uses the power, and the power increases exponentially, which soon broke through the tolerance range of python. In the mathematical method, we will discuss other calculation algorithms to reduce the burden of python.

You will find that when N increases, the function of the Taylor expansion becomes more and more like a cosine function. This is a polynomial.

merged

We can plot the multi-level results together, and you can see the difference between them.

When our order is insufficient, underfitting will occur, that is, the Taylor expansion function cannot keep up with the change of the original function. When the order is right, it fits the original function exactly. When the order is too high, over-fitting will occur, and the Taylor expansion will not tolerate any noise at all. But because the real world is noisy, the polynomials will have large deviations.

For example, in the previous example, if I get too few polynomials, the following results will appear:

Underfitting

If our polynomial order is right, then we can get as good results as the rbf kernel:

Correct fit

In this example, there is no overfitting problem. The overfitting problem needs a larger and more noisy data set.

Comparison of different kernels

Reference program

sklearn provides us with a very good program to visualize the effects of different SVM cores.

SVM different kernel visualization

The basic idea of this visualization: use the iris data set to train an SVM model, and then divide the visualization space into small grids by linear interpolation, and then send the data of each small grid to the SVM to predict the results, and then give the background based on the results Color, and finally get the color block map of this area in the above figure.

Statistics

Start time of this page: December 21, 2021

Completion time of this page: December 22, 2021

Last updated