Before deep learning became popular on a large scale, support vector machines were the most popular machine learning method.
The support vector machine has a small amount of calculation, so it is still favored by devices that cannot run large memory and high calculation amount.
To illustrate SVM, we use the iris dataset as an example. (For the convenience of explaining the principle, I only extract two categories from it)
For visualization, I compressed the iris data set to two-dimensional data using principal component analysis (PCA).
##################### read data####################f =open('iris.csv', 'r')lines = f.readlines()data = []label = []for line in lines[1:]: line = line.replace('\n', '').split(',')# Convert string data into float form data.append( list(map(float, line[ :-1 ])) ) label.append( line[-1] )# Converted into a numpy array, you can process the data like MATLAB.import numpy as npdata = np.array(data)label = np.array(label)##################### setosa & versicolor##################### Extract only two categories from itdata = data[ label !='virginica']label = label[ label !='virginica']# Replace the label with a numberfor i inrange(len(label)):if label[i]=='setosa': label[i]=0else: label[i]=1# convert string into intlabel = label.astype(int)##################### pca####################from sklearn.decomposition import PCApca =PCA( n_components =2 )# 2D data# fit the modelpca.fit(data)print(pca.explained_variance_ratio_)# transformed datadata = pca.fit_transform(data)##################### show data####################print('data:')# View the dimensional information of the dataprint(data.shape)print(data)print()print('label:')print(label.shape)print(label)import matplotlib.pyplot as plt# visualizaionplt.scatter(data[:, 0], data[:, 1], c = label)plt.show()
[0.90505321 0.07466183] is the parameter of pca. The first dimension occupies 0.90505321 of the information of all dimensions, and the second dimension occupies 0.07466183 of the information of all dimensions, so the iris data set actually only needs one dimension to retain 0.90505321 Information which is nearly all the information.
The first thing: use sklearn to build an svm on the data set:
##################### read data####################f =open('iris.csv', 'r')lines = f.readlines()data = []label = []for line in lines[1:]: line = line.replace('\n', '').split(',')# Convert string data into float form data.append( list(map(float, line[ :-1 ])) ) label.append( line[-1] )# Converted into a numpy array, you can process the data like MATLAB.import numpy as npdata = np.array(data)label = np.array(label)##################### setosa & versicolor##################### Extract only two categories from itdata = data[ label !='virginica']label = label[ label !='virginica']# Replace the label with a numberfor i inrange(len(label)):if label[i]=='setosa': label[i]=0else: label[i]=1# convert string into intlabel = label.astype(int)##################### pca####################from sklearn.decomposition import PCApca =PCA( n_components =2 )# 2D data# fit the modelpca.fit(data)print(pca.explained_variance_ratio_)# transformed datadata = pca.fit_transform(data)##################### show data####################print('data:')# View the dimensional information of the dataprint(data.shape)print(data)print()print('label:')print(label.shape)print(label)import matplotlib.pyplot as plt# visualizaionplt.scatter( data[:, 0], data[:, 1], c = label, cmap ='tab10' )# plt.show()##################### svm####################from sklearn import svmclf = svm.SVC( kernel='linear' )clf.fit( data, label )xlim = [ min(data[:, 0]),max(data[:, 0]) ]ylim = [ min(data[:, 1]),max(data[:, 1]) ] xx = np.linspace(xlim[0], xlim[1], 10)yy = np.linspace(ylim[0], ylim[1], 10)YY, XX = np.meshgrid(yy, xx)xy = np.vstack([XX.ravel(), YY.ravel()]).TZ = clf.decision_function(xy).reshape(XX.shape)ax = plt.gca()# plot decision boundary and marginsax.contour( XX, YY, Z, cmap ='tab10', levels = [-1, 0, 1], alpha =0.5, linestyles = ['--', '-', '--'] )plt.show()
SVM is a classifier and one of the best classifiers for supervised learning.
The basic idea of SVM is to draw a line to separate two different data. For example, in the picture: the brown line separates the blue and green dots.
The SVM algorithm ensures that the points on both sides are separated by this line to the greatest extent: that is, the points on both sides are the farthest away from this line.
So SVM adjusts the slope and position of this line so that the point on both sides which closest to this line(Use dashed lines to indicate) are the farthest from this line. Just as shown in the picture.
We call the distance between the dotted line and the SVM dividing line margen. The purpose of the SVM trainer is to maximize the margen of the training set data.
The algorithm of SVM is complicated, but fortunately sklearn includes libsvm, we can use linear svm and kernel svm very conveniently.
XOR problem
Problems like the following that cannot be divided by a straight line are called XOR problems.
x = [ [1,2], [2,-1], [-1,-2], [-2,1], [2,1], [1,-2], [-2,-1], [-1,2], ]y = [0,0,0,0,1,1,1,1]import numpy as npx = np.array(x)y = np.array(y)import matplotlib.pyplot as pltplt.figure( figsize=(6, 6) )plt.scatter( x[:, 0], x[:, 1], c = y )plt.grid()plt.show()
The accuracy of dividing this problem with a straight line can never be higher than 50%.
SVM is a classifier based on a hyperplane, so SVM can draw a dividing line that is much more complicated than a straight line.
We can see that linear SVM is racking it brains to solve this classification problem.
But this is basically futile, and you will find that SVM fantasizes out a pattern that does not exist in the first place. When we adjust C (penalty coefficient) too high, SVM has a strong overfitting problem.
In order to solve the XOR problem, if we make the dividing line of the SVM into a curve, then the problem is solved?
In the following example, I replaced the linear core with an rbf core.
The area where the purple dot is located is divided into purple, and the area where the yellow dot is located is divided into yellow.
And you will find that the space between the yellow and purple points is very large, which means that the SVM is very adaptable. If the data has a certain range, it can still be in the correct position.
RBF kernel
The RBF core is one of the most commonly used cores because of its simple principle and good effect.
The RBF function turns a linear function into a nonlinear result, which is very useful in machine learning. Because the world we live in is also non-linear.
For example, in SVM, we cannot use straight lines to divide regions, because the points are not on different sides of a picture, they may overlap(XOR problem), so these nonlinear functions are very useful.
Other kernel
We use other kernels to show you the effect of nonlinear SVM:
sigmoid core
from matplotlib import pyplot as pltfrom mpl_toolkits.mplot3d import Axes3Dimport numpy as npfig = plt.figure()ax = plt.axes( projection ='3d' )minmax =2xx = np.arange( -minmax, minmax, minmax /100 )yy = np.arange( -minmax, minmax, minmax /100 )X, Y = np.meshgrid( xx, yy )Z =1/ ( 1+ np.e **-( X - Y ))ax.plot_surface( X, Y, Z, cmap ='rainbow' )plt.title('sigmoid function')plt.xlabel('x')plt.ylabel('y')plt.grid()plt.show()
What is drawn here is a two-dimensional sigmoid function. The sigmoid function is also a very commonly used function to turn linear results into nonlinearities.
You will find that sigmoid does not work well when dealing with XOR problems.
polynomial core
If you know what a series, Taylor expansion or Fourier transform is, then polynomial expansion will be much easier.
If you don't know the above concepts, then please check out the following example.
import numpy as npx = np.arange(-np.pi*2, np.pi*2, 0.1)y = np.zeros(x.shape)N =2for i inrange( 1, N ): y += (-1)**(i-1) * x**(2*i-1) / np.math.factorial(2*i-1)import matplotlib.pyplot as pltplt.plot(x, y)plt.title('cos(x) Taylor Expansion')plt.xlabel('x'); plt.ylabel('y')plt.ylim(-3, 3)plt.grid(); plt.show()
The above function can draw the function image of the Taylor expansion of the cosine. Adjusting N can change the order of the Taylor expansion drawn.
For example, when N=2, the first-order Taylor expansion is drawn:
For example, when N=3, the second-order Taylor expansion is drawn:
Then comes the higher-order Taylor expansion:
When N=10, numpy reports an error, and Taylor expansion uses the power, and the power increases exponentially, which soon broke through the tolerance range of python. In the mathematical method, we will discuss other calculation algorithms to reduce the burden of python.
Traceback (most recent call last): File "xxx.py", line 8,in<module> y += (-1)**(i-1) * x**(2*i-1) / np.math.factorial(2*i-1)TypeError: ufunc 'add'output (typecode 'O') could not be coerced to provided output parameter (typecode 'd') according to the casting rule ''same_kind''
You will find that when N increases, the function of the Taylor expansion becomes more and more like a cosine function. This is a polynomial.
import numpy as npx = np.arange(-np.pi*2, np.pi*2, 0.1)y = np.zeros(x.shape)N =10curves = np.zeros((N-1, x.shape[0]))for i inrange( 1, N ): y += (-1)**(i-1) * x**(2*i-1) / np.math.factorial(2*i-1) curves[i-1]= y# print(curves)import matplotlib.pyplot as pltfor i inrange(len(curves)): plt.plot(x, curves[i], label = i+1)plt.title('cos(x) Taylor Expansion')plt.xlabel('x'); plt.ylabel('y')plt.ylim(-3, 3)plt.grid(); plt.legend(); plt.show()
We can plot the multi-level results together, and you can see the difference between them.
When our order is insufficient, underfitting will occur, that is, the Taylor expansion function cannot keep up with the change of the original function. When the order is right, it fits the original function exactly. When the order is too high, over-fitting will occur, and the Taylor expansion will not tolerate any noise at all. But because the real world is noisy, the polynomials will have large deviations.
For example, in the previous example, if I get too few polynomials, the following results will appear:
If our polynomial order is right, then we can get as good results as the rbf kernel:
In this example, there is no overfitting problem. The overfitting problem needs a larger and more noisy data set.
Comparison of different kernels
sklearn provides us with a very good program to visualize the effects of different SVM cores.
import numpy as npimport matplotlib.pyplot as pltfrom sklearn import svm, datasetsdefmake_meshgrid(x,y,h=0.02):""" Create a mesh of points to plot in Parameters ---------- x: data to base x-axis meshgrid on y: data to base y-axis meshgrid on h: stepsize for meshgrid, optional Returns ------- xx, yy : ndarray """ x_min, x_max = x.min()-1, x.max()+1 y_min, y_max = y.min()-1, y.max()+1 xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))return xx, yydefplot_contours(ax,clf,xx,yy,**params):""" Plot the decision boundaries for a classifier. Parameters ---------- ax: matplotlib axes object clf: a classifier xx: meshgrid ndarray yy: meshgrid ndarray params: dictionary of params to pass to contourf, optional """ Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) out = ax.contourf(xx, yy, Z, **params)return out# import some data to play withiris = datasets.load_iris()# Take the first two features. We could avoid this by using a two-dim datasetX = iris.data[:,:2]y = iris.target# we create an instance of SVM and fit out data. We do not scale our# data since we want to plot the support vectorsC =1.0# SVM regularization parametermodels = ( svm.SVC(kernel="linear", C=C), svm.LinearSVC(C=C, max_iter=10000), svm.SVC(kernel="rbf", gamma=2, C=C), svm.SVC(kernel="poly", degree=3, gamma="auto", C=C),)models = (clf.fit(X, y)for clf in models)# title for the plotstitles = ("SVC with linear kernel","LinearSVC (linear kernel)","SVC with RBF kernel","SVC with polynomial (degree 3) kernel",)# Set-up 2x2 grid for plotting.fig, sub = plt.subplots(2, 2)plt.subplots_adjust(wspace=0.4, hspace=0.4)X0, X1 = X[:,0], X[:,1]xx, yy =make_meshgrid(X0, X1)for clf, title, ax inzip(models, titles, sub.flatten()):plot_contours(ax, clf, xx, yy, cmap=plt.cm.coolwarm, alpha=0.8) ax.scatter(X0, X1, c=y, cmap=plt.cm.coolwarm, s=20, edgecolors="k") ax.set_xlim(xx.min(), xx.max()) ax.set_ylim(yy.min(), yy.max()) ax.set_xlabel("Sepal length") ax.set_ylabel("Sepal width") ax.set_xticks(()) ax.set_yticks(()) ax.set_title(title)plt.show()
The basic idea of this visualization: use the iris data set to train an SVM model, and then divide the visualization space into small grids by linear interpolation, and then send the data of each small grid to the SVM to predict the results, and then give the background based on the results Color, and finally get the color block map of this area in the above figure.
Tips on tuning
In svm, rbf and linear are the most commonly used. Linear is suitable for linear data, while rbf is used for almost any data. When the noise is very large or the data has XOR problems, then rbf is a good choice. SVM has Some significant shortcomings, such as over-fitting, this problem can generally only be solved using deep learning.