In the last chapter we explained what projection is, and in this chapter we will explain what PCA is.
In the previous chapter, I explained how to generate a Gaussian two-dimensional distribution:
import numpy as npimport matplotlib.pyplot as plt# Set a random seed# So the result of each run is the samenp.random.seed(42)mu = [0,0]sigma = [ [10,8], [8,10] ]point = np.random.multivariate_normal(mu, sigma, 1000)plt.scatter( point[ :, 0 ], point[ :, 1 ], alpha =0.5 )plt.show()
We now make a straight line and draw the projection of each point to this straight line:
import numpy as npimport matplotlib.pyplot as plt# Set a random seed# So the result of each run is the samenp.random.seed(42)mu = [0,0]sigma = [ [10,8], [8,10] ]point = np.random.multivariate_normal(mu, sigma, 1000)plt.scatter( point[:,0], point[:,1], alpha =0.5 )x = np.arange(-12, 12, 0.1)y =2* xplt.plot(x, y, lw=3, c='#ff7f0e')plt.xlim(-12, 12)plt.ylim(-12, 12)plt.show()
Because 1000 points are too dense, I reduced the number of points to 200 and made some beautifications on the interface:
import numpy as npimport matplotlib.pyplot as plt# Set a random seed# So the result of each run is the samenp.random.seed(42)mu = [0,0]sigma = [ [10,8], [8,10] ]point = np.random.multivariate_normal(mu, sigma, 200)plt.scatter( point[:,0], point[:,1], alpha =0.3 )x = np.arange(-12, 12, 0.1)y =2* xplt.plot(x, y, lw=3, c='#ff7f0e')a = np.array([x[0], y[0]])for i inrange(point.shape[0]): b = point[i,:] projection = a * np.dot(b, a)/ np.linalg.norm(a)**2 plt.plot( [b[0], projection[0]], [b[1], projection[1]], c='#2ca02c', alpha=0.5 )plt.axis('square')plt.xlim(-12, 12)plt.ylim(-12, 12)plt.show()
We can change the slope of this line, and the result of the projection will change accordingly, for example:
y =-x
y = x
We call the result above as result 1, and the result below as result 2.
I now want to compress this two-dimensional Gaussian distribution into one-dimensional data. Do you think I keep the projection of result 1 more reflecting the real situation or the projection of keeping result 2 more reflecting the real situation?
Our data is a positive correlation (x increases with y will also increase), if we keep the result 1, then the data becomes a negative correlation, then the second result is more representative.
So does the projection of result 2 best reflect the characteristics of the data?
We can rotate this line to make the projection of each point more dispersed on this line. When we adjust to a certain position, these points reach the most dispersed, then this is the best position.
There are some complicated mathematical algorithms involved, and I won’t describe them here. Those who are interested can read the description of PCA in the machine learning book.
Let's talk about how to use python's sklearn library to instantiate PCA.
import numpy as npimport matplotlib.pyplot as plt# Set a random seed# So the result of each run is the samenp.random.seed(42)mu = [0,0]sigma = [ [10,8], [8,10] ]point = np.random.multivariate_normal(mu, sigma, 200)plt.scatter(point[:,0], point[:,1])from sklearn.decomposition import PCApca_clf =PCA(n_components =2)pca_clf.fit(point)print(pca_clf.explained_variance_ratio_)pca_data = pca_clf.fit_transform(point)print(pca_data)plt.scatter(pca_data[:,0], pca_data[:,1])plt.show()
In this program, I used sklearn's pca function to perform principal component analysis on the data.
The input data is two-dimensional, and PCA model analyzes two-dimensional (you can only analyze and output one-dimensional, n_components = 2 can be changed to n_components = 1, and the reduction of n_components will only reduce the dimensionality of the output result, but it will not change the result Value)
Then use this model to fit_transform the two-dimensional Gaussian distribution data and plot it.
The blue point in the above figure is the original data, and the orange point is the data after PCA.
In the orange data, the x-axis is its principal component, and the y-axis is its second component.
You will find that the working method of PCA is to rotate the coordinate axis to a certain angle, so that the data can be displayed as much as possible on the new coordinate axis of this angle.
Let's look at the output of the command line:
The content of the first line is: [0.89874751 0.10125249], This is the output of the line of code print(pca_clf.explained_variance_ratio_). This line of code is used to analyze how much information the principal and minor components account for in the overall data.
Here we find that the principal component occupies 0.89874751 information of the overall data, and the principal component occupies close to 90% of the information, which is very high. So if we ignore the remaining 10% of the amount of information, it will not have much impact on the integrity of our data.
After that, a large amount of data output is the data after PCA. If we only extract the first number of each row of data, it is our principal component.
Principal component analysis can not only extract principal components. In our actual machine learning, because sometimes the data dimension is too high, the training time is too long. Using PCA for dimensionality reduction and discarding some low-information dimensions can solve this problem.
View the most important dimensions
In PCA, we always want to know which dimension is the most important. For example, we have information about wages, consumption, housing loans, car loans, etc. We want to know which information has a greater impact on raising children(positive and negative). We need to analyze the proportion of the original components in the PCA.
The data obtained in the first row is the data source ratio of the principal components.
pca obtained 0.69951524 data from the x-axis and 0.71461768 data from the y-axis. From the figure, we can see that the dispersion of the data on the x-axis and y-axis is similar.
The data obtained in the second row is the data source ratio of the second principal component.
PCA inverse_transform
In machine learning, PCA inverse_transform is rarely used, so I won’t introduce it here. If you are interested, please check sklearn.
Principal component analysis analyzes the key information in the data. If you use principal component analysis to reduce dimensionality, you will retain most of the information in the data. Of course, the data can be reconstructed using this information. If you retain enough information, the reconstructed data will be very close to the original data.
If the Mars exploration satellite transmits the complete data of Mars it has collected to the earth, the amount of this data is very large and the transmission will be very difficult. But if we use PCA for dimensionality reduction and retain 80% or 90% or 95% of the information before transmission, a considerable part of the data will be reduced. But the transmitted data is similar to which before it was reduce.