One Hot Encoding
Multi-classification problem
tf.keras.datasets provides some data sets:
boston_housing
module: Public API for tf.keras.datasets.boston_housing namespace.cifar10
module: Public API for tf.keras.datasets.cifar10 namespace.cifar100
module: Public API for tf.keras.datasets.cifar100 namespace.fashion_mnist
module: Public API for tf.keras.datasets.fashion_mnist namespace.imdb
module: Public API for tf.keras.datasets.imdb namespace.mnist
module: Public API for tf.keras.datasets.mnist namespace.reuters
module: Public API for tf.keras.datasets.reuters namespace.
Today we use the cifar10 data set to explain:
Basic data analysis process
When we are prepare to use a new data set, the first thing is to download the data set.
The cifar10 data set is in tensorflow, so the first thing is to install tensorflow.
Enter the above command in the CMD command prompt (search in win10)
tensorflow will be installed on your computer. Please note that tensorflow is a deep learning library, the library has two versions of CPU and GPU, because the GPU version is extremely troublesome to install, so I recommend installing the CPU version here.
I will tell you about the GPU version installation in the anaconda and deeplearning tutorial.
Deep learning requires a huge amount of calculation. Generally, the CPU has only 4-16 cores, and even the most powerful CPU today will generally not exceed 100 cores. That is to say, if I want to train a deep learning model, a CPU can only train 16 data at a time, and deep learning generally contains tens of thousands of data and extremely complex networks. So using CPU to train deep learning often takes a very long time.
However, GPUs often have hundreds of thousands of cores, so when we use GPUs for calculations, they are dozens of times faster than CPUs, which compresses the calculations that would otherwise take several days to dozens of minutes to dozens of minutes.
New GPUs often have TPU cores (computing cores that only support integer numbers, and deep learning often does not require such high precision), and these tensor cores often have faster computing speeds.
Designing GPU computing programs is a very complicated thing. Fortunately, libraries like tensorflow have helped us design programs. The same program can run on both the CPU and the GPU.
Then we can load the cifar data set and visualize it:
The output is as follows:
cifar10 contains a total of ten different types.
Each type is represented by a number. They are from 0-9.
Maximum likelihood analysis
Maximum likelihood analysis is our most commonly used analysis method.
For example, there is a baby who just came into this world. He doesn't know anything.
You tell him what a panda is, and he learned what a panda is.
You tell him what a horse is, and he learned what a horse is.
At this time you showed him a picture of a zebra. He has never seen a zebra, but he has seen horses and pandas. He thinks that zebras look more like pandas, so zebras are pandas.
This is the maximum likelihood analysis. For another example, in the morning there are dense clouds and the weather forecast says it might rain in the afternoon, so you bring an umbrella before you go out. This is also a maximum likelihood analysis, because you believe it is likely to rain in the afternoon, so in order not to get wet, you Bring an umbrella. If you don't think it will rain in the afternoon, then you will not bring an umbrella because it is a troublesome thing.
Maximum likelihood analysis is to find the probability of each event, then compare these probabilities, and consider that the highest probability event will occur.
For example, if I throw a dice, but this dice has been manipulated, the probability of 5 face up is 80%, and the probability of the other faces up is only 4%. Then after maximum likelihood analysis, I Think 5 is up.
One Hot Encoding
We return to the cifar10 classification problem.
The code of the airplane is 0, the code of the bird is 2, and the code of the car is 1.
There is a thing that looks like both a bird and an airplane. The definition of airplane is 0 and the definition of bird is 2. Then the weight of the probability of this thing should be around 1, right? So if we analyze according to the maximum likelihood, this thing is likely to be a car.
This is obviously wrong, if we directly weight according to the number, then there will be confusion between things.
So we use one-hot encoding to solve this problem.
One-hot encoding is to turn the original label into a sparse matrix. Each different type is represented on a different column, and the remaining columns are all 0.
Then when the probability distribution of an object is [0.6, 0, 0.4], this object is an airplane according to the maximum likelihood analysis, instead of being defined as a car.
Code and implementation
There are many ways to implement one-hot encoding, and to_categorical is one of the easiest ways.
We can observe whether the position of 1 in the figure below is the same as the label above. (Hint: start from 0)
The above is the way to realize one-hot encoding, let's talk about how to restore one-hot encoding:
argmax is a function specially used to solve maximum likelihood analysis and one-hot encoding fallback. Its meaning is to find out where the maximum value appears in a set of data.
We use the same set of data to test and you will find that the backed one-hot encoding is exactly the same as the label before encoding.
One-hot coding is very commonly used in deep learning and machine learning.
This section introduces the cafir10 data set and some keras commonly used data sets, as well as what is maximum likelihood analysis and one-hot coding.
Need to pay attention to
The result of the above code is:
There are three types of our input, but to_categorical gives four types of results, indicating that to_categorical starts from 0.
The above code is as follows:
to_categorical will faithfully encode the numbers into one-hot codes in the order from 0-1-2-3, which means that even if I only have three categories 1, 9, 14, it will still be encoded into a huge matrix. When your categories are not continuous, you need to make the categories continuous or use other methods to encode one-hot encoding.
The output of the above program is as follows:
You will find that to_categorical can only be numbered according to numbers, and this number has to be an integer number.
Statistics
Start time of this page: December 25, 2021
Completion time of this page: December 28, 2021
Last updated