AI foundation: sequence model based on natural language processing

AI foundation: sequence model based on natural language processing

This article mainly refers to the notes part of Wu Enda's deep learning course [1].

0. Introduction

The sequence model is the basis of natural language processing. This episode explains the cyclic sequence model.

I am writing an AI basic series, which has been released so far:/

AI Basics: Introduction to Simple Mathematics

AI basics: Python development environment settings and tips

AI Basics: Getting Started with Python

AI Basics: Getting Started with Numpy

AI Basics: Simple Introduction to Pandas

AI basics: Scipy (scientific computing library) easy entry

AI basics: an easy introduction to data visualization (matplotlib and seaborn)

AI foundation: the use of machine learning library Scikit-learn

AI basics: an easy introduction to machine learning

** Basics of AI: ** The loss function of machine learning

** Basics of AI: ** Practice data for machine learning and deep learning

AI Foundation: Feature Engineering-Category Features

AI Foundation: Feature Engineering-Digital Feature Processing

AI Foundation: Feature Engineering-Text Feature Processing

AI foundation: word embedding foundation and Word2Vec

AI basics: Graphical Transformer

AI basics: understand BERT in one article

AI Foundation: A must-see paper for introductory artificial intelligence

Basics of AI: Into Deep Learning

AI foundation: optimization algorithm

AI Foundation: Convolutional Neural Network

AI foundation: classic convolutional neural network

AI foundation: deep learning paper reading route (127 classic papers download)

AI Foundation: Overview of Data Augmentation Methods

AI Foundation: Essay Writing Tool

Follow-up continuous update

Start of text

Sequence Models

Recurrent Neural Networks

1.1 Why choose a sequence model? (Why Sequence Models?)

In this course you will learn sequence models, which are one of the most exciting content in deep learning. Models such as Recurrent Neural Networks ( RNN ) have caused changes in speech recognition, natural language processing, and other fields. In this lesson, you will learn how to create these models yourself. Let's look at some examples first, these examples all effectively use the sequence model.

During speech recognition, an input audio segment is given, and the corresponding text record is required to be output. In this example, the input and output data are both sequence models, because is an audio clip that plays on time, and the output is a series of words. So some sequence models that will be learned later, such as recurrent neural networks, are very useful in speech recognition.

The music generation problem is another example of using sequence data. In this example, only the output data is a sequence, and the input data can be an empty set or a single integer. This number may refer to the style of music you want to generate , Or it may be the first few notes of the song you want to generate. The input can be empty or just a number, and then output the sequence.

When processing sentiment classification, the input data is a sequence, and you will get an input similar to this: " There is nothing to like in this movie. ". How many stars do you think this comment corresponds to?

Series models are also very useful in DNA sequence analysis. Your DNA can be represented by the four letters A , C , G , and T. So given a DNA sequence, can you mark which part matches a certain protein?

In the process of machine translation, you will get an input sentence like this: " Voulez-vou chante avecmoi? " (French: Do you want to sing with me?), and then ask you to output the translation result in another language.

When performing video behavior recognition, you may get a series of video frames and then ask you to recognize the behavior.

When performing named entity recognition, you may be given a sentence to ask you to identify the name of the person in the sentence.

So these problems can be called supervised learning using label data as the training set. But from this series of examples, you can see that there are many different types of sequence problems. In some problems, the input data and output data are both sequences, but even in that case, and sometimes are not the same length. Or the sum of number 1 in the figure above and number 2 in the figure above have the same data length. In other problems, there is only or only sequence.

So in this section we have learned sequence models that are suitable for different situations.

In the next section, we will define some symbols used to define sequence problems.

1.2 Notation

In this section, the sequence model is constructed step by step from the definition of symbols.

For example, if you want to build a sequence model, its input sentence is like this: " Harry Potter and Herminoe Granger invented a new spell. " (These names are from the series of novels Harry Potter written by JK Rowling ). If you want to build a sequence model that can automatically identify the position of a person's name in a sentence, then this is a named entity recognition problem, which is often used in search engines, such as indexing all the names of people mentioned in news reports in the past 24 hours. Use this The method can be properly indexed. The named entity recognition system can be used to find the names of people, companies, times, places, countries, currencies, etc. in different types of text.

Now given such input data, suppose you want a sequence model output so that each word of the input corresponds to an output value, and this can indicate whether the input word is part of a person's name. Technically speaking, this may not be the best output form. There are more complex output forms. Not only can it indicate whether the input word is part of a person's name, it can also tell you where the person's name starts and ends in the sentence. For example, Harry Potter (shown in number 1 above), Hermione Granger (shown in number 2 above).

The simpler output form:

This input data is a sequence of 9 words, so in the end we will have 9 feature sets and to represent these 9 words, and index them according to the position in the sequence,,, etc. keep coming to index different positions, I will Used to index the middle position of this sequence. It means that they are time series, but whether they are time series or not, we will use them to index the position in the sequence.

The same is true for output data. We still use,, and so on to indicate output data. At the same time, we use it to indicate the length of the input sequence. In this example, the input is 9 words, so. We used to indicate the length of the output sequence. In this example, you know that and can have different values in the last video.

You should remember the symbol we used before, which we used to represent the first training sample, so in order to refer to the first element, or the first element in the sequence of training sample i, use this symbol. If it is the sequence length, then different training samples in your training set will have different lengths, so it represents the input sequence length of the first training sample. It also represents the first element in the first training sample, which is the length of the output sequence of the first training sample.

So in this example, but if another sample is a sentence consisting of 15 words, then for this training sample,.

Since our example is NLP , which is natural language processing, this is our first foray into natural language processing. One thing we need to decide in advance is how to represent individual words in a sequence, how would you represent words like Harry , What should it actually be?

Next we discuss how to represent a single word in a sentence. To represent the words in a sentence, the first thing is to make a vocabulary, sometimes called a dictionary, which means to list the words used in your representation method. The first word in this vocabulary (shown in the figure below) is a , which means that the first word in the dictionary is a , the second word is Aaron , and the lower ones are the words and , and then you will Find Harry , then Potter , so until the end, the last word in the dictionary may be Zulu .

So a is the first word, and Aaron is the second word. In this dictionary, and appears at position 367, Harry is at position 4075, Potter is at 6830, and Zulu , the last word in the dictionary, may be the 10,000th Words. So in this example I used a 10,000 word dictionary, which is too small for modern natural language processing applications. For commercial applications, or for general-scale commercial applications, dictionaries with a size of 30,000 to 50,000 words are more common, but 100,000 words are not uncommon, and some large Internet companies use a million words or even larger dictionaries. The dictionaries used in many commercial applications may be 30,000 words or 50,000 words. But I will use a 10,000-word dictionary for illustration, because this is a good integer.

If you choose a 10,000-word dictionary, one way to build this dictionary is to traverse your training set and find the first 10,000 common words. You can also browse some web dictionaries, which can tell you the most common 10,000 in English. Words, then you can use one-hot notation to represent each word in the dictionary.

For example, the word Harry is represented here . It is a vector with 1 in line 4075 and 0 in the rest of the values (shown in number 1 in the figure above), because that is Harry 's position in this dictionary.

It is also a vector with 1 in row 6830 and 0 in the rest of the positions (shown in number 2 in the figure above).

And is ranked 367th in the dictionary, so the 367th line is a vector with 0 and the rest of the values are 0 (shown as number 3 in the figure above). If your dictionary size is 10,000, then each vector here is 10,000-dimensional.

Because a is the first word of the dictionary and corresponds to a , then the first position of this vector is 1, and the rest are vectors of 0 (shown in number 4 in the figure above).

So in this representation method, it refers to any word in the sentence, it is a one-hot vector, because it has only one value of 1, and the rest are 0, so you will have 9 one-hot vectors to represent this The purpose of the 9 words in the sentence is to express in this way, using the sequence model to learn to establish a mapping between the target output. I will treat it as a supervised learning problem, and I am sure that labeled data will be given.

So there is one last thing left, we will discuss in the following video, if you encounter a word that is not in your vocabulary, the answer is to create a new tag, which is a fake word called Unknow Word , use < UNK > is used as a mark to indicate words that are not in the vocabulary. We will discuss more about this later.

Summarizing the content of this lesson, we describe a set of symbols used to express the sequence data in your training set. In the next lesson, we will start to describe how to construct the mapping in the recurrent neural network.

1.3 Recurrent Neural Network Model (Recurrent Neural Network Model)

In the video in the previous section, you learned about the notation we used to define sequence learning problems. Now we discuss how to build a model and build a neural network to learn the mapping.

One of the methods you can try is to use a standard neural network. In our previous example, we have 9 input words. Imagine that these 9 input words may be 9 one-hot vectors, and then input them into a standard neural network. After some hidden layers, 9 items with a value of 0 or 1 will eventually be output. It shows Whether each input word is part of a person's name.

But the results show that this method is not good, there are two main problems,

1. The input and output data can have different lengths in different examples. Not all examples have the same input length or the same output length. Even if each sentence has a maximum length, you may be able to pad or zero pad to make each input sentence reach the maximum length, but it still does not seem to be a good way of expression.

2. A simple neural network structure like this one does not share the characteristics learned from different positions of the text. Specifically, if the neural network has learned that Harry who appears in position 1 may be part of a person's name, then if Harry appears in other positions, such as time, it can also automatically recognize it as part of the person's name, which is great. . This may be similar to what you see in a convolutional neural network. You want to quickly promote the content learned in some pictures to other parts of the picture, and we hope to have a similar effect on sequence data. Similar to what you learned in convolutional networks, using a better expression will also allow you to reduce the number of parameters in the model.

We mentioned earlier that these (shown as number 1 in the above figure...) are all 10,000-dimensional one-hot vectors, so this will be a very large input layer. If the total input size is the maximum number of words multiplied by 10,000, then the weight matrix of the first layer will have a huge number of parameters. But the cyclic neural network does not have the above two problems.

So what is a recurrent neural network? Let's create one first (shown in number 1 in the figure below). If you read this sentence from left to right, the first word is, if yes, what we have to do is to input the first word into a neural network layer. I plan to draw it like this. Hidden layer, we can let the neural network try to predict the output to determine whether it is part of the person's name. What the recurrent neural network does is that when it reads the second word in the sentence, it assumes that it is not just predicting it, and it will also input some information from time step 1. Specifically, the activation value of time step 1 will be passed to time step 2. Then, at the next time step, the recurrent neural network inputs the word, and then it tries to predict and output the prediction result, and so on, until the last time step, input, and then output. At least in this example, at the same time, if the and are not the same, the structure will need to be changed. So in each time step, the cyclic neural network transmits an activation value to the next time step for calculation.

To start the entire process, an activation value needs to be constructed at zero time, which is usually a zero vector. Some researchers will randomly use other methods to initialize, but using a zero vector as the false activation value at zero time is the most common choice, so we input it into the neural network.

In some research papers or in some books, you will see this type of neural network, represented by such a graph (shown in number 2 in the above figure), in each time step, you input and then output. Then in order to show the cyclic connection, sometimes people will draw a circle like this to indicate the input back to the network layer, and sometimes they will draw a black square to indicate that there will be a time step delay at this black square. I personally think that these cyclic diagrams are difficult to understand, so in this course, I prefer to use the distribution method on the left (shown in the number 1 above). However, if you see the drawing method of the chart on the right (shown in the number 2 above) in textbooks or research papers, you can expand the chart to the left in your mind.

The cyclic neural network scans the data from left to right, and the parameters of each time step are also shared, so in the following slides we will describe in detail its set of parameters, which we use to represent the management of the connection to the hidden layer A series of parameters, each time step uses the same parameters. The activation value, that is, the horizontal connection, is determined by the parameters, and at the same time, the same parameters are used for each time step, and the same output results are determined. The following figure details how these parameters work.

In this recurrent neural network, it means that when predicting, not only the information to be used, but also the information from and, because the information from and can be helped by such a path (the path shown in the number 1 above) prediction. One disadvantage of this recurrent neural network is that it only uses the previous information in the sequence to make predictions, especially when it comes to predictions, it does not use information such as,, and so on. So there is a problem, because if this sentence is given, " Teddy Roosevelt was a great President. ", in order to judge whether Teddy is part of a person's name, it is not enough to know the first two words in the sentence. You also need to know The information in the latter part of the sentence is also very useful, because the sentence may also be like this, " Teddy bears are on sale! ". Therefore, if only the first three words are given, it is impossible to know exactly whether Teddy is part of a person s name. The first example is the person s name, but the second example is not. So you can t tell by looking at the first three words Out of the difference.

Therefore, a limitation of such a specific neural network structure is that its prediction at a certain moment only uses the input information from the previous sequence and does not use the information from the latter part of the sequence. We will use the bidirectional recurrent neural network ( BRNN ) in the future . Deal with this problem in the video. But for now, this simpler one-way neural network structure is enough for us to explain the key concepts. After that, as long as we make changes on this basis, we can use both the front and back information in the sequence to predict, but we will later The video tells these things, and then we specifically write what this neural network calculates.

Here is a schematic diagram of a cleaned neural network. As I mentioned before, it is usually entered first, and it is a zero vector. Then there is the forward propagation process, the activation value is calculated first, and then the calculation.

I will use this notational convention to represent these matrix subscripts. For example, the second subscript means to be multiplied by a certain type of quantity, and then the first subscript indicates that it is used to calculate a certain type. variable. Similarly, it can be seen that here is multiplied by a certain type of quantity to calculate a certain type of quantity.

The activation function used by recurrent neural networks is often tanh , but sometimes ReLU is also used , but tanh is the more common choice. We have other ways to avoid the problem of vanishing gradients, which we will talk about later. Which activation function you choose depends on your output. If it is a dichotomous problem, then I guess you will use the sigmoid function as the activation function. If it is a category classification problem, then you can choose softmax as the activation function. However, the type of activation function here depends on what type of output you have. For named entity recognition, it can only be 0 or 1. Then I guess the second activation function here can be the sigmoid activation function.

More generally, at the moment,

So these equations define the forward propagation of the neural network. You can start with a zero vector, then use the sum to calculate the sum, and then use the sum to calculate the sum and so on. As shown in the figure, the forward propagation is completed from left to right. .

Now to help us build a more complex neural network, I actually want to simplify this notation. I copied these two equations in the next slide (the two equations shown in number 1 in the figure above).

Next, in order to simplify these symbols, I will write this part () (shown in number 1 in the above figure) in a simpler form, and I will write it as (shown in number 2 in the above figure), then the left and right sides are underlined. Should be equivalent. So the way we define is to place the matrix and the matrix horizontally side by side (as shown in the number 3 in the figure above). For example, if it is 100-dimensional, and then continues the previous example, it is 10,000-dimensional, then it is a dimensional matrix, it is a dimensional matrix, so if you stack these two matrices, it will be a dimensional matrix .

Using this symbol () means to stack these two vectors together. I will use this symbol to indicate that (as shown in the number 4 in the figure above). In the end, this is a 10,100-dimensional vector. You can check it yourself and multiply this vector by this matrix, and you can just get the original amount, because at this time, the matrix multiplied by is just equal to, just equal to the previous conclusion (shown in number 5 in the figure above). The advantage of this notation is that we can not use the sum of two parameter matrices, but compress it into one parameter matrix, so when we build more complex models, this can simplify the notation we need to use.

Also for this example (), I will rewrite it in a simpler way, (shown in number 6 in the figure above). Now the and symbol has only one subscript, which indicates what type of quantity will be output during calculation, so it indicates that it is the weight matrix of the calculated type of quantity, and the above sum indicates that these parameters are used to calculate the type or Activation value.

Schematic diagram of RNN forward propagation:

Fortunately, that's all. You now know the basic recurrent neural network. In the next lesson, we will discuss back propagation and how you can learn with RNN .

1.4 Backpropagation through time

We have learned the basic structure of recurrent neural networks before, and in this video we will understand how backpropagation works in recurrent neural networks in the future. As before, when you implement a recurrent neural network in a programming framework, the programming framework usually handles backpropagation automatically. But I think it is very useful to have a rough understanding of the operation of backpropagation in recurrent neural networks. Let us find out.

Before, you have seen how to calculate these activations from left to right in the neural network for forward propagation (in the direction indicated by the blue arrow above), until all the prediction results are output. For backpropagation, I think you have guessed it. The calculation direction of backpropagation (the direction indicated by the red arrow in the figure above) is basically the opposite of forward propagation.

Let's analyze the calculation of forward propagation. Now you have an input sequence,,, until, and then use and calculate the activation item of time step 1, then use and calculate, then calculate and so on, until.

In order to actually calculate, you also need some parameters, and, use them to calculate. These parameters will be used in every subsequent time step, so continue to use these parameters to calculate, etc., all these activation items depend on the parameter sum. With it, the neural network can calculate the first predicted value, and then continue to calculate the next time step,, and so on, until. In order to calculate, the parameter sum is required, and they will be used for all these nodes.

Then in order to calculate backpropagation, you also need a loss function. We first define an element loss function (shown in number 1 in the figure above)

It corresponds to a specific word in the sequence. If it is a person's name, then the value is 1, and the neural network will output the probability value that the word is a name, such as 0.1. I define it as the standard logistic regression loss function, also called Cross Entropy Loss ( Cross Entropy Loss ), which is very similar to the formula we saw in the binary classification problem. So this is about the loss function of the predicted value of a word at a single position or at a certain time step.

Now we define the loss function of the entire sequence, which will be defined as (shown in number 2 in the figure above)

In this calculation diagram, the corresponding loss function can be calculated, so the loss function of the first time step (shown in number 3 in the figure above) is calculated, and then the loss function of the second time step is calculated, and then the third Time steps, until the last time step, and finally in order to calculate the overall loss function, we have to add them up, through the following equation (the equation shown in the number 2 above) to calculate the final (above Number 4), that is, add up the loss functions of each individual time step.

This is the complete calculation diagram. In the previous example, you have seen backpropagation, so you should be able to imagine that the backpropagation algorithm needs to calculate and transmit information in the opposite direction. Finally, what you do is to forward The propagating arrows are all reversed, after which you can calculate all the appropriate quantities, and then you can use the derivative-related parameters to update the parameters using the gradient descent method.

In the process of backpropagation, the most important information transfer or the most important recursive operation is this right-to-left operation, which is why this algorithm has a very unique name, called **" through (through ) Backpropagation through time ** ( backpropagation through time )". The reason for this name is that for forward propagation, you need to calculate from left to right, and in this process, it keeps increasing. For backpropagation, you need to calculate from right to left, just like going backwards in time. "Backward propagation through time" is like traveling through time. This statement sounds like you need a time machine to implement this algorithm.

Schematic diagram of RNN back propagation:

I hope you have a general understanding of how forward and backpropagation work in RNNs . So far, you have only seen a major example in RNN , where the length of the input sequence and the length of the output sequence are the same. In the next lesson, more RNN architectures will be shown , which will allow you to handle a wider range of applications.

1.5 Different types of RNN s

Now you have understood a kind of RNN structure, its input quantity is equal to the output quantity. In fact, for some other applications, and are not necessarily equal. In this video, you will see more of the RNN structure.

You should remember the slide in the first video of this week. There are many examples of input and output. There are various types and not all situations are satisfied.

For example, in the example of music generation, the length can be 1 or even an empty set. Another example is movie sentiment classification, the output can be an integer from 1 to 5, and the input is a sequence. In named entity recognition, the input length and output length are the same in this example.

In some cases, the input length and the output length are different. They are both sequences but different in length. For example, in machine translation, a French sentence and an English sentence can express the same meaning with different numbers of words.

So we should modify the basic RNN structure to deal with these problems. The content of this video refers to Andrej Karpathy 's blog, an article called " The Unreasonable Effectiveness of Recurrent Neural Networks ". Let's look at some examples.

You have already seen the example (shown in number 1 in the figure below), that is, we input the sequence,, until, our recurrent neural network works like this, the input is calculated, and so on. In the original picture, I would draw a series of circles to represent neurons. Most of the time, in order to make the symbols simpler, I used simple small circles to represent them here. This is called a " many-to-many " ( many-to-many ) structure, because the input sequence has many inputs, and the output sequence has many outputs.

Now let s look at another example. If you want to deal with sentiment classification (shown in number 2 in the figure below), this may be a piece of text, such as a movie review, " These is nothing to like in this movie. " (" There s nothing left to watch this movie. ), so it s a sequence, and it may be a number from 1 to 5, or 0 or 1, which represents positive and negative reviews, and numbers 1 to 5 represent that the movie is 1. Stars, 2 stars, 3 stars, 4 stars or 5 stars. So in this example, we can simplify the structure of the neural network, input,, and input one word at a time. If the input text is " These is nothing to like in this movie ", then the correspondence of the word is shown in the figure number 2 below. We no longer have output at every time, but let this RNN network read the entire sentence, and then get the output at the last time, so that the input is the entire sentence, so this neural network is called "many-to-one" ( Many-to-one ) structure, because it has many inputs, many words, and then outputs a number.

For completeness, also add a " one to one " ( One-to-One ) structure (figure No. 3), this may not be so important, this is a small standard neural network, then get input Output, this type of neural network has already been discussed in the first two courses of our series of courses.

In addition to " many to one " configuration, there may be " one to many " ( One-to-MANY structure). An example of a "one-to-many" neural network structure is music generation (shown in number 1 above). In fact, you will implement such a model in this after-school programming exercise. Your goal is to use a neural network. Output some notes. Corresponding to a piece of music, the input can be an integer, indicating the type of music you want or the first note of the music you want, and if you don t want to input anything, it can be an empty input, which can be set to a 0 vector .

In this way, the structure of this neural network is first your input, then the output of the RNN , the first value, then there is no input, and then the second output, then the third value and so on, until the synthesis of this The last note of the music composition can also be entered here (as shown in the number 3 in the figure above). There is a technical detail that will be discussed later. When you generate a sequence, you usually feed the output of the first synthesis to the next layer (shown in the number 4 above), so the actual network structure will look like this in the end .

We have already discussed the " many-to-many ", " many-to-one ", " one-to-one " and " one- to-many" structures. There is another interesting example of the "many-to-many" structure that deserves to be explained in detail. When the input and output lengths are different. In the many-to-many example you have just seen, its input length and output length are exactly the same. For applications like machine translation, the number of words in the input sentence, such as a French sentence, and the number of words in the output sentence, such as translated into English, the length of the two sentences may be different, so a new one is needed. The network structure, a different neural network (shown in number 2 in the figure above). First read the sentence and read the input. For example, if you want to translate French into English, after reading it, the network will output the translation result. With this structure and can be of different lengths. Similarly, you can also draw this. The structure of this network has two different parts. This (shown as number 5 in the figure above) is an encoder that takes input, such as French sentences, and this (shown as number 6 in the figure above) is a decoder, which reads the entire Sentence, and then output the result of translation into other languages.

This is an example of a " many-to-many " structure. By the end of this week, you will have a good understanding of the basic components of these various structures. Strictly speaking, there is another structure that we will cover in the fourth week, which is the " attention based " structure, but it is not easy to understand this model based on the diagrams we draw now.

Summarize these various RNN structures. This (shown as number 1 in the figure above) is a " one-to-one " structure. When removed, it is a standard type of neural network. There is also a " one-to-many " structure (shown in number 2 above), such as music generation or sequence generation. There is also " many-to-one ". This (shown in number 3 above) is an example of sentiment classification. First read the input, the text of a movie review, and then judge whether they like the movie or not. There is also a " many-to-many " structure (shown in number 4 above), named entity recognition is an example of " many-to-many ". Finally, there is another version of the " many-to-many " structure (shown as number 5 in the figure above). For applications such as machine translation, the sum can be different.

Now that you have understood most of the basic modules, these are almost all neural networks. Except for sequence generation, some details will be explained in the next lesson.

I hope you learned from this video that using these basic modules of RNN , you can build a variety of models by combining them. But as I mentioned earlier, there are some differences in sequence generation. In this week s exercise, you will also implement it. You need to build a language model. If the result is good, you will get some interesting sequences or interesting. text. The next lesson will delve into sequence generation.

1.6 Language model and sequence generation

In natural language processing, building a language model is one of the most basic and important tasks, and it can be implemented well with RNN . In this video, you will learn to use RNN to build a language model. At the end of this week, there will be a very interesting programming exercise. You can build a language model in the exercise and use it to generate Shakespearean text. Or other types of text.

So what is a language model? For example, you are working on a speech recognition system, and you hear a sentence, " the apple and pear (pair) salad was delicious. ", so what did I say? Did I mean " the apple and pair salad " or " the apple and pear salad "? ( Pear and pair are synonymous words). You may think that what I said should be more like the second one. In fact, this is what a good speech recognition system should help output, even if these two sentences sound so similar. The way for the speech recognition system to choose the second sentence is to use a language model that can calculate the respective possibilities of the two sentences.

For example, a speech recognition model may calculate that the probability of the first sentence is, and the probability of the second sentence is, comparing these two probability values, obviously what I said is more like the second one, because the second sentence The probability of is more than 1000 times higher than the first sentence, which is why the speech recognition system is able to choose between these two sentences.

So what the language model does is that it will tell you the probability of a particular sentence appearing. According to the probability I said, suppose you pick up a newspaper at random, open any email, or any web page or listen to it. If someone says the next sentence, and this person is your friend, what is the probability that the sentence you are about to get from somewhere in the world will be a certain sentence, such as " the apple and pear salad ". It is the basic component of two systems, a speech recognition system just mentioned, and a machine translation system. It must be able to correctly output the closest sentence. The most basic work done by the language model is to input a sentence, to be precise, a text sequence,, until. For the language model, it is better to represent these sequences than to represent them, and then the language model estimates the probability of each word in a sentence sequence.

So how to build a language model? In order to use RNN to build such a model, you first need a training set that contains a large English text corpus ( corpus ) or other language, the corpus of the language you want to use to build the model. Corpus is a proper noun of natural language processing, which means a text composed of a very long or a large number of English sentences.

Suppose you get the sentence in the training set, " Cats average 15 hours of sleep a day. ", the first thing you need to do is to tokenize this sentence, which means like before As in the video, create a dictionary, and then convert each word into a corresponding one-hot vector, which is the index in the dictionary. There may be another thing that you want to define the end of the sentence. The general approach is to add an extra mark called EOS (shown in the number 1 above), which indicates the end of the sentence, which can help you figure out when a sentence At the end, we will discuss this in detail later. EOS tags can be attached to the end of each sentence in the training set, if you want your model to be able to accurately identify the end of the sentence. We don't need to use this EOS mark in this week's exercise , but you may use it in some applications, but you will see its usefulness later. So in this example, if you add the EOS mark, this sentence will have 9 inputs, yes, until. In the process of tokenization, you can decide for yourself whether to treat punctuation marks as marks. In this example, we ignore punctuation marks, so we only regard day as a sign, excluding the period after it. If you want to Periods or other symbols are also used as signs, so you can also add periods to your dictionary.

Now there is a problem if there are some words in your training set that are not in your dictionary. For example, your dictionary has 10,000 words and 10,000 most commonly used English words. Now in this sentence, " The Egyptian Mau is a bread of cat. " There is a word Mau in it. It may not be the 10,000 most commonly used words in advance. In this case, you can replace Mau with a word called UNK The symbol representing the unknown word, we only build a probability model for UNK , not for this specific word Mau .

After completing the identification process, this means that the input sentence is mapped to each mark, or each word in the dictionary. The next step is to build an RNN to build a probabilistic model of these sequences. One thing you will see in the next slide is that at the end you will set it to.

Now we come to build the RNN model, we continue to use the sentence " Cats average 15 hours of sleep a day. " as our running example, I will draw an RNN structure. At the 0th time step, you need to calculate the activation term, which is a function as input, and will be set to a set of all 0s, which is a vector of 0. In the previous step, it is also set to 0 vector as usual, so all it needs to do is to use softmax to make some predictions to calculate what the first word may be, and the result is (shown in number 1 in the figure above). This step is actually It is to predict the probability that any word in the dictionary will be the first word through a softmax layer, for example, what is the probability that the first word is, what is the probability that the first word is Aaron , and the first word is cats What is the probability that Zulu is the first word, what is the probability that the first word is UNK (unknown word), and the probability that the first word is the end of the sentence? How many means you don t have to read So the output is the calculation result of softmax , which only predicts the probability of the first word, regardless of the result. In our example, we will end up with the word Cats . So the softmax layer outputs 10,000 results, because there are 10,000 words in your dictionary, or there will be 10,002 results, because you may add an unknown word, and there are two additional signs of the end of the sentence.

Then RNN enters the next time step. In the next time step, the activation item is still used. What we need to do in this step is to figure out what the second word will be. Now we still pass it the correct first word, we will tell it that the first word is Cats , that is, tell it that the first word is Cats , that's why (shown in number 2 in the figure above). Then in the second time step, the output result is also predicted by the softmax layer. The responsibility of RNN is to predict the probability of these words (shown in the number 3 above), regardless of the result, which may be b or arron , It may be Cats or Zulu or UNK (unknown word) or EOS or other words, it will only consider the words obtained before. So in this case, I guess the correct answer will be average , because the sentence really starts with Cats average .

Then proceed to the next time step of the RNN , which is now to be calculated. In order to predict the third word, which is 15, we now give it the two words before it, tell it that Cats average is the first two words of the sentence, so this is the next input, after inputting average , now we need to calculate the sequence What is the next word, or calculate the probability of each word in the dictionary (shown in the number 4 in the figure above), through the Cats and average obtained before , in this case, the correct result will be 15, and so on.

Until the end, if you guessed it correctly, you will stop at the 9th time step, and then pass to it (shown in the number 5 in the figure above), which is the word day , here it is, it will output, and finally get The result will be the EOS logo. In this step, no matter what they are, we hope to predict that the probability of the EOS sentence ending logo will be very high (as shown in the number 6 in the figure above).

So each step in the RNN will consider the words obtained before, for example, give it the first 3 words (shown in the number 7 in the above figure), and let it give the distribution of the next word. This is how the RNN learns from left to right. Predict one word at a time.

Next, in order to train this network, we have to define the cost function. Therefore, at a certain time step, if the real word is and the prediction result value of the softmax layer of the neural network is, then this (shown in Figure 8 above) is the softmax loss function. The overall loss function (shown as number 9 in the figure above) is to add up the loss functions of all individual predictions.

If you use a large training set to train this RNN , you can predict the probability of subsequent words by starting a series of words such as Cars average 15 or Cars average 15 hours of . Now there is a new sentence, it is,,, for the sake of simplicity, it only contains 3 words (as shown in the figure above), now to calculate the probability of each word in the entire sentence, the method is that the first softmax layer will tell you The probability (shown in number 1 in the figure above), which is also the first output, and then the second softmax layer will tell you the probability under consideration (shown in number 2 in the figure above), and then the third softmax layer tells You are considering the probability of the sum (shown in number 3 in the figure above), multiply these three probabilities, and finally get the probability of the entire sentence containing 3 words.

This is the basic structure of training a language model with RNN . Maybe these things I said sound a bit abstract, but don't worry, you can implement these things yourself in programming exercises. One of the most interesting things to do with the language model in the next lesson is to sample from the model.

1.7 Sampling novel sequences

After you train a sequence model, to understand what the model has learned, an informal way is to sample a new sequence to see how it should be done.

Remember that a sequence model simulates the probability of any particular sequence of words, all we have to do is to sample these probability distributions to generate a new sequence of words. The network shown in number 1 in the figure below has been trained by the structure training shown above, and in order to sample (the network shown in number 2 in the figure below), you have to do something quite different.

The first step is to sample the first word that you want the model to generate, so you enter ,, and now your first time step is the probability that all possible outputs are obtained after passing through the softmax layer. Then random sampling is performed according to the distribution of this softmax . The information that Softmax distribution gives you is the probability of the first word a , the probability that the first word is aaron , the probability that the first word is zulu , and the first word is UNK (unknown identifier ) What is the probability that this identifier may represent the end of the sentence, and then use, for example, the numpy command on this vector ,

(Shown in number 3 in the figure above), to sample according to the distribution of these probabilities in the vector, so that the first word can be sampled.

Then continue to the next time step, remember that the second time step needs to be used as input, and what we have to do now is to put the sample just obtained (shown in the number 4 above) as the input of the next time step, so whatever What word you get at the first time step must be passed to the next position as input, and then the softmax layer will predict what it is. For example, if after said first word sampling, get is The , The situation as the first word is very common, and then The as now is that now you have to calculate the first word is The In the case of, what should be the second word (shown in number 5 in the figure above), and then the result is, and then use this sampling function to sample again.

Then to the next time step, no matter what kind of one-hot code you get , pass it to the next time step, and then sample the third word. Pass it on whatever you get, and keep doing this until the last time step.

So how do you know that a sentence is over? One of the methods is, if the mark that represents the end of the sentence is in your dictionary, you can keep sampling until you get the EOS mark (shown in number 6 in the figure above), which means that the end has been reached and you can stop sampling. Another situation is that if you don't have this word in your dictionary, you can decide to sample from 20 or 100 or other words, and then keep sampling until the set time step is reached. However, this process sometimes produces some unknown marks (shown in the number 7 in the above figure). If you want to ensure that your algorithm does not output such marks, one thing you can do is to refuse any unknown during the sampling process. Mark, once it appears, continue to resample in the remaining words until a word that is not an unknown mark is obtained. If you don t mind that unknown signs are generated, you can also ignore them completely.

This is how you generate a randomly selected sentence from your RNN language model. Until now, what we have built is a vocabulary-based RNN model, which means that the words in the dictionary are all English words (shown in the number 1 in the figure below).

According to your actual application, you can also build a character-based RNN structure. In this case, your dictionary only contains letters from a to z , and there may be space characters, if you need it, you can also have Numbers from 0 to 9. If you want to distinguish between uppercase and lowercase letters, you can add uppercase letters. You can also actually take a look at the characters that may appear in the training set, and then use these characters to form your dictionary (numbers in the figure above) 2).

If you build a character-based language model, compared to a vocabulary-based language model, your sequence,, will be individual characters in your training data, rather than individual words. So for the previous example, the sentence (shown as number 3 in the figure above), " Cats average 15 hours of sleep a day. ", in this example C is, a is, t is, space is and so on.

Using a character-based language model has its advantages and disadvantages. The advantage is that you don t have to worry about unknown identifiers. For example, a character-based language model will treat sequences like Mau as sequences with non-zero probability. For the vocabulary-based language model, if Mau is not in the dictionary, you can only treat it as the unknown identifier UNK . However, one of the main disadvantages of the character-based language model is that you will end up with too many long sequences. Most English sentences have only 10 to 20 words, but they may contain many, many characters. Therefore, the character-based language model captures the dependency in the sentence, which is how the earlier part of the sentence affects the later part. It is not as good as the vocabulary-based language model that can capture the long-range relationship, and the training of the character-based language model is relatively expensive. . So the trend of natural language processing I see is that most of them use vocabulary-based language models, but as the performance of computers becomes higher and higher, there will be more applications. In some special cases, character-based models will start to be used. But this does require more expensive computing power to train, so it is not widely used now. In addition to some applications that need to deal with a large amount of unknown text or unknown vocabulary, there are also some applications that have to deal with many proprietary vocabulary.

Under the existing methods, you can now build an RNN structure, take a look at a corpus of English text, and then build a vocabulary-based or character-based language model, and then sample from the trained language model.

Here are some samples. They are sampled from a language model. To be precise, they are character-based language models. You can implement such models yourself in programming exercises. If the model is trained with news articles, it will generate the text on the left, which is a bit like a news text that is not grammatical, but it sounds like the sentence " Concussion epidemic ", to be examined , is indeed a bit like news Report. After training with Shakespeare s articles, the thing on the right is generated, which sounds a lot like something written by Shakespeare:

" The mortal moon hath her eclipse in love.

And subject of this thou art another this fold.

When besser be my love to me see sabl's.

For whose are ruse of mine eyes heaves.

These are the basic RNN structure and how to build a language model and use it to sample the trained language model. In the following video, I want to discuss some of the more in-depth challenges when training RNN and how to adapt to these challenges, especially the vanishing gradient problem to build a more powerful RNN model. In the next lesson, we will talk about gradient disappearance and will start to talk about GRU , which is the gated recurrent unit and the LSTM long-term memory network model.

1.8 Vanishing gradients with RNN s

You have understood how RNNs work, and know how to apply to specific problems, such as named entity recognition, such as language models, and you have also seen how to use backpropagation for RNNs . In fact, the basic RNN algorithm still has a big problem, that is, the problem of the disappearance of the gradient. We will discuss in this lesson, and in the next few lessons we will discuss some methods to solve this problem.

You already know what RNN looks like. Now we give an example of a language model. If you see this sentence (shown in number 1 in the figure above), " The cat, which already ate , was full. ", it should be consistent before and after , Because cat is singular, so we should use was . " The cats, which ate , were full. " (shown in number 2 in the figure above), cats is plural, so we use were . The sentence in this example has a long-term dependency, and the first word has an effect on the words after the sentence. But the basic RNN model we have seen so far (the network model shown in Figure 3 above) is not good at capturing this long-term dependence effect, and explain why.

You should remember the deeply trained network discussed earlier, and we discussed the problem of vanishing gradients. For example, for a very deep network (shown as number 4 in the figure above), 100 layers or even deeper, the network is propagated forward from left to right and then back propagated. We know that if this is a deep neural network, the gradient obtained from the output is difficult to propagate back, it is difficult to affect the weight of the previous layer, and it is difficult to affect the calculation of the previous layer (the layer shown in number 5).

For the RNN with the same problem , first propagate forward from left to right, and then propagate backward. But backpropagation will be very difficult, because the same problem of the disappearance of the gradient, the output error of the latter layer (shown in number 6 in the above figure) hardly affects the calculation of the previous layer (the layer shown in number 7 in the above figure). This means that it is actually difficult for a neural network to realize that it has to remember whether it is seeing a singular noun or a plural noun, and then generate was or were that depends on the singular and plural forms after the sequence . And in English, the content in the middle (shown in number 8 above) can be arbitrarily long, right? So you need to remember whether the word is singular or plural for a long time, so that this information can be used in subsequent sentences. It is for this reason that the basic RNN model will have a lot of local effects, which means that this output (shown in number 9 in the above figure) is mainly affected by nearby values (shown in the number 10 in the above figure), as shown in the number 11 in the above figure. The value shown is mainly related to the nearby input (shown as number 12 in the above figure). The output shown in number 6 in the above figure is basically hard to be affected by the input at the top of the sequence (shown in number 10 in the figure above). It is because no matter what the output is, whether it is right or wrong, it is difficult for this area to propagate back to the front part of the sequence, and therefore it is difficult for the network to adjust the calculations in front of the sequence. This is a shortcoming of the basic RNN algorithm, we will deal with this problem in the next few videos. If left unchecked , RNN will not be good at dealing with long-term dependencies.

Although we have been discussing the problem of vanishing gradients, you should remember that when we talked about deep neural networks, we also mentioned gradient explosions. When we are backpropagating, as the number of layers increases, the gradient may not only be exponential A decline in type may also increase exponentially. In fact, gradient disappearance is the primary problem when training RNNs . Although gradient explosions will also occur, gradient explosions are obvious, because exponentially large gradients will make your parameters extremely large, so that your network parameters will collapse. So the gradient explosion is easy to find, because the parameters will be too big to collapse, and you will see a lot of NaN or not numbers, which means that your network calculations have numerical overflow. If you find the problem of gradient explosion, one solution is to use gradient pruning. Gradient pruning means to observe your gradient vector. If it is greater than a certain threshold, scale the gradient vector to ensure that it is not too large. This is the method of pruning by some maximum value. So if you encounter a gradient explosion, if the derivative value is large, or NaN appears , use gradient pruning. This is relatively robust. This is the solution to the gradient explosion. However, the disappearance of the gradient is more difficult to solve, which is the subject of our next few videos.

To sum up, in the previous course, we learned that when training a deep neural network, as the number of layers increases, the derivative may exponentially decrease or increase exponentially, and we may encounter gradient disappearance or gradient explosion. problem. Add an RNN to process 1,000 time series data sets or 10,000 time series data sets. This is a 1,000-layer or 10,000-layer neural network. Such a network will encounter the above-mentioned types of problems. Gradient explosion can basically be dealt with with gradient pruning, but gradient disappearance is more tricky. In the next section, we will introduce GRU , a gated recurrent unit network. This network can effectively solve the problem of vanishing gradients and enable your neural network to capture longer-term dependencies. Let's go to the next video to find out.

1.9 GRU unit (Gated Recurrent Unit ( GRU ))

You have already understood the operating mechanism of the basic RNN model. In this section of the video, you will learn the gated loop unit, which changes the hidden layer of the RNN to better capture deep connections and improve the problem of gradient disappearance. , Let's take a look.

You have already seen this formula to calculate the activation value at the time of the RNN . I draw a picture of this, draw a picture of the RNN unit, draw a box, and input (shown in number 1 in the above figure), which is the activation value of the previous time step, and then enter (shown in number 2 in the above figure) Combine these two again, and then multiply the weight term. After this linear calculation (shown in number 3 in the figure above), if it is a tanh activation function, after tanh calculation, it will calculate the activation value. Then the activation value will be passed to the softmax unit (shown as number 4 in the figure above), or other things used to generate output. As far as this picture is concerned, this is the visual presentation of the unit of the hidden layer of the RNN . I will show this picture because we will use a similar picture to explain the gated loop unit.

Many GRU ideas are derived from two papers by Yu Young Chang, Kagawa, Gaza Hera, Chang Hung Chu and Jose Banjo . I will quote the sentence you have already seen in the last video, " The cat, which already ate , was full. " You need to remember that the cat is singular, in order to make sure you understand why this is was instead of were , " The cat was full. " or " The cats were full ." When we read this sentence from left to right, the GRU unit will have a new variable called the representative cell ( cell ), which is the memory cell (shown in the number 1 below). The function of memory cells is to provide the ability to remember, such as whether a cat is singular or plural, so when it sees the following sentence, it can still judge whether the subject of the sentence is singular or plural. So at time, there are memory cells, and what we see is that the GRU actually outputs the activation value (shown in number 2 in the figure below). So we want to use different symbols and to represent the value of the memory cell and the activation value of the output, even if they are the same. I use this mark now because when we will talk about LSTMs , these two will be different values, but now for GRU , the value of is equal to the activation value.

So these equations represent the calculation of the GRU unit. At each time step, we will rewrite the memory cell with a candidate value, that is, the value of, so it is a candidate value, a substituted value. Then we use the tanh activation function to calculate, so the value is a substitute value instead of the indicated value (shown in number 3 in the figure below).

Here comes the point . The really important idea in GRU is that we have a door. I first call this door (as shown in the number 4 in the figure above). This is an uppercase Greek letter subscripted as an update door, which is a 0. Value between 1 and 1. In order to let you think intuitively about the working mechanism of GRU , first think about this gate value, which is always between 0 and 1. In fact, this value is obtained by bringing this formula into the sigmoid function. We still remember that the sigmoid function is shown as number 5 in the above figure. Its output value is always between 0 and 1. For most possible inputs, the output of the sigmoid function is always very close to 0 or very close to 1. With this intuition, you can imagine that it is very close to 0 or 1 in most cases. Then the letter u means " update ". I chose the letter because it looks like a door. There are Greek letters G , G is the first letter of the door, so G represents the door.

Then the key part of the GRU is the equation numbered 3 above, and the updated equation we just wrote. Then the gate decides whether to actually update it. So we look at it this way, the memory cell will be set to 0 or 1, depending on whether the word you are considering is singular or plural in the sentence, because this is the singular case, so we first assume that it is set to 1, or if If it is a plural number, we set it to 0. Then the GRU unit will keep remembering the value until the position shown by number 7 in the figure above, the value of is still 1, which tells it, oh, this is singular, so we use was . So the role of the gate is to decide when you will update this value, especially when you see the phrase the cat , which is the subject cat of the sentence, this is a good time to update this value. Then when you finish using it, " The cat, which already ate......, was full. ", then you know that I don't need to remember it, I can forget it.

So the next formula we will use for GRU is (shown in number 1 in the figure above). You should have noticed that if the updated value, that is to say, the new value is set as a candidate value (when simplifying the above formula, ). Set the gate value to 1 (shown as number 2 in the figure above), and then update this value forward. For all the values in the middle, you should set the value of the gate to 0, which means that the old value is used instead of updating it. Because if, then, is equal to the old value. Even if you scan this sentence from left to right, when the gate value is 0 (as shown in the number 3 in the above figure, the middle is always 0, indicating that it has not been updated), that is, when it is not updated, do not update it, just use Don t forget the old value, so even if you have been processing the sentence until the number 4 in the figure above, you should have been waiting, so it still remembers that the cat is singular.

Let me draw another picture (shown below) to explain the GRU unit. By the way, when you are reading blogs or textbooks or tutorials on the Internet, these pictures are useful for explaining GRU and we will talk about it later. LSTM is quite popular. I personally feel that the formula is easier to understand in the picture, so it doesn t matter if you don t understand the picture. I just draw, if I can help.

GRU unit input (shown as number 1 in the figure below), for the previous time step, first assume that it is exactly equal, so this is used as input. Then it is also used as input (shown as number 2 in the figure below), and then these two are combined with appropriate weights, and then calculated with tanh to calculate, that is, the substitute value.

Then use a different parameter set to calculate through the sigmoid activation function, which is the update gate. In the end, all the values are combined by another operator. I will not write a formula, but I marked this box with purple shading (shown in the number 5 in the figure below, and the calculation process it represents is shown in the number 13 in the figure below. The equation of) represents this equation. So this is what the purple operator means is that it inputs a gate value (shown in number 6 in the figure below), a new candidate value (shown in number 7 in the figure below), and there is another gate value (shown in number 8 in the figure below) Show) and the old value (shown as number 9 in the figure below), so it takes this (shown in number 1 in the figure below), this (shown in number 3 in the figure below) and this (shown in number 4 in the figure below) as input together Generate a new value of the memory cell, so it is equal to. If you want, you can also substitute this into softmax or other predictive things.

This is the GRU unit or a simplified GRU unit. Its advantage is that it is determined by the gate. When you scan a sentence from the left (shown in the number 10 above) to the right, the timing is to update a certain memory Cells, or do not update, do not update (shown in the number 11 above, the middle is always 0, indicating that it has not been updated) until you really need to use memory cells (shown in the number 12 above), this may be It is decided before the sentence. Because of the value of sigmoid, now it is easy to get the value of 0 because of the gate. As long as the value is a large negative number, and due to the rounding of the value, the above gates are roughly 0, or very, very close to 0. So in this case, this update formula (the equation shown in number 13 above) will become, which is very helpful to maintain the value of the cell. Because it is very close to 0, which may be 0.000001 or less, there will be no problem of gradient disappearance. Because it is very close to 0, it means it is almost equal to, and the value is well maintained even after many, many time steps (shown in Figure 14 above). This is the key to alleviating the problem of vanishing gradients, thus allowing the neural network to run on very large dependent words, such as cat and was words even if they are separated by many words in the middle.

Now I want to talk about some implementation details. In the formula I wrote down, it can be a vector (shown in number 1 in the figure above). If you have a 100-dimensional hidden activation value, then it is also 100-dimensional. The same dimension (), the same dimension (), and other values drawn in the box. In this case, "*" is actually the product () corresponding to the element, so here: (), that is, if the gate is a 100-dimensional vector, it is also a 100-dimensional vector, and the values inside are almost all 0 or 1, which is Say that these 100-dimensional memory cells (shown as number 1 in the figure above) are the bits you want to update.

Of course, in actual applications, it will not really be equal to 0 or 1. Sometimes it is an intermediate value from 0 to 1 (shown in the number 5 in the figure above), but this is very convenient for intuitive thinking, just take it as the exact 0, exactly 0 or exactly 1. What the product corresponding to the element does is to tell the GRU unit which memory cell's vector dimension should be updated at each time step, so you can choose to keep some bits unchanged, and update other bits. For example, you may need one bit to remember whether a cat is singular or plural, and other bits to understand that you are talking about food, because you are talking about eating or food, and then you may talk about " The cat was full. " later , you can Only a few bits are changed at each time point.

You have now understood the most important idea of GRU. What is shown in the slide is actually a simplified GRU unit. Now let's describe the complete GRU unit.

For the complete GRU unit, one change I need to make is to add a new term to the new candidate value of the memory cell in the first formula we calculated. I want to add a gate (shown in number 1 in the figure below), You can think of it as representing relevance . This gate tells you how relevant the calculated next candidate value is. Calculating this gate requires parameters, as you can see, a new parameter matrix,.

As you can see, there are many ways to design these types of neural networks, and then why do we have? Why not use the simple version from the previous slide? This is because over the years, researchers have tried many different possible methods to design these units, to try to make the neural network have deeper connections, to try to produce a larger range of effects, and to solve the problem of gradient disappearance. GRU One of the most commonly used versions by researchers has also been found to be very robust and practical on many different problems. You can try to invent a new version of the unit, as long as you want. But GRU is a standard version, which is the most commonly used. You can imagine that researchers have tried many other versions, similar to this but not exactly, such as this one I wrote here. Then another commonly used version is called LSTM , which stands for long and short-term memory network, which we will talk about in the next section of the video, but GRU and LSTM are the two most commonly used concrete examples in neural network structure.

There is also a point on symbols. I try to define fixed symbols to make these concepts easy to understand. If you read academic articles, you will sometimes see that some people use another symbol,, and to represent these quantities. But I tried to use a more fixed symbol between GRU and LSTM , such as using a more fixed symbol to represent the door, so I hope this will make these concepts better understood.

So this is the GRU , the gated recurrent unit, which is one of the RNNs . This structure can better capture very long-range dependencies and make RNN more effective. Then I briefly mention other commonly used neural networks. The more classic one is called LSTM , which is a long and short-term memory network, which we will explain in the next section of the video.

( Chung J, Gulcehre C, Cho KH, et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling[J]. Eprint Arxiv, 2014.

Cho K, Merrienboer BV, Bahdanau D, et al. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches[J]. Computer Science, 2014. )

1.10 Long short term memory ( LSTM (long short term memory) unit)

You have learned about GRU (Gated Recurrent Unit) in the previous video . It allows you to learn very deep connections in the sequence. Other types of units can also allow you to do this, such as LSTM or long-short-term memory network, which is even more effective than GRU . Let us see.

Here is the formula in the last video, we have for GRU .

There are two more doors:

Update doors ( at The Gate Update )

Related door ( at The Relevance Gate )

, This is the candidate value to replace the memory cell, and then we use the update gate to decide whether to use update.

LSTM is a more powerful and versatile version than GRU , thanks to Sepp Hochreiter and Jurgen Schmidhuber , thanks to the groundbreaking paper, it has a huge impact on sequence models. I feel that this paper is very difficult to read. Although I think this paper has a significant impact in the deep learning community, it discusses the theory of gradient disappearance in depth. I feel that most people learn the details of LSTM in Other places, not this paper.

This is the main formula of LSTM (shown in number 2 in the figure above). Let's go back to the memory cell c and use it to update its candidate value (shown in number 3 in the figure above). Note that we no longer have the situation in LSTM . Now we use the formula similar to the one on the left (shown in number 4 in the figure above), but there are some changes. Now we use or, instead of using, We don't use it, that is, the relevant door. Although you can use a variant of LSTM and put all these things (the GRU formula shown on the left ) back, but in a more typical LSTM , we don't do that.

We have an update gate and parameters representing the update as before (shown in number 5 in the figure above). The new feature of an LSTM is that there is not only one update gate control. For these two items (as shown in the numbers 6 and 7 in the above figure), we will replace them with different items, and use other items to replace the sum, here ( Number 6 in the figure above) We use.

Then here (shown in number 7 in the figure above) use the forget gate ( the forget gate ), we call it, so this (shown in number 8 in the figure above);

Then we have a new output gate, (shown as number 9 in the figure above);

So the update value of the memory cell (shown in number 10 in the figure above);

So this gives the memory cell the option to maintain the old value or add a new value, so here we use separate update gates and forget gates.

Then this represents the update gate (shown in number 5 in the figure above);

Forgotten gate (shown in number 8 above) and output gate (shown in number 9 above).

The final formula will become. This is the main formula of LSTM , and there are three doors instead of two here (shown as number 11 in the figure above). This is a bit complicated. It puts the doors in a slightly different place than before.

Again, these formulas are the main formulas that control the behavior of the LSTM (shown in the number 1 in the figure above). Just use the picture to explain a little bit like before, let me put the picture here first (shown in the number 2 above). If the picture is too complicated, don't worry, I personally feel that the formula is easier to understand than the picture, but I draw the picture only because it is more intuitive. The picture in the upper right corner is inspired by a blog by Chris Ola , the title is "Understanding LSTM Network" ( Understanding LSTM Network ), the picture here is very similar to the picture on his blog, but the key is different. The picture here uses the sum to calculate all gate values (shown in numbers 3 and 4 in the above picture). In this picture, it is used to calculate the value of the forget gate together, as well as the update gate and the output gate (above Figure number 4 shows). Then they are also calculated by the tanh function (shown in number 5 in the above figure), and these values are combined in complex ways, such as the product corresponding to the elements or other ways to get from the previous (shown in number 6 in the above figure) Obtained in (shown in number 7 above).

One of the elements here is very interesting. As you can see in this pile of pictures (a series of pictures shown in the number 8 above), this is one of them. Connecting them is to connect them in chronological order. , Here (shown as number 9 in the figure above) input, and then,, and then you can connect these units in turn, and the output of the previous time here will be used as the input of the next time step, the same is true. In the following piece, I simplified the figure a bit (compared to the figure shown in the figure number 2 above). Then there is an interesting thing. You will notice that there is a line above (the line shown in number 10 in the figure above). This line shows that as long as you set the forget gate and update gate correctly, LSTM is quite easy to fix The value of (shown as number 11 in the figure above) is passed down to the right, for example (shown as number 12 in the figure above). This is why LSTM and GRU are very good at remembering a certain value for a long time, for a certain value in the memory cell, even after a long and long time step.

This is LSTM . You might think that this is a bit different from the commonly used version. The most commonly used version may be that the gate value not only depends on the sum, but sometimes you can also take a peek at the value (shown in the number 13 in the figure above). This is called "Peephole connection" ( peephole connection ). Although it is not a nice name, but you think, " peep hole connection " actually means that the gate value depends not only on the sum, but also on the value of the previous memory cell (), and then "peep hole connection" can combine these three The door (,,) is calculated.

As you can see, the main difference of LSTM lies in a technical detail. For example, this (shown as number 13 in the figure above) has a 100-dimensional vector, and you have a 100-dimensional hidden memory cell unit, and then for example the 50th one Elements only affect the gate corresponding to the 50th element, so the relationship is one-to-one, so not any of these 100 dimensions can affect all gate elements. On the contrary, the first element can only affect the first element of the door, the second element affects the corresponding second element, and so on. But if you have read papers and met people discussing " peephole connection ", you are saying that it can also affect the threshold.

LSTM forward propagation diagram:


LSTM back propagation calculation:

Gate partial derivative:

Partial derivative of parameters:


In order to calculate the sum of each is required.

Finally, calculate the partial derivatives of the hidden state, memory state, and input:


This is LSTM , when should we use GRU ? When to use LSTM ? There is no uniform guideline here. And even if I explained GRU first , LSTM appeared earlier in the history of deep learning , and GRU was invented only recently. It may be due to the simplifications made by Pavia in the more complex LSTM model. Researchers have tried these two models on many different problems to see which model is better in different problems and different algorithms, so this is not an academic and advanced algorithm, I just want to show these two models Give you.

The advantage of GRU is that it is a simpler model, so it is easier to create a larger network, and it has only two gates, and it runs more computationally, and then it can expand the scale of the model.

But LSTM is more powerful and flexible because it has three gates instead of two. If you want to choose one to use, I think LSTM is a more preferred choice in history, so if you have to choose one, I feel that most people today will still try LSTM as the default choice. Although I think that GRU has received a lot of support in recent years , and I feel that more and more teams are also using GRU , because it is simpler, and the effect is good, it is easier to adapt to larger problems.

So this is LSTM , whether it is GRU or LSTM , you can use them to build neural networks that capture deeper connections.

( Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997, 9(8):1735-1780. )

1.11 bidirectional recurrent neural network (Bidirectional RNN )

Now that you have understood most of the key components of the RNN model, there are two methods that allow you to build a better model. One of them is the bidirectional RNN model. This model allows you to not only Get the previous information, you can also get the future information, we will explain in this video. The second one is the deep RNN , we will see in the next video, now let s start with the two-way RNN .

In order to understand the motivation of two-way RNN , let's take a look at the neural network that has been seen many times in named entity recognition before. There is a problem with this network. When judging whether the third word Teddy (shown in number 1 in the above figure) is part of a person s name, it is not enough to look at the front part of the sentence. In order to judge (shown in number 2 in the figure above) whether it is 0 or 1. In addition to the first 3 words, you also need more information, because based on the first 3 words, it is impossible to tell whether they are talking about Teddy bear or former U.S. President Teddy Roosevelt , so this is a non-bidirectional or forward-looking the RNN . What I just said is always true, no matter if these units (shown as number 3 in the figure above) are standard RNN blocks, GRU units or LSTM units, as long as these components are forward only.

So how does a two-way RNN solve this problem? The working principle of the bidirectional RNN is explained below . For simplicity, we use four inputs or a sentence with only 4 words, so that there are only 4 inputs, to. The network starting from here will have a forward cyclic unit called,, and I added a rightward arrow on it to indicate the forward cyclic unit, and they are connected like this (shown in number 1 in the figure below) . These four loop units have a current input input to get the predicted,, and.

So far, I haven't done anything. I just drew the RNN from the previous slide here, just drew arrows in these places. The reason why I drew arrows in these places is because we want to add a reverse cycle layer, here is one, the left arrow represents reverse connection, reverse connection, reverse connection, reverse connection, so the left arrow here Represents reverse connection.

Similarly, when we connect the network upwards in this way, this reverse connection is connected backwards and forwards in turn (as shown in the number 2 in the figure above). In this way, this network constitutes an acyclic graph. Given an input sequence, this sequence first calculates the forward, then calculates the forward, and then,. The reverse sequence starts from the calculation, proceeds in the reverse direction, and calculates the reverse. You are calculating the network activation value. This is not the reverse but the forward propagation. In the figure, part of the forward propagation is calculated from left to right, and part of the calculation is from right to left. After calculating the reverse, you can use these activation values to calculate the reverse, and then the reverse. After all these activation values are calculated, the prediction result can be calculated.

For example, in order to predict the result, your network will look like, (shown in number 1 in the figure above). For example, if you want to observe the prediction results at time 3, the information comes from, flows through here, from forward to forward, these functions have expressions, to forward and then to (the path shown in number 2 in the above figure) , So the information from,, will be taken into account, and the information from the past will flow in the reverse direction, to the reverse direction and then to (the path shown in the number 3 in the above figure). This makes the prediction result of time 3 not only input the past information, but also the current information. This step involves forward and backward propagation of information and future information. Given a sentence " He said Teddy Roosevelt... " to predict whether Teddy is part of a person's name, you need to consider both past and future information.

This is a two-way recurrent neural network, and these basic units are not only standard RNN units, but also GRU units or LSTM units. In fact, for many NLP problems, for a large number of texts with natural language processing problems, the two-way RNN model with LSTM units is the most used. So if there is an NLP problem and the text sentences are complete, these sentences need to be calibrated first. A two-way RNN model with LSTM units , forward and reverse processes is a good first choice.

The above is the content of the two-way RNN . This improved method can be used not only for the basic RNN structure, but also for GRU and LSTM . With these changes, you can use a model built with RNN or GRU or LSTM and be able to predict any position, even in the middle of the sentence, because the model can consider the information of the entire sentence. The disadvantage of this two-way RNN network model is that you need a complete sequence of data so that you can predict any position. For example, if you want to build a speech recognition system, the two-way RNN model requires you to consider the entire speech expression, but if you use this to implement it directly, you need to wait for the person to finish speaking, and then obtain the entire speech expression to process this speech, and Do further speech recognition. For actual speech recognition applications, there are usually more complex modules, rather than just the standard two-way RNN models we have seen . But for many natural language processing applications, if you can always get the entire sentence, this standard two-way RNN algorithm is actually very efficient.

Okay, this is the two-way RNN , the next video, and the last one of this week, we will discuss how to use these concepts, standard RNN , LSTM unit, GRU unit, and two-way versions to build deeper networks.

1.12 Deep Recurrent Neural Network (Deep RNN s)

Each of the different RNN versions you have learned so far can be its own. But to learn very complex functions, we usually stack multiple layers of RNN together to build a deeper model. In this video we will learn how to build these deeper RNNs .

A standard neural network first inputs, and then a hidden layer is stacked, so there should be an activation value, for example, the first layer is, and then stack the next layer, the activation value, you can add another layer, and then get the predicted value . The deep RNN network is a bit similar to this. Draw this network by hand (shown as number 1 in the figure below), and then expand it according to time. Let's take a look.

This is what we have seen in the standard RNN (on the inside of the box shown in Figure 3 No. RNN ), but I put a little sign here of changed a bit, no longer use the original activation values represent 0 time, but rather It is used to represent the first layer (shown as number 4 in the figure above), so we now use it to represent the activation value of the lth layer. This represents the first time point, so that it can be represented. The activation value at the first time point of the first layer, this () is the activation value at the second time point of the first layer, and. Then we stack these (the part shown in the box numbered 4 in the figure above) on top, and this is a new network with three hidden layers.

Let's look at a specific example and see how this value (, as shown in the number 5 in the figure above) is calculated. The activation value has two inputs, one is the input from below (shown in the number 6 in the above figure), and the other is the input from the left (shown in the number 7 in the above figure), this is the calculation method of the activation value . The parameters are the same as in the calculation of this layer, and the corresponding first layer also has its own parameter sum.

For a standard neural network like the one on the left, you may have seen a very deep network, even 100 layers deep, while for RNN , three layers are not enough. Due to the time dimension, the RNN network will become quite large, even if there are only a few layers, it is rare to see such a network stacked to 100 layers. But there is one thing that is easy to see, is to stack the loop layer on each one, remove the output here (shown in the number 1 in the above figure), and then replace it with some deep layers. These layers are not horizontally connected, just a deep layer. The network is then used to predict. Also here (shown in number 2 in the figure above) is also added a deep network, and then predicted. This type of network structure uses a little more. This structure has three cyclic units connected in time, and then a network is followed by a network. Of course, the same is true. This is a deep network, but there is no horizontal direction. On the connection, so we will meet a little bit more with this type of structure. Usually these units (shown as number 3 in the figure above) do not need to be standard RNNs . The simplest RNN model can also be GRU units or LSTM units, and you can also build deep bidirectional RNN networks. Since deep RNN training requires a lot of computing resources and takes a long time, although it does not seem to have many recurrent layers, this means that three deep recurrent layers are connected in time. You can t see many deep recurrent layers, no Like convolutional neural networks, there are a large number of hidden layers.

This is the content of the deep RNN , from the basic RNN network, the basic recurrent unit to the GRU , LSTM , to the two-way RNN , and the deep version of the model. After this lesson, you can already build a very good learning sequence model.

The source of this article and the deep learning course [1].

Author of the notes: Huang Haiguang[2]

Main writers: Huang Haiguang, Lin Xingmu, Zhu Yansen, He Zhiyao, Wang Xiang, Hu Hanwen, Yu Xiao, Zheng Hao, Li Huaisong, Zhu Yuepeng, Chen Weihe, Cao Yue, Lu Haoxiang, Qiu Muchen, Tang Tianze, Zhang Hao, Chen Zhihao, You Ren, Ze Lin, Shen Weichen, Jia Hongshun, Shi Chao, Chen Zhe, Zhao Yifan, Hu Xiaoyang, Duan Xi, Yu Chong, Zhang Xinqian

Participating editors: Huang Haiguang, Chen Kangkai, Shi Qinglu, Zhong Boyan, Xiang Wei, Yan Fenglong, Liu Cheng, He Zhiyao, Duan Xi, Chen Yao, Lin Jiayong, Wang Xiang, Xie Shichen, Jiang Peng

Remarks: The notes, assignments (including data, original assignment files), and videos of this article are all downloaded in github[3].


[1] Deep learning course:

[2] Huang Haiguang :

[3]github: **


Wonderful review of the previous period, suitable for beginners to get started with artificial intelligence routes and data download Machine learning online manual Deep learning online manual AI basic download (pdf updated to 25 episodes) Remarks: Join the WeChat group or QQ group on this site, please reply to "add group" Get a 10% discount on the Knowledge Planet coupon, please reply to the "Knowledge Planet" like article, click it

Copy code