Wanzi long essay! In-depth analysis of 540,000 poems with text mining

Wanzi long essay! In-depth analysis of 540,000 poems with text mining

wedge
\

Many years later, facing the two remnants of the words on the desk, the Dongpo layman who was demoted to Huangzhou will recall the distant afternoon when he met Lao Ni Zhu in his hometown of Meishan. At that time, Dongpo was not Dongpo, but just a seven-year-old child. One day, he ran into an old man named Zhu, who was about ninety years old, near his home. Seeing Su Shi's talents and intelligence, Lao Ni talked to him about his youth experience, and once followed his master into the palace of Hou Shu Lord Meng Chang. One day, it was so hot that Meng Chang and his concubine Huarui enjoyed the coolness in the Maha Pool late at night. In the face of this situation, the Lord Shu improvised a poem... Lao Ni told Su Shi about the only two first lines in her impression.

Forty years later, Su Shi was demoted to Huangzhou. Thinking of this past, he regretted that there were only two sentences left in Meng Chang's words. Suddenly, he wanted to continue writing these two sentences intact. He first guessed the name of the poem "Dong Xian Ge Ling", but to restore the entire poem, he must deeply integrate the mood of the writer at the time and the accompanying artistic conception. Therefore, Su Shi followed the last two sentences and, according to the description given to him by Lao Ni , tried his best to restore the creative scene and mood of the Shu master at that time in his mind, and finally continued the words to complete the masterpiece " Dong Xian Ge ":

Bing muscle jade bone, self-cooling and no sweat. The breeze of the water hall is full of fragrance. When the embroidered curtain is open, a little bright moon peeks at people, and people are not going to bed, leaning on the pillow and leaning against the chaos.

Get up and join hands, the house is silent, and sometimes I see the sparse stars crossing the river. How about the night, the night is late, the golden wave is light, and the jade rope turns low. But when the westerly wind will come, they secretly change.

The above is the famous "Dongpo Continuation" in the history of literature. Although it is a good story in the history of literature, the author vaguely sees the shadow of mathematical thinking:

The creation process of poems is like solving an "optimization problem":

Under certain constraints, such as the flatness, rhyme, antithesis/dualism, five-seven variants, word scores, contexts, etc. that poems must comply with, poetry creators use words to express their inner true feelings in words. In the case of "dancing in shackles", we strive to achieve the ultimate realm of beauty of phonology, beauty of refinement, beauty of words, beauty of obscurity, beauty of emotion, beauty of painting and beauty of form...

At this time, the exquisite poetry and the rigor of mathematics can be perfectly combined.

Since the creation of poetry is regular, we can find some insights through certain data mining methods.

In this article, following this idea, the author will use several text mining methods to conduct in-depth mining and analysis on the poetry corpus at hand (the original poetry corpus address is https://github.com/Werneror/Poetry). The basic statistics are as follows:

As can be seen from the above table, there are nearly 850,000 poems in the poetry corpus, and the number of poetry authors reaches 29,377; among them, the fields include "title", "dynasty", "author" and "content (poetry)" .

In order to facilitate subsequent analysis, the author only selects the rhyme and quatrains, and only the five characters and seven characters, the rhythms (such as "Chun Jiang Hua Yue Ye", "Eternal Sorrow", etc.), miscellaneous words (such as Li Bai's Jiang Jin Wine) and so on are not within the scope of this article.

After data cleaning, finally got 504,443 poems, accounting for 59.1% of the original database. The following are the statistical results and some examples of the poem data after cleaning:

In response to the above data, the author has two major goals in this article:

  • Construct a poetry corpus containing popular topic tags for subsequent poetry topic classification and poetry generation tasks;
  • Based on the above-mentioned poetry corpus, various text mining and semantic analysis, in order to obtain interesting discoveries.

Aiming at the above goals, the road map of this article is also the context of this article, as shown below (click on the picture to enlarge it):

It is worth noting that in the above implementation path, two major components of natural language processing are involved, namely natural language understanding (word segmentation, semantic modeling, semantic similarity, clustering and classification, etc.) and natural language generation (poetry generation and poetry generation). Poetry translation), after reading it, you will have a certain understanding of natural language processing. The amount of information is large, please enjoy it patiently~

1 Poetry word segmentation and hot word discovery

Given a poem text and randomly select a fragment from it, how to judge whether this fragment is a meaningful word?

If the collocation between the left and right of this segment is varied and rich, and the composition of the segment is very fixed, then we can consider this segment as a vocabulary. For example, the "Maj" shown in the figure below meets this definition, then It is a vocabulary.

In the actual implementation of the algorithm, the index to measure the richness of the left and right collocations outside the fragment is called "degree of freedom", which can be measured by (left and right) information entropy; while the fixed degree of the internal collocation of the fragment is called "solidification degree", which can be used as subsequence Measured by mutual information.

Here, the author uses Jiayan (Jiayan) to automatically segment the more than 540,000 poems, sort the results according to the vocabulary frequency from high to low, and finally extract a number of meaningful high-frequency words from the corpus. Among them, the length of the vocabulary ranges from 1 to 4.

The extraction results are as follows (click on the picture to enlarge it):

The author observes some of the results and finds that only one-character and two-character words can be regarded as vocabulary in the general sense, such as "no", "shuo", "suihan", "stay", etc.; three-character and four-character words Generally speaking, it is a combination of multiple types of part-of-speech vocabulary. Strictly speaking, it should be counted as a phrase or a fixed expression, such as "Follow the Flowing Water", "Deep Clouds", "Everything in the World", "Jianghu Wanli" etc. However, for the convenience of presentation in this article, the author refers to them collectively as words.

Below, the author respectively shows the word cloud of TOP100 high-frequency words with word length ranging from 1 to 4 (click on the picture to enlarge it).

Among the one-character high-frequency words, removing the "function words" such as "no", "no", and "you", just look at the 11 high-frequency words "people, mountains, wind, sun, sky, clouds, spring flowers, years, moon, water", which coincides with the Chinese sky. The philosophical tradition of the unity of man, poetry is like painting. The poet puts man in the natural environment, the time and space of the world, the seven emotions and six desires, the interaction between heaven and man, the poetry and painting are born from feelings, and the poetry is full !

" Poetry and painting are the same ", the ancients are sincere not to deceive me!

Among the two-character high-frequency words, the more conspicuous ones are "Wanli" and "Qianli", which depict a huge sense of space. In poetry, they are often associated with the themes of "Hongjing", "Relegation", "Homesickness", "Boudoir" Bundled together.

In addition, words such as "Mingyue", "Old Man", "Baiyun", "Fame", "Human", "Lifetime" and "Meeting" are also popular terms that span the past and present.

Among the three-character high-frequency words, the use of numbers is very common, such as "two or three sons", "twenty-four", "one bottle of wine", "two thousand stones" and so on. Among them, the most noteworthy is the depiction of time and space by the poets with numerals: expressing the span of time, such as "twenty years", "forty years", "500 years", "ten years ago", "thousands". "After loading" etc.; expressing spatial distance, such as "thousands of miles away", "three hundred miles", "hundred feet of buildings"... The ancients always like to put themselves in the vast and vague time and space, to think about themselves in a hurry life. Just as Dongpo said with emotion in "Chibi Fu": " Send a mayfly in the sky and the earth, a drop in the sea. I mourn the need of my life, and envy the infinity of the Yangtze River! "

Among the four-character high-frequency words, there are more vocabulary of spatial orientation, such as words such as "south, north, east, west", "jiangnan, Jiangbei," "east, west, south, north". Because the four-character words are longer, words like "everything in the world", "thousands of rocks and gully", "bright moon and clear breeze", "deep in the clouds", "meeting a smile" and other words have a relatively high amount of information and can be restored. Part of the poetry mood.

2 Training a word embedding model that includes the semantic relevance of poetry vocabulary

The word embedding model can automatically learn the association relationship between words from a large number of poetry texts, and can realize tasks such as word association analysis, word similarity analysis, and cluster analysis.

However, computer programs cannot directly process text data in the form of strings, so one of the first steps that the author bears the brunt is to segment the poetry text data, and then "translate" it into a data format that can be processed by the computer. This is called "text vectorization". The operation to achieve.

Let's talk about word segmentation first. It is related to the previous high-frequency word mining and is the starting point for all subsequent analysis tasks.

Combining the previously accumulated vocabulary, and then segmenting these 540,000 poems based on the directed acyclic word graph, the maximum probability path of the sentence and the dynamic programming algorithm. Let's give an example:

Before participle:

"Everything grows into a cloud, and I am in the same spirit. You can feel it, but the shape is accidental. The Qiu Yue is tall, and the dust is fine. Forgetting things and forgetting me, you can travel what you want."

After word segmentation:

[ 'All things' , 'sheng' , 'yunyun' , ',' , 'and' , 'I' , 'ben' , 'tong' , 'qi' , '. ' , ' Dense ' , ' with ' , ' by ' , ' feeling ' , ', ' , ' body ' , ' accidental ' , ' iso ' , '. ' , ' Qiuyue ' , ' what ' , ' to ' , ' high ' , 'fine' , '. ' ,                  ' Forget ' , ' object ' , ' also ' , ' ecstasy ' , ', ' , ' leisurely ' , ' any ' , ' the ' , ' covet ' , '. ' ] Copy code

After the word segmentation, you can "feed" the word embedding model (Word2vec here) for training after proper processing./

The word embedding model based on Word2vec can "learn" from a large amount of unlabeled text data to word/word vectors, and these word/word vectors contain the semantic relationship between words (which can be semantically related or syntactically related), as In the real world, things are clustered by clusters, and clusters are divided by clusters . Words can be defined by the words (context) around them, and the Word2vec word embedding model can learn the correlation between vocabulary and context. .

The basic principle is shown in the figure below (click on the picture to enlarge it):

After training the model, project its training results into a three-dimensional space, and it will look like the following (click on the picture to enlarge it):

In the process of training Word2vec, the model will learn two types of association relationships between words from a large amount of poetry text data, that is, aggregation relationship and combination relationship .

Aggregation relationship : If vocabulary A and vocabulary B can replace each other, they have an aggregation relationship. In other words, if vocabulary A and vocabulary B have an aggregation relationship, one can be used to replace the other in the same semantic or syntactic category, but it does not affect the understanding of the entire sentence. For example, "Xiao Xiao" and "Xiao Xiao" are both onomatopoeia, which are mostly used to describe the sound of rain, and have a convergent relationship. Then, "Xiao Xiao" in "The orchid shoots under the mountain is short soaking in the stream, the sand road between the pine is clean and there is no mud, and Xiao Xiao is in the rain." It can be changed to "Xiaoxiao".

Combination relationship : If vocabulary A and vocabulary B can be combined with each other in syntactic relationship, then they have a combination relationship. For example, "Yu Da Lihua closed the door, forgot youth, missed youth. Who is to share with you?" Both "forget" and "mistake" in "youth" have a combined relationship, and they are both "verbs". + The verb-object structure of "noun".

Now let s look for some words that have semantic relevance to " Bing ":

The results are mostly vocabulary related to " war " & " trauma ", and the ability to capture semantic relationships is strong . This feature of the word embedding model will also be used in subsequent popular poetry genre mining tasks.

3 Measuring the semantic relationship between poetry vocabulary

3.1 Using cosine similarity to measure the relevance of poetry vocabulary

To measure the similarity or relevance between words, we generally use the cosine value between the word vectors of two words. The smaller the angle between the word vectors, the larger the cosine value, and the closer to 1, the semantically related The higher the degree; on the contrary, the lower the correlation degree. As shown in the figure below, it shows the visualized schematic diagram of the cosine similarity between "Jiabing", "Bingge" and "Fenghuo" (click on the picture to enlarge it):

Through the above word embedding model, similarity(" "," ") = 0.75, similarity(" ","Fenghuo") = 0.37, similarity(" ","Fenghuo") = 0.48. Among the three words, the semantic correlation between "Jiabing" and "Bingge" is the highest, followed by "Bingge" and "Fenghuo", and "Jiabing" and "Fenghuo" are the next.

The advantages of this method of identifying vocabulary related and irrelevant by giving a value are simple expression and efficient calculation, such as the discovery/clustering of popular poetry topics that will be carried out next. However, this calculation of vocabulary relevance does not directly reflect the "causal path" of the relevance between vocabulary.

So, is there an intuitive way to show the semantic relevance between words and see why they have such a relationship (that is, to find the vocabulary relevance path or the semantic evolution path)?

The answer is --- of course there is.

We need to convert this task of finding the path of semantic evolution of words into a TSP problem (traveling salesman problem) .

3.2 Use A* algorithm to find the semantic evolution path between words

TSP problem (Traveling Salesman Problem) is also translated as traveling salesman problem, which is one of the well-known problems in the field of mathematics. Suppose there is a traveling merchant who wants to visit n cities and he must choose the path he wants to take. The restriction of the path is that each city can only be visited once, and he must return to the original city in the end. The goal of path selection is that the required path distance is the smallest value among all paths.

Going back to the problem of lexical relevance measurement, if we can find the shortest "semantic evolution" line between two words in the word embedding space obtained by the above training, we can intuitively show the generation between these two words. The "cause and effect" of semantic association.

To achieve this goal, there is a great algorithm that can be implemented --- A* search algorithm .

The A* algorithm, also called A* (A-Star) algorithm, is the most effective direct search method for solving the shortest path in a static road network, and it is also an effective algorithm for solving many search problems. The closer the estimated distance value in the algorithm is to the actual value, the faster the final search speed. In the figure below (click on the picture to zoom in and view), the net result is the word2vec word embedding space constructed earlier. The nodes are the words distributed in it, and the edges are composed of cosine correlation between words.

Based on the above word embedding model, the author combines the A* algorithm to calculate the most phrase-meaning path between two words. Part of the result is as follows (click on the picture to enlarge it):

Among the five word pairs in the picture above, the semantic distance between "Yuqiao" and "Bong Geng" is the shortest, that is, the semantic relevance is the highest. The semantic evolution path between them is also shorter, with only 2 in between. "Yanshi" and "Baowu" have the largest semantic distance and the smallest semantic relevance. The semantic evolution path of the two is separated by 12 words.

It can be seen that the weaker the semantic relevance (the greater the distance value), the longer the path of the most phrase meaning evolution between the two words, and vice versa, the shorter, so the semantic distance is positively related to the length of the semantic evolution path, and the degree of semantic relevance is related to The semantic evolution path is negatively correlated.

With the previous word embedding model and semantic relevance as a "pavement", the subsequent discovery of popular poetry subjects will be a matter of course~

4 Use text clustering to discover popular poems

First of all, the author believes that in this article, the definition of the word "subject" in "the subject of poetry" is:

Certain aspects of social life as materials for poetry creation also specifically refer to the materials used by the poet to express the theme of the work. It usually refers to the life events or life phenomena that enter the work after concentration, selection, and refinement. In a word, description of scenery, imitation, lyricism, memorization, and reasoning are all "subjects."

Because I don t know in advance how many themes will exist in these more than 540,000 poems, the clustering algorithm chosen by the author does not have the parameter of preset clustering number, and it takes into account operating efficiency and saving computing resources, and can use the previously trained word2vec Word embedding model and calculation of semantic relevance.

At this point, there is a good choice --- Infomap in the community discovery algorithm .

4.1 Discovery of Popular Poetry Themes Based on Community Discovery

Words are the smallest semantic unit that carries the subject matter of poetry. For example, "5.clouds fly on Wuyun Mountain, connected to the peaks and near the embankment. If you ask what is good in Hangzhou, you can hear wild ying screaming here." "Yunshan" and "Qunfeng" can give the poem a theme label of "mountains and rivers". Therefore, the author will then discover popular poetry topics based on the community discovery algorithm, combined with the idea of "lexical clusters--->lexical clusters semantic features--->subject tags".

Let me talk about the general principles based on community discovery.

We know that in a social network, each user is equivalent to each point, and users form the entire online interpersonal network through mutual attention.

In such a network, some users have relatively close connections, and some users have relatively sparse connections. Among them, the more closely connected part can be regarded as a community, and its internal nodes have relatively close connections, while the connection between the two communities is relatively sparse.

How to divide the above-mentioned communities is called the problem of community discovery.

The topic clustering/discovery based on the community discovery algorithm is to mine the large "circle" at the head in the semantic network of words.

To personify vocabulary, the similarity/relevance between vocabulary can be regarded as the degree of intimacy between vocabulary. Then, the task of poetry subject discovery can be regarded as finding a "circle" composed of different members, and the characteristics of the circle can be based on it. In other words, the name of the subject matter can be drawn up according to the connotation of the vocabulary aggregated in it. For example, if a vocabulary cluster contains words such as "Weihuo", "Jiabing", "Zhengzhan", then the subject matter can be named As "war". The schematic diagram is as follows (click on the picture to enlarge it):

After running the community discovery algorithm, the visualization of the clusters of popular subject vocabularies living in the head is shown as follows (click on the picture to enlarge it):

Among them, different colors represent different themes, font size represents their frequency of appearance, and the distance between words represents the degree of correlation.

After clustering, 634 topics are obtained, and the final results are presented in descending order of popularity (the number of vocabulary under the topic), as shown below (click on the picture to enlarge it):

4.2 Identify popular poems

In this link, the author is based on some knowledge in the field of poetry to find the hot topics in the above operation results and the specific vocabulary of the topics under their jurisdiction. Among them, the connotation of " subject-specific vocabulary " mainly has the following two points:

  • The vocabulary cannot be further cut, otherwise the meaning of the word will change. For example, the meaning of "husband" in ancient Chinese is "manly". In an independent vocabulary, if it is cut into "zhang" and "fu", the original meaning will be Lost all
  • The vocabulary only appears in one subject and is exclusive. For example, "Zhang Li" only appears in the subject of "Wandering Quartet", and will not appear in "Jinge Iron Horse", "Song to Wine", "Mourning the Dead", etc. Poetry subject matter.

According to the author's definition in the previous article, descriptions of sceneries, imitations, lyricism, memorabilia, and rationality are all "subjects", and the selection of popular themes here adopts the principle of " grasping the big and letting go of the small ".

In addition, although the clustering results are ideal, there are still some noises. For example, there are a few words that are not closely related to the subject matter, the vocabulary with low subject discrimination, and the vocabulary in the vocabulary cluster is too few (such as less than 10) etc. These are all situations that need to be eliminated.

After careful screening by the author, a total of 23 popular poetry themes were identified, namely, towering mountains and rivers, plowing in the countryside, homesickness, golden horses, chanting history and nostalgia, chanting things to express feelings, giving away friends, love and grievances, mourning for the dead, Building boat paintings, flowers blooming, teasing, singing to wine, horses, cultivating immortals, changes in world events, comprehension of Zen, strong and fierce, wandering in all directions, dejected, dazzling stars, rewarding the grace of the Lord, the words of the birds, and the lings. Lun Gan, etc. Of course, these are not all themes, limited to the author's knowledge, there are still a large number of themes that have not been discovered. The results of the enumeration are as follows (click on the picture to enlarge it):

In this link, the author selected some popular poems based on the background knowledge of poems, and formed a keyword rule system corresponding to the themes, which can be used to classify these 540,000 poems based on keywords.

It is worth noting that the selection of keywords in this link is too harsh, resulting in a small number and an imperfect rule system. Therefore, before the formal classification of poetry themes in the poetry corpus, the author needs to use some "small means" to expand the keyword rules of the above-mentioned popular themes .

5 Extend keywords based on linear classifier features

Here, the author first uses the obtained popular theme classification system and its keyword rules to label these 540,000 poems with subject matter labels, allowing the same poem to hit multiple labels. Excluding the data with missing theme tags, there are a total of 443,589 lines, and most of the poems are marked with 2 or more theme tags.

Some of the results are as follows (click on the picture to enlarge it):

With the labeled data, the author converts the multi-label problem into a single-label problem, and then "feeds" the above-mentioned poetry text and its corresponding label into the linear classifier, and finds the best under each category according to the weight of the linear classifier Representative vocabulary, that is, subject-specific vocabulary. The reason for choosing a linear classifier instead of the popular deep learning classifier here is its interpretability, which allows us to clearly know which salient features (here, vocabulary) make poetry fall into this subject category . The general principle is shown in the figure below (click on the picture to enlarge it):

Among the many linear classifiers tested by the author, namely RandomForestClassifier, Perceptron, PassiveAggressiveClassifier, MultinomialNB, RidgeClassifier, SGDClassifier, RidgeClassifier has the best distinguishing effect, and its F1_score is 0.519. Since it is a bag-of-words model, the semantic representation is relatively simple, and the original is more For the label classification task, this result is acceptable. Based on the descending order of the feature vocabulary weights of RidgeClassifier, a number of subject-specific vocabularies in the above 23 popular poetry subject categories can be obtained. Some of the results are shown below (click on the picture to enlarge it):

In this way, each category takes TOP500 vocabulary. After the author's screening and combing, the keyword rules for each subject matter have been expanded to varying degrees, so that the classification label system can better assist in the completion of the multi-label classification task of poetry topics, and the subsequent can Combine the classification results to make continuous expansion.

Based on this more complete classification system for poetry subjects, the author obtained 58W+ rows of data after running it, and added 14W+ rows of data on the basis of the previous ones, and the data scale has increased significantly!

At this point, the author's first goal is to build a corpus of tags for popular poetry topics, and subsequent text mining tasks can be carried out on this basis.

The most representative characteristic vocabulary is derived from the classification label and its classification model. This is an induction process of "data -> law", which well reflects the data-driven thinking; and the model will learn and summarize The extension of "experience" to the label prediction task of new samples reflects the deductive process of "rules -> data".

6 Various statistical analysis based on classification labels

For the poetic subject corpus composed of the above 58W+ line data, cross-analyze the subject classification tags and various meta data (such as style, dynasty, author, etc.) to obtain many interesting analysis results.

6.1 Poetry theme & style analysis

The style label and subject label of the poetry data set are analyzed by cross-listing the composition ratio, and the results obtained are as follows (click the picture to enlarge it):

Among them, some obvious statistical descriptive characteristics can be found:

  • The two themes of "Farewell to Friends" and "Bibi Niaoyu" account for a relatively high proportion of all poetry styles, and are two of the more "popular" themes;
  • "Mourning the dead" and "Zhuang Huai fierce" these two themes account for a relatively low proportion of all poetry styles, and they are two relatively "unpopular" themes.

6.2 Co-occurrence analysis of subject label

The previous classification of poetry subject matter is a multi-label classification, which means that the same poem may correspond to multiple subject labels. In this case, we can perform a co-occurrence analysis of subject tags, that is, subject tags that appear multiple times at the same time, there will be a certain correlation between them.

Now we model the co-occurrence of tags, and the results obtained are visualized as shown below (click on the picture to enlarge it):

In the above figure, the thickness of the line indicates the frequency of co-occurrence. The thicker the line, the higher the frequency of co-occurrence, and vice versa. Among them, there are several pairs of labels that have a higher co-occurrence frequency:

Changes in the world-depressed

Homesickness-Changes in the world

Chanting History and Nostalgia-Ping Lun Pole

Changes in the World-Jin Ge Iron Horse

Songs to Wine-Changes in the World

Mourning the deceased-changes in the world

Among them, "dejected" and "things change," the most relevant, this is well understood, after all, " was a personnel matter off, the first tears flow like language ", because similar things and sad laments death of verse and " died a few Looking back at the past, Yamagata still sleeps in the cold current ", "The career of a lifetime is empty, and the fame of a lifetime is in a dream "; the correlation between "journey and homesickness" and "changes in the world" is the second highest. Such verses include " Shao Xiao " The old man left home, the local accent has not changed, and the temples are declining ", "The children who went to Japan all grow up, and the relatives and friends of the past are half withered. "

In addition, we can also find that in poems with two or more subject labels, the probability of "changes in the world" and other subjects appearing at the same time: changes in the world may cause the poet to feel sad; it may also be caused by the war. The misfortune resulted in the feeling of " prosperity, people suffering, death, people suffering "; or " peach and plum spring breeze, a glass of wine, rivers and lakes, night rain, ten years of light ".

6.3 Analysis of the Trend of Poetry Themes

The author arranges the dynasties in the poetry data set in chronological order from far to nearest, and merges the dynasties with similar ages, and cross-analyses them with 23 popular poetry subjects (proportion), and obtains the following picture (click on the picture to enlarge it ):

In the figure above, you can look at it from the horizontal (dynasty) and vertical (poem subject) dimensions.

From a horizontal perspective, there are two themes that endure for a long time, namely "giving a friend to send farewell" and " bird language".

In ancient times, due to inconvenient transportation and poor communication, it was often difficult for relatives and friends to see each other for several years, so the ancients paid special attention to parting. At the time of parting, people often set farewells with wine, fold willows, and sometimes chant poems to say goodbye. Therefore, "giving friends farewell" has become an eternal theme chanted by ancient literati. In addition to this deep sentimentality, there are often other accommodations: or used to encourage and persuade, such as "Momhou has no confidant, no one in the world knows the emperor"; or used to express friendship, such as "the peach blossom pond is deep in the water." A ruler is not as good as Wang Lun s sentiment to me"; or used to entrust the poet s own ideals and ambitions, such as "Luoyang relatives and friends are like asking each other, a piece of ice in the jade pot"; even full of positive youthful breath, full of hopes and dreams, such as " There is a confidant in the sea, and the end of the world is close to each other."

Poems with the theme of "Bi Xing" generally use "bixing" to express their emotions. The author understands two types: one is to describe the poet's indifferent and peaceful mood of returning to the mountains and nature by writing bird language. The poem king Mojie wrote the most, such as "The moon rises to startle the mountain and the birds are screaming in the spring stream", "The flying egret in the desert paddy field, the yellow oriole in the shade of the summer wood", "The pheasant shows the wheat seedlings, and the mulberry leaves are thin", etc. ; The second is to express the poet s faint sadness through images such as Zigui (Rhododendron) and swan geese, such as "Yanghualuo Zigui sings, and the dragon signs over Wuxi". "The mountains and trees on both sides meet, and the rules will cry all the time." "The feeling of homesickness and homecoming, "It's hard to send a book to a wild goose, and a dreamy dream."

From a vertical perspective, in addition to the two popular themes mentioned above, poems on the theme of "repaying the favor of the king" accounted for a relatively high proportion in the late Sui and early Tang Dynasties. At that time, it coincided with the third great unification of China, the two great glory of the "rule of Zhenguan" and "rule of Kaiyuan". In the era of "I am the country", the vast number of passionate young people are eager to ride the battlefield, make contributions and serve the country.

In addition, the author has also noticed that from the Jin Dynasty to the contemporary era, themes such as "Hua Kai Tu Feng", "Journey and Homesickness", "Golden Ge Iron Horse" and "Quiet Enlightenment Zen Ji" have always maintained a high degree of enthusiasm, combined with the previous mentions. The two enduring poetry themes mentioned above indicate that the direction of poetry creation during this period has a certain continuity.

From the above table, we can have some findings, but if we want to get some deeper information hidden in the surface data, we also need to use high-level data mining methods to transform it. Here, the author uses the method of multiple correspondence analysis to map its high-dimensional representation (that is, the 21*23-dimensional chart above) into a two-dimensional representation (decomposed into 2 two-dimensional matrices, the subject is 23*2, and the dynasty is 21 *2), so as to more intuitively reveal the relationship between poetry subjects, poetry subjects and dynasties, as shown in the following figure (click on the picture to enlarge it):

In the above picture, there are two types of coordinates --- the red dot with a radius circle on the periphery is the dynasty, and the "x" is the coordinates of the poetic subject matter.

The coordinate of the Han Dynasty "lonely hanging overseas" is because the amount of data is too small and the statistical characteristics are not obvious, so the author will not analyze it here.

In the upper left corner of the picture, the circles of the Wei-Jin, Southern and Northern Dynasties, Late Sui and Early Tang, and Sui dynasties have a higher degree of overlap, indicating that their poems have a relatively similar distribution of themes. It is associated with the succession of these dynasties, this time again. It embodies the continuity of the times in poetry creation.

Similarly, the circles in the Tang Dynasty and later are "clustered", indicating that their poetry writing themes are relatively similar in number distribution, reflecting that the dynasties since the Tang Dynasty have less differences in poetry creation themes, and the creative direction of the theme creation. not tall. The reason is that poetry has evolved to the "ultimate state" in the Tang Dynasty:

The subject matter and artistic conception of Tang poetry are almost all-encompassing, and the use of rhetoric has reached the level of perfection. It not only inherited the traditions of Han and Wei folk songs and Yuefu, but also greatly developed the style of Gexing; not only inherited the five or seven-character ancient poems of the previous generation, but also developed into a long narrative and romantic system; it not only expanded the five-character and seven-character poetry The use of form also creates modern poems with a particularly beautiful and neat style. Modern style poetry was a new style of poetry at that time. Its creation and maturity were a major event in the history of poetry development in the Tang Dynasty. It pushes the syllable harmony and the refined artistic characteristics of Chinese ancient poetry to unprecedented heights, and finds the most typical form of ancient lyric poetry, which is still especially popular among the people.

Tang poetry represents the highest achievement of Chinese poetry, and is undoubtedly the rich and colorful brushstrokes in the Chinese and world literary circles! This is undoubtedly a huge pressure for Song poets who want to create a new horizon. As Wang Anshi and Lu Xun said:

"The good language of the world has been exhausted by Lao Du; the popular language of the world has been exhausted by the happy way",

"I thought that all good poems had been finished in the Tang Dynasty. After that, if it weren't for the'Great Sage of Heaven' in the palm of the Tathagata, there is no need to do it again."

7 Generate fluent poems through GPT-2

To some extent, poetry generation is a deep analysis of poetry from another dimension.

What kind of poetry is generated is closely related to what the poetry generation model "eats". The "generation" of the poetry generation model is not "water without a source" or "wood without roots". It is based on fully learning and absorbing a number of predecessors' poems and acquiring a certain "creative technique", so that it can generate good results. Possible poetry.

At the same time, we can also discover some rules of poetry creation from the generated results, and do some in-depth exploration.

7.1 Example analysis of poem generation

In this part, the corpus used by the author to train the poetry generation model is based on the popular theme label system with theme tags (currently 23) and quatrain poems (seven words and five words) and quatrains (seven words and five words). They all satisfy the structural, tonal and semantic requirements of poetry.

The author here uses GPT2 (Generative Pre-Training 2nd), which is an unsupervised language model that can generate coherent text paragraphs. It has achieved leading performance in many language modeling task benchmarks (data magnitude and parameter The weight level is there, of course it cannot be compared with its back wave GPT3...). And this model can do preliminary reading comprehension, machine translation, question answering and automatic summarization without task-specific training. The core idea can be summarized as "given more parameters and more diverse and larger amounts of text, unsupervised training of a language model may enable the model to have stronger natural language understanding capabilities, and without any supervision. Begin to learn to solve different types of NLP tasks".

In the task of text poetry generation, the author trains a GPT2 model for poetry generation from zero to one, and strives to let the model learn all kinds of obvious features in the poetry data set (the relationship between the subject and the poetry, the relationship between the poetry and the style, the Tibetan The general principle of the relationship between initials and poems, etc.) and implicit features (mainly the rhythm of poems) is shown in the figure below:

Compared with the LSTM poetry generation model used when the author wrote "Analysis of Nearly 50,000 Poems of the Tang Dynasty" with Text Mining 3 years ago , the GPT2 model has made great progress:

  • The generated poems are more fluent, and the cohesion of the outgoing and incoming sentences of each joint also appears more natural
  • Focus on the overall situation (that is, the whole poem), good memory ability, consider the context, the generated verses are closely related, and there will be no "jumping subjects".
  • Can learn more hidden features in poetry data, such as rhyme, flatness, contrast, interrogative tone, etc.
  • Due to the above 3 advantages, the "scrap rate" of generated poems is greatly reduced

Below, the author presents the poetry generation ability of GPT2 in a "fancy" style:

1) The generated poems may have a certain correlation with the verses written by predecessors, but the GPT2 model can be "magically modified", and it is difficult to see the direct "plagiarism target", such as the following seven-character poems generated by the GPT2 model. For each joint, one sentence with the closest semantics can be found in the corpus:

The legend of the war resounded throughout China, and the Central Plains became white.
The heroes in front of the soldiers who do not die will have no worries.
The universe is absorbed into the world, and the years are long and old.
Ande's boat is lonely, and the smoke and waves of the five lakes are flowing east.

2) Many generated poems can learn rhyme well, for example, conform to the rhyming rules of "Ping Shui Yun":

The previous example is an example of Qilu Pingqi (the first sentence enters the rhyme) in Ping Shui Yun:

Ping Ping Ping Ping Ping Ping Ping Ping (rhyme)

Zhe Zhe Ping Ping Z Z Z Ping (rhyme)

Flat flat flat flat flat

Ping Ping Ping Ping Ping Ping Ping Ping (rhyme)

Flat flat flat flat flat flat flat flat flat flat flat flat flat flat flat flat flat flat flat flat flat flat flat flat flat flat flat flat flat flat flat flat flat flat flat flat flat flat flat flat flat flat flat flat flat flat flat

Zhe Zhe Ping Ping Z Z Z Ping (rhyme)

Flat flat flat flat flat

Ping Ping Ping Ping Ping Ping Ping Ping (rhyme)

The Legend of the War Resounds throughout China [Zhou: 11.You] rhymes

Wanli Zhongyuan Yibaitou [head: eleven you] rhymes

The hero who won't die

The hero in front of you has no worries [Worry: 11.You] Rhymes

The universe is home to the world

Years long old bullfighting [Bull: 11.You] rhyme

Ande's boat becomes lonely

The smoke and waves of the 5.Lakes are the east flow [flow: eleven you] rhyme

Look at two more cases:

Even if some poems do not strictly use rhyme (that is, a poem can only rhyme one rhyme), it will automatically adopt neighboring rhymes to alleviate the problem of rhyming.

3) For some verses in Zhang Ruoxu s old Yuefu title "Chunjiang Huayueye" (because it is a Yuefu poem, with a large number of words, and did not participate in the training of the poetry generation model), the pairwise method is used to generate the lower line, which does not violate the harmony sense:

[Part I] The tide of the Chunjiang River connects with the sea, and the moon and the sea are in the same tide. [Part II] The ancient tower is surrounded by mountains and green, and the peaks are beautiful and misty.

[Shang Lian] Yan Yan has followed the waves for thousands of miles, where is there no moon in the spring river [Lian Lian] Fifty years of faint and illusion, there is a romantic atmosphere here

[Shang Lian] The river flows around Fangdian, the moon shines in the flowers and the forests are like hail [Lian Lian] The mountains are full of floating emerald mists, the wind combs the willows and smokes.

[Shanglian] The oblique moon sinks and the sea fog is hidden, and the Jieshi Xiaoxiang Road is boundless .

[Shang Lian] I don t know who Jiang Yue is waiting for, but I see that the Yangtze River sends water. [Lian Lian] only seeks out old traces from the fishermen, and how can I ask where to reply

[Shang Lian] White clouds are gone, and Qingfengpu is full of sorrows. [Lian Lian] The clear water and the desert, the red geese are flying south

[The Union] does not light geese fly long degree, latent jump ichthyosaur written [water] second line fishing sojourn months go hand in hand, Oulu exchanges ship song

[Shanglian] At this time, I don t know each other, and I am willing to Zhaojun in Hualiu month by month . I want to invite the fairy to return home

[Shanglian] Who first saw the moon by the river? Jiang Yuehe Lu Zhaoren [Xialian] There are places in the world where dragons can be hidden, and where is the sea breeze so impressive

[Shang Lian] The sky and the river are the same color and there is no dust, and the lone moon wheel in the sky is bright.

4) Embedding the author s company name---" Daguan Data " as a hidden head word, using " mountains and rivers ", "to cultivate immortals ", and " pastoral cultivation " as themes to generate three seven-character poems:

"Part One"

Of Mount high 72 peak, Tao qi hero in the world.

Watching the sea and three mountains, the clouds are ethereal, and the towers are thousands of miles away.

With a few stems and white hair, the body is long and healthy.

According to the risk, he could not win the survey, lean on a railing and lean on thinking.

"Second"

The origin of the master is natural, and there is no fairy in the world.

View of empty Wu has three thousand community, see the world we know two decades.

Count the green hills with sticks, and a line of white birds descended in the clouds.

According to Saddle, I want to go to Ling Feng, looking back at the lonely city.

"Part Three"

Of Mount Vientiane pay assessment, one from the creator.

Sometimes the viewer laughs, and there is nowhere to be busy watching people.

Several huts face running water, and several fishing boats face the sunset.

According to the true face of this, the universe is not vicissitudes of life.

5) Introduce more poetry label information in the training and generative formulas, such as style, subject matter, and hidden characters to guide GPT2 to generate poetry in the desired direction. After training the model, the author enters the poetry style (seven-character quatrains, seven-character rhyme poems), Tibetan head characters (Canghai Yueming), and subject matter (Jin Ge Tie, changes in the world, Jing Wu Zen Ji, etc.) to generate the following poetry:

7.quatrains <sep> Ji Wu Ge is <sep> armored cavalry <sep> Ji Bing one hundred thousand Habayashi Lang, Ge Bi aspect one hundred battlefield. As a general newspaper can defeat the enemy, military power was originally a good border province.

7.Poems <sep> sea next month <sep> armored cavalry <sep> Cang Hai arms a battlefield, west wind tears clothes. Haimen Wanli beacon smokes, and the world will be cool with rain and dew for thousands of years. When the moon goes downstairs, the boat is noisy, and the flags are stirred in the middle of the night. Out of this evening boarding at Du Yi railings of the sunset.

7.Poems <sep> sea next month <sep> armored cavalry <sep> Cang wave transitory man river access, and more ahead in the vast emotion. Sea upstairs ship three hundred years, the clouds over five thousand pavilion. May Ming Ferries Fishermen from autumn and into the remote hamlet Gujiao. Next day we meet with the other, painful tears torrential.

7.Poems <sep> sea next month <sep> armored cavalry <sep> Cang wave transitory man eismowe group, ahead in the vast skies cloud. Sea upstairs boat three miles, the flames of a solitary graves horizon. On the moonlight, the fishing song of the ancient ferry began, and the frost fell into the sky and the geese shadows. Next day they meet with other, tears have back home.

7.Poems <sep> sea next month <sep> things change <sep> Cang Sang Jiehuo few rise and fall, look back on the vicissitudes of feeling hurt. Sea onto the balcony Jiehuo air, earth Wo millet from sadness. May Ming Ferries Fishermen, the wind calm cold Yan Ying long. Next day we meet with another, a desolate back home.

7.Poems <sep> sea next month <sep> Jing Wu Chan machine <sep> Tsang waves ten thousand ares eismowe group, this place has been here points. Sea tower on tonight month, the mountains pavilion a few autumn clouds. The monks of the ancient monastery of Yueming were first decided, and Jiang Yan was hearing the tide. Next day I meet with do not, I do not know where is the outlier.

The above generated results are flat and consistent, rhyming is also acceptable, and poetic is not bad. I dare not say that it is perfect, but at least many people cannot write poems with such a look.

In addition, the author has conducted a large number of subject-based poem generation tests. The results show that the correlation between poem subject matter and generated poems is relatively high. This also verifies from the side that the poetry subject corpus constructed by the author has a certain degree. rationality.

In addition, the author also found some differences in the expression of ancient and modern poetry through the generated verses. For example, the author used "Golden Ge and Iron Horse" as the generated theme, and respectively used Chairman Mao's " People's Liberation Army Occupy Nanjing " and Mr. Chen 's " Meiling 3.Chapters" "The first couplet in "starts with 9 poems each, and the results are as follows (click to view larger image):

The original poems occupying the C position in the middle of the two pictures above are the original poems. The rest of the poems were "guided" by the first couplets of the poems of Chairman Mao and Mr. Chen. They basically contain images related to "Golden Ge and Iron Horse". It is related to fighting, defending the border, killing the enemy and protecting the country, such as:

Hearing that the Han family fought and fought, the general re-emerged in Lampang today.

The flags and shadows moved the three armies to silence, and the sound of Diao fighting spread for five nights.

The blood of the Central Plains was three thousand li, and the heroic soul of the southern kingdom was heartbroken.

The blood of the Central Plains was three thousand li, and the heroic soul of the southern kingdom was heartbroken.

The west wind blows the horns and cold geese, and the southern flags cross the river at night.

...

However, it may be related to the study of a large number of poems from the feudal era. These generated poems have a sad tone at the end, which is slightly negative, such as the following sentences:

Since then, the border city has been fought more and more, and it is more sad and desolate without the need for a drum.

The sound of coldness all the way back to the geese, but in the deep autumn, I don't see the worrisome window of the guest.

I want to find the old Yin from the king, and Bian Zhou revisit the Caotangtang.

The only heroic spirit knew this, and he couldn't bear to look back and tear his clothes in tears.

Looking back on the homeland, the westerly wind is bleak and sad.

Looking back on the embarrassing situation, the sunset is full of grass.

...

The above-mentioned verses lack the optimism of revolutionary optimism, which is a characteristic that the poetry of the feudal era does not possess, but this is exactly the difference between the two poems of Chairman Mao and Mr. Chen. Consider these two sentences:

If the sky is sentimental and the sky is also old, the right path in the world is the vicissitudes of life.

To join the revolution is to be home, and the bloody wind should endure.

" Articles are written together with time, and songs and poems are written together for things. " The above results also reflect from the side that poetry creation has a sense of time and reality. Although they are written on the same subject, they are due to the life track of the poet and the era facing him. The background is different, and the weather contained in the chest is also very different.

The above-mentioned poems generated by GPT2 all look pretty good, and many of them are fake. In this case, how do we distinguish which ones are written by humans and which ones are written by machines?

In the final analysis, machine-written poetry is still a statistical problem . "The person who needs to tie the bell to untie the bell" must be solved by statistics.

7.2 Comparison of differences in the creation of man-machine poetry

The general principle of poetry generation modeling is: through a large amount of poetry corpus, the poetry generation model can learn the dependency relationship between adjacent words in any verse , such as a "desert", according to the learned experience, GPT2 will Guess which word will appear next, and these words will be "stored" in the " memory " of the GPT2 model in the form of probability , such as:

"Desert": 0.1205,

"North": 0.0914

"Ran": 0.0121,

"Sight": 0.00124,

...

Under normal circumstances, when a machine "composes a poem", it will select the word with the highest chance of appearing in the past, and so on, until it encounters the "terminator", and gradually generates the entire poem.

This is the simplest case, and the resulting effect is very general, and in many cases it is unreasonable in the arts and sciences.

In order to ensure the generation effect, some complex generation strategies are generally used (at the same time), such as Beam Search, Top-k sampling, Top-p sampling (NUCLEUS SAMPLING), Repetition_penalty (penalizes repeatability), Length_penalty ( Punish the generation of long verses), etc. This will take into account some other factors of poetry generation, such as fluency, richness, consistency, etc., and the effect of poetry generation can also be greatly improved.

The author is based on Harvard University's GLTR (Statistical Detection and Visualization of Generated Text) to explore some differences between machines and humans when writing poems. The input of this tool is poetry, and the output is statistics on the probability distribution of words in poems written by machines and humans. , We can discover some of the mysteries of the poetry "refining characters". The author tries to give an example:

In the above figure, the color of the color block represents the probability interval of the word, red represents the word with the probability of TOP10, yellow is TOP100, green is TOP1000, and purple is TOP10000.

From the results, we can see that when the machine composes poems, the probability distribution interval of red and yellow characters is relatively large. When generating word by word, it is generally taken from the word probability distribution of the head, which leads to the more common generation of poems. Expression; when people create poetry, the probability distribution interval of the characters represented by each color is relatively close, at least the difference is not big, which ultimately leads to the ever-changing expression of poetry and unconventional.

In ancient times, when poets wrote poems, they focused on "refining characters." Refining words refers to tempering words. It means that the poet selects the most appropriate, accurate and vivid words from the vocabulary treasure house after repeated considerations to describe things or express meaning. From this point of view, the statistically significant "character selection" strategy is basically undesirable --- either the words fail to express their meaning or are easy to fall into the "conventional".

For example, in Tao Yuanming's sentence "Cai chrysanthemum under the eastern fence, leisurely see Nanshan", it is not good to replace "see" with "wang". Although according to the probability learned from the poetry data set, the probability of "wang" appearing in the past is much greater than that of "seeing", but "seeing" through "" now has the meaning of "unintentionally seeing", indicating that the author is inadvertently I raised my head and saw Nanshan, expressing the kind of leisurely and contented feeling in the whole poem, as if I had inadvertently seen the beautiful scenery in the mountains, which was in line with the very natural and frank mood of "the mountain atmosphere is beautiful at sunset and the birds are in return", and "Wang" is a bit blunt.

8 Translate poems into easy-to-understand vernacular

Poetry translation refers to the translation of poetry that is heavily colored in classical Chinese and is difficult for ordinary people to understand into vernacular that is easy for modern people to understand.

The model used by the author here is an Encoder-Decoder composed of two BERTs. The goal is to input a sentence or a poem to generate the corresponding vernacular translation. Considering the semantic continuity of a large number of vocabulary in ancient and modern Chinese, unlike Chinese-English translation, the semantic and grammatical structure between the source sentence and the target sentence are very different, and they will have a lot of shared vocabulary, so the number of training corpus It can be less. The poetry translation model learns the semantic correspondence between them from a large number of translation sentence pairs, and outputs them in a smoother (in ideal case) form .

The following is a brief schematic diagram:

The author compiled a crawler to collect a number of poems and their translation data from a poem sharing website. After cleaning, 30,000 training sentences were obtained.

Part of the training data is shown in the following table:

After training the model, the author selected some sentences that did not appear in the training set as examples of text translation. Some examples are as follows:

Word Ask you how light farewell is, how many reunions can be made in a year. The willows are like silk, and the spring is at the end of the homeland. (Bodhisattva Man Ask the monarch what's the light farewell, Qing, Nalan Xingde)

[Translation 1] When will you be able to go home? Only the beautiful peach blossom has passed.

[Translation 2] Could you ask me why I disappear so easily? It is the Qingming night under the three moons in a year. The willows are just as soft as silk, as if spring is about to pass in my hometown .

[Words] The mountain wipes the clouds, the sky is sticky and the grass is declining, and the corners are drawn and the door is broken. The water flows and the moon is bright, the wind sends the residual flowers, and the poetry moves the river tower. ("Man Tingfang Mountains Mop Clouds", the author of the first sentence is Song, Qin Guan, and the author of the second sentence is GPT2 )
[Translation] On the distant mountain peak is a thin cloud floating in the sky; crows dotted In the withered grass, the trumpet resounded all around. With the bright moon shining on the bright Milky Way mapping to the west, bursts of music of winter plum blossoms are heard, which makes people feel heartbroken at this time.

[Couple] The emperor's way is true, now it's a thing of the past. The people's aspirations have been contacts since ancient times. (Yang Du from the gang, the author of the Shanghai Union is the Republic of China, Yang Du, and the author of the next is GPT2 )
[Translation] Jingcheng Avenue really understands my knowledge, and I have experienced it just like the place back then. The people still protect the people's living world in their hearts. Since ancient times, it has been the world where people used to come and go!

[Verse] Thousands of households and thousands of doors enter the chapter, the golden rope boundary road willow silk yellow. The gentle breeze skews the music, and the sunny days follow the palace spring. ("Gong Ci 10.Thousands of Households and Thousand Doors to Build a Chapter", Song, Wang Zhongxiu)
[Translation] The emperor's residence has been closed to establish fame and fortune, and the willows on both sides of the royal road are still like young shoots. Coming here in spring, the palace towers high into the sky, and there is a burst of clear and gentle water flowing from the palace.

Verse The two waters of Yuanxiang are clear and shallow, and the forest and flowers on the beach are very loud. The vastness of Dongting leads to the Yangtze River, and the water rises in the spring. ("Young Hunan Song", Republic of China, Yang Du)
[Translation 1] How far is the flow on both sides of the Yuanjiang River here? Growing weeds and small continents surround the surface of the river, and the waves on the riverside seem to be so wide; when spring comes, the water surface rises and rises with a blue color.
[Translation 2] A piece of clear water flows on both sides of the Yuan River, and dense woods surround the river bank. The vastness of Dongting Lake, spring water surging continuously to the distance.
[Translation 3] There is a piece of clear water flowing on both sides of Xiaoxiang, and the petals of the woods are floating with the wind. Dongting Lake is vast, turbulent, and sparkling, as if the sky meets each other.
[Translation 4] Yuanshui Xiangjiang River is crystal clear, the water is rippling, and the trees on the shore are as shallow as flowing. Looking far away from the vastness of Dongting Lake, the water and the sky are connected together.

Judging from the results, the effect of the 30,000 sentences is still so-so. Many translations are not literally translated, and they prefer "free translation". Machine translation will "brainly supplement" some scenes, such as "Shan Mo Weiyun,... In the translation of "Poetry Heart Rhyme Moves Jianglou", the machine can " figure " "this season makes people feel broken", and it starts to have "inner taste".

If some methods are used to expand the corpus, such as splitting the entire poem and corresponding translation sentence by sentence, text enhancement of the vernacular part (synonym substitution, random insertion, random exchange, etc.), and changing free translation to literal translation, etc., the model of the training department It may be more powerful and the translation effect can be improved a lot.

Concluding remarks

Through the above-mentioned poetry corpus analysis process, the author would like to talk about some views on (text) data mining:\

The so-called excavation usually carries connotations such as "discovery, search, induction, and refinement". Since it needs to be discovered and refined, the content sought is often not obvious, but "concealed" and "hidden" in the text , Or it is impossible for people to discover and summarize directly on a large scale. If you want to get rid of the cocoon, you need to combine domain knowledge (such as the common sense of poetry in the text), use a variety of analytical methods (such as the various NLU and NLG methods in the text), and sometimes even need reverse thinking (such as the generation of poetry in the text), and various types of analysis It is best to be a sequential and complementary organic whole, so that the text data exploration task can be completed with the highest efficiency.

Reference materials:

  • "Resonance of Mathematics and Literature", Qiu Chengtong 
  • "Jaling Talking about Poems. Jiaying Talking about Poems", Ye Jiaying
  • "Text Data Management and Analysis", Zhai Chengxiang
  • "Text Data Mining", Zong Chengqing
  • "Basics of Ancient Chinese", Wu Hongqing
  • The Metrical Pattern of Poetry, Wang Li
  • "The Science of Language", Noam Chomsky
  • "Modern Chinese Lexicology Course", Zhou Jian
  • "Language Cognitive Research and Computational Analysis", Yuan Shulin
  • "The Cognitive Method of Natural Language Processing", Bernadette Sharp
  • "Introduction to Natural Language Processing", He Han
  • github.com/Werneror/Po...
  • github.com/kpu/kenlm
  • github.com/jiaeyan/Jia...
  • "Catching a Unicorn with GLTR: A tool to detect automatically generated text", gltr.io

  • "Better Language Models and Their Implications", openai.com/blog/better...

  • "Degree of Freedom + Degree of Solidification + New Word Discovery in Statistics", blog.csdn.net/qq_39006282...
Long press to scan the QR code to add "Python Assistant" to enter the PY exchange group Click to become a community member, just click and watch if you like Copy code