Directory of RSS feeds

RSS feeds in the directory: 2817

Added today: 0

Added yesterday: 0

Hi-Tech / Internet

How search engines understand us. Semantic analysis of text

SAPE.RU - Blog 02.10.2019 at 12:26

Company Blog comprehensive promotion Sape. Expert articles on SEO, usability, contextual advertising, advertising from bloggers.

Semantic or semantic analysis of the text is one of the key problems of the theory of creation of systems of artificial intelligence related to the processing of natural language (Natural Language Processsing, NLP) and computational linguistics. The results of the semantic analysis can be applied to solve problems in areas such as psychiatry, science, trade, literature, search engines, automatic translation, etc.

Despite its relevance in virtually all areas of human life, the analysis is one of the most difficult mathematical problems. The whole difficulty lies in how to "teach" the computer to correctly interpret the images that attempts to convey the author of the text.

In this article we will examine how search engines are extracted from the query, its semantic value, the method TF-IDF and the Zipf's law. In the first part of the article you can read about the main method of language processing Bag-of-words, as a search engine understands individual words and sentences and finds the corresponding document. Read and become a real guru of search engine optimization...

TF-IDF and the Zipf's law

Test for Zipf's law is a method of distribution of the frequency of words in natural language: if all words of the language (or just long enough text) sort descending frequency of their use, the frequency of the n-th word in this list is approximately inversely proportional to its sequence number n (called the rank of the word). For example, the second usage the word occurs about two times less than the first, the third three times less likely than the first, and so on. The most frequently used 18% of the words (approximately) account for more than 80% of the volume of the whole text.

the Most popular words will appear in most documents. As a result, such words complicate the selection of texts presented using the model of bag-of-words. In addition, the most popular words are often functional words without semantic meaning. They do not carry the meaning of the text.

the 10 most popular words in the Russian language.

and not I be he that and

can We apply the statistical measure TF-IDF (frequency — inverse document frequency) to reduce the weight of the words that are often used in the text and does not carry the semantic load. Figure TF-IDF is calculated according to the following formula:

tfi,j is the frequency of the word in the text dfj is the number of documents that contains text with the data word N is the total number of documents

the table below shows the values of the IDF for some words in Shakespeare's plays, ranging from the informative words that occur only in one play (for example, "Romeo") to those that are so common that they are not fully discriminatory, for there are in all 37 plays. Such as "good" or "sweet".

the IDF the most common words is equal to 0, as a result of their frequency in the model of bag-of-words will also be equal to 0. The frequency of rare words will be higher.

the Word

DF IDF Romeo

1 1,57 salad

2 1,27 Falstaff

4 0,967 forest

12 0,489 combat

21 0,074 fool

36 0,012 well

37 0 sweetheart

37 0

What you need to know SEO specialist is Unlikely that the model of bag-of-words is currently used in commercial search engines. There are models that better reflect the structure of the text and take into account more linguistic features, but the basic idea remains the same. Documents and search queries are converted into vectors, and the similarity or distance between vectors is used as a measure of relevance. This model gives the understanding how lexical search, unlike semantic search. For the lexical search it is important that the document contained the words mentioned in the search query. For semantic search is yet optional. The Zipf's law shows that text written in natural language, there are predictable proportions. Deviations from the typical proportions are easy to identify. So it is not difficult to identify overly-optimized text that is "unnatural". Thanks to the use of TF-IDF, documents containing the keywords, acquiring a growing weight in the vector search. It is tempting to interpret this phenomenon as something associated with "semantics".

Semantic word

Semantic search has become a key word in the SEO community in 2013. Semantic search is a search for meaning, in contrast to the lexical search, where the search engine looks for literal matches of words or options request, not realizing the total value of the request.

will Give a simple example. Input the request in Yandex or Google — drunk the wrong flat movie. The results can be seen in the photo.

You just realized what kind of movie is it? As we can see, search engine coped with the task. Despite the fact that in our query no words irony / fate / enjoy your bath in the results we see “Irony of fate”.

But as a search engine can understand the meaning of the word or meaning of a search query? Or how should we specify the meaning that a computer program can understand and practically use it to issue documents?

the Key concept which helps to answer these questions, is distributional analysis. It was first formulated in the 1950-ies. Linguists have noticed that words with similar meaning tend to meet in the same environment (i.e. next to the same words), and the number of differences in meaning between the two words is roughly equivalent to the difference in their LSI-phrase.

Here is a simple example. Suppose you face the following suggestions, while not knowing what the langoustine :

the Langoustines are considered a delicacy. The scampi white meat in the tail and body, juicy, slightly sweet and lean. When selecting crawfish, we focus on translucent orange color.

Also, you definitely are, as most readers know what a shrimp is :

Shrimp is a delicacy that goes well with white wine and sauce. The tender meat of shrimp you can add to pasta. When cooking the shrimp change color to red.

the fact that the langoustine found in these words, as a delicacy, meat and pasta, may indicate that it is a kind of edible crustaceans, somewhat shrimp-like. Thus, it is possible to define the word for the environment in which it occurs in many contexts.

How we can transform these observations into something meaningful to a computer program? You can build a model similar to the bag-of-words. However, instead of documents, we will denote the columns using the words. Quite common is the use of small phrases in the context of the target words, but not more than four words. In this case, each cell in the model number denotes how many times the word occurs in the context sentence (e.g., plus or minus four words). Let's consider these contextual phrase. The table below is an example from the book of Daniel Jurafsky and James Martin, "speech Processing and language".


the Key word in the Context of sugar, sliced lemon, a tablespoon

apricot jam, a pinch each of their pleasure. She carefully took a sample

pineapple and other fruit, the taste of which it is compared is well suited for programming on a digital

the computer In finding the optimal R-stage policy from to collect data and

information necessary for research permitted under

For each word in the adjacent column we indicate thematic words from the text where it is used. The result is a matrix of coincidence of words. Please note that "digital" and "informational" context words are more similar to each other than the "apricot". The number of words can be replaced by other indicators. For example, the rate of mutual information.

the aardvark ... the computer data to pinch result sugar ... apricot

0 ... 0 0 1 0 1 ... pineapple

0 ... 0 0 1 0 1 digital ...

0 ... 2 1 0 1 0 ... info

0 ... 1 6 0 4 0 ...

Every word and its semantic value is represented by a vector. Semantic properties of each word are determined by its neighbors, i.e. typical contexts in which it occurs. This model can easily capture the synonymy and relatedness of words. Vectors of the same two words will take place next. The vectors of words that appear in the same thematic field, will form clusters.

In semantic search is not magic. The conceptual difference lies in the fact that the words are represented by vector investments, and not lexical items.

What you need to know SEO specialist Semantic models are well-suited for coverage of synonyms, related words and semantic frames.

The system of linked frames can form a semantic network. A semantic network is a set of words that represent the domain objects and specify relationships between them. For example, a semantic network of tea "the Golden bowl" may include tradition, tea, teacup, teapot, spoon, sugar, beverage, etc.

When you create a new content will be useful to think about the semantic frames. I.e. to take into account the semantic structure on which you want to promote your page to the top, rather than a specific keyword. The game content is likely to have little effect. Synonymous words, such as apartments, will have very similar vectors. When you replace words in the text word synonyms we get the text close to the original variant with the point of view of search engines. Search engines have become much better at finding information, but will not be superfluous to give them clues using structured markup data.

computational linguistics is a fascinating and fast developing science. The concept presented in this article are not new and not revolutionary. However, they are quite simple and helpful to get a General idea of the problem field.

Questions, suggestions and criticism are welcome in the comments.

more interesting facts about SEO you can read in the VK or FB.

the post How search engines understand us. Semantic text analysis appeared first on the Blog of Sape.