Directory of RSS feeds

RSS feeds in the directory: 374

Added today: 0

Added yesterday: 0

Hi-Tech / Internet

How search engines understand us. The basics of text analysis

SAPE.RU - Blog 27.09.2019 at 13:04

Company Blog comprehensive promotion Sape. Expert articles on SEO, usability, contextual advertising, advertising from bloggers.

Our great and mighty Russian language is not only beautiful, but also very complicated. Often, even intuition of native speakers is at odds with the formal. The results of the machine analysis is still strikingly different from our intuitive view.

In this article we will examine how search engines understand user queries to find relevant documents and how to query is removed from its semantic value.

the Word for the search engine Model of bag-of-words Search document

it Seemed, before Google introduced RankBrain and Yandex Korolev, SEO promotion and SEO specialists life was much simpler. Now we are exposed to the stream of contradictory information from influential persons in the industry. The situation is compounded by the fact that the representatives of Yandex and Google give obscure information about quality signals and the voice saying the same things: "Make sites for people."

How to distinguish between useless advice and assumptions about the operation of algorithms from real-existing methods? Below you will find answers to questions that will help you to understand how search engines and SEO the essence of SEO. Read and become a real guru of search engine optimization...

the Word for search engines

the Word is the smallest meaningful unit of speech that serves to Express individual concepts. To start figuring out how words are represented in computer programs, and to identify the strengths and weaknesses of these approaches.

In the simplest case, the computer program sees the text as a sequence of alphanumeric characters and punctuation marks. This so-called raw representation of of text.

"Programs the programmer has been programmed".

Some words can be separated by spaces or punctuation. The result is a list of characters. Punctuation marks also are considered as separate symbols.

it is Worth noting a feature of any text as a capital letter. It seems reasonable to substitute for all symbols to lower case. In the end, "What" and "what" are one and the same word, namely, a pronoun. But what about the word "faith" and the name "Faith", which depending on context can be a proper noun or a common noun.

Unhandled characters retain all the linguistic information, but at the same time brings up more questions when you enter. Further post-processing is carried out to get rid of excess information.

the Program of the programmer was programmed.

the Words can have different shapes. For example, the word "program" is the noun form of the plural for "software". "Learnt" is a past participle formed from the verb "to program". Unmodified, original form of the word is called a Lemma. For nouns it is nominative singular, for verbs — the form of words, to answer the question "what to do?" The first logical step in query processing is to convert the words to their respective lemmas.

the Program programmer needs to be program.

Search engines utilize stop words to preprocess the input queries. A stoplist is a set of characters that are removed from the text. Stop words may include functional words and punctuation. Functional words are words that have no independent value, for example auxiliary verbs or pronouns.

For example, try to ignore functional words in a sentence. Therefore, the original statement contains only meaningful words (words with meaning). However, it is difficult to say as the program in the request is associated with a programmer.

the programmer of the software

Also, search engines can understand words, based on their reason, that is the roots. The root of the word is its main significant part, which concluded the total value of all single-root words. For example, we can add the suffix "-ist" to the main root of the "programs" and get someone who performs the action.

Now look at the transformed query by replacing all the words to their Lemma.

program program program

After reducing the original request we received, it would seem, is not very informative sequence.

There are three ways to represent words:

character; Lemma; the root.

in addition, we can remove all functional words and convert the rest to lowercase. These treatments and their combinations are used depending on the language tasks. For example, it would be inappropriate to reduce functional words, if we need to differentiate texts in English and French. And when we mean the noun, it is wise to keep the original case of the characters.

These linguistic components are the building blocks for larger structures such as documents.

What you need to know SEO specialist it is Important to understand why it is necessary to break sentences into linguistic components. These units are part of the metric you know and use optimizers. They are an indicator such as keyword density. Although many SEO optimizers are against this indicator and argue that keyword density has no effect. As an alternative they propose to use the indicator is TF-IDF, as it relates to semantic search. Later we will see that both raw and weighted number of words can be used for lexical and semantic searches. Keyword density is a convenient and simple metric that has a right to exist. However, it is not necessary to dwell on it. Also keep in mind that the grammatical forms are considered by search engines as one word, so it makes no sense to optimize a web page, for example, singular and plural of the same keywords.

the Bag-of-words

the Bag-of-words (bag-of-words) is a model that is used in natural language processing for representing text (search query to full-scale books). Although this concept dates back to the 1950-th years, it is still used for text categorization and information retrieval.

If we want to represent the text as a large set of words, i.e. "bag of words", we just count how many times each individual word appears in the text, and list these values. In mathematics this is called a vector. Before counting it is possible to apply the methods of preliminary processing described in the above.

it will result In losing all information about text structure, syntax and grammar of the text.

the program of the programmer was programmed

{: 1, programmer: 1, s: 1, programs: 1, was: 1 was: 1, programmed: 1} or

[1, 1, 1, 1, 1, 1, 1]

programmer program program

{programmer: 1, program: 2}

[1, 2]

to separate the text in the form of a list of numbers almost no sense. However, if we have a list of documents (for example, all the web pages indexed by a particular search engine), we can build the so-called vector model of the available texts.

Sounds daunting, but really it's simple. Imagine a spreadsheet where each column represents a set of words (vector text), and each row represents a word from the set of these words (vector of words). The number of columns equal to the number of documents in the list. The number of rows equal to the number of unique words that occur in the entire list of documents.

the Value in the intersection of each row and column is the number of times the corresponding word appears in the corresponding text. The table below shows a vector model for Shakespeare's plays. For simplicity we use only four words.

As you like it

twelfth night, or What you will, Julius Caesar, Henry V Battle 1

0 7 114 13 Excellent

80 62 89 36 Fool

58 1 4 Wit 20

15 2 3

As we have said earlier, the bag-of-words is actually a vector. The advantage of vectors is that we can measure the distance or angle between them. The smaller the distance or the angle, the more "similar" vectors, and documents to which they correspond. This is done using the cosine similarity measure. The result varies from 0 to 1. The higher the value, the more similar the documents.

a Search of the relevant document

for example, the user enters the query "battle of Agincourt". This is a small document that can be embedded in a vector space, as in the example above. The corresponding vector is [1, 0, 0, 0]. "Excellent", "fool" and "wit" are zero. Then we can calculate the similarity of the search query with each document in the list.

The results are shown in the table below. It is seen that Henry V best fits the request. This is not surprising, since the word "battle" occurs in the text more often. This document can be considered more relevant to the request. Also it is not necessary that all the words in the search query is present in the text.

the Similarity of the Play As you like it

0,008249825 twelfth night, or What you will

0 Julius Caesar

0.11211846 Henry V


this approach has several obvious drawbacks:

Vulnerable a density of keywords. It is possible to significantly increase the relevance of the document to the search query, simply repeating the word as many times as necessary to surpass competing documents in the collection. Exactly what worked for search engines at the start, in the late 1990s. it was Enough to oversaturate the text with keywords and first place in the results is guaranteed. Selection of documents to bags of words of the type

I was impressed, it was good! and

I was not impressed, it was bad! will be exactly the same, although they have different meanings. Remember that the model of bag-of-words does not distinguish the whole structure of the underlying document. Model bag-of-words with a frequency of words is not the best measure. The search results are distorted documents with a high density of input keywords, but in fact these documents may not contain the desired information in yourself.

the following part of the article we'll describe the items:

Test for Zipf's law and the method TF-IDF. How is semantic search.

Follow our publications in VK and FB.

the post How search engines understand us. The basics of text Analytics appeared first on Blog of Sape.