An approach based on analysis of n-grams on links of words to extract the knowledge and relevant linguistic means on subject-oriented text sets
D.V. Mikhaylov, A.P. Kozlov, G.M. Emelyanov


Yaroslav-the-Wise Novgorod State University, Velikii Novgorod, Russia

Full text of article: Russian language.

In this paper we look at two interrelated problems of extracting knowledge units from a set of subject-oriented texts (the so-called corpus) and completeness of reflection of revealed actual knowledge in initial phrases. The main practical goal here is finding the most rational variant to express the knowledge fragment in a given natural language for further reflection in the thesaurus and ontology of a subject area. The problems are of importance when constructing systems for processing, analysis, estimation and understanding of information. In this paper the text relevance to the initial phrase in terms of the described fragment of actual knowledge (including forms of its expression in a given natural language) is measured by estimating the coupling strength of words from the initial phrase jointly occurring in phrases of the analyzed text together with classifying these words according to their values of TF-IDF metrics in relation to text corpus. The paper considers an extension of links of words from traditional bigrams to three and more elements for the revelation of constituents of an image of the initial phrase in the form of combinations of related words. Variants of link revelation with and without application of a database of known syntactic relations are considered here. To describe more completely the fragment of expert knowledge revealed in corpus texts, sets of the initial phrases mutually equivalent or complementary in sense and related to the same image are entered into consideration. In comparison with the search of components of the analyzed image on a syntactically marked text corpus the method for text selection offered in the current paper can reduce, on average, by 17 times the output of phrases which are irrelevant to the initial ones in terms of either the knowledge fragment described or its expression forms in a given natural language.

pattern recognition, intelligent data analysis, information theory, open-form test assignment, natural-language expression of expert knowledge, contextual annotation, document ranking in information retrieval.

Mikhaylov DV, Kozlov AP, Emelyanov GM. An approach based on analysis of n-grams on links of words to extract the knowledge and relevant linguistic means on subject-oriented text sets. Computer Optics 2017; 41(3): 461-471 DOI: 10.18287/2412-6179-2017-41-3-461-471.


  1. Mikhaylov DV, Kozlov AP, Emelyanov GM. An approach based on TF-IDF metrics to extract the knowledge and their linguistic forms of expression on the subject-oriented text set [In Russian]. Computer Optics 2015; 39(3): 429-438. DOI: 10.18287/0134-2452-2015-39-3-429-438.
  2. Mikhaylov DV, Kozlov AP, Emelyanov GM. Extraction of knowledge and relevant linguistic means with efficiency estimation for the formation of subject-oriented text sets [In Russian]. Computer Optics 2016; 40(4): 572-582. DOI: 10.18287/2412-6179-2016-40-4-572-582.
  3. Shannon CE. Prediction and entropy of printed English. Bell System Technical Journal 1951; 30(1): 50-64.
  4. Sidorov G. Syntactic dependency based N-grams in rule based automatic English as second language grammar correction. IJCLA 2013; 4(2): 169-188.
  5. Kudinov MS. Shallow parsing of Russian text with conditional random fields [In Russian]. Machine Learning and Data Analysis 2013; 1(6): 714-724.
  6. Moskovich WA. Distributive-Statistical Method of Thesaurus Construction: The State of the Art and Perspectives [In Russian]. Moscow: The Scientific Council «Cybernetics» of the USSR Academy of Science; 1971.
  7. Tanimoto TT. An elementary mathematical theory of classification and prediction. New York: International Business Machines Corporation; 1958.
  8. Zagoruiko NG. Applied methods of data and knowledge analysis [In Russian]. Novosibirsk: Institute of Mathematics SD RAS; 1999.
  9. Frantzi K, Ananiadou S, Mima H. Automatic recognition of multi-word terms: the C-value/NC-value method. Int J Digit Libr 2000; 3(2): 115-130. DOI: 10.1007/s007999900023
  10. Brodskiy A, Kovalev R, Lebedev M, Leshchiner D, Su­shin P. Yandex algorithms of contextual annotation at ROMIP 2008 [In Russian]. Russian Information Retrieval Evaluation Seminar (ROMIP) 2008; 160-169.
  11. Russian National Corpus [In Russian]. Source: <>.
  12. Apache OpenNLP. Source: <>.
  13. Leipzig Corpora Collection Download Page. Source: <>.
  14. Natural Language Toolkit. Source: < >.
  15. Pymorphy – NLPub. Source: <>.
  16. Russianmorphology: Russian Morphology for lucene. Source: <>.

© 2009, IPSI RAS
Institution of Russian Academy of Sciences, Image Processing Systems Institute of RAS, Russia, 443001, Samara, Molodogvardeyskaya Street 151; e-mail:; Phones: +7 (846 2) 332-56-22, Fax: +7 (846 2) 332-56-20