Extraction the knowledge and relevant linguistic means with efficiency estimation for formation of subject-oriented text sets
D.V. Mikhaylov, A.P. Kozlov, G.M. Emelyanov


Yaroslav-the-Wise Novgorod State University, Velikii Novgorod, Russia

Full text of article: Russian language.

In this paper we look at two interrelated problems of extracting knowledge units from a set of subject-oriented texts (the so-called corpus) and selecting texts to the corpus by analyzing the relevance to the initial phrase. The main practical goal here is finding the most rational variant to express the knowledge fragment in a given natural language for further reflection in the thesaurus and ontology of a subject area. The problems are of importance when constructing systems for processing, analysis, estimation and understanding of information. In this paper the text relevance to the initial phrase in terms of the described fragment of actual knowledge (including forms of its expression in a given natural language) is defined by the total numerical estimate of the coupling strength of words from the initial phrase jointly occurring in phrases of the text under analysis. The paper considers known variants of such estimation procedures and their application for the search of distinct components which reflect the initial phrase in the texts selected to the topical text corpus. These components correspond to words and their combinations. In comparison with the search of such components on a syntactically marked text corpus, the method for text selection offered in this paper enables a 15-times reduction (on average) in the output of phrases which are irrelevant to the initial one in terms of either the described knowledge fragment or its expression forms in a given natural language.

pattern recognition, intelligent data analysis, information theory, open-form test assignment, natural-language expression of expert knowledge, contextual annotation, document ranking in information retrieval.

Mikhaylov DV, Kozlov AP, Emelyanov GM. Extraction of knowledge and relevant linguistic means with efficiency estimation for the formation of subject-oriented text sets. Computer Optics 2016; 40(4): 572-582. DOI: 10.18287/2412-6179-2016-40-4-572-582.


  1. Koltsov PP, Osipov AS, Kutsaev AS, Kravchenko AA, Kotovich NV, Zakharov AV. On the quantitative performance evaluation of image analysis algorithms. Computer Optics 2015; 39(4): 542-556. DOI: 10.18287/0134-2452-2015-39-4-542-556.
  2. Mikhaylov DV, Kozlov AP, Emelyanov GM. An approach based on TF-IDF metrics to extract the knowledge and relevant linguistic means on subject-oriented text sets. Computer Optics 2015; 39(3): 429-438. DOI: 10.18287/0134-2452-2015-39-3-429-438.
  3. Tsarkov SV. Automatic keyphrase extraction for vocabulary reduction in probabilistic topic models [In Russian]. Natural and Technical Sciences 2012; 6: 456-464.
  4. Shannon CE. Prediction and entropy of printed English. BSTJ 1951; 30(1): 50-64.
  5. Russian National Corpus [In Russian]. Source: áhttp://www.ruscorpora.ru/ñ.
  6. Biemann C, Bordag S, Heyer G, Quasthoff U, Wolff C. Language-independent Methods for Compiling Monolingual Lexical Data. 5th International Conference “Computational Linguistics and Intelligent Text Processing” (CICLing 2004) 2004; 2945: 217-228.
  7. McDonald JH. G-test of goodness-of-fit. Handbook of Biological Statistics (Third ed.). Baltimore, Maryland: Sparky House Publishing; 2014: 53-58.
  8. Moskovich WA. Distributive-Statistical Method of Thesaurus Construction: The State of the Art and Perspectives [In Russian]. Moscow: The Scientific Council “Cybernetics” of the USSR Academy of Science; 1971.
  9. Tanimoto TT. An elementary mathematical theory of classification and prediction. New York: International Business Machines Corporation; 1958.
  10. Emelyanov GM, Mikhaylov DV, Kozlov AP. Formation of the representation of topical knowledge units in the problem of their estimation on the basis of open tests [In Russian]. Machine Learning and Data Analysis 2014; 1(8): 1089-1106.
  11. Zagoruiko NG. Applied methods of data and knowledge analysis [In Russian]. Novosibirsk: Institute of Mathematics SD RAS; 1999.
  12. Grechnikov EA., Gusev GG, Kustarev AA, Raigorodsky AМ. Detection of artificial texts [In Russian]. RCDL’2009 Proceedings 2009; 306-308.
  13. Manber U. Finding Similar Files in a Large File System. USENIX Winter 1994 Technical Conference Proceedings 1994; 1-10.
  14. Heintze N. Scalable Document Fingerprinting. Proceedings of the Second USENIX Workshop on Electronic Commerce 1996; 191-200.
  15. Brodskiy A., Kovalev R., Lebedev M., Leshchiner D., Sushin P. Yandex algorithms of contextual annotation at ROMIP 2008 [In Russian]. Russian Information Retrieval Evaluation Seminar (ROMIP) 2008; 160-169.
  16. Karp RM, Rabin MO. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development 1987; 31(2): 249-260.
  17. Knuth DE, Morris JH, Pratt VR. Fast pattern matching in strings. SIAM Journal on Computing 1977; 6(2): 323-350. DOI: 10.1137/0206024.
  18. Boyer RS, Moore JS. A fast string searching algorithm. Communications of the ACM 1977; 20(10): 762-772.
  19. Apache OpenNLP. Source: áhttps://opennlp.apache.org/ñ.
  20. Leipzig Corpora Collection Download Page. Source: áhttp://corpora2.informatik.uni-leipzig.de/download.htmlñ.
  21. Gurevich I, Trusova Yu, Yashina V. The challenges, the problems and the tasks of the descriptive approach to image analysis. 11th International Conference “Pattern recognition and image analysis: new information technologies” (PRIA-11-2013). Samara: IPSI RAS; 2013; 1: 30-35.

© 2009, IPSI RAS
Institution of Russian Academy of Sciences, Image Processing Systems Institute of RAS, Russia, 443001, Samara, Molodogvardeyskaya Street 151; e-mail: ko@smr.ru; Phones: +7 (846 2) 332-56-22, Fax: +7 (846 2) 332-56-20