An approach based on tf-idf metrics to extract the knowledge and relevant linguistic means on subject-oriented text sets
D.V. Mikhaylov, A.P. Kozlov, G.M. Emelyanov


Yaroslav-the-Wise Novgorod State University, Novgorod, Russia

Full text of article: Russian language.


In this paper we look at a problem of extracting knowledge units from the sets of subject-oriented texts. Each such text set is considered as a corpus. The main practical goal here is finding the most rational variant to express the knowledge fragment in a given natural language for further reflection in the thesaurus and ontology of a subject area. The problem is of importance when constructing systems for processing, analysis, estimation and understanding of information represented, in particular, by images. In this paper, by applying the TF-IDF metrics to classify words of the initial phrase in relation to given text corpora we address the task of selecting phrases closest to the initial one in terms of the described fragment of actual knowledge or forms of its expression in a given natural language.

pattern recognition, intelligent data analysis, information theory, open-form test assignment, natural-language expression of expert knowledge.

Mikhaylov DV, Kozlov AP, Emelyanov GM. An approach based on TF-IDF metrics to extract the knowledge and relevant linguistic means on subject-oriented text sets. Computer Optics 2015; 39(3): 429-38. DOI: 10.18287/0134-2452-2015-39-3-429-438.


  1. Soifer VA, Kupriyanov AV. Analysis and recognition of the nanoscale images: conventional approach and novel problem statement [In Russian]. Computer Optics 2011; 35(2): 136-44.
  2. Tsarkov SV. Automatic keyphrase extraction for vocabulary reduction in probabilistic topic models [In Russian]. Natural and Technical Sciences 2012; 6: 456-64.
  3. Gurevich I, Trusova Yu, Yashina V. The challenges, the problems and the tasks of the descriptive approach to image analysis. 11th International Conference «Pattern Recognition and Image Analysis: New Information Technologies» (PRIA-11-2013) 2013; 1: 30-5.
  4. Emelyanov GM, Mikhaylov DV, Kozlov AP. Formation of the representation of topical knowledge units in the problem of their estimation on the basis of open tests [In Russian].  Machine Learning and Data Analysis 2014; 1(8): 1089-106.
  5. Mel’chuk IA. An Attempt at a Theory of «MeaningÛText» Linguistic Models: Semantics, Syntax [In Russian]. Moscow: Languages of Slavonic Culture; 1999.
  6. Huang E. Paraphrase Detection Using Recursive Autoencoder. Source: [­ses/cs224n/2011/re­ports/ehhuang.pdf].
  7. Jones KS. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 2004; 60(5): 493-502.
  8. Zagoruiko NG. Applied Methods of Data and Knowledge Analysis [In Russian]. Novosibirsk: Institute of Mathematics SD RAS; 1999.
  9. Vorontsov K, Potapenko A, Frei O, Apishev M, Doikov N, Shapulin A, Chirkova N. Multi-criteria and multi-modal probabilistic topic models of text collections. International Conference «Intelligent Information Processing» IIP-10 2014; 199.
  10. russianmorphology: Russian Morphology for lucene. Source: [].
  11. Apache PDFBox. Source: [].
  12. Turdakov D, Astrakhantsev N, Nedumov Ya, Sysoev A, An­drianov I, Mayorov V, Fedorenko D, Korshunov A, Kuz­netsov S. Texterra: A Framework for Text Analysis. Source: [].
  13. Serelex. Source: [].
  14. WordNet. Source: [].
  15. Baroni M, Bernardini S, Ferraresi A, Zanchetta E. The wacky wide web: A collection of very large linguistically processed web-crawled corpora. Source: [].
  16. Shannon CE. Prediction and entropy of printed English. BSTJ 1951; 30(1): 50-64.

© 2009, IPSI RAS
151, Molodogvardeiskaya str., Samara, 443001, Russia;; Tel: +7 (846) 242-41-24 (Executive secretary), +7 (846) 332-56-22 (Issuing editor), Fax: +7 (846) 332-56-20