ruTS
Founded 5 years ago

Library for statistics extraction from texts in Russian

Overview

ruTS is a library that allows extracting statistics from texts in Russian. It provides the following functionality:

  • Object extraction - creating tools for sentence and word extraction from a text, which can be further employed for counting statistics
  • Basic statistics - extracting basic linguistic statistics from a text (the number of complex words, syllables, letters, etc)
  • Readability metrics - counting readability metrics for a text (SMOG Index, Flesch-Kincaid Grade Level, etc)
  • Lexical diversity metrics - counting lexical diversity metrics for a text (Hapax Legomena Index, Type-Token Ratio, etc)
  • Morphological statistics - extracting morphological features from a text (part of speech, gender, transitivity, etc)
  • Datasets - working with a number of preprocessed datasets (soviet reading-books for literature classes, the collected works of Stalin)
  • Visualization - visualizing text with the help of graphs (Zipf's law, Literature Fingerprinting, Word Tree)
  • Components - adding the library's classes to spaCy pipelines
  • API - using functions via RESTful interface

Installation

ruTS requires Python 3.7 or higher. To install the latest stable version from PyPI:

$ pip install ruts

Usage

from ruts import ReadabilityStats
text = "Ног нет, а хожу, рта нет, а скажу: когда спать, когда вставать, когда работу начинать"
rs = ReadabilityStats(text)
rs.get_stats()

    {'automated_readability_index': 0.2941666666666656,
    'coleman_liau_index': 0.2941666666666656,
    'flesch_kincaid_grade': 3.4133333333333304,
    'flesch_reading_easy': 83.16166666666666,
    'lix': 48.333333333333336,
    'smog_index': 0.05}

rs.print_stats()

                    Метрика                 | Значение 
    --------------------------------------------------
    Тест Флеша-Кинкайда                     |   3.41   
    Индекс удобочитаемости Флеша            |  83.16   
    Индекс Колман-Лиау                      |   0.29   
    Индекс SMOG                             |   0.05   
    Автоматический индекс удобочитаемости   |   0.29   
    Индекс удобочитаемости LIX              |  48.33  

Links

Cookies help us deliver our services. By using our services, you agree to our use of cookies.