Text Dataset Cleaner
Founded 4 years ago

Python pipeline for clean text datasets

Customizable tool for data cleaning (profanity, incorrect language, HTML tags, etc.). The list of handlers can be changed and selected in any order.

This tool will be useful for everyone who processes large amounts of text data to train neural networks. It will help to clear and normalize the text to make the learning result even better.

GitHub repo: https://github.com/TextDatasetCleaner/TextDatasetCleaner

All you need to do to start is to specify a configuration file and point to the input file to be processed.

The example configuration (in YAML format):

PRE_PROCESSING:
  - unique
PROCESSING:
  - detect_language:
      language_code: ru
  - filter_stop_words:
      language_code: ru
      mode: replace
  - filter_url
POST_PROCESSING:
  - shuffle

First, all line duplicates will be removed from the whole file (pre-processing), then for each line, a language will be defined, stop words will be removed and lines from URL will be removed (processing), and at the, end all lines will be shuffled (post-processing).

Now more than 20 processors have been implemented, the full list can be found here.

Contribute

At the moment, the project has only one developer (@saippuakauppias in ODS slack), but the list of ideas you want to implement is very large (you can find it here).

It would be great if someone who is also partial to this task and wants to make the world a better place would join the project team. After all, there is no universal tool for text data cleaning now, so we can be among the first :)

Participants

Cookies help us deliver our services. By using our services, you agree to our use of cookies.