Text Dataset Cleaner — Open Data Science

Customizable tool for data cleaning (profanity, incorrect language, HTML tags, etc.). The list of handlers can be changed and selected in any order.

This tool will be useful for everyone who processes large amounts of text data to train neural networks. It will help to clear and normalize the text to make the learning result even better.

GitHub repo: https://github.com/TextDatasetCleaner/TextDatasetCleaner

All you need to do to start is to specify a configuration file and point to the input file to be processed.

The example configuration (in YAML format):

PRE_PROCESSING:
  - unique
PROCESSING:
  - detect_language:
      language_code: ru
  - filter_stop_words:
      language_code: ru
      mode: replace
  - filter_url
POST_PROCESSING:
  - shuffle

First, all line duplicates will be removed from the whole file (pre-processing), then for each line, a language will be defined, stop words will be removed and lines from URL will be removed (processing), and at the, end all lines will be shuffled (post-processing).

Now more than 20 processors have been implemented, the full list can be found here.

Contribute

At the moment, the project has only one developer (@saippuakauppias in ODS slack), but the list of ideas you want to implement is very large (you can find it here).

It would be great if someone who is also partial to this task and wants to make the world a better place would join the project team. After all, there is no universal tool for text data cleaning now, so we can be among the first :)

Denis VLeader

Our website uses cookies, including web analytics services. By using the website, you consent to the processing of personal data using cookies. You can find out more about the processing of personal data in the Privacy policy

Learn More

Text Dataset CleanerFounded 5 years ago

Text Dataset Cleaner
Founded 5 years ago