This project proposes a solution to the problem of Russian sarcastic comments detection. A corpus of Russian-language comments was formed with binary markup: whether the comment is sarcastic or not. The problem was solved using three natural language processing approaches: TF- IDF and Logistic Regression, Recurrent Neural Networks, Transformers. A comparative analysis of the results of these approaches was carried out.
Sarcasm recognition topic is popular nowadays. A lot of articles devoted to the sarcasm recognition theme have been published since 2010. Although there was actually no one of them to be about Russian sarcastic comments recognition.
This project proposes a solution to the problem of Russian sarcastic comments detection. A corpus of Russian-language comments was formed with binary markup: whether the comment is sarcastic or not. The problem was solved using three natural language processing approaches: TF-IDF and Logistic Regression, Recurrent Neural Networks, Transformers. A comparative analysis of the results of these approaches was carried out.
The link to the github repository is here.
To build the corpus of Russian sarcastic and non-sarcastic comments we decided to use the text-based English news headlines corpus (here is a link to the website, where the text-based English news headlines corpus could be downloaded). We have translated to Russian and then we have cleared all the samples. As the result we have got 9554 sarcastic and 10479 non-sarcastic translated samples.
But we had the idea that the corpus we want has to be more "Russian", has to have some exclusively Russian sarcastic statements that appears due to Russian mentality. That's why we decided to use also another corpus – the corpus of Russian jokes (here is a link to the website, where the corpus of Russian jokes could be downloaded). The corpus is huge: 130204 samples. These jokes we would like to consider as Russian sarcastic statements, because every Russian joke has Russian-sarcastic emotional background.
Finally, the corpus we made consisted of 5000 translated sarcastic samples and 5000 Russian jokes – as sarcastic comments (labelled 1), and 10000 translated non-sarcastic samples as non-sarcastic comments (labelled 0). Thus we have get the corpus of 20000 balanced comments.
The split of the dataset to the train, valid and test samples is presented on the following table.
We decided to use three train models: TF-IDF with Logistic Regression, Recurrent Neural Network and Transformer.
As for RNN, it was attempted to use a relatively good quality, and most importantly, lightweight glove model "navec" (here you can find navec glove model). Using the glove model, we will build a dictionary and tokenize (converting text strings into sequences of numbers) every comment. In this case, the size of the vocabulary was limited to 30,000 words.
As for transformer, we used pre-trained BERT type transformer. The rubert-tiny2 (sentence encoder model) model was chosen for classification. As a result, the pre-trained tokenizer and model were loaded from huggingface (here you can find rubert-tiny2 model). For classification, it was necessary to add a fully connected layer.
The evaluations (F1-score) of presented methods on the testing set are showed in a folowing table.
The third approach – Transformers – has the higher score of detection, than Recurrent Neural Networks which showed the result close to the score of the baseline Logistic Regression.
In respect that the presented model is the first model capable to detect Russian sarcasm, the received results are quite encouraging for the further researches connected to this field.
Dmitry Davidov, Oren Tsur, Ari Rappoport Semi-supervised recognition of sarcastic sentences in twitter and amazon // Proceedings of the Fourteenth Conference on Computational Natural Language Learning, Association for Computational Linguistics, 2010.
Roberto González-Ibánez, Smaranda Muresan, Nina Wacholder Identifying sarcasm in Twitter: a closer look // Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2, Association for Computational Linguistics, 2011.
Stephanie Lukin, Marilyn Walker Really? well. apparently bootstrapping improves the performance of sarcasm and nastiness classifiers for online dialogue // Proceedings of the Workshop on Language Analysis in Social Media, 2013.
Peng Liu, et al. Sarcasm detection in social media based on imbalanced classification // International Conference on Web-Age Information Management, Springer International Publishing, 2014.
Francesco Barbieri, Horacio Saggion, Francesco Ronzano Modelling sarcasm in twitter, a novel approach ACL 2014, 2014.
Elisabetta Fersini, Federico Alberto Pozzi, Enza Messina Detecting irony and sarcasm in microblogs: The role of expressive signals and ensemble classifiers // Data Science and Advanced Analytics (DSAA), 2015.
Ashwin Rajadesingan, Reza Zafarani, Huan Liu Sarcasm detection on twitter: A behavioral modeling approach // Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, ACM, 2015.
Santosh Kumar Bharti, Korra Sathya Babu, Sanjay Kumar Jena Parsing-based sarcasm sentiment recognition in Twitter data // 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). IEEE, 2015.
David Bamman, Noah A. Smith Contextualized sarcasm detection on twitter // Ninth International AAAI Conference on Web and Social Media, 2015.
Nabeela Altrabsheh, Mihaela Cocea, Sanaz Fallahkhair Detecting sarcasm from students' feedback in Twitter // Design for teaching and learning in a networked world, Springer International Publishing, 2015.
Debanjan Ghosh, Weiwei Guo, Smaranda Muresan Sarcastic or not: word embeddings to predict the literal or sarcastic meaning of words // Proccedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Accociation for Computational Linguistics, 2015.
Mondher Bouazizi, Tomoaki Otsuki A pattern-based approach for sarcasm detection on twitter // IEEE Transl., 2016.
S.K. Bharti, et al. Sarcastic sentiment detection in tweets streamed in real time: a big data approach // Digital Communications and Networks, 2016.