This project proposes a SOTA solution to the problem of paraphrase identification on PAWS-Wiki test set. We used Concatenate Pooler with DeBERTa backbone trained on PAWS-Wiki and PAWS-QQP train sets to achieve F1=0.95: improve from 0.943 (previous SOTA). Also, we investigate the effects of unlabeled part of PAWS-Wiki.
Code and full paper can be found here:
https://github.com/Sergey-Tkachenko/nlp_project_2023
Paraphrasing is a form of plagiarism that refers to use of other idea, words or work, and presenting it in different ways by switching words, changing sentence construction and changing grammar style. Additionally, paraphrasing may include replacing some words with synonyms [Chowdhury and Bhattacharyya, 2018]. In short, a sentence can be defined as a paraphrase of another sentence if they are not identical but share the same semantic meaning [Liu et al., 2022]. Large language models (LMMs) showed the high efficiency in paraphrasing tasks [Becker et al., 2023]. The use of LLMs may lead to an increase in paraphrasing, which can compromise the integrity of legal writing. Using Transformer based models for the classification seems to be intuitive to counteract this new form of plagiarism. [Wahle et al., 2021] And therefore, in this study, we provide a solution for paraphrase detection task using transformer-based neural models. We provide SOTA (as far as we know on PAWS-Wiki dataset) architecture, based on DeBERTav3 [He et al., 2021]
In out work, we use the PAWS dataset. PAWS training data dramatically improves performance on challenging examples and makes models more robust to real world examples. [Zhang et al., 2019] The dataset consists of 2 parts: WiKi and QQP (Quora Question Pairs).
Examples are generated from controlled language models and back translation, and given five human ratings each in both phases. The main idea of the PAWS is generating adversarial examples to break NLP systems. Some examples from PAWS dataset.
non-paraphrase:
paraphrase:
In this work, we conducted two major experiments. The first experiment was designed to find the best effect of adding PAWSQQP and Unlabeled Wiki. The second major experiment was designed to determine the optimal architecture.
We used standard binary classification metrics to evaluate our models:
Architecture | F1 | Recall | Precision | Accuracy | ROC AUC |
---|---|---|---|---|---|
Baseline (current SOTA) | 0.943 | 0.956 | 0.93 | - | - |
Perceptron Pooler | 0.943 | 0.940 | 0.946 | 0.950 | 0.984 |
Mean Max Pooler | 0.946 | 0.959 | 0.934 | 0.952 | 0.986 |
Convolutional Pooler | 0.948 | 0.955 | 0.942 | 0.954 | 0.988 |
Concatenate Pooler | 0.950 | 0.965 | 0.935 | 0.955 | 0.985 |
LSTM Pooler | 0.946 | 0.962 | 0.930 | 0.951 | 0.985 |
Concatenate + LSTM Pooler | 0.948 | 0.959 | 0.937 | 0.954 | 0.985 |
Concatenate + PAWS QQP | 0.950 | 0.970 | 0.930 | 0.954 | 0.986 |
The results shows that adding Pooler Layers improves the performance of the model, and the best architecture is Concatenate Pooler. This could happened due to the fact that resulting hidden vector was large enough to capture all necessary information about the input sequence. Also, it seems that capturing information from all sequence does not improves performance as Concatenate + LSTM Pooler showed slightly lower performance than plain Concatenate Pooler. This may be caused by overfitting as well.
Unlabeled part of dataset did not boost performance of our models. We suppose, the main reason is that the unlabeled part is too noisy. This part of dataset may be used for pre-training, but not for finetuning. PAWS_QQP did increased the overall performance for both baseline and our solution. Therefore, the best combination of training sets are PAWS_Wiki and PAWS_QQP .
Cookies help us deliver our services. By using our services, you agree to our use of cookies.