Final project for Natural Language Processing course (stream 3, autumn 2022)
In this project, we have implemented a model to correctly classify the medical diagnosis based on the given medical transcriptions. The main goal is to correctly classify the medical specialties based on the transcription text and suggest our approach to solving this problem.
Anastasia: created a text corpus, developed the programm code, prepared the report.
Arina: developed the programm code and prepared the report.
Evgeniy: developed the programm code and prepared the report.
Medical data is extremely hard to find due to HIPAA privacy regulations. But https://www.mtsamples.com/ is contained sample transcription reports for many specialties and different work types. This webcite is designed to give access to a large collection of transcribed medical reports. These reports can be used by learning, as well as working medical transcriptionists for their daily transcription needs. Data collection was carried out manually by copying data from the site. Thus, 4314 medical transcriptions from various fields of medicine were obtained. This dataset contains six columns: ’description’, ’medical specialty’, ’sample name’, ’transcription’, and ’keywords’.
As part of pre-processing, we have filtered out the categories which have more than 50 samples, so the number of categories got reduced from 33 to 17. There are a huge number of records belonging to the class ’Consult - History and Phy.’, which is almost thrice when compared with some of the other classes in the dataset. Since we are trying to classify the medical specialities based on medical transcriptions, we need only the ’transcription’ and ’medical specialty’ columns in the dataset.
Then we transformed all the texts to lower case, deleted punctuations, removed stop words, performed lemmatization and tokenization.
After applying the methods of SVM and multiclass logistic regression, we get the following results. As we can see in the table, the best classification result is achieved by improving the dataset.
There is a study that used data similar to ours. Most likely, the dataset was collected from the same site as ours. In any case, the result of this study is much worse than ours (the best result is 0.65). In addition, they used completely different classification methods. Thus, we can claim that our model shows the best training outcomes on medical transcription data.