Medical transcriptions classification
active,
Founded 20 months ago

Final project for Natural Language Processing course (stream 3, autumn 2022)

medicaltranscriptionsclassificationdoctordiagnosisvectorizationtf-idfnlphospital

Medical transcriptions classification

github repository

About

In this project, we have implemented a model to correctly classify the medical diagnosis based on the given medical transcriptions. The main goal is to correctly classify the medical specialties based on the transcription text and suggest our approach to solving this problem.

Researchers

Anastasia: created a text corpus, developed the programm code, prepared the report.
Arina: developed the programm code and prepared the report.
Evgeniy: developed the programm code and prepared the report.

Dataset

Medical data is extremely hard to find due to HIPAA privacy regulations. But https://www.mtsamples.com/ is contained sample transcription reports for many specialties and different work types. This webcite is designed to give access to a large collection of transcribed medical reports. These reports can be used by learning, as well as working medical transcriptionists for their daily transcription needs. Data collection was carried out manually by copying data from the site. Thus, 4314 medical transcriptions from various fields of medicine were obtained. This dataset contains six columns: ’description’, ’medical specialty’, ’sample name’, ’transcription’, and ’keywords’.

As part of pre-processing, we have filtered out the categories which have more than 50 samples, so the number of categories got reduced from 33 to 17. There are a huge number of records belonging to the class ’Consult - History and Phy.’, which is almost thrice when compared with some of the other classes in the dataset. Since we are trying to classify the medical specialities based on medical transcriptions, we need only the ’transcription’ and ’medical specialty’ columns in the dataset.

Then we transformed all the texts to lower case, deleted punctuations, removed stop words, performed lemmatization and tokenization.

Results

After applying the methods of SVM and multiclass logistic regression, we get the following results. As we can see in the table, the best classification result is achieved by improving the dataset.

ModelPrecisionRecallF1 scoreAccuracy(%)
SVC+SMOTE0.560.450.470.63
LR+SMOTE0.540.420.450.62
model-impr.+SVC0.790.750.760.81
model-impr.+LR0.780.670.710.78
model-impr.+SMOTE+SVC0.820.760.790.85
model-impr.+SMOTE+LR0.810.72.0.750.84

There is a study that used data similar to ours. Most likely, the dataset was collected from the same site as ours. In any case, the result of this study is much worse than ours (the best result is 0.65). In addition, they used completely different classification methods. Thus, we can claim that our model shows the best training outcomes on medical transcription data.

Cookies help us deliver our services. By using our services, you agree to our use of cookies.