Tweets for Stocks
active,
Founded 15 months ago

Finding correlation between tweets and future stock prices using ML/NLP approaches [this is a graduation project of the course "Natural Language Processing course, stream 3, autumn 2022"]

nlpfinancemlai

Stocks Tweets Project

This project is devoted to the study of the relationship between publications in social networks and the corresponding stock prices. 

All code is open-source, here is the link to GitHub: https://github.com/fin-algo-lake-ai/stocks-tweets-project

Dataset

At the first stage of the project, the original dataset from the article [Jaggi M. at al., 2021] was taken for work. It contains 6.4 million tweets for 25 stocks. 

After examining the dataset a number of issues had been found and fixed: 

  • added stop words preprocessing (initially stop words were just removed)
  • corrected filling missing data algorithm to more stable one
  • data labels were recalculated in a different way (with isolation in time of the message release date and the price change date)
     

Experiment setup

After loading the dataset, it is divided into train and test sets in a ratio of 0.85 to 0.15. To check the stability and variability of each model, the split is carried out several times with a different seed value. Each partition is tested independently of the others.

Of all the dataset fields, only one is used - the preprocessed message text, the rest of the features (date, time, author, etc.) are intentionally not used in the pipeline of this work to avoid distortion in the search for the correlation of the entities under study.

To estimate model quality, we used the Accuracy metric. As the dataset is perfectly balanced, this metric is very intuitive, as the value 0.5 corresponds to random noise.

After training and saving all the models, an ensemble of models is created to evaluate the cross-correlation of the models and obtain the final maximum achievable accuracy value.
 

Results

Below is a final table and graph with results of individual models and their ensembles.

Model Type

Accuracy

Model

File Size

Comments

Baseline

(Random Guess)

0.500 +- 0.003-It’s not an actual model (DummyClassifier with strategy="uniform")
Transformer (Roberta)0.545 +- 0.002350 MBRoberta and DistilRoberta had the same accuracy in our experiments
GBDT (Catboost)0.560 +- 0.0035 MB

Model details: CatBoostClassifier

n_estimators=300, max_depth = 8

Naive Bayes0.568 +- 0.00258 MB + 49 MBModel details: MultinomialNB on TF-IDF features for word n-grams with length from 1 to  3, alpha = 0.1

Boxplot graphs for pairwise and triple ensembles.
(BL - baseline model (random uniform), CB - CatBoost model
NB - Naive Bayes model, TR - Transformer model)

Conclusion and future work

The results turned out to be promising - despite the complexity of the subject area (predicting future price behavior), we managed to get statistically significant accuracy values (0.582) with a random baseline of 0.500 + -0.003. A separate interesting conclusion was that the size and algorithmic complexity of machine learning models do not always determine the result of their application in each specific case, requiring researchers to conduct more versatile experiments at the modeling stage.

Possible next steps:

  • Calculate business-oriented metrics (closer to the real potential usage of the idea)
  • Labels: Take not just the next calendar day, but the next N days (up to a week). Also it's possible to experiment with the 0.5% threshold that is used for class label calculation.
  • Models: Complete results with RNN approaches (LSTM / GRU) - the main runs have already been carried out, they need to be added to the final ensemble.

References

  1. Yilmaz, E. S., Ozpolat, A., & Destek, M. A. (2022). Do Twitter sentiments really effective on energy stocks? Evidence from the intercompany dependency. Environmental Science and Pollution Research, 29(52), 78757-78767.
    https://www.academia.edu/download/89681397/s11356-022-21269-9.pdf
  2. Jaggi, M., Mandal, P., Narang, S., Naseem, U., & Khushi, M. (2021). Text Mining of Stocktwits Data for Predicting Stock Prices. arXiv preprint arXiv:2103.16388.
    https://arxiv.org/abs/2103.16388
  3. Rahul Pandey R. (2021) Common Pitfalls to Avoid in Forecasting Models for Stock Price Prediction
    https://medium.com/geekculture/common-pitfalls-to-avoid-in-forecasting-models-for-stock-price-prediction-3a7c3ff8b80
  4. Adusumilli, R. (2020). NLP in the Stock Market. Leveraging sentiment analysis on 10-k fillings as an edge
    https://towardsdatascience.com/nlp-in-the-stock-market-8760d062eb92
  5. Sun, Y., Liu, X., Chen, G., Hao, Y., & Zhang, Z. J. (2020). How mood affects the stock market: Empirical evidence from microblogs. Information & Management, 57(5), 103181.
    https://www.sciencedirect.com/science/article/pii/S0378720618307183
  6. Yuz, T. (2018). A Sentiment Analysis Approach to Predicting Stock Returns
    https://medium.com/@tomyuz/a-sentiment-analysis-approach-to-predicting-stock-returns-d5ca8b75a42


Contacts and feedback

Please, feel free to contact us at this Telegram group: https://t.me/tweets_for_stocks_feed

Cookies help us deliver our services. By using our services, you agree to our use of cookies.