#Masks: a Twitter Sentiment Analysis Throughout COVID-19

Collecting the Data

Using twint (a Python package for scraping Twitter), Josh and I ran a loop 24/7 for three weeks, scraping all tweets containing the word ‘Mask’ or ‘Masks’ between January 1st and June 30th 2020: over 150 Million tweets in total.

Note: there was a disruption in Twitter service on March 28th

Preparing Tweets for Analysis

In order to run Machine Learning and NLP models on the tweets, we needed to clean up the text. We made all text lowercase, removed links and usernames, converted emojis into their associated words or phrases, and removed punctuation and stop words (common English connecting words such as ‘the’ and words redundant given the subject matter such as ‘virus’).

Sentiment Towards Masks Over Time

Using Python package vader, we computed a sentiment score for each tweet on a scale from -1 (negative sentiment) to +1 (positive sentiment). Tweets with a sentiment score between -0.5 and +0.5 are considered to be neutral.

LDA Topic Modeling

In order to gain further insights into the subject matter of each sentiment category, we employed Latent Dirichlet Allocation (LDA) topic modeling with another NLP Python library gensim. Essentially LDA calculates N given number of ‘topics’ based on the words in all the tweets combined and then scores each tweet a score for each topic, all of which add up to 1 or 100%. Due to computational resource constraints, we had to limit topic modeling to words that appeared at least 250 unique times across the entire dataset.

Conclusions

In general we were surprised to see how consistent sentiment remained towards masks over the course of the pandemic across the general public (who use Twitter), which was not the case with major health officials. Upon further reflection, our results may have highlighted the so called ‘wisdom in the crowd’ vs. the opinion of a few experts.

Data Science | Data Engineering | Python Development