Data and Details: The data set employed in this study was selected because it served as a large, longitudinal sample of the Twitter community which purportedly has over 152 million daily active users.
Data Set Contents: The data set consists of 258,800 Twitter tweets that were scrapped together based on the condition of it having the hashtag, “#covid19”. This serves to create a record of reactions towards COVID-19, thereby making this data targeted. It has provided 34 features, including location, tweet content, description, and others. To draw meaningful conclusions from these features, we are employing a variety of preprocessing methods. Location strings are being categorized by country and the collection of full tweets (rather than the short tweet with a hyperlink) are some of the preprocessing methods employed.
Validation: To ensure the accuracy of the Vader model, we will be manually labeling 100 tweets with sentiment scores our groups find accurate. We will use this as the validation set and test whether the model stands up to our expectations.
Timeline: This data was collected between the dates March 20, 2020, and November 29, 2020, and is longitudinal. This timeline captures a significant portion of lockdown periods as COVID-19 cases fluctuated in different countries throughout the year, thereby providing a varied sample of data.
Sourcing Location: The data set was pulled from the Twitter platform which is used internationally. The data is geographically worldwide however it does cluster in locations where Twitter is more prevalent or where the coronavirus is of greater concern. This means consideration should be made in regard to the distribution of Twitter use among the regions where this will be used to make decisions. There is an oversampling of users who tweeted their opinions about COVID-19 a lot and an undersampling of those who didn’t, but having 258,500 tweets in the analysis should mostly mitigate this issue.
Reason: The data has been collected specifically to gain insight on public sentiment with regards to the coronavirus. Also, its large sample size will allow for a more comprehensive study.
Model Details: We use the VADER model for our sentiment analysis. VADER is built specifically for Social Media Text and requires no extra training to provide accurate results. It employs a continuous scale from -1 to 1 with -1 being terrible and 1 being great, so it will gave us further insight into how negative or positive each tweet is. This will allow us to draw more precise conclusions.