Sentiment Analysis of Long-term of Social Data during the COVID-19 Pandemic

Directed Research Project

Spring 2021

PROJECT DETAILS

Social Network Sentimental Analysis is a directed research project which focus on the sentiment of long-term social media content during the covid-19 pandemic. The project was directed under Prof. Hailu Xu at CSU, Long Beach.


ACHIEVEMENT

This research paper is accepted for publication in the Springer Nature - Research Book Series: Transactions on Computational Science & Computational Intelligence. It is scheduled to be published soon after the 22nd International Conference on Internet Computing & IoT (ICOMP'21), which will take place in July 26-29, 2021, USA.

PAPER ID #: ICM4216

TITLE OF PAPER/ARTICLE: Sentiment Analysis of Long-term Social Data during the COVID-19 Pandemic Sophanna Ek, Marco Curci, Xiaokun Yang, Beiyu Lin, Hailu Xu


Overview

The global outbreak of COVID-19 pandemic has changed and disrupted the human's lives in the past year. By March 2020, the deadly virus has been spread to 223 countries, and reached to 119 million cases and more than 2 million deaths. Despite the spreading of real virus worldwie, social medias, plays an important key as means to recieve and post news information in the daily updates.

Social networks show strong emotional reactions towards the pandemic in the past year. During the pandemic, a large number of social posts related to rumors, hate speech, racist conspiracy, and negative sentiments had quickly proliferated on the social networks. In this research work, we've studied the sentiment analysis specifically on Twitter social network data. For a full research work, please see here.

IMPLEMENTATION

FEATURES

Research Work Pipeline

Data Collection

The data collection was done by using the Tweepy API with the access token from Twitter. The tweets selection was filtered by using the relevant keywords including "covid-19" and "vaccine" or any hashtags such as "#covid-19" or "#coronavirus". The data was collected on weekly basis from the 1st - 7th, 8th -14th, 15th - 21st and 22nd - 28th/30th periods for 12 months. The volume of the weekly collection was varied between 25,000 to 35,000. The total collected tweets up to approximately 1,300,000 tweets from February 2020 to February 2021.


Data Processing

We performed the data preprocessing on each data instance before performing the data analysis. All collected tweets had been preprocessed in the following order:

  1. Each tweet content is extracted by removing url link, user references, punctuations and hashtags symbol
  2. Then extracted text was then tokenized and have common English stop words removed.
  3. The cleaned text was then applied the stemming process using Porter Stemmer to remove the morphological affixes from words and leaves only the word stem for our analysis.
  4. The text was then applied the lemmatization to keep the word to its meaningful base form. We also use the stop words, Porter Stemmer, and Lemmatization from the NLTK Library (Natural Language Toolkit) in python for the data processing.

Analysis methods

  1. TF-IDF:
  2. TF-IDF method was used to find more important topics from the collected tweet data using the TF-IDF score. Instead of giving every word with equal importance, TF-IDF gives more importance to the words that occur more frequently in one document and less frequently in other documents. TF-IDF score is determined by conducting the word’s term frequency and its inverse document frequency. TfidfVectorizer in sklearn python library was used to learn the pre-processed tweet data and score each term appear in the tweet corpus.

  3. Sentiment Analysis
  4. The sentiment analysis is performed on the same pre-processed tweet dataset. TextBlob [5] is a python library that can perform the common natural language process tasks such as sentiment analysis, classification, noun phrase extraction, and more. It assigns individual scores to all the words, then takes an average of all the sentiments to calculate the final sentiment. It takes the pre-processed tweet content and gives the polarity score which is used to determine the sentiment of the tweet data. The score is range from -1 to 1. The content is said to be negative if the polarity score is less than 0, positive if it is greater than 0, and neutral if it is equal to 0

Analysis Result

we perform the sentiment analysis with the collected data and investigate the sentiment patterns of content-level, political and social cognitive attributes. We provide perspectives on social data in three categories: vaccine-related, politic-related, and economic-related. Besides, we explore the differences of views among gender effects.

Conclusion

In this work, we conduct a long-term analysis of social posts that related to COVID-19 within 12 months. We analyze the collected social data in three categories: vaccines, politics, and economics and analyze the sentiment attitudes of tweets from various perspectives. Our analysis shows that negative tweets occupy an important position during the pandemic and decline with the development of vaccines. Besides, the evolution of various topics tightly follow the hot discussions on economics, politics, and vaccines.

Future Work

we will analyze the tweets by targeting on deeper and more comprehensive perspectives. We will continue to explore the variances among different topics and show the roles that genders feeds back at different stages of the ”infodemic”. Besides, we will characterize sentiment patterns to extend the understanding of the impacts of online social ”infodemic”.