COVID-19 Health Related Data Classification

  • Mahathir Mohammad Bishal (Creator)
  • Md Rakibul Hassan Chowdory (Creator)
  • Anik Das (Creator)
  • Ashad Kabir (Creator)

Dataset

Description of Data

We have used a publicly available dataset, COVID-19 Tweets Dataset, consisting of an extensive collection of 1,091,515,074 tweet IDs, and continuously expanding. The dataset was compiled by tracking over 90 distinct keywords and hashtags commonly associated with discussions about the COVID-19 pandemic. From this massive dataset, we focused on a specific time frame, encompassing data from August 05, 2020, to August 26, 2020, to meet our research objectives. As this dataset contains only tweet IDs, we have used the Twitter developer API to retrieve the corresponding tweets from Twitter. This retrieval process involved searching for tweet IDs and extracting the associated tweet texts, and it was implemented using the Twython library. In total, we successfully collected 21,890 tweets during this data extraction phase.

Following guidelines set by the CDC and WHO, we categorized tweets into five distinct classes for classification: health risks, prevention, symptoms, transmission, and treatment. Specifically, individuals aged over sixty, or those with pre-existing health conditions such as heart disease, lung problems, weakened immune systems, or diabetes, are at higher risk of severe COVID-19 complications. Therefore, tweets categorized as ‘health risks’ pertain to the elevated risks associated with COVID-19 due to age or specific health conditions. ‘Prevention’ related tweets encompass discussions on preventive and precautionary measures regarding the COVID-19 pandemic. Tweets discussing common COVID-19 symptoms, including cough, congestion, breathing issues, fever, body aches, and more, are classified as ‘symptoms’ related tweets. Conversations pertaining to the spread of COVID-19 between individuals, between animals and humans, and contact with virus-contaminated objects or surfaces are categorized as ‘transmission’ related tweets. Lastly, tweets indicating vaccine development and drugs used for COVID-19 treatment fall under the ‘treatment’ related category.

We determined specific keywords for each of the five classes (health risks, prevention, symptoms, transmission, and treatment) based on the definitions provided by the CDC and WHO on their official websites. These definitions, along with their associated keywords, are detailed in Table 1. For instance, the CDC and WHO indicate that individuals over the age of sixty with conditions like heart disease, lung problems, weak immune systems, or diabetes face a higher risk of severe COVID-19 complications. In accordance with this definition, we selected relevant keywords such as “lung disease”, “heart disease”, “diabetes”, “weak immunity”, and others to identify tweets related to health risks within the larger tweet dataset. This approach was consistently applied to define keywords for the remaining four classes. Subsequently, we filtered the initial dataset of 21,890 tweets to extract tweets relevant to our predefined classes, resulting in a total of 6,667 tweets based on the selected keywords.

To ensure the accuracy of our dataset, two separate annotators individually assigned the 6,667 tweets to the five classes. A third annotator, a natural language expert, meticulously cross-checked the dataset and provided necessary corrections. Subsequently, the two annotators resolved any discrepancies through mutual agreement, resulting in the final annotated dataset. Our dataset comprises a total of 6,667 data points categorized into five classes: 978, 2046, 1402, 802, and 1439 tweets annotated as ‘health risk’, ‘prevention’, ‘symptoms’, ‘transmission’, and ‘treatment’, respectively
Date made available01 Sept 2021
PublisherCell Press
Date of data production2021 -

Cite this