Effective text data preprocessing technique for sentiment analysis in social media data

Saurav Pradha, Malka N. Halgamuge, Nguyen Tran Quoc Vinh

Research output: Book chapter/Published conference paperConference paper

Abstract

In the big data era, data is made in real-time or closer to real-time. Thus, businesses can utilize this evergrowing volume of data for the data-driven or information-driven decision-making process to improve their businesses. Social media, like Twitter, generates an enormous amount of such data. However, social media data are often unstructured and difficult to manage. Hence, this study proposes an effective text data preprocessing technique and develop an algorithm to train the Support Vector Machine (SVM), Deep Learning (DL) and Naïve Bayes (NB) classifiers to process Twitter data. We develop an algorithm that weights the sentiment score in terms of weight of hashtag and cleaned text. In this study, we (i) compare different preprocessing techniques on the data collected from Twitter using various techniques such as (stemming, lemmatization and spelling correction) to obtain the efficient method (ii) develop an algorithm to weight the scores of the hashtag and cleaned text to obtain the sentiment. We retrieved N=1, 314, 000 Twitter data, and we compared the popularity of two products, Google Now and Amazon Alexa. Using our data preprocessing algorithm and sentiment weight score algorithm, we train SVM, DL, NB models. The results show that stemming technique performed best in terms of computational speed. Additionally, the accuracy of the algorithm was tested against manually sorted sentiments and sentiments produced before text data preprocessing. The result demonstrated that the impact produced by the algorithm was close to the manually annotated sentiments. In terms of model performance, the SVM performed better with the accuracy of 90.3%, perhaps, due to the unstructured nature of Twitter data. Previous studies used conventional techniques; hence, no precise methods were utilized on cleaning the text. Therefore, our approach confirms that proper text data preprocessing technique plays a significant role in the prediction accuracy and computational time of the classifier when using the unstructured Twitter data.

Original languageEnglish
Title of host publicationProceedings of 2019 11th International Conference on Knowledge and Systems Engineering, KSE 2019
EditorsJosiane Mothe, Le Hoang Son, Nguyen Tran Quoc Vinh
PublisherIEEE, Institute of Electrical and Electronics Engineers
Number of pages8
ISBN (Electronic)9781728130033
DOIs
Publication statusPublished - 05 Dec 2019
Event11th IEEE International Conference on Knowledge and Systems Engineering: KSE 2019 - University of Education and Science, The University of Danang, Da Nang, Viet Nam
Duration: 24 Oct 201926 Oct 2019
http://kse2019.ued.udn.vn/welcome
https://ieeexplore.ieee.org/xpl/conhome/8909968/proceeding (proceedings)
http://kse2019.ued.udn.vn/sites/default/files/KSE%202019%20Program%20Ver13-23-10-2019%20-%20print.pdf (program)

Publication series

NameProceedings of 2019 11th International Conference on Knowledge and Systems Engineering, KSE 2019

Conference

Conference11th IEEE International Conference on Knowledge and Systems Engineering
CountryViet Nam
CityDa Nang
Period24/10/1926/10/19
OtherThe 11th KSE is an international forum for presentation, discussion, and exchange of the state-of-the-art research, development, and applications in the field of knowledge and systems engineering.

The main objective of the KSE is to bring together researchers, practitioners and students not only to share research results and practical applications but also to foster collaboration in the field. The KSE this year will be held in Da Nang, a coastal and tourism city with beautiful beaches and a lot of exciting experiences. The conference will be co-organized by The University of Da Nang - University of Science and Education (UD-USE) and VNU University of Engineering and Technology (VNU-UET).
Internet address

Fingerprint Dive into the research topics of 'Effective text data preprocessing technique for sentiment analysis in social media data'. Together they form a unique fingerprint.

  • Cite this

    Pradha, S., Halgamuge, M. N., & Tran Quoc Vinh, N. (2019). Effective text data preprocessing technique for sentiment analysis in social media data. In J. Mothe, L. H. Son, & N. T. Q. Vinh (Eds.), Proceedings of 2019 11th International Conference on Knowledge and Systems Engineering, KSE 2019 [8919368] (Proceedings of 2019 11th International Conference on Knowledge and Systems Engineering, KSE 2019). IEEE, Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/KSE.2019.8919368