Abstract
This research establishes an optimal classification model for online SMSspamdetection by utilizing topological sentence transformer methodologies. The study is a responsetothe increasing sophisticated and disruptive activities of malicious actors. We present a viablelightweight integration of pre-trained NLP repository models with sklearn functionality. Thestudy design mirrors the spaCy pipeline component architecture in a downstreamsklearnpipeline implementation and introduces a user-extensible spam SMS solution. We leveragelarge-text data models from HuggingFace (roberta-base) via spaCy and apply linguisticNLPtransformer methods to short-sentence NLP datasets. We compare the F1-scores of models anditeratively retest models using a standard sklearn pipeline architecture. ApplyingspaCytransformer modelling achieves an optimal F1-score of 0.938, a result comparable to existingresearch output from contemporary BERT/SBERT/‘black box’ predictive models. This researchintroduces a lightweight, user-interpretable, standardized, predictive SMS-spamdetectionmodel, that utilizes semantically similar paraphrase/ sentence transformer methodologies and generatesoptimal F1-scores for an SMS dataset. Significant F1-scores are also generated for a Twitterevaluation set, indicating potential real-world suitability.
Original language | English |
---|---|
Pages (from-to) | 173-181 |
Number of pages | 9 |
Journal | Journal of Data Science and Intelligence Systems |
Volume | 2 |
Issue number | 1 |
Early online date | 12 Jul 2023 |
DOIs | |
Publication status | Published - 2024 |