Establishing an optimal online phishing detection method: Evaluating topological NLP transformers on text message data

Helen Milner, Michael Baron

Research output: Contribution to journalArticlepeer-review

179 Downloads (Pure)

Abstract

This research establishes an optimal classification model for online SMSspamdetection by utilizing topological sentence transformer methodologies. The study is a responsetothe increasing sophisticated and disruptive activities of malicious actors. We present a viablelightweight integration of pre-trained NLP repository models with sklearn functionality. Thestudy design mirrors the spaCy pipeline component architecture in a downstreamsklearnpipeline implementation and introduces a user-extensible spam SMS solution. We leveragelarge-text data models from HuggingFace (roberta-base) via spaCy and apply linguisticNLPtransformer methods to short-sentence NLP datasets. We compare the F1-scores of models anditeratively retest models using a standard sklearn pipeline architecture. ApplyingspaCytransformer modelling achieves an optimal F1-score of 0.938, a result comparable to existingresearch output from contemporary BERT/SBERT/‘black box’ predictive models. This researchintroduces a lightweight, user-interpretable, standardized, predictive SMS-spamdetectionmodel, that utilizes semantically similar paraphrase/ sentence transformer methodologies and generatesoptimal F1-scores for an SMS dataset. Significant F1-scores are also generated for a Twitterevaluation set, indicating potential real-world suitability.
Original languageEnglish
Pages (from-to)173-181
Number of pages9
JournalJournal of Data Science and Intelligence Systems
Volume2
Issue number1
Early online date12 Jul 2023
DOIs
Publication statusPublished - 2024

Fingerprint

Dive into the research topics of 'Establishing an optimal online phishing detection method: Evaluating topological NLP transformers on text message data'. Together they form a unique fingerprint.

Cite this