TY - JOUR
T1 - DATM
T2 - A novel data agnostic topic modelling technique with improved effectiveness for both short and long text
AU - Bewong, Michael
AU - Wondoh, John
AU - Kwashie, Selasi
AU - Liu, Jixue
AU - Liu, Lin
AU - Li, Jiuyong
AU - Islam, Md Zahidul
AU - Kernot, David
N1 - Funding Information:
This work was supported in part by Charles Sturt University publication grant, and the Charles Sturt University Early Career Researcher Award 2021 (Michael Bewong).
Publisher Copyright:
© 2013 IEEE.
PY - 2023
Y1 - 2023
N2 - Topic modelling is important for tackling several data mining tasks in information retrieval. While seminal topic modelling techniques such as Latent Dirichlet Allocation (LDA) have been proposed, the ubiquity of social media and the brevity of its texts pose unique challenges for such traditional topic modelling techniques. Several extensions including auxiliary aggregation, self aggregation and direct learning have been proposed to mitigate these challenges, however some still remain. These include a lack of consistency in the topics generated and the decline in model performance in applications involving disparate document lengths. There is a recent paradigm shift towards neural topic models, which are not suited for resource-constrained environments. This paper revisits LDA-style techniques, taking a theoretical approach to analyse the relationship between word co-occurrence and topic models. Our analysis shows that by altering the word co-occurrences within the corpus, topic discovery can be enhanced. Thus we propose a novel data transformation approach dubbed DATM to improve the topic discovery within a corpus. A rigorous empirical evaluation shows that DATM is not only powerful, but it can also be used in conjunction with existing benchmark techniques to significantly improve their effectiveness and their consistency by up to 2 fold.
AB - Topic modelling is important for tackling several data mining tasks in information retrieval. While seminal topic modelling techniques such as Latent Dirichlet Allocation (LDA) have been proposed, the ubiquity of social media and the brevity of its texts pose unique challenges for such traditional topic modelling techniques. Several extensions including auxiliary aggregation, self aggregation and direct learning have been proposed to mitigate these challenges, however some still remain. These include a lack of consistency in the topics generated and the decline in model performance in applications involving disparate document lengths. There is a recent paradigm shift towards neural topic models, which are not suited for resource-constrained environments. This paper revisits LDA-style techniques, taking a theoretical approach to analyse the relationship between word co-occurrence and topic models. Our analysis shows that by altering the word co-occurrences within the corpus, topic discovery can be enhanced. Thus we propose a novel data transformation approach dubbed DATM to improve the topic discovery within a corpus. A rigorous empirical evaluation shows that DATM is not only powerful, but it can also be used in conjunction with existing benchmark techniques to significantly improve their effectiveness and their consistency by up to 2 fold.
KW - Australia
KW - Benchmark testing
KW - Context modeling
KW - Data models
KW - Document Transformation
KW - Greedy algorithm
KW - Information retrieval
KW - Latent dirichlet allocation
KW - Multi-set multi-cover problem
KW - Probabilistic generative topic modelling
KW - Reliability
KW - Social networking (online)
KW - Task analysis
KW - Document transformation
KW - information retrieval
KW - multi-set multi-cover problem
KW - latent dirichlet allocation
KW - probabilistic generative topic modelling
KW - greedy algorithm
UR - http://www.scopus.com/inward/record.url?scp=85151570629&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85151570629&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2023.3262653
DO - 10.1109/ACCESS.2023.3262653
M3 - Article
AN - SCOPUS:85151570629
SN - 2169-3536
VL - 11
SP - 32826
EP - 32841
JO - IEEE Access
JF - IEEE Access
ER -