Adaptive Phishing Detection System using Machine Learning

Download files
Access & Terms of Use
open access
Copyright: Purwanto, Rizka
Altmetric
Abstract
Despite the availability of toolbars and studies in phishing, the number of phishing attacks has been increasing in the past years. It remains a challenge to develop robust phishing detection systems due to the continuous change of attack models. We attempt to address this by designing an adaptive phishing detection system with the ability to continually learn and detect phishing robustly. In the first work, we demonstrate a systematic way to develop a novel phishing detection approach using compression algorithm. We also propose the use of compression ratio as a novel machine learning feature, which significantly improves machine learning based phishing detection over previous studies. Our proposed method outperforms the use of best-performing HTML-based features in past studies, with a true positive rate of 80.04%. In the following work, we propose a feature-free method using Normalised Compression Distance (NCD), a metric which computes the similarity of two websites by compressing them, eliminating the need to perform any feature extraction. This method examines the HTML of webpages and computes their similarity with known phishing websites. Our approach is feasible to deploy in real systems with a processing time of roughly 0.3 seconds, and significantly outperforms previous methods in detecting phishing websites, with an AUC score of 98.68%, a G-mean score of 94.47%, a high true positive rate (TPR) of around 90%, while maintaining a low false positive rate (FPR) of 0.58%. We also discuss the implication of automation offered by AutoML frameworks towards the role of human experts and data scientists in the domain of phishing detection. Our work investigates whether models that are built using AutoML frameworks can outperform the results achieved by human data scientists in phishing datasets and analyses the relationship between the performances and various data complexity measures. There remain many challenges for building a real-world phishing detection system using AutoML frameworks due to the current support only for supervised classification problems, leading to the need for labelled data, and the inability to update the AutoML-based models incrementally. This indicates that experts with knowledge in the domain of phishing and cybersecurity are still essential in phishing detection.
Persistent link to this record
Link to Publisher Version
Link to Open Access Version
Additional Link
Creator(s)
Editor(s)
Translator(s)
Curator(s)
Designer(s)
Arranger(s)
Composer(s)
Recordist(s)
Conference Proceedings Editor(s)
Other Contributor(s)
Corporate/Industry Contributor(s)
Publication Year
2022
Resource Type
Thesis
Degree Type
PhD Doctorate
UNSW Faculty
Files
download public version.pdf 29.63 MB Adobe Portable Document Format
Related dataset(s)