K-Gen PhishGuard: an Ensemble Approach for Phishing Detection with K-Means and Genetic Algorithm

Ali Al-Hafiz; Adnan Jabir; Shamala Subramaniam

doi:10.22153/kej.2025.04.011

المؤلفون

Ali Al-Hafiz Department of Computer Science, College of Science, University of Baghdad, Baghdad, Iraq
Adnan Jabir Department of Computer Science, College of Science, University of Baghdad, Baghdad, Iraq
Shamala Subramaniam Department of Communication Technology and Networking, Faculty of Computer Science and Information Technology, Universiti Putra Malaysia, Serdang, 43400, Selangor Darul Ehsan, Malaysia

DOI:

https://doi.org/10.22153/kej.2025.04.011

الكلمات المفتاحية:

AdaBoost; ensemble learning; feature selection; genetic algorithm; K-means clustering; machine learning; phishing detection

الملخص

المستخلص

الكشف عن التصيد الاحتيالي هو مشكلة حرجة في مجال الأمن السيبراني، وأكبر تحدٍ هو كيفية استخدام التعلم الآلي مع طريقة فعالة لاختيار الميزات لتحديد المواقع الضارة بدقة. يقدم هذا البحث نظامًا للكشف عن التصيد الاحتيالي يتكون من مرحلتين رئيسيتين، يتم فيهما استخدام اختيار الميزات غير المراقب والتصنيف المراقب.في المرحلة الأولى، تُستخدم خوارزمية التحسين الجيني (GA)لتحديد أفضل مجموعة من الميزات التي يتم استخدامها بواسطة خوارزمية التجميع K-means لتقسيم مجموعة البيانات إلى مجموعات تحمل سمات متشابهة.أما في المرحلة الثانية، فيتم استخدام خوارزمية التحسين الجيني (GA) مرة أخرى لتحديد أفضل مجموعة ميزات داخل كل مجموعة، مما يعزز عملية التصنيف. في النهاية، يتم تطبيق تقنية التجميع بالتصويت Voting Ensemble)، حيث يتم دمج نماذج Support Vector Machine (SVM) وRandom Forest (RF) وXGBoost و(AdaBoost باستخدام آلية تصويت ناعمة لتجميع التنبؤات.تم استخدام مجموعة بيانات خاصة بالكشف عن التصيد الاحتيالي لصفحات الويب في هذا البحث، تحتوي على 11,430 عنوان URL و87 ميزة.أظهرت النتائج أن تقنية التجميع بالتصويت تحقق دقة تصل إلى 99% عند استخدام اختيار الميزات، مقارنة بـ 77.3% دون استخدام اختيار الميزات. يُظهر اختيار الميزات المُحسَّن باستخدام خوارزمية (GA) تحسينًا كبيرًا في أداء النموذج، من خلال تقليل التعقيد الحسابي وتحسين المؤشرات الرئيسية مثل الدقة (Accuracy)، ومعامل التحديد (Precision)، ودرجة F1 (F1-score). إضافة إلى ذلك، تُظهر النتائج عبر أربع مجموعات البيانات أن خوارزمية K-means تُسهم بشكل إيجابي في تحسين دقة التصنيف ضمن مجموعات بيانات معينة. تُثبت النتائج المحققة أن دمج اختيار الميزات مع تقنيات التعلم المجمع يعد حلاً فعالاً للكشف عن التصيد الاحتيالي، ويظهر قابلية التطبيق والكفاءة لهذا الحل في الاستخدامات الواقعية.

التنزيلات

تنزيل البيانات ليس متاحًا بعد.

المراجع

[1] M. S. Bakken, "Webpage Fingerprinting using Infrastructure-based Features," NTNU, 2023.

[2] P. Patel, D. M. Sarno, J. E. Lewis, M. Shoss, M. B. Neider, and C. J. Bohil, "Perceptual representation of spam and phishing emails," Applied Cognitive Psychology, vol. 33, no. 6, pp. 1296-1304, 2019.

[3] J. A. Chaudhry, S. A. Chaudhry, and R. G. Rittenhouse, "Phishing attacks and defenses," International journal of security and its applications, vol. 10, no. 1, pp. 247-256, 2016.

[4] M. A. Chargo, "You've been hacked: How to better incentivize corporations to protect consumers' data," Transactions: The Tennessee Journal of Business Law, vol. 20, pp. 115-143, 2018.

[5] G. Ho et al., "Understanding the Efficacy of Phishing Training in Practice," in 2025 IEEE Symposium on Security and Privacy (SP), 2024: IEEE Computer Society, pp. 76-76.

[6] R. A. Al Mudhafar and N. K. El Abbadi, "Image Noise Detection and Classification Based on Combination of Deep Wavelet and Machine Learning," Al-Salam Journal for Engineering and Technology, vol. 3, no. 1, pp. 23-36, 2024.

[7] L. Al-Shalabi and Y. Hasan Jazyah, "Phishing Detection Using Hybrid Algorithm Based on Clustering and Machine Learning," International Journal of Computing and Digital Systems, vol. 15, no. 1, pp. 1-13, 2024.

[8] G. Sonowal and K. Kuppusamy, "PhiDMA–A phishing detection model with multi-filter approach," Journal of King Saud University-Computer and Information Sciences, vol. 32, no. 1, pp. 99-112, 2020.

[9] K. L. Chiew, C. L. Tan, K. Wong, K. S. Yong, and W. K. Tiong, "A new hybrid ensemble feature selection framework for machine learning-based phishing detection system," Information Sciences, vol. 484, pp. 153-166, 2019, doi: 10.1016/j.ins.2019.01.064.

[10] Y. Mourtaji, M. Bouhorma, D. Alghazzawi, G. Aldabbagh, and A. Alghamdi, "Hybrid Rule‐Based Solution for Phishing URL Detection Using Convolutional Neural Network," Wireless Communications and Mobile Computing, vol. 2021, p. 24, 2021, doi: 10.1155/2021/8241104.

[11] J. Solanki and R. G. Vaishnav, "Website phishing detection using heuristic based approach," in Proceedings of the third international conference on advances in computing, electronics and electrical technology, 2015.

[12] L. A. T. Nguyen and H. K. Nguyen, "Developing an efficient fuzzy model for phishing identification," in 2015 10th Asian Control Conference (ASCC), 2015: IEEE, pp. 1-6.

[13] R. M. Mohammad, F. Thabtah, and L. McCluskey, "Predicting phishing websites based on self-structuring neural network," Neural Computing and Applications, vol. 25, pp. 443-458, 2014.

[14] R. M. Mohammad, F. Thabtah, and L. McCluskey, "An assessment of features related to phishing websites using an automated technique," in 2012 international conference for internet technology and secured transactions, 2012: IEEE, pp. 492-497.

[15] M. S. I. Ovi, M. H. Rahman, and M. A. Hossain, "PhishGuard: A Multi-Layered Ensemble Model for Optimal Phishing Website Detection," arXiv preprint arXiv:2409.19825, 2024.

[16] A. R. Mahmood and S. M. Hameed, "A Smishing Detection Method Based on SMS Contents Analysis and URL Inspection Using Google Engine and VirusTotal," Iraqi Journal of Science, pp. 6276-6291, 2023.

[17] A. R. Mahmood and S. M. Hameed, "Review of Smishing Detection Via Machine Learning," Iraqi Journal of Science, pp. 4244-4259, 2023.

[18] A. A. Zuraiq and M. Alkasassbeh, "Phishing detection approaches," in 2019 2nd International Conference on new Trends in Computing Sciences (ICTCS), Amman, Jordan, 2019: IEEE, pp. 1-6, doi: 10.1109/ICTCS.2019.8923069.

[19] M. Pratiwi, T. Lorosae, and F. Wibowo, "Phishing site detection analysis using artificial neural network," Journal of Physics: Conference Series, vol. 1140, p. 012048, 2018, doi: 10.1088/1742-6596/1140/1/012048.

[20] A. Odeh, I. Keshta, and E. Abdelfattah, "PHIBOOST-a novel phishing detection model using Adaptive boosting approach," Jordanian Journal of Computers and Information Technology (JJCIT), vol. 7, no. 1, pp. 65-74, 2021.

[21] H. Shirazi, K. Haefner, and I. Ray, "Improving auto-detection of phishing websites using fresh-phish framework," International Journal of Multimedia Data Engineering and Management (IJMDEM), vol. 9, no. 1, p. 14, 2018, doi: 10.4018/IJMDEM.2018010104.

[22] W. Wang, F. Zhang, X. Luo, and S. Zhang, "PDRCNN: Precise phishing detection with recurrent convolutional neural networks," Security and Communication Networks, vol. 2019, p. 15, 2019, doi: 10.1155/2019/2595794.

[23] Z. Liu, B. Yang, J. An, and C. Huang, "Similarity evaluation of graphic design based on deep visual saliency features," The Journal of Supercomputing, pp. 1-22, 2023.

[24] M. Sheykhmousa, M. Mahdianpari, H. Ghanbari, F. Mohammadimanesh, P. Ghamisi, and S. Homayouni, "Support vector machine versus random forest for remote sensing image classification: A meta-analysis and systematic review," IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 13, pp. 6308-6325, 2020.

[25] S. Juanita and R. D. Cahyono, "K-means clustering with comparison of Elbow and silhouette methods for medicines clustering based on user reviews," Jurnal Teknik Informatika (JUTIF), vol. 5, no. 1, pp. 283-289, 2024.

[26] S. Mathankar, S. R. Sharma, T. Wankhede, M. Sahu, and S. Thakur, "Phishing Website Detection using Machine Learning Techniques," in 2023 11th International Conference on Emerging Trends in Engineering & Technology-Signal and Information Processing (ICETET-SIP), 2023: IEEE, pp. 1-6.

[27] R. Mahajan and I. Siddavatam, "Phishing website detection using machine learning algorithms," International Journal of Computer Applications, vol. 181, no. 23, pp. 45-47, 2018.

[28] A. Altaher, "Phishing websites classification using hybrid SVM and KNN approach," International Journal of Advanced Computer Science and Applications, vol. 8, no. 6, 2017.

[29] K. G. Liakos, P. Busato, D. Moshou, S. Pearson, and D. Bochtis, "Machine learning in agriculture: A review," Sensors, vol. 18, no. 8, p. 2674, 2018.

[30] D. M. Abdullah and A. M. Abdulazeez, "Machine learning applications based on SVM classification a review," Qubahan Academic Journal, vol. 1, no. 2, pp. 81-90, 2021.

[31] A. Roy and S. Chakraborty, "Support vector machine in structural reliability analysis: A review," Reliability Engineering & System Safety, vol. 233, p. 109126, 2023.

[32] A. Parmar, R. Katariya, and V. Patel, "A review on random forest: An ensemble classifier," in International conference on intelligent data communication technologies and internet of things (ICICI) 2018, 2019: Springer, pp. 758-763.

[33] W. Wang and D. Sun, "The improved AdaBoost algorithms for imbalanced data classification," Information Sciences, vol. 563, pp. 358-374, 2021.

[34] S. S. Azmi and S. Baliga, "An overview of boosting decision tree algorithms utilizing AdaBoost and XGBoost boosting strategies," Int. Res. J. Eng. Technol, vol. 7, no. 5, pp. 6867-6870, 2020.

[35] O. Sagi and L. Rokach, "Ensemble learning: A survey," Wiley interdisciplinary reviews: data mining and knowledge discovery, vol. 8, no. 4, p. e1249, 2018.

[36] J. Tang, S. Alelyani, and H. Liu, "Feature selection for classification: A review," Data classification: Algorithms and applications, p. 37, 2014.

[37] B. Venkatesh and J. Anuradha, "A review of feature selection and its methods," Cybernetics and information technologies, vol. 19, no. 1, pp. 3-26, 2019.

[38] S. N. Mohammed and A. J. Jabir, "A Ranked-Aware GA with HoG Features for Infant Cry Classification," International Journal of Intelligent Engineering & Systems, vol. 16, no. 6, 2023.

[39] A. Sohail, "Genetic algorithms in the fields of artificial intelligence and data sciences," Annals of Data Science, vol. 10, no. 4, pp. 1007-1018, 2023.

[40] W. Ali and F. Saeed, "Hybrid filter and genetic algorithm-based feature selection for improving cancer classification in high-dimensional microarray data," Processes, vol. 11, no. 2, p. 562, 2023.

[41] X. Liu and Y. Du, "Towards effective feature selection for iot botnet attack detection using a genetic algorithm," Electronics, vol. 12, no. 5, p. 1260, 2023.

[42] S. Katoch, S. S. Chauhan, and V. Kumar, "A review on genetic algorithm: past, present, and future," Multimedia tools and applications, vol. 80, pp. 8091-8126, 2021.

[43] G. K. Soon, T. T. Guan, C. K. On, R. Alfred, and P. Anthony, "A comparison on the performance of crossover techniques in video game," in 2013 IEEE international conference on control system, computing and engineering, 2013: IEEE, pp. 493-498.

[44] B. Mahesh, "Machine learning algorithms-a review," International Journal of Science and Research (IJSR).[Internet], vol. 9, no. 1, pp. 381-386, 2020.

[45] G. Ascenso, M. H. Yap, T. Allen, S. S. Choppin, and C. Payton, "A review of silhouette extraction algorithms for use within visual hull pipelines," Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, vol. 8, no. 6, pp. 649-670, 2020.

[46] Kaggle, "Web page Phishing Detection Dataset," 2021. [Online]. Available: https://www.kaggle.com/datasets/shashwatwork/web-page-phishing-detection-dataset

[47] A. Hannousse and S. Yahiouche, "Towards benchmark datasets for machine learning based website phishing detection: An experimental study," Engineering Applications of Artificial Intelligence, vol. 104, p. 104347, 2021.

[48] K. Adane, B. Beyene, and M. Abebe, "Single and hybrid-ensemble learning-based phishing website detection: examining impacts of varied nature datasets and informative feature selection technique," Digital Threats: Research and Practice, vol. 4, no. 3, pp. 1-27, 2023.