An Optimized LightGBM Framework for Static Malware Classification Based on Windows Portable Executable File Attributes
DOI:
https://doi.org/10.63503/j.ijcma.2025.208Keywords:
machine learning, LGBM, security, malwareAbstract
Malware poses a significant threat to the digital world, as weak detection systems can enable attackers to steal data or corrupt critical files. Numerous researchers have contributed to this domain by developing highly accurate malware classification systems using various machine learning and deep learning techniques. According to the literature, most machine learning models have been extensively explored for accurate malware classification, with LightGBM consistently demonstrating superior performance in most cases. However, there remains scope for developing more efficient, lightweight, and robust malware classification models through machine-learning-based ensemble techniques. In this paper, we conduct experiments using the balanced and extensive EMBER 2018 dataset, comprising 799,876 samples with 2,382 features, to ensure robust training and obtain an unbiased model. We fine-tune the LightGBM model and additionally implement various machine learning models including Random Forest (RF), ExtraTrees Classifier (ET), XGBoost Classifier (XGB), and a soft-voting ensemble stacking RF, ET, and LightGBM classifiers. Our results show that the proposed optimized fine-tuned LightGBM model outperforms other approaches, achieving an accuracy of 96%.
References
[1] M. H. Al-Adhaileh, A. Verma, T. H. Aldhyani, and D. Koundal, “Potato blight detection using fine-tuned cnn architecture,”
Mathematics, vol. 11, no. 6, p. 1516, 2023.
[2] T. H. Aldhyani, A. Verma, M. H. Al-Adhaileh, and D. Koundal, “Multi-class skin lesion classification using a lightweight dynamic kernel deep-learning-based convolutional neural network,” Diagnostics, vol. 12, no. 9, p. 2048, 2022.
[3] L. Liu, B.-s. Wang, B. Yu, and Q.-x. Zhong, “Automatic malware classification and new malware detection using machine learning,” Frontiers of Information Technology & Electronic Engineering, vol. 18, no. 9, pp. 1336–1347, 2017.
[4] X. Ying, “An overview of overfitting and its solutions,” in Journal of physics: Conference series, vol. 1168. IOP Publishing, 2019, p. 022022.
[5] N. Milosevic, A. Dehghantanha, and K.-K. R. Choo, “Machine learning aided android malware classification,” Computers & Electrical Engineering, vol. 61, pp. 266–274, 2017.
[6] M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, and B. Scholkopf, “Support vector machines,” IEEE Intelligent Systems and their applications, vol. 13, no. 4, pp. 18–28, 1998.
[7] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
[8] B. De Ville, “Decision trees,” Wiley Interdisciplinary Reviews: Computational Statistics, vol. 5, no. 6, pp. 448–455, 2013.
[9] T. Bayes, “Naive bayes classifier,” Article Sources and Contributors, pp. 1–9, 1968.
[10] R. E. Schapire, “Explaining adaboost,” in Empirical inference: festschrift in honor of vladimir N. Vapnik. Springer, 2013,
pp. 37–52.
[11] R. Islam, M. I. Sayed, S. Saha, M. J. Hossain, and M. A. Masud, “Android malware classification using optimum feature selection and ensemble machine learning,” Internet of Things and Cyber-Physical Systems, vol. 3, pp. 100–111, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2667345223000202
[12] A. Choudhary, S. Pawar, and Y. Haribhakta, “Efficient malware detection with optimized learning on high-dimensional fea- tures,” arXiv preprint arXiv:2506.17309, 2025.
[13] J. Sunny Manjaly, R. CR, and L. Jose, “Evaluating the efficacy of machine learning models in predictive malware detection,”
SSRN Electronic Journal, 2025.
[14] A. Brown, M. Gupta, and M. Abdelsalam, “Automated machine learning for deep learning based malware detection,” Com- puters & Security, vol. 137, p. 103582, 2024.
[15] C. Galen and R. Steele, “Evaluating performance maintenance and deterioration over time of machine learning-based malware detection models on the ember pe dataset,” in 2020 Seventh International Conference on Social Networks Analysis, Management and Security (SNAMS). IEEE, 2020, pp. 1–7.
[16] P. P. Kundu, L. Anatharaman, and T. Truong-Huu, “An empirical evaluation of automated machine learning techniques for malware detection,” in Proceedings of the 2021 ACM Workshop on Security and Privacy Analytics, 2021, pp. 75–81.
[17] K. Thosar, P. Tiwari, R. Jyothula, and D. Ambawade, “Effective malware detection using gradient boosting and convolutional neural network,” in 2021 IEEE Bombay Section Signature Conference (IBSSC). IEEE, 2021, pp. 1–4.
[18] C. Connors and D. Sarkar, “Machine learning for detecting malware in pe files,” in 2023 International Conference on Machine Learning and Applications (ICMLA). IEEE, 2023, pp. 2194–2199.
[19] S. S. Lad and A. C. Adamuthe, “Improved deep learning model for static pe files malware detection and classification,”
International Journal of Computer Network and Information Security, vol. 12, no. 2, p. 14, 2022.
[20] Kaggle Dataset: dhoogla, “Ember-2018-v2-features: Elastic malware benchmark feature set (version 2),” https://www.kaggle. com/datasets/dhoogla/ember-2018-v2-features, 2025, accessed: 2025-12-10.