| 263 | 2 | 207 |
| Downloads | Citas | Reads |
Just-in-time (JIT) software defect prediction aims to predict whether code commits during project development and maintenance will introduce defects.In the field of JIT software defect prediction research,model training relies on high-quality datasets.However,the impact of dataset augmentation methods on JIT software defect prediction has not been thoroughly investigated in existing methods.To enhance the performance of JIT software defect prediction,a method based on dataset augmentation,named prediction based on data augmentation (PDA) is proposed.PDA includes four parts:feature stitching,sample generation,sample filtering,and sampling processing.The augmented dataset has an ample number of samples with high quality and eliminates the class imbalance problem.Comparing the proposed PDA method with the latest JIT software defect prediction method (JIT-Fine),results indicate:an 18.33%improvement in the F_1score on the JIT-Defects4J dataset;and a 3.67%improvement on the LLTC4J dataset,demonstrating PDA′s generalization ability.Ablation studies have confirmed that the performance improvement of the proposed PDA method mainly comes from dataset augmentation and filtering mechanisms.
[1] WEN M, WU R X, LIU Y P, et al. Exploring and exploiting the correlations between bug-inducing and bug-fixing commits[C]//Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering,August 26-30, 2019, Tallinn, Estonia. New York:ACM,2019:326-337.
[2]陈翔,顾庆,刘望舒,等.静态软件缺陷预测方法研究[J].软件学报,2016, 27(1):1-25.CHEN X, GU Q, LIU W S, et al. Survey of static software defect prediction[J]. Journal of Software, 2016, 27(1):1-25.(in Chinese)
[3] ZHAO Y H, DAMEVSKI K, CHEN H. A systematic survey of just-in-time software defect prediction[J]. ACM Computing Surveys, 2023, 55(10):201.
[4] NI C, WANG W, YANG K W, et al. The best of both worlds:integrating semantic features with expert features for defect prediction and localization[C]//Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, November 14-18, 2022, Singapore, Singapore. New York:ACM, 2022:672-683.
[5] MOCKUS A, WEISS D M. Predicting risk of software changes[J]. Bell Labs Technical Journal, 2000, 5(2):169-180.
[6] KAMEI Y, SHIHAB E, ADAMS B, et al. A large-scale empirical study of just-in-time quality assurance[J]. IEEE Transactions on Software Engineering, 2013, 39(6):757-773.
[7] HOANG T, DAM H K, KAMEI Y, et al. DeepJIT:an end-to-end deep learning framework for just-in-time defect prediction[C]//Proceedings of the 2019 IEEE/ACM16th International Conference on Mining Software Repositories(MSR), 2019, May 25-31, Montreal, QC, Canada.New York:IEEE Xplore, 2019:34-45.
[8] HOANG T, KANG H J, LO D, et al. CC2Vec:distributed representations of code changes[C]//Proceedings of the2020 IEEE/ACM 42nd Inter national Conference on Software Engineering(ICSE), October 05-11, 2020, Seoul,Korea(South). New York:IEEE Xplore, 2020:518-529.
[9] PORNPRASIT C, TANTITHAMTHAVORN C K. JITLine:a simpler, better, faster, finer-grained just-in-time defect prediction[C]//Proceedings of the 2021 IEEE/ACM18th International Conference on Mining Software Repositories(MSR), May 17-19, 2021, Madrid, Spain. New York:IEEE Xplore, 2021:369-379.
[10] ZENG Z R, ZHANG Y Q, ZHANG H T, et al. Deep justin-time defect prediction:how far are we?[C]//Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, July 11-17, 2021, Virtual, Denmark. New York:ACM, 2021:427-438.
[11] WANG W Y, YANG D Y. That′s so annoying!!!:a lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using#petpeeve tweets[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal. Stroudsburg, PA, USA:Association for Computational Linguistics, 2015:2557-2563.
[12] WEI J, ZOU K. EDA:easy data augmenta tion techniques for boosting performance on text classification tasks[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing(EMNLPIJCNLP), Hong Kong, China. Stroudsburg, PA, USA:Association for Computational Linguistics, 2019:6381-6387.
[13] XIE Z A, WANG S I, LI J W, et al. Data noising as smoothing in neural network language models[EB/OL].(2017-03-07)[2023-11-06]. https://arxiv.org/abs/1703.02573.
[14] ZHANG H Y, CISSE M, DAUPHIN Y N, et al. Mixup:beyond empirical risk minimization[EB/OL].(2018-10-25)[2023-11-06]. https://arxiv.org/abs/1710.09412.
[15] SAHIN G G, STEEDMAN M. Data augmentation via dependency tree morphing for low-resource languages[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium. Stroudsburg, PA, USA:Association for Computational Linguistics,2018:5004-5009.
[16] SENNRICH R, HADDOW B, BIRCH A. Improving neural machine translation models with monolingual data[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers), Berlin,Germany. Stroudsburg, PA, USA:Association for Computational Linguistics, 2016:86-96.
[17] CHEN J A, YANG Z C, YANG D Y. MixText:linguistically-informed interpolation of hidden space for semi-supervised text classification[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics,Online. Stroudsburg, PA, USA:Association for Computational Linguistics, 2020:2147-2157.
[18] CHEN X, ZHANG D, ZHAO Y Q, et al. Software defect number prediction:unsupervised vs supervised methods[J].Information and Software Technology, 2019, 106:161-181.
[19] ZHAO M Y, ZHANG L, XU Y, et al. EPiDA:an easy plug-in data augmentation framework for high performance text classification[C]//Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies,Seattle, USA. Stroudsburg, PA, USA:Association for Computational Linguistics, 2022:4742-4752.
[20]李冉,周丽娟,王华.面向类不平衡数据集的软件缺陷预测模型[J].计算机应用研究,2018, 35(9):2806-2810.LI R, ZHOU L J, WANG H. Software defect prediction model based on class imbalanced datasets[J]. Application Research of Computers, 2018, 35(9):2806-2810.(in Chinese)
Basic Information:
DOI:10.12194/j.ntu.20231206001
China Classification Code:TP311.5
Citation Information:
[1]YANG Fan,XIA Hongling.A just-in-time software defect prediction method based on data augmentation[J].Journal of Nantong University (Natural Science Edition),2024,23(01):58-65.DOI:10.12194/j.ntu.20231206001.
Fund Information:
南通市科技计划面上项目(JC2023070)
2024-01-09
2024-01-09
2024-01-09