Uncertainty-aware multi-objective refactoring for code duplication
Main Article Content
Abstract
Code clones are recurring code fragments that may hinder software maintainability if not properly managed. While many clone detection tools exist, they often stop at identification and provide no clear guidance on whether a detected clone group should be refactored, how to do so, or in what order. This paper presents a machine learning–based method for recommending clone refactorings with prioritization and confidence estimation. The proposed approach represents code fragments using abstract syntax trees, program dependency graphs, and semantic embeddings from a pre-trained CodeBERT model. In addition, version control data is used to extract evolutionary features such as churn, age, and co-change patterns. A multi-class classifier predicts refactoring types, while open-set recognition techniques identify uncertain cases and flag them as unknown. Effort and benefit estimation models help prioritize suggestions based on a cost-effectiveness ratio. We evaluated the method on four open-source Java projects using a manually labeled dataset of 600 clone groups. The system achieves a macro-F1 score of zero point seven six on known refactoring types and an AUROC of zero point nine one for unknown detection. Prioritized recommendation quality reaches NDCG@3 of zero point eight nine, showing strong alignment with expert assessments. The results indicate that clone refactoring can be effectively supported through integrated code representation, uncertainty modeling, and prioritization. The approach transforms clone analysis from a passive task into an actionable process.