Publication Date
Fall 2025
Degree Type
Master's Project
Degree Name
Master of Science in Computer Science (MSCS)
Department
Computer Science
First Advisor
Robert K. Chun
Second Advisor
Sayma Akther
Third Advisor
Nava Prashanth Kakarla
Keywords
Code Smell Identification, CodeBERT, Machine Learning, Deep Learning, Software Quality, Software Engineering
Abstract
This project explores automatic code smell detection in Java code using transformer based embeddings and machine learning classifiers. I utilize Code-BERT, a pre-trained transformer model, to extract semantic features from code snippets and evaluate the efficacy of Random Forest and Neural Network classifiers. The study uses the SmellyCode++ dataset, which has 107,554 Java code examples with four types of code smells: Long Method, God Class, Feature Envy, and Data Class. My methodology comprises extracting 768-dimensional embeddings with CodeBERT, training two separate classifiers, and assessing their performance on a balanced subset of 5,000 samples. Presented the metrics of accuracy, precision, recall, F1-score, and confusion matrices, and verified generalization through the application of real world Java examples. Class imbalance is addressed through stratified sampling and the implementation of a balanced training division. Both models performed similarly with an accuracy of around 78The results show that CodeBERT embeddings are good at finding semantic patterns in code structure. This makes it possible to find code smells and gives us a standard that I can use for future multi-label extensions and large-scale deployment. Beyond metrics, I recorded preprocessing decisions, hyperparameter configurations, and inference procedures to ensure complete reproducibility, and I address potential threats to validity. I delineated integration pathways into IDEs and CI systems, explored opportunities for explainability through token-level saliency, and described remediation workflows, positioning this work as a scalable foundation for multi-language, repository-level analysis.
Recommended Citation
Shaik, Sohail, "Semantic and Structural Fusion for Code Smell Detection Using CodeBERT and Random Forest" (2025). Master's Projects. 1610.
DOI: https://doi.org/10.31979/etd.a93q-gq7d
https://scholarworks.sjsu.edu/etd_projects/1610