Publication Date
Fall 2021
Degree Type
Master's Project
Degree Name
Master of Science (MS)
Department
Computer Science
First Advisor
Katerina Potika
Second Advisor
Mark Stamp
Third Advisor
Thomas Austin
Keywords
code2vec, code analysis, programming language labeling
Abstract
Software development is an expensive and difficult process. Mistakes can be easily made, and without extensive review process, those mistakes can make it to the production code and may have unintended disastrous consequences.
This is why various automated code review services have arisen in the recent years. From AWS’s CodeGuro and Microsoft’s Code Analysis to more integrated code assistants, like IntelliCode and auto completion tools. All of which are designed to help and assist the developers with their work and help catch overlooked bugs.
Thanks to recent advances in machine learning, these services have grown tremen- dously in sophistication to a point where they can catch bugs that often go unnoticed even with traditional code reviews.
This project investigates the use of code2vec [1], which is a probabilistic machine learning model on source code, in correctly labeling methods from different program- ming language families. We extend this model to work with more languages, train the created models, and compare the performance of static and dynamic languages.
As a by-product we create new datasets from the top stared open source GitHub projects in various languages. Different approaches for static and dynamic languages are applied, as well as some improvement techniques, like transfer learning. Finally, different parsers were used to see their effect on the model’s performance.
Recommended Citation
Elsaid, Sherif, "The Impact of Programming Language’s Type on Probabilistic Machine Learning Models" (2021). Master's Projects. 1050.
DOI: https://doi.org/10.31979/etd.ferw-3a7j
https://scholarworks.sjsu.edu/etd_projects/1050
Included in
Artificial Intelligence and Robotics Commons, Programming Languages and Compilers Commons