Publication Date

Fall 2021

Degree Type

Master's Project

Degree Name

Master of Science (MS)

Department

Computer Science

First Advisor

Katerina Potika

Second Advisor

Mark Stamp

Third Advisor

Thomas Austin

Keywords

code2vec, code analysis, programming language labeling

Abstract

Software development is an expensive and difficult process. Mistakes can be easily made, and without extensive review process, those mistakes can make it to the production code and may have unintended disastrous consequences.

This is why various automated code review services have arisen in the recent years. From AWS’s CodeGuro and Microsoft’s Code Analysis to more integrated code assistants, like IntelliCode and auto completion tools. All of which are designed to help and assist the developers with their work and help catch overlooked bugs.

Thanks to recent advances in machine learning, these services have grown tremen- dously in sophistication to a point where they can catch bugs that often go unnoticed even with traditional code reviews.

This project investigates the use of code2vec [1], which is a probabilistic machine learning model on source code, in correctly labeling methods from different program- ming language families. We extend this model to work with more languages, train the created models, and compare the performance of static and dynamic languages.

As a by-product we create new datasets from the top stared open source GitHub projects in various languages. Different approaches for static and dynamic languages are applied, as well as some improvement techniques, like transfer learning. Finally, different parsers were used to see their effect on the model’s performance.

Share

COinS