Code authorship attribution using content-based and non-content-based features

Loading...
Thumbnail Image

Date

Journal Title

Journal ISSN

Volume Title

Publisher

Lethbridge, Alta. : University of Lethbridge, Dept. of Mathematics and Computer Science

Abstract

Machine learning approaches are widely used in natural language analysis. Previous research has shown that similar techniques can be applied in the analysis of computer programming (artificial) languages. In this thesis, we focus on identifying the authors of computer programs by using machine learning techniques. We extend these techniques to determine which features capture the writing style of authors in the classification of a computer program according to the author's identity. We then propose a novel approach for computer program author identification. In this method, program features from the text documents are combined with authors' sociological features (gender and region) to develop the classification model. Several experiments have been conducted on two datasets composed of computer programs written in C++, and the results are encouraging. According to the experimental results, the author's identity can be predicted with a $75\%$ accuracy rate.

Description

Citation

Endorsement

Review

Supplemented By

Referenced By