Classification of computer programming contest programs based on gender, region and software metrics

Thumbnail Image
Zinnat, Sara Binte
University of Lethbridge. Faculty of Arts and Science
Journal Title
Journal ISSN
Volume Title
Lethbridge, Alta. : University of Lethbridge, Dept. of Mathematics and Computer Science
This research focuses on determining the effect of sociolinguistics characteristics (particularly, gender and region) on computer programs. Previous studies have demonstrated the use of machine learning techniques to analyze the relationship between sociolinguistics features and programming language. We collected C++ programs from an open source programming contest website. The features were calculated based on three software metrics: lines of code, cyclomatic complexity and Halstead metrics. Using five machine learning algorithms we trained several models and performed experiments to compare their performance. To investigate the significance of the features, we also carried out statistical and correlation analysis. As indicated by the experimental results, our models successfully predicted the gender of the programmers with 91.7% accuracy when programmers solved the same problems. When the programmers solved different problems, the model achieved an accuracy of 86.4%. Our models also efficiently classified the region of the programmer with 75.2% accuracy.
artificial intelligence , machine learning , computer programming , programming languages , sociolinguistics (gender, region) , web scraping , data mining , classification , statistical analysis , software metrics , Computer programming--Sex differences--Research , Programming languages (Electronic computers)--Syntax--Sex differences--Research , Sociolinguistics--Network analysis , Software measurement