Do sociolinguistic variations exist In programming?

Thumbnail Image
Naz, Fariha
University of Lethbridge. Faculty of Arts and Science
Journal Title
Journal ISSN
Volume Title
Lethbridge, Alta. : University of Lethbridge, Dept. of Mathematics and Computer Science
Machine learning techniques are currently widely used in the analysis of natural language. This thesis focuses on extending these techniques for analysis of programming languages. In particular we are interested in determining whether there are differences in the use of programming languages that might be associated with the authors’ gender. There are currently few studies that address possible relationships between linguistics and programming. In this thesis we use computer programs as the samples in our dataset. These programs have been written using the C++ programming language. We also acquired sociolinguistic information about the programmers, with the focus especially on gender. We use machine learning and statistical techniques to identify patterns (in language use) that are consistent for male and female programmers. The results of numerous experiments are encouraging. We demonstrate that we can predict the gender of programmers with 71% accuracy and detect similarities or dissimilarities in their programming style.
computer science , machine learning , sociolinguistics , gender , text mining , computer programs , programming