Files

Abstract

GitHub, the largest platform for open-source software, which allows code contributors to collaboratively develop software in a variety of programming languages, has motivated extensive research on social coding. However, the heterogeneity of GitHub communities in different programming languages has not been explored in previous studies. Inspired by the linguistic relativity hypothesis, I deduce that different programming languages might result in distinct patterns in social coding. This research identifies such discrepancies between the Python and Java communities based on repository representations (embeddings) from a unique, newly constructed dataset. By describing the representation learning process as a pipeline, this thesis first demonstrates how to generate high-quality repository embeddings from content and contextual data, including source code (import), readme text, and co-contributor networks. The evaluation results suggest that models derived from Word2Vec, including Doc2Vec, Import2Vec, and Node2Vec, are the most competitive for representing the GitHub data. I then used the best performing embeddings to explore language-specific patterns via questions regarding GitHub activities and success. By investigating the consistency between different embedding spaces, I identified how social (contributor) and functional (import, readme) embedding spaces diverge in the Python community but align in the Java community, implying the difference in their socio-functional mapping. Furthermore, the results indicate that functionally similar Python repositories experience more competition and that Python programmers contribute more to functionally diverse repositories when compared with their Java counterparts. Afterward, by analyzing the correlations between the functional diversity and the average popularity among contributed repositories, I found evidence that Python programmers who commit to dissimilar repositories are more likely to be the contributors of popular repositories, while Java programmers do not exhibit this pattern. Finally, by comparing embeddings with baseline features, I verified their potency to predict repository popularity and discovered that functional embeddings are beneficial for predicting Python repositories, while social embeddings contribute more to Java repositories. These language-related heterogeneities can be attributed to the inherent difference in philosophy between Python and Java: As a language highlighting flexibility and reusability, Python treasures contributors’ ability to produce code for other coders with various functional needs, whereas Java focuses on the independence and thoroughness in programming, prioritizing specialized coding for passive end-users who value only the performance of final products.

Details

Actions

PDF

from
to
Export
Download Full History