Embedding GitHub Repositories:  A Comparative Study of the Python and Java Communities

Chen Yutao

doi:10.6082/uchicago.3520

Embedding GitHub Repositories: A Comparative Study of the Python and Java Communities

Chen Yutao

2021

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DataCite
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket

Files

Abstract

GitHub, the largest platform for open-source software, which allows code contributors to collaboratively develop software in a variety of programming languages, has motivated extensive research on social coding. However, the heterogeneity of GitHub communities in different programming languages has not been explored in previous studies. Inspired by the linguistic relativity hypothesis, I deduce that different programming languages might result in distinct patterns in social coding. This research identifies such discrepancies between the Python and Java communities based on repository representations (embeddings) from a unique, newly constructed dataset. By describing the representation learning process as a pipeline, this thesis first demonstrates how to generate high-quality repository embeddings from content and contextual data, including source code (import), readme text, and co-contributor networks. The evaluation results suggest that models derived from Word2Vec, including Doc2Vec, Import2Vec, and Node2Vec, are the most competitive for representing the GitHub data. I then used the best performing embeddings to explore language-specific patterns via questions regarding GitHub activities and success. By investigating the consistency between different embedding spaces, I identified how social (contributor) and functional (import, readme) embedding spaces diverge in the Python community but align in the Java community, implying the difference in their socio-functional mapping. Furthermore, the results indicate that functionally similar Python repositories experience more competition and that Python programmers contribute more to functionally diverse repositories when compared with their Java counterparts. Afterward, by analyzing the correlations between the functional diversity and the average popularity among contributed repositories, I found evidence that Python programmers who commit to dissimilar repositories are more likely to be the contributors of popular repositories, while Java programmers do not exhibit this pattern. Finally, by comparing embeddings with baseline features, I verified their potency to predict repository popularity and discovered that functional embeddings are beneficial for predicting Python repositories, while social embeddings contribute more to Java repositories. These language-related heterogeneities can be attributed to the inherent difference in philosophy between Python and Java: As a language highlighting flexibility and reusability, Python treasures contributors’ ability to produce code for other coders with various functional needs, whereas Java focuses on the independence and thoroughness in programming, prioritizing specialized coding for passive end-users who value only the performance of final products.

Details

Title Embedding GitHub Repositories: A Comparative Study of the Python and Java Communities

Author Chen Yutao : University of Chicago

Degree Type M.A.

Content Type Thesis

Academic Advisor

James Evans

Keywords

GitHub; Social Coding; Representation Learning; Heterogeneity in Programming Languages

Digital Object Identifier https://doi.org/10.6082/uchicago.3520

Publication Date 2021-12-03

Language English

Record Appears in Social Sciences Division > Computational Social Sciences (MACSS)
Social Sciences Division > MA Thesis Archive
All

Record Created 2021-12-03

Actions

PDF

Statistics

Download Full History