GitHub discussion pages

First glance at text-data

Oskar Jarczyk
Powered by R, Slidify, FTW!

Presentation agenda

  1. How I gathered discussions from the GitHub,
  2. Getting more insights into what's there,
    • global buzz-having repositories and small teams as well
    • all of them are coordinating their work through written communication
    • calculated simple yet meaningfull statistics
  3. The discussion network,
    • vaious definitions, multiple subnetworks, but same data
    • more than one scope of such network
    • a true social network - people who spoke and/or coded together and/or follow(-ed) each other..
  4. What I found out, some results,
  5. Summary and Q&A time.

Note: You can hit [P] key on any slide to reveal the source code.

Handouts

Quiz

In which way would you describe the discussion dimension in the GitHub portal?

  1. A feedback mechanism, place for discussion, bug-list and task delegation, accessible through GitHub webpages. It is a dialogue between developers and/or users held under a code change, code integration or issue submission.
  2. An exchange postbox, available as newsgroup protocol, created automaticly for every GitHub code repository, which you can add to e.g. Microsoft Outlook and automaticly get new messages.
  3. Exchanging messages between programmers strictly through git commit messages and other git features like git blame.
  4. None of those.

Programmers on GitHub discuss their work and ask for pull-requests (of code they created) with help of the GitHub web interface. In very rare cases, someone can have an external tool to submit comments through GitHub API.

2nd is incorrect, despite the fact you can get notifications on your e-mail about new comments, there is no such thing as a seperate newsgroup. 3rd is incorrect - while of course proper title of the commit is an important part of communication inside the team, the main action is happening on the github pages, under particular commit / issue / pull request pages. 1st answer is correct.

Where discussion is taking place?

What you need to know

  • Discussions, same as wiki pages and README files, use a Markdown syntax
  • They also have emoticons placed between double-colons, e.g. :+1:
  • You can cite somebody or call him by login (with the '@' char)
  • Keep in mind that discussion always occurs under a change or integration request (pull request) or issue submission
  • You can paste code blocks and ony objects supported by Markdown syntax
  • Detecting quoting is tricky, but in most cases, it is recognized by this pattern:
    • starts with: "On [date] user <e-mail> wrote:"
    • '>' char starting every line in-a-quote
  • It can have any language used worldwide and is always (in technical terms) properly formated UTF-8 text

Possible data-sources

Hence 3 types of discussions: dialogue under issue/feature page, dialogue under a code commit, and a dialogue page under a pull request page. And you can get them from:

GitHub Archive

  • document-oriented JSON data, document structure (MongoDB), vulnerable to constant change
  • unlimited time periods and easy refill from the GitHub Archive (wget -> mongoimport)
  • includes only a full body of PullRequest comment, rest of utterances one needs to download directly from the GitHub through web
  • includes much more of meta-information!

GitHub Torrent

  • table-suctured relational data, well structured - they're in defined MySQL database tables
  • time period limited to the date of last SQL dumps
  • includes the body (message) of only PullRequest comment and a Commit comment, body is truncated to 256 chars
  • includes additional information about timestamp, author, and commit id / pull request id, but nothing more in particulat table..

Possible data-sources (2)

For rapid data analysis, you can try using Google BigQuery, there is GitHub archive data (called timeline) already included in fast engine using propertiary BigQuery Language (BQL).

Alternatively, you can use GitHub Torrent MySQL web-interface, but don't expect good relability neither performance.

Web-scrapping always requires executing JavaScript.

The best of course is using local copy of GHA / GHT data.

When using MySQL, set encoding collation to utf8_unicode_ci.

Creating discussion network

  1. I made union of Commit Comments and Pull Request Comments in MySQL instance of GitHub Torrent
  2. Loaded into R and joined with GitHub users

    dialogues_n_users <- sqldf("select d.*, u.login from dialogues d 
                              join users u on d.user_id = u.id");
    > nrow(dialogues_n_users)
    [1] 2923703
    
  3. Count contributions per discussion page

    aggregates <- sqldf("select commit_id, login, count(login) as n 
                       from dialogues_n_users d group by commit_id, login");
    > nrow(aggregates)
    [1] 1273025
    

Creating discussion network (2)

> summary(aggregates[aggregates$n < 50, c('n')])
   Min.     1st Qu.  Median   Mean     3rd Qu.  Max.
   1.000    1.000    1.000    2.244    2.000    49.000

aggregates-typical-number-of-utt-under-dialogue.png

Creating discussion network (3)

4. Count how many times users spoke with each other (under any discussion page - in whole GitHub)

activity_network <- sqldf("SELECT a1.login as login1, a2.login as login2,
                              a1.n as c1, a2.n as c2
                              FROM aggregates a1
                              LEFT JOIN aggregates a2
                              ON a1.commit_id = a2.commit_id
                              WHERE a1.login < a2.login;")
network <- sqldf("SELECT login1, login2, count(*) as weight 
                  from activity_network group by login1, login2")
network_matrix <- as.matrix(network)

Creating discussion network (4)

4. Count how many times users spoke with each other (under any discussion page - in whole GitHub)

> summary(network[network$weight < 500, c('weight')])
   Min.     1st Qu.  Median   Mean     3rd Qu.  Max.
   1.000    1.000    1.000    1.243    1.000    431.000
> nrow(network)
[1] 1340327

5. Returns a matrix of unique paris (Vx, Vy)

Creating discussion network (5)

g = graph.edgelist(network_matrix[,1:2],directed=FALSE)
E(g)$weight=as.numeric(network_matrix[,3])
# Don't call plot where >1 milion nodes, reduce first
# plot(g,layout=layout.fruchterman.reingold,edge.width=E(g)$weight)

Network schema

  • Discussion network
Graph[{B \[UndirectedEdge] A}]

  • Citing network
  • Mentioning network
  • (both can be combined to one network)
Graph[{B \[DirectedEdge] A}]

Network properties

> summary(dialogues_n_users)

> summary(network_matrix)

Network properties (2)

Calculating any properties at all in a reasonable period of time was unreal. Even when using the cutoff parameter (where possible), I got nothing.

library(igraph)
mycutoff <- 3

betweenness(g, directed = FALSE, weights = E(g)$weight, normalized = FALSE)
edge.betweenness(g, directed = FALSE, weights = E(g)$weight)
betweenness.estimate(g, directed = FALSE, cutoff = mycutoff,
                     weights = E(g)$weight, nobigint = TRUE)
walktrap.community(g, weights = E(g)$weight, steps = 4, merges =
                      TRUE, modularity = TRUE, membership = TRUE)
fastgreedy.community(g, merges=TRUE, modularity=TRUE,
                     membership=TRUE, weights=E(g)$weight)

(Smaller) sample discussion network

create table github_discussions_selected_users as select d.* from github_discussions d 
join (select distinct user_id from project_members_with_owners pm join 
selected_repos sr on pm.repo_id = sr.repo_id) as s
on d.user_id = s.user_id;
  • Selected repos is a sample created joined with a list of users (project members) but from repositories
    • having at least 5 project members
    • existing at least for 2 years
    • with minimum of 100 commits
  • Number of utterances downgraded slightly - from 2.9 mln to 1.5 mln

Network properties

  • One of the most important features in the graph theory, is a clustering coefficient of the network, sometimes called transivity (especially in R),
  • Is a measure of the degree to which nodes in a graph tend to cluster together.

coefficient.png

Swear words

library(sqldf)
library(plyr)
library(rCharts)

swear_words <- read.csv("C:/big data/swearing.csv")
names(swear_words)[1] <- "entry"
swear_words$entry <- as.character(swear_words$entry)
swear_matched <- sqldf("select * from dialogues_n_users where body like '%" + famous_word + "%';")
short.date = strftime(swear_matched$created_at, "%Y/%m")
count_swear_dates <- count(short.date)

n1 <- rPlot(freq ~ x, data = count_swear_dates, type = "point")

Swear words (2)

load(file="swear_plot_obj.RData")
# n1$print("chart_swear_words")

NLP analysis

from textblob import TextBlob
text_data = gracefully_degrade_to_ascii(remove_control_characters(line[2]))
sentyment = text_data.sentiment.polarity
sentyment_subj = text_data.sentiment.subjectivity
  • Utterances with positive sentiment: 1018370
  • Utterances with negative sentiment: 1905329
  • Mean sentiment for positive: 0.297915856077
  • Mean sentiment for negative: -0.0507566802789
  • Utterances with subjectivity: 1708393
  • Utterances with 0(none) subjectivity: 1215306
  • Mean subjectivity (for > 0): 0.502911222599

NLP analysis (2)

Sentiment through time

Multilayer analysis

  1. muxViz - Visualization of interconnected multilayer networks
    • aditionaly allows to e.g. make multilayer centrality analysis etc.
  2. Different algorithms for connecting layers
    • the most simple is (n-x/n)
    • it means edge in at least 2 layers of 3 is required
  3. Possible to detect communities in such multilayered networks

Recognizing dialogue acts

  1. In short - every utterances have 1 or more dialogue acts (it's a type of speech classification)
  2. It's a microtask for human, but we can tag a represantative dataset and apply supervised machine learning to classify rest of the dialogues
  3. Apply the tf-idf algorithm to select maximally diverse set of utterances. {python} from sklearn.feature_extraction.text import TfidfVectorizer
  4. Tag them with min 2 person team and ask a 3rd person (judge) to resolve doubts.

Text annotation tools

My classification of utterances

  • Basing mostly on Ferschke et al. (2012),
  • All of them have a Wikipedia equivalent
  • List of dialogue acts divided by a category: contribution criticism, explicit informative, information content, interpersonal

Some more or less important findings

  • there are short utterances, consisting on 1 or 2 words, like: 'meh', 'nvm', 'remove *', 'merge', 'blah', 'test', 'no description', '....'
  • there are different distribution for issues and different for commits with pull request, i.e. issues will less likely have contribution criticism, while pull requests will have lot of them

First results from taggging

  • most of the acts are explicit informative
  • some of them are rarely used - reconsider to avoid overfitting

End

Q & A time