GitHub discussion pages

First glance at text-data

Oskar Jarczyk
Powered by R, Slidify, FTW!

Presentation agenda

How I gathered discussions from the GitHub,
Getting more insights into what's there,
- global buzz-having repositories and small teams as well
- all of them are coordinating their work through written communication
- calculated simple yet meaningfull statistics
The discussion network,
- vaious definitions, multiple subnetworks, but same data
- more than one scope of such network
- a true social network - people who spoke and/or coded together and/or follow(-ed) each other..
What I found out, some results,
Summary and Q&A time.

Note: You can hit [P] key on any slide to reveal the source code.

Handouts

You can download this presentation offline here:

https://github.com/oskar-j/slidify-presentations-topic1

Sources and data is available here:

https://github.com/wikiteams/gh-torrent-queries

https://github.com/oskar-j/slidify-presentations-topic1

https://github.com/wikiteams/linda-nlp

Draft paper with raw data:

https://dl.dropboxusercontent.com/u/103068909/recognizing-dialogue-acts-on-github-abstract.pdf

Quiz

In which way would you describe the discussion dimension in the GitHub portal?

A feedback mechanism, place for discussion, bug-list and task delegation, accessible through GitHub webpages. It is a dialogue between developers and/or users held under a code change, code integration or issue submission.
An exchange postbox, available as newsgroup protocol, created automaticly for every GitHub code repository, which you can add to e.g. Microsoft Outlook and automaticly get new messages.
Exchanging messages between programmers strictly through git commit messages and other git features like git blame.
None of those.

Programmers on GitHub discuss their work and ask for pull-requests (of code they created) with help of the GitHub web interface. In very rare cases, someone can have an external tool to submit comments through GitHub API.

2nd is incorrect, despite the fact you can get notifications on your e-mail about new comments, there is no such thing as a seperate newsgroup. 3rd is incorrect - while of course proper title of the commit is an important part of communication inside the team, the main action is happening on the github pages, under particular commit / issue / pull request pages. 1st answer is correct.

Where discussion is taking place?

Please visit this 'developer's manual' for better insight:

https://developer.github.com/guides/working-with-comments/

If you are less familiar with GitHub, start with 'the guide':

https://guides.github.com/activities/hello-world/#intro

It's also called "Hello World in 10 minutes"

What you need to know

Discussions, same as wiki pages and README files, use a Markdown syntax
They also have emoticons placed between double-colons, e.g. :+1:
You can cite somebody or call him by login (with the '@' char)
Keep in mind that discussion always occurs under a change or integration request (pull request) or issue submission
You can paste code blocks and ony objects supported by Markdown syntax
Detecting quoting is tricky, but in most cases, it is recognized by this pattern:
- starts with: "On [date] user <e-mail> wrote:"
- '>' char starting every line in-a-quote
It can have any language used worldwide and is always (in technical terms) properly formated UTF-8 text

Possible data-sources

Hence 3 types of discussions: dialogue under issue/feature page, dialogue under a code commit, and a dialogue page under a pull request page. And you can get them from:

GitHub Archive

document-oriented JSON data, document structure (MongoDB), vulnerable to constant change
unlimited time periods and easy refill from the GitHub Archive (wget -> mongoimport)
includes only a full body of PullRequest comment, rest of utterances one needs to download directly from the GitHub through web
includes much more of meta-information!

GitHub Torrent

table-suctured relational data, well structured - they're in defined MySQL database tables
time period limited to the date of last SQL dumps
includes the body (message) of only PullRequest comment and a Commit comment, body is truncated to 256 chars
includes additional information about timestamp, author, and commit id / pull request id, but nothing more in particulat table..

Possible data-sources (2)

For rapid data analysis, you can try using Google BigQuery, there is GitHub archive data (called timeline) already included in fast engine using propertiary BigQuery Language (BQL).

Alternatively, you can use GitHub Torrent MySQL web-interface, but don't expect good relability neither performance.

Web-scrapping always requires executing JavaScript.

The best of course is using local copy of GHA / GHT data.

When using MySQL, set encoding collation to utf8_unicode_ci.

Creating discussion network

I made union of Commit Comments and Pull Request Comments in MySQL instance of GitHub Torrent

Loaded into R and joined with GitHub users

dialogues_n_users <- sqldf("select d.*, u.login from dialogues d 
                          join users u on d.user_id = u.id");
> nrow(dialogues_n_users)
[1] 2923703

Count contributions per discussion page

aggregates <- sqldf("select commit_id, login, count(login) as n 
                   from dialogues_n_users d group by commit_id, login");
> nrow(aggregates)
[1] 1273025

Creating discussion network (2)

> summary(aggregates[aggregates$n < 50, c('n')])
   Min.     1st Qu.  Median   Mean     3rd Qu.  Max.
   1.000    1.000    1.000    2.244    2.000    49.000

Creating discussion network (3)

4. Count how many times users spoke with each other (under any discussion page - in whole GitHub)

activity_network <- sqldf("SELECT a1.login as login1, a2.login as login2,
                              a1.n as c1, a2.n as c2
                              FROM aggregates a1
                              LEFT JOIN aggregates a2
                              ON a1.commit_id = a2.commit_id
                              WHERE a1.login < a2.login;")
network <- sqldf("SELECT login1, login2, count(*) as weight 
                  from activity_network group by login1, login2")
network_matrix <- as.matrix(network)

Creating discussion network (4)

4. Count how many times users spoke with each other (under any discussion page - in whole GitHub)

> summary(network[network$weight < 500, c('weight')])
   Min.     1st Qu.  Median   Mean     3rd Qu.  Max.
   1.000    1.000    1.000    1.243    1.000    431.000
> nrow(network)
[1] 1340327

5. Returns a matrix of unique paris (Vx, Vy)

Creating discussion network (5)

g = graph.edgelist(network_matrix[,1:2],directed=FALSE)
E(g)$weight=as.numeric(network_matrix[,3])
# Don't call plot where >1 milion nodes, reduce first
# plot(g,layout=layout.fruchterman.reingold,edge.width=E(g)$weight)

Network schema

Discussion network

Graph[{B \[UndirectedEdge] A}]

Citing network
Mentioning network
(both can be combined to one network)

Graph[{B \[DirectedEdge] A}]

Network properties

> summary(dialogues_n_users)

> summary(network_matrix)

Network properties (2)

Calculating any properties at all in a reasonable period of time was unreal. Even when using the cutoff parameter (where possible), I got nothing.

library(igraph)
mycutoff <- 3

betweenness(g, directed = FALSE, weights = E(g)$weight, normalized = FALSE)
edge.betweenness(g, directed = FALSE, weights = E(g)$weight)
betweenness.estimate(g, directed = FALSE, cutoff = mycutoff,
                     weights = E(g)$weight, nobigint = TRUE)
walktrap.community(g, weights = E(g)$weight, steps = 4, merges =
                      TRUE, modularity = TRUE, membership = TRUE)
fastgreedy.community(g, merges=TRUE, modularity=TRUE,
                     membership=TRUE, weights=E(g)$weight)

(Smaller) sample discussion network

create table github_discussions_selected_users as select d.* from github_discussions d 
join (select distinct user_id from project_members_with_owners pm join 
selected_repos sr on pm.repo_id = sr.repo_id) as s
on d.user_id = s.user_id;

Selected repos is a sample created joined with a list of users (project members) but from repositories
- having at least 5 project members
- existing at least for 2 years
- with minimum of 100 commits
Number of utterances downgraded slightly - from 2.9 mln to 1.5 mln

Network properties

One of the most important features in the graph theory, is a clustering coefficient of the network, sometimes called transivity (especially in R),
Is a measure of the degree to which nodes in a graph tend to cluster together.

Swear words

library(sqldf)
library(plyr)
library(rCharts)

swear_words <- read.csv("C:/big data/swearing.csv")
names(swear_words)[1] <- "entry"
swear_words$entry <- as.character(swear_words$entry)
swear_matched <- sqldf("select * from dialogues_n_users where body like '%" + famous_word + "%';")
short.date = strftime(swear_matched$created_at, "%Y/%m")
count_swear_dates <- count(short.date)

n1 <- rPlot(freq ~ x, data = count_swear_dates, type = "point")

Swear words (2)

load(file="swear_plot_obj.RData")
# n1$print("chart_swear_words")

NLP analysis

from textblob import TextBlob
text_data = gracefully_degrade_to_ascii(remove_control_characters(line[2]))
sentyment = text_data.sentiment.polarity
sentyment_subj = text_data.sentiment.subjectivity

Utterances with positive sentiment: 1018370
Utterances with negative sentiment: 1905329
Mean sentiment for positive: 0.297915856077
Mean sentiment for negative: -0.0507566802789
Utterances with subjectivity: 1708393
Utterances with 0(none) subjectivity: 1215306
Mean subjectivity (for > 0): 0.502911222599

NLP analysis (2)

Multilayer analysis

muxViz - Visualization of interconnected multilayer networks
- aditionaly allows to e.g. make multilayer centrality analysis etc.
Different algorithms for connecting layers
- the most simple is (n-x/n)
- it means edge in at least 2 layers of 3 is required
Possible to detect communities in such multilayered networks

Recognizing dialogue acts

In short - every utterances have 1 or more dialogue acts (it's a type of speech classification)
It's a microtask for human, but we can tag a represantative dataset and apply supervised machine learning to classify rest of the dialogues
Apply the tf-idf algorithm to select maximally diverse set of utterances. {python} from sklearn.feature_extraction.text import TfidfVectorizer
Tag them with min 2 person team and ask a 3rd person (judge) to resolve doubts.

Text annotation tools

There is a good list on Quora: "Natural Language Processing: What are the best tools for manually annotating a text corpus with entities and relationships?"
In my humble opinion, stay away from BRAT Annotation Tool and some old standalone programs for tagging - they are not scalable and very often buggy.
At this time I used the GATE Teamware.

My classification of utterances

Basing mostly on Ferschke et al. (2012),
All of them have a Wikipedia equivalent
List of dialogue acts divided by a category: contribution criticism, explicit informative, information content, interpersonal

Some more or less important findings

there are short utterances, consisting on 1 or 2 words, like: 'meh', 'nvm', 'remove *', 'merge', 'blah', 'test', 'no description', '....'
there are different distribution for issues and different for commits with pull request, i.e. issues will less likely have contribution criticism, while pull requests will have lot of them

First results from taggging

most of the acts are explicit informative
some of them are rarely used - reconsider to avoid overfitting

End

Q & A time