All articles, tagged with “analysis”

Identifying contributors in FLOSS projects

As part of my graduate work, I need to analyze FLOSS repositories to identify number of external contributors. What I mean by an external contributor is any individual who made a patch contribution without having commit access to the source code repositories in addition to being a first time contributor.

What I usually do to identify contributors in general, is to parse the commit logs for any attribution to individuals who are not committers. Take for example the following log message:

Fixed #9859 – Added another missing force unicode needed in admin when 15 running on Python 2.3. Many thanks for report & patch to nfg. - (Django Revision 9656)

I wrote some regex based scripts to identify names or pseudo-names such as “nfg” from previous example.

Things however are not always clear cut for FLOSS projects as not all projects attribute contributors in the log message. For example, I noticed in the MapServer project, which seems to be actively developed, that there were no attributions in the log messages. After inquiring in IRC, it turns out that the attributions are available in the project tracker (thanks danmo!). What is included in the commit log message is a reference to the ticket number

So I pulled up my sleeves, and wrote a quick parser to identify all ticket numbers in the log messages. I then used httplib2 and beautifulsoup to connect to project tracker, and parse the patch name and contributor. The following is the code I used to perform that task:

import httplib2
from BeautifulSoup import BeautifulSoup as BS
def get_mapserver_author(ticket):

    url = 'http://trac.osgeo.org/mapserver/ticket/%s' % ticket

    h = httplib2.Http(".cache")
    resp, content = h.request(url, "GET")

    bs = BS(content)
    div = bs.find('div',id='attachments')
    patches = []
    for y in div.findAll('dt'): 
        try:
            patches.append((y.em.string,y.a.string))
        except:
            print 'Problem parsing ticket ',ticket
    return patches

I managed to identify 68 unique names for the duration between Jan. 1st, 2007 and June 1st, 2009. These are of course the names of contributors who are not necessarily first time contributors. Further analysis is needed before one can determine which of these contributors are “external.
Of course, it goes without saying, that the number of contributors is just an estimate. There might be some other contributions made through the mailing list (thanks FrankW for pointing this out). Not to mention the likelihood that an individual might have two different pseudo-names. As jmkenna (IRC: #mapserver) simply put it “It’s difficult to identify FLOSS contributors”

Just in case you are wondering, here are the names:

warmerdam
aalbarello
unicoletti
tamas
jimk
rouault
dmorissette
tomkralidis
aboudreault
brage
armin
bartvde
diletant
pramsey
nmandery
nharding
eshabtai@gmail.com
assefa
bartvde@osgis.nl
dfuhry
hschoenhammer
project10
hopfgartner
ujunge@pmcentral.com
sdlime
richf
dionw
nnikolov
abajolet
laurent
tbonfort
BobBruce
nsavard
woodbri
flavio
scott.e@goisc.com
dstrevinas
ivanopicco
jlacroix
cplist
kfaschoway
szigeti
zjames
elzouavo
mcoladas@telefonica.net
nfarrell@bom.gov.au
jparapar
vulukut@tescilturk.com
novorado
russellmcormond
msmitherdc
crschmidt
hjaekel
peter.hopfgartner@r3-gis.com
hulst
mturk@apache.org
thomas.bonfort@gmail.com
ivano.picco@aqupi.tk
jmckenna
drewsimpson
bartw
djay
sholl
dirk@advtechme.com
cph
jratike80
hobu
hpbrantley

Django Code Base Modularity

Let me start first by defining what I mean by modularity. Modularity is how well source code files are arranged into groups that share maximum dependency (i.e. imports) within group and minimum dependency between groups.

Groups that share a high degree of dependency are said to be cohesive, and they usually serve a single function. When these cohesive groups have little dependencies between them the code base is said to be loosely coupled. When a code base is non-modular, then the whole group of source files share a high level of dependency between one another which makes the code base seem as a single monolithic unit.

This obsession with modularity and dependency graphs was actually sparked by Mark Ramm’s presentation in Djangocon. He had some rather excellent lessons learned for the community, but one part of his presentation stuck out for me where he compared django’s dependency graph with that of turbogears (around the 9th minute). I am no graph expert, but I am almost certain that eye balling graphs is not a good way to compare them or decide how well they are arranged. I think you now see where this is going.

I went ahead and generated the dependency graphs for both django trunk and turbogears trunk. For the fun of it, I also included other python based projects, CherryPy, SqlAlchemy and Genshi. Let me be clear on what I mean by dependency graph of trunk. I actually went through the whole trunk history of these projects and generated the dependency graph for each commit.

I ended up with a lot of graphs and eyeballing is certainly not a good way to compare them. As it turns out, the concept of modularity exists in graph theory and it matches the definition I just gave. I used a method by newman which identifies groups in graphs using a clustering method that attempts to maximize modularity. Modularity in graph theory is basically a characteristic of how a graph is partitioned.

When applying the method on a source dependency graph, the method groups files that share dependencies into groups (i.e. modules) and the identified groups would maximize the modularity of the graph. The identified modularity value from this method would be an upper limit for how modular the code base is. So without further ado, I give you the the result of the analysis where I calculated the modularity of the dependency graph after each commit, and averaged the values per month:

Modularity graphs

Some highlights

  • Django seems to have a good increasing trend (Django community, keep up the good work!).
  • Turbogears, what happened? this is Turbogears trunk btw so it’s V2.0, I think they should have listened to Mark Ramm’s presentation. Seems like something went wrong, maybe backwards compatibility?
  • I marked out the two highest jumps in Django’s modularity. I attributed the first to the Boulder sprint, since I couldn’t find any other significant news during April 2007. The second can be attributed to newform-admin branch merging into trunk.
  • If you are wondering where queryset-refactor is, look 3 points prior to merging of newforms-admin. I dont think it had an effect on modularity, any ideas why?
  • SQLAlchemy, well done guys! anyone worked on SQLAlchemy and can confirm that indeed their code is modular? I would appreciate any comments to confirm that there is some level of reliability in the method I am using (I need to graduate people).

I hope you find this all interesting. I’ll be sharing some more analysis about other FLOSS projects. I’m currently working on Pylons, Twisted, and Trac. I thought about doing Zope but my computer begged for mercy. Stay tuned!