All articles, tagged with “floss”

The Mythical Django Pony, a blessing?

On my way to have lunch on the first day of DjangoCon in Portland, I met Eric Holscher in one of the corridors of the hotel holding a pink unicorn. It seemed odd to me so I approached and asked him about what he had in his hands. He explained that this was the unofficial Django mascot, which was a pony. I didn’t make the obvious observation that what he had in his hands was a unicorn, but asked how this came to be. He explained that one of the core dev on the Django mailing list responded to one of the feature requests saying “no, you can’t have a Pony!” as a way of politely refusing the feature request.

I was surprised to be in a DjangoCon session two days later where Russell Keith-Magee talks about declined feature requests and how they are referred to as ponies, and suggests ways in which your features are likely to get accepted. He also explained the whole story behind the mythical Django pony (which is really a unicorn!).

Why am I bringing this up now? well as I write the concluding chapters of my dissertation, I notice an odd relationship between the number of modules present in a code base and the number of new contributors. The statistical model suggests that adding modules to a code base is associated with an increase in the number of new contributors. However, this relationship reverses itself for projects that have an above average number of modules. So adding modules when there are already a high number of modules results in fewer contributors joining the development effort over time. The same effect could be observed for average module size (measured in SLOC), where an increase in the average size of a module is associated with an increase in new contributors up to a certain point. Then, the relationship starts to reverse (Quadratic effect for those of you who are statistically inclined).

It dawned on me that one of the explanations for such an observation is that an increase in the number of modules or an increase in the average size of a module is a result of the increased complexity in the code base from adding or implementing a feature. Assuming that the projects in my sample are not mature to a point where there is no need for new contributors to join, then we can attribute the decrease or increase in numbers of new contributor to the balancing act of complexity where the community correctly decides to include just enough features as to not make the codebase overly complex for new participants and yet valuable enough for new members to start using and contributing to it (Please don’t bite my head off for implying causation here, I am just forwarding a hypothesis that seems to be supported by the data. It’s up to you whether to accept it or not).

So how does this relate to ponies? Well, I just might have put my finger on one of the things that makes Django unique, and that’s the core developers know how to play this balancing act by knowing when to include or refuse a new feature. This I believe is possible because there is what we can refer to as a Django philosophy in deciding which feature requests are considered ponies. This seems to be paying off as participation in Django is way off the charts. Don’t ask me what the Django philosophy is, as I have no idea. I am just observing its results. If someone out there thinks he knows what it is, or has a link, please do share.

Take away from this, at least for other FLOSS projects that want to learn from the Django community. Be clear on the goals you want to achieve with your project, and don’t be afraid of saying “No! you can’t have a pony!”. As for the Django community, you are already on the right track in trying to explain why features are refused. Keep at it!

Update: Here is the Django philosophy summed up quite eloquently by Dougal Matthews:

I think the philosophy is quite simple generally. “Does this need to be in the core?”

You can read his comment for more explanation.

Django … an outlier

While analyzing the development activity and code metrics for over 240 of the most actively developed FLOSS projects, guess which project popped out?

Yes Django! Its an outlier in terms of its activity. It’s influencing the results of my statistical analysis more than any other project as per the Cook’s distance diagnostic index. Let me bring your attention to the lonely dot that is close to the value 1 at the top right corner. I missed it at first, but noticed it when I looked at the sorted values.

This is telling us that at least among the sample that I have ,Python, C and C++ based actively developed FLOSS projects, Django (including its community) is quite unique.

I leave you with the graph of sorted Cook’s D values from my analysis:

Click to Enlarge Image

Contributor analysis

Some of you might find it interesting to know that over 900 unique contributors have participated in Django’s development since Jan 1st, 2007, as attributed by the svn log messages. I would say the community is very healthy especially if you compare it to other well known projects:

Django: 906
Pylons: 80
PyPy: 240
Linux Kernel: 4043
PostgreSQL: 150
Appache HTTP server: 118
SQLAlchemy: 36 (Contributors identified from trac tickets mentioned in svn log)
Python: 428

Note that I consider anonymous or guest contributors as a single contributor, so these number can be considered conservative if the project allows anonymous contributions.

Identifying contributors in FLOSS projects

As part of my graduate work, I need to analyze FLOSS repositories to identify number of external contributors. What I mean by an external contributor is any individual who made a patch contribution without having commit access to the source code repositories in addition to being a first time contributor.

What I usually do to identify contributors in general, is to parse the commit logs for any attribution to individuals who are not committers. Take for example the following log message:

Fixed #9859 – Added another missing force unicode needed in admin when 15 running on Python 2.3. Many thanks for report & patch to nfg. - (Django Revision 9656)

I wrote some regex based scripts to identify names or pseudo-names such as “nfg” from previous example.

Things however are not always clear cut for FLOSS projects as not all projects attribute contributors in the log message. For example, I noticed in the MapServer project, which seems to be actively developed, that there were no attributions in the log messages. After inquiring in IRC, it turns out that the attributions are available in the project tracker (thanks danmo!). What is included in the commit log message is a reference to the ticket number

So I pulled up my sleeves, and wrote a quick parser to identify all ticket numbers in the log messages. I then used httplib2 and beautifulsoup to connect to project tracker, and parse the patch name and contributor. The following is the code I used to perform that task:

import httplib2
from BeautifulSoup import BeautifulSoup as BS
def get_mapserver_author(ticket):

    url = 'http://trac.osgeo.org/mapserver/ticket/%s' % ticket

    h = httplib2.Http(".cache")
    resp, content = h.request(url, "GET")

    bs = BS(content)
    div = bs.find('div',id='attachments')
    patches = []
    for y in div.findAll('dt'): 
        try:
            patches.append((y.em.string,y.a.string))
        except:
            print 'Problem parsing ticket ',ticket
    return patches

I managed to identify 68 unique names for the duration between Jan. 1st, 2007 and June 1st, 2009. These are of course the names of contributors who are not necessarily first time contributors. Further analysis is needed before one can determine which of these contributors are “external.
Of course, it goes without saying, that the number of contributors is just an estimate. There might be some other contributions made through the mailing list (thanks FrankW for pointing this out). Not to mention the likelihood that an individual might have two different pseudo-names. As jmkenna (IRC: #mapserver) simply put it “It’s difficult to identify FLOSS contributors”

Just in case you are wondering, here are the names:

warmerdam
aalbarello
unicoletti
tamas
jimk
rouault
dmorissette
tomkralidis
aboudreault
brage
armin
bartvde
diletant
pramsey
nmandery
nharding
eshabtai@gmail.com
assefa
bartvde@osgis.nl
dfuhry
hschoenhammer
project10
hopfgartner
ujunge@pmcentral.com
sdlime
richf
dionw
nnikolov
abajolet
laurent
tbonfort
BobBruce
nsavard
woodbri
flavio
scott.e@goisc.com
dstrevinas
ivanopicco
jlacroix
cplist
kfaschoway
szigeti
zjames
elzouavo
mcoladas@telefonica.net
nfarrell@bom.gov.au
jparapar
vulukut@tescilturk.com
novorado
russellmcormond
msmitherdc
crschmidt
hjaekel
peter.hopfgartner@r3-gis.com
hulst
mturk@apache.org
thomas.bonfort@gmail.com
ivano.picco@aqupi.tk
jmckenna
drewsimpson
bartw
djay
sholl
dirk@advtechme.com
cph
jratike80
hobu
hpbrantley

Modularity of FreeBSD, Haiku, and OpenOffice.org

While analyzing data for my dissertation, I gave up on analyzing the modularity of FreeBSD, Haiku, and OpenOffice.org. I was trying to get 8 different modularity readings in time, but decided to stop analyzing these projects.

Why? well, after having my MBP run non-stop for a week without getting a single data point, it became clear that I will not be graduating if I continued to pursue that route. Anyone out there looked at the source code of these projects? was it surprising that the graph analysis I was performing took ages? How easy is it to find your way around these projects?

It took a whole day to obtain all readings of modularity of the Linux Kernel using the same measure. It averaged around .92, which is excellent. I take it the projects I stopped analyzing aren’t as modular as the Linux Kernel.

Richness Vs. Generalizability

It was an interesting couple of days at the OSS2009 conference. I presented the FLOSS marketplace paper as part of the PhD consortium and found the feedback to be very constructive. My goal was to get feedback on the validity of measures I am using to test my theories, and was able to get some valuable insights.

Being trained in quantitative methods and a positivists philosophy, my inclination was to build generalizable theories about FLOSS communities and attempt to falsify them. Which explains why I used theories like TCT and Organizational Information Processing Theory to build my research models with a project unit of analysis. This proved the biggest discussion point in many of my conversations. I managed to gather some useful insights, which I couldn’t have easily gathered on my own. This made me appreciate the value of diversity in research philosophies and methods in the conference.

What was made clear through the discourse was that each FLOSS community has unique processes, members and software. This made me reconsider some of the limitations of my methods, and improve on some of them whenever possible. One particular approach that was suggested to me at the conference was to use a mixed method approach, in which I use qualitative methods on a limited sample of the projects I am observing and show that the nuances in these project follow the predictions and logic of my theory. I could then use quantitative methods to generalize my findings.

To give an example of some of the methodological issues that caught me by surprise, consider the estimate of productivity when measured as lines of code added or removed in a commit. According to Hoffman and Reihle, the unified patch format does not track changed line, but rather, considered any changed line to be a line removed, and line added. A seemingly good estimate of productivity that takes such an issue into account was presented at the conference, which I thought to be valuable.

Getting to meet some FLOSS developers in the conference, and talking with them about research ideas and our work was also enlightening. It was interesting to know what were the important issues and challenges faced by developers which would allow us to make our research more relevant. Furthermore, I found it valuable to get an account from practitioners on how some of the assumptions and theoretical explanations in my work reflects in reality.

Based on the discussions, I got a feeling that the focus of most researchers I have met was on building rich and specialized theories. I felt that generalizable theories were somewhat under represented. I could not tell if this was due to the philosophical background of many of the european researchers I have met, or because the FLOSS phenomenon is not yet well understood and that richness is needed before we can generalize. Personally, I think there is value in generalization and we know enough about at least the software development practices to start building such theories.

I would love to hear the thoughts of whoever reads this post on whether its possible or valuable to build generalizable theories related to the FLOSS phenomenon.

Brook’s law at work

Here is a snapshot of how different FLOSS projects compare in terms of source lines of code (log scale) graphed to modularity from March 2008 and for a total of 200+ observations (as described in previous post).

Click to Enlarge Image

Given that I have sampled the projects from the top 1000 listed on ohloh.net, which are mostly actively developed FLOSS projects, one could see that a good organization of source code dependencies is needed to be maintained for development to continue as suggested by Brook’s Law. This might explain why we have an almost empty bottom right quadrant, which represents projects with a large code base and poor organization of dependencies.

The significance of this graph lies in that it adds to the validity of the graph modularity measure we used in comparing different python based projects. This is hopefully but one step in many for us to better understand FLOSS project management.

Presentation material for OSS2009 PC

Click here to get the latest copy of my dissertation essays and my OSS2009 presentation.

dependency graph analysis

I tried analyzing the Haiku and FreeBDS’s dependency graph to see how modular the code base is. I was running the analysis on a Dell XPS gaming rig, and the analysis kept running for 7 straight days with 100% cpu utilization and it still hasn’t finished.

For my analysis, I need to perform this kind of analysis for both operating systems at 5 different points in time. So I think it would be a good idea to drop these two projects from my analysis, if I couldn’t obtain a single data point after a full week of analysis.

Django Code Base Modularity

Let me start first by defining what I mean by modularity. Modularity is how well source code files are arranged into groups that share maximum dependency (i.e. imports) within group and minimum dependency between groups.

Groups that share a high degree of dependency are said to be cohesive, and they usually serve a single function. When these cohesive groups have little dependencies between them the code base is said to be loosely coupled. When a code base is non-modular, then the whole group of source files share a high level of dependency between one another which makes the code base seem as a single monolithic unit.

This obsession with modularity and dependency graphs was actually sparked by Mark Ramm’s presentation in Djangocon. He had some rather excellent lessons learned for the community, but one part of his presentation stuck out for me where he compared django’s dependency graph with that of turbogears (around the 9th minute). I am no graph expert, but I am almost certain that eye balling graphs is not a good way to compare them or decide how well they are arranged. I think you now see where this is going.

I went ahead and generated the dependency graphs for both django trunk and turbogears trunk. For the fun of it, I also included other python based projects, CherryPy, SqlAlchemy and Genshi. Let me be clear on what I mean by dependency graph of trunk. I actually went through the whole trunk history of these projects and generated the dependency graph for each commit.

I ended up with a lot of graphs and eyeballing is certainly not a good way to compare them. As it turns out, the concept of modularity exists in graph theory and it matches the definition I just gave. I used a method by newman which identifies groups in graphs using a clustering method that attempts to maximize modularity. Modularity in graph theory is basically a characteristic of how a graph is partitioned.

When applying the method on a source dependency graph, the method groups files that share dependencies into groups (i.e. modules) and the identified groups would maximize the modularity of the graph. The identified modularity value from this method would be an upper limit for how modular the code base is. So without further ado, I give you the the result of the analysis where I calculated the modularity of the dependency graph after each commit, and averaged the values per month:

Modularity graphs

Some highlights

  • Django seems to have a good increasing trend (Django community, keep up the good work!).
  • Turbogears, what happened? this is Turbogears trunk btw so it’s V2.0, I think they should have listened to Mark Ramm’s presentation. Seems like something went wrong, maybe backwards compatibility?
  • I marked out the two highest jumps in Django’s modularity. I attributed the first to the Boulder sprint, since I couldn’t find any other significant news during April 2007. The second can be attributed to newform-admin branch merging into trunk.
  • If you are wondering where queryset-refactor is, look 3 points prior to merging of newforms-admin. I dont think it had an effect on modularity, any ideas why?
  • SQLAlchemy, well done guys! anyone worked on SQLAlchemy and can confirm that indeed their code is modular? I would appreciate any comments to confirm that there is some level of reliability in the method I am using (I need to graduate people).

I hope you find this all interesting. I’ll be sharing some more analysis about other FLOSS projects. I’m currently working on Pylons, Twisted, and Trac. I thought about doing Zope but my computer begged for mercy. Stay tuned!