Archive for June, 2009

Modularity of FreeBSD, Haiku, and OpenOffice.org

While analyzing data for my dissertation, I gave up on analyzing the modularity of FreeBSD, Haiku, and OpenOffice.org. I was trying to get 8 different modularity readings in time, but decided to stop analyzing these projects.

Why? well, after having my MBP run non-stop for a week without getting a single data point, it became clear that I will not be graduating if I continued to pursue that route. Anyone out there looked at the source code of these projects? was it surprising that the graph analysis I was performing took ages? How easy is it to find your way around these projects?

It took a whole day to obtain all readings of modularity of the Linux Kernel using the same measure. It averaged around .92, which is excellent. I take it the projects I stopped analyzing aren’t as modular as the Linux Kernel.

How Great is Django’s Documentation?

One aspect of Django that never ceases to amaze me, is how well it is documented. I believe this aspect of the Django project got many of us to use it, me included. While doing some boring graduate donkey work, Django’s name popped out, and not surprisingly, as one of the highly documented code bases out there (only projects focusing on documentation came close!). So lets me take this opportunity to thank all those who might be reading this, who had a hand in making Django what it is (Thank you!)

The Lines of documentation to SLOC ratio is plotted against SLOC for 1st of Jan, 2009. Django has one of the highest ratios compared to other open source projects of similar size (See the top part of the Graph). Of course I could have plotted doc lines to SLOC, but then I have to show the deviation of Django from the regression line. This graph just makes it more obvious and is easier to plot:

Click to enlarge, and don’t forget to zoom in.

I also have a few questions should any Django contributors pass by. Any of you guys think that you might have over done it? Is it difficult to maintain the quality of the documentation and keep it up-to-date and did anything change after moving to a regular release schedule? Is it only contributors experienced with the Django code base, who are able to write documentation to ensure quality?

Since I also live in a world where ponies roam and correlation is always seen as causation, let me end this post by saying:

Good documentation is the cause of success for Django, and high ice cream consumption is the cause of increased deaths in neighborhoods with swimming pools during summer!

Richness Vs. Generalizability

It was an interesting couple of days at the OSS2009 conference. I presented the FLOSS marketplace paper as part of the PhD consortium and found the feedback to be very constructive. My goal was to get feedback on the validity of measures I am using to test my theories, and was able to get some valuable insights.

Being trained in quantitative methods and a positivists philosophy, my inclination was to build generalizable theories about FLOSS communities and attempt to falsify them. Which explains why I used theories like TCT and Organizational Information Processing Theory to build my research models with a project unit of analysis. This proved the biggest discussion point in many of my conversations. I managed to gather some useful insights, which I couldn’t have easily gathered on my own. This made me appreciate the value of diversity in research philosophies and methods in the conference.

What was made clear through the discourse was that each FLOSS community has unique processes, members and software. This made me reconsider some of the limitations of my methods, and improve on some of them whenever possible. One particular approach that was suggested to me at the conference was to use a mixed method approach, in which I use qualitative methods on a limited sample of the projects I am observing and show that the nuances in these project follow the predictions and logic of my theory. I could then use quantitative methods to generalize my findings.

To give an example of some of the methodological issues that caught me by surprise, consider the estimate of productivity when measured as lines of code added or removed in a commit. According to Hoffman and Reihle, the unified patch format does not track changed line, but rather, considered any changed line to be a line removed, and line added. A seemingly good estimate of productivity that takes such an issue into account was presented at the conference, which I thought to be valuable.

Getting to meet some FLOSS developers in the conference, and talking with them about research ideas and our work was also enlightening. It was interesting to know what were the important issues and challenges faced by developers which would allow us to make our research more relevant. Furthermore, I found it valuable to get an account from practitioners on how some of the assumptions and theoretical explanations in my work reflects in reality.

Based on the discussions, I got a feeling that the focus of most researchers I have met was on building rich and specialized theories. I felt that generalizable theories were somewhat under represented. I could not tell if this was due to the philosophical background of many of the european researchers I have met, or because the FLOSS phenomenon is not yet well understood and that richness is needed before we can generalize. Personally, I think there is value in generalization and we know enough about at least the software development practices to start building such theories.

I would love to hear the thoughts of whoever reads this post on whether its possible or valuable to build generalizable theories related to the FLOSS phenomenon.

Brook’s law at work

Here is a snapshot of how different FLOSS projects compare in terms of source lines of code (log scale) graphed to modularity from March 2008 and for a total of 200+ observations (as described in previous post).

Click to Enlarge Image

Given that I have sampled the projects from the top 1000 listed on ohloh.net, which are mostly actively developed FLOSS projects, one could see that a good organization of source code dependencies is needed to be maintained for development to continue as suggested by Brook’s Law. This might explain why we have an almost empty bottom right quadrant, which represents projects with a large code base and poor organization of dependencies.

The significance of this graph lies in that it adds to the validity of the graph modularity measure we used in comparing different python based projects. This is hopefully but one step in many for us to better understand FLOSS project management.