While analyzing data for my dissertation, I gave up on analyzing the modularity of FreeBSD, Haiku, and OpenOffice.org. I was trying to get 8 different modularity readings in time, but decided to stop analyzing these projects.
Why? well, after having my MBP run non-stop for a week without getting a single data point, it became clear that I will not be graduating if I continued to pursue that route. Anyone out there looked at the source code of these projects? was it surprising that the graph analysis I was performing took ages? How easy is it to find your way around these projects?
It took a whole day to obtain all readings of modularity of the Linux Kernel using the same measure. It averaged around .92, which is excellent. I take it the projects I stopped analyzing aren’t as modular as the Linux Kernel.
Here is a snapshot of how different FLOSS projects compare in terms of source lines of code (log scale) graphed to modularity from March 2008 and for a total of 200+ observations (as described in previous post).
Click to Enlarge Image
Given that I have sampled the projects from the top 1000 listed on ohloh.net, which are mostly actively developed FLOSS projects, one could see that a good organization of source code dependencies is needed to be maintained for development to continue as suggested by Brook’s Law. This might explain why we have an almost empty bottom right quadrant, which represents projects with a large code base and poor organization of dependencies.
The significance of this graph lies in that it adds to the validity of the graph modularity measure we used in comparing different python based projects. This is hopefully but one step in many for us to better understand FLOSS project management.
I tried analyzing the Haiku and FreeBDS’s dependency graph to see how modular the code base is. I was running the analysis on a Dell XPS gaming rig, and the analysis kept running for 7 straight days with 100% cpu utilization and it still hasn’t finished.
For my analysis, I need to perform this kind of analysis for both operating systems at 5 different points in time. So I think it would be a good idea to drop these two projects from my analysis, if I couldn’t obtain a single data point after a full week of analysis.
Let me start first by defining what I mean by modularity. Modularity is how well source code files are arranged into groups that share maximum dependency (i.e. imports) within group and minimum dependency between groups.
Groups that share a high degree of dependency are said to be cohesive, and they usually serve a single function. When these cohesive groups have little dependencies between them the code base is said to be loosely coupled. When a code base is non-modular, then the whole group of source files share a high level of dependency between one another which makes the code base seem as a single monolithic unit.
This obsession with modularity and dependency graphs was actually sparked by Mark Ramm’s presentation in Djangocon. He had some rather excellent lessons learned for the community, but one part of his presentation stuck out for me where he compared django’s dependency graph with that of turbogears (around the 9th minute). I am no graph expert, but I am almost certain that eye balling graphs is not a good way to compare them or decide how well they are arranged. I think you now see where this is going.
I went ahead and generated the dependency graphs for both django trunk and turbogears trunk. For the fun of it, I also included other python based projects, CherryPy, SqlAlchemy and Genshi. Let me be clear on what I mean by dependency graph of trunk. I actually went through the whole trunk history of these projects and generated the dependency graph for each commit.
I ended up with a lot of graphs and eyeballing is certainly not a good way to compare them. As it turns out, the concept of modularity exists in graph theory and it matches the definition I just gave. I used a method by newman which identifies groups in graphs using a clustering method that attempts to maximize modularity. Modularity in graph theory is basically a characteristic of how a graph is partitioned.
When applying the method on a source dependency graph, the method groups files that share dependencies into groups (i.e. modules) and the identified groups would maximize the modularity of the graph. The identified modularity value from this method would be an upper limit for how modular the code base is. So without further ado, I give you the the result of the analysis where I calculated the modularity of the dependency graph after each commit, and averaged the values per month:
- Django seems to have a good increasing trend (Django community, keep up the good work!).
- Turbogears, what happened? this is Turbogears trunk btw so it’s V2.0, I think they should have listened to Mark Ramm’s presentation. Seems like something went wrong, maybe backwards compatibility?
- I marked out the two highest jumps in Django’s modularity. I attributed the first to the Boulder sprint, since I couldn’t find any other significant news during April 2007. The second can be attributed to newform-admin branch merging into trunk.
- If you are wondering where queryset-refactor is, look 3 points prior to merging of newforms-admin. I dont think it had an effect on modularity, any ideas why?
- SQLAlchemy, well done guys! anyone worked on SQLAlchemy and can confirm that indeed their code is modular? I would appreciate any comments to confirm that there is some level of reliability in the method I am using (I need to graduate people).
I hope you find this all interesting. I’ll be sharing some more analysis about other FLOSS projects. I’m currently working on Pylons, Twisted, and Trac. I thought about doing Zope but my computer begged for mercy. Stay tuned!