Let me start first by defining what I mean by modularity. Modularity is how well source code files are arranged into groups that share maximum dependency (i.e. imports) within group and minimum dependency between groups.
Groups that share a high degree of dependency are said to be cohesive, and they usually serve a single function. When these cohesive groups have little dependencies between them the code base is said to be loosely coupled. When a code base is non-modular, then the whole group of source files share a high level of dependency between one another which makes the code base seem as a single monolithic unit.
This obsession with modularity and dependency graphs was actually sparked by Mark Ramm’s presentation in Djangocon. He had some rather excellent lessons learned for the community, but one part of his presentation stuck out for me where he compared django’s dependency graph with that of turbogears (around the 9th minute). I am no graph expert, but I am almost certain that eye balling graphs is not a good way to compare them or decide how well they are arranged. I think you now see where this is going.
I went ahead and generated the dependency graphs for both django trunk and turbogears trunk. For the fun of it, I also included other python based projects, CherryPy, SqlAlchemy and Genshi. Let me be clear on what I mean by dependency graph of trunk. I actually went through the whole trunk history of these projects and generated the dependency graph for each commit.
I ended up with a lot of graphs and eyeballing is certainly not a good way to compare them. As it turns out, the concept of modularity exists in graph theory and it matches the definition I just gave. I used a method by newman which identifies groups in graphs using a clustering method that attempts to maximize modularity. Modularity in graph theory is basically a characteristic of how a graph is partitioned.
When applying the method on a source dependency graph, the method groups files that share dependencies into groups (i.e. modules) and the identified groups would maximize the modularity of the graph. The identified modularity value from this method would be an upper limit for how modular the code base is. So without further ado, I give you the the result of the analysis where I calculated the modularity of the dependency graph after each commit, and averaged the values per month:
- Django seems to have a good increasing trend (Django community, keep up the good work!).
- Turbogears, what happened? this is Turbogears trunk btw so it’s V2.0, I think they should have listened to Mark Ramm’s presentation. Seems like something went wrong, maybe backwards compatibility?
- I marked out the two highest jumps in Django’s modularity. I attributed the first to the Boulder sprint, since I couldn’t find any other significant news during April 2007. The second can be attributed to newform-admin branch merging into trunk.
- If you are wondering where queryset-refactor is, look 3 points prior to merging of newforms-admin. I dont think it had an effect on modularity, any ideas why?
- SQLAlchemy, well done guys! anyone worked on SQLAlchemy and can confirm that indeed their code is modular? I would appreciate any comments to confirm that there is some level of reliability in the method I am using (I need to graduate people).
I hope you find this all interesting. I’ll be sharing some more analysis about other FLOSS projects. I’m currently working on Pylons, Twisted, and Trac. I thought about doing Zope but my computer begged for mercy. Stay tuned!