Django Code Base Modularity

Let me start first by defining what I mean by modularity. Modularity is how well source code files are arranged into groups that share maximum dependency (i.e. imports) within group and minimum dependency between groups.

Groups that share a high degree of dependency are said to be cohesive, and they usually serve a single function. When these cohesive groups have little dependencies between them the code base is said to be loosely coupled. When a code base is non-modular, then the whole group of source files share a high level of dependency between one another which makes the code base seem as a single monolithic unit.

This obsession with modularity and dependency graphs was actually sparked by Mark Ramm’s presentation in Djangocon. He had some rather excellent lessons learned for the community, but one part of his presentation stuck out for me where he compared django’s dependency graph with that of turbogears (around the 9th minute). I am no graph expert, but I am almost certain that eye balling graphs is not a good way to compare them or decide how well they are arranged. I think you now see where this is going.

I went ahead and generated the dependency graphs for both django trunk and turbogears trunk. For the fun of it, I also included other python based projects, CherryPy, SqlAlchemy and Genshi. Let me be clear on what I mean by dependency graph of trunk. I actually went through the whole trunk history of these projects and generated the dependency graph for each commit.

I ended up with a lot of graphs and eyeballing is certainly not a good way to compare them. As it turns out, the concept of modularity exists in graph theory and it matches the definition I just gave. I used a method by newman which identifies groups in graphs using a clustering method that attempts to maximize modularity. Modularity in graph theory is basically a characteristic of how a graph is partitioned.

When applying the method on a source dependency graph, the method groups files that share dependencies into groups (i.e. modules) and the identified groups would maximize the modularity of the graph. The identified modularity value from this method would be an upper limit for how modular the code base is. So without further ado, I give you the the result of the analysis where I calculated the modularity of the dependency graph after each commit, and averaged the values per month:

Modularity graphs

Some highlights

  • Django seems to have a good increasing trend (Django community, keep up the good work!).
  • Turbogears, what happened? this is Turbogears trunk btw so it’s V2.0, I think they should have listened to Mark Ramm’s presentation. Seems like something went wrong, maybe backwards compatibility?
  • I marked out the two highest jumps in Django’s modularity. I attributed the first to the Boulder sprint, since I couldn’t find any other significant news during April 2007. The second can be attributed to newform-admin branch merging into trunk.
  • If you are wondering where queryset-refactor is, look 3 points prior to merging of newforms-admin. I dont think it had an effect on modularity, any ideas why?
  • SQLAlchemy, well done guys! anyone worked on SQLAlchemy and can confirm that indeed their code is modular? I would appreciate any comments to confirm that there is some level of reliability in the method I am using (I need to graduate people).

I hope you find this all interesting. I’ll be sharing some more analysis about other FLOSS projects. I’m currently working on Pylons, Twisted, and Trac. I thought about doing Zope but my computer begged for mercy. Stay tuned!

Add post to:   Delicious Reddit Slashdot Digg Technorati Google
Make comment

Comments

You don’t specify how you’re counting “linking” here, so it’s hard to see if you’re accounting for one of the systematic problems that were in Mark’s graphs: importing is a directional dependency and it’s quite possible for some module “A” to be imported into a lot of other modules because it contains common code. That isn’t a problem at all. It’s simply a factoring of the code base to avoid repetition either in implementation or usage (for example, django.conf.settings being imported everywhere is a usage factoring decision). Mark’s graphs would have been better with a third-dimension, for example, so that the commonality factoring was clearer (it would appear as layer or two in the third dimension).

As for queryset-refactor, it was a mostly self-contained refactoring, so wouldn’t be expected to change the modularity number (however that coefficient might be being computed). It changed the number of files and lines of code, but it was predominantly self-contained in django.db.models for effects.

Malcolm,

I’m glad that you clarified the queryset-refactor bit, indeed if the changes are self contained within a module, then the modularity number should not change. That lends support to the validity of the measure, at least for what I am using it for.

Your point regarding the setting.py file is well taken. I will exclude setting.py from the analysis and report later on the results. But even with it included, Django fairs favorably compared to other python web frameworks and most certainly against TG. Whats important is that the trend seems to be going up.

To extract dependencies I’m using the snakefood package. The dependencies are based on imports and they either exist or don’t. There is no notion of strength of the dependency between two modules that might exist from having multiple functional calls to the other module. The analysis is also inclusive of all the source files and starts from the root directory of the project. Which means test files are also included. I think we might want to exclude those as well in future analysis.

There is one rather important thing that I feel needs to be clear. The modularity analysis I’m using works only with undirected graphs. So I convert the directed dependency graphs into undirected ones, which might raise a few eye brows.

Consider this however, if we have two developers a & b working on modules A & B, it really doesn’t matter if the directed dependency was A -> B, B <- A, or even A <-> B. If each developer is working on his own module, conflict is bound to arise between them since the effects of any change will trickle down the dependency path. If we can agree that modularity is an enabler for collaborative software development because it limits the need for coordination between developers by limiting dependencies between them, then for the sake of understanding collaborative development (which is what I’m doing), directed and undirected dependency graphs are equivalent.

Just excluding the settings file doesn’t fix things. That was simply one example amongst dozens. You have to do the directed analysis.

The conflict thing you talk about isn’t really an argument either way, since people working on related pieces of code always have potential conflicts, but reuse is a good thing, providing the interfaces make sense. It’s why coding involves humans with brains, and not mere machines. Interfaces often remain stable, or are intentionally marked as internal (a common pattern inside most of the frameworks you are considering here, for example). Avoiding reuse simply to avoid extra users and potential changes is poor engineering practice; it limits the need for interaction at the cost of requiring duplicated work when the same functionality is required. The real evaluation then requires understanding whether the interface make sense and are stable and whether changes are easy to notify downstream dependencies of (again, direction is important).

It would be easy to change code to get a much higher score by your metric and have it be much, much worse to both maintain and use due to duplication, repeated and slightly different interfaces, larger modules and functions, etc.

Realise, too, that inter-project analysis is slightly different from intra-project stuff, particularly when the projects are as small as Django or Turbogears. Internal implementation is done in a particular way to improve the user experience (i.e the projects do something that is difficult internally so that things are easier for the user). It’s quite pragmatic to do it that way as these projects are all quite small.

If your analysis really requires that directions are unimportant, then the simplifying assumptions you’re making are throwing away pertinent information. So, no, I don’t agree that modularity, as measured by this metric, is an “enabler for collaborative software development” (I do agree that modularity is good, but this metric is making simplifying assumptions that make it inappropriate).

Directed and undirected are definitely not equivalent in almost all network analysis and certainly not here.

Thanks for the insight!

What you just outlined shows how it is challenging to find a good metric in software engineering. The reason why we have so many of them spanning different levels of analysis.

What happens is we end up having to make a trade off between richness and generalizability. Which is why no metric should be taken at face value, including this one (which most certainly suffers in richness).

This is not to say that these metrics are useless. We just need to be careful in interpreting them and expose them to further scrutiny.

Would you agree though, that at the very least, the metric I describe is related to how the source code is organized (irrespective of design considerations)? which may or may not have implications on developer collaboration or conflict. I guess further work needs to be done to understand this relationship and to show if there is validity to the metric.

Congratulations on the new job btw :)

Sorry, but I would argue these metrics pretty much are useless for what you’re trying to measure. I’ve already commented on what real-world aspects affect subjective things like code quality (even for collaboration) and I agree with you that that’s why all metrics are pretty suspect. Most of the pertinent factors are hard to measure algorithmically. For this specific case, though — and why I don’t think it’s even a partial measure — there’s probably a simpler way to look at it. From a purely mathematical perspective, is the metric good at making any discrimination between “good” and “bad” results. Does it separate out a set of results of the same quality from the rest of the results?

I think the answer to that is “no”.

Most damningly, one can increase the “value” of the measure by writing unambiguously bad code: never ever using imports and repeating yourself everywhere would give an arbitrarily high level of modularity. And that’s a special case of the general problem. High and low scores don’t correspond to good or bad code. Results in all four quadrants are fairly easy to achieve.

Therefore the metric isn’t a good separator for points on the quality axis. The Newman paper identify clusters in a network and that’s fine. However, it’s then domain-specific as to whether those clusters mean anything and using “modularity” in the Newman sense to imply something about code quality is where I think you haven’t proven anything.

I am sorry my comments sound negative, since you’ve written, for the most part, a nice rigorous post, written the code to measure things, and presented your results very clearly. Your post has been interesting to read and think about (and reading the Newman paper was interesting; I hadn’t seen that before). I’m definitely grateful you took the time to do that. I’m just unconvinced about the appropriateness of the solution attempt to the problem.

Best wishes.

Thanks Malcolm,

Your comments actually made my effort in sharing this worthwhile. I appreciate all the thought and effort put into it.

Might I add that Newman suggests that the approach could be used in many fields that analyze networks including computing, so it is not domain specific. The wording he used makes it seem though and the example he used was specific. In another paper he suggests using this method as a less complex alternative to the Laplacian method for distributing instructions in a parallel or distributed computing architecture. The idea that highly dependent groups of instructions would be sent to one node, while another be sent to another node to maximize utilization of computing power. This seem analogous to the notion of modularity in software design if you think of collaborating programmers as distributed computing nodes.

Furthermore, Based on what I have read so far in FLOSS research, there is this notion that successful FLOSS projects have a well organized code base. The sample I gathered so far points in that direction. Modularity values ranged from .45 to .9 for successful FLOSS projects which include Django. These values are considered high enough to consider a code base well organized, they need to be close to 0 to be considered monolithic. There is no notion of good or bad associated with the variability in this range of .45 to .9. The variability might be due to the trade-offs these projects made as you pointed out. But we can never know without further inquiry into the matter.

Take for example Linux and SQLAlchemy which have .9 and .7 values respectively. They are low in the software stack arrangement and, therefore, do not have to make much concessions for ease of use as other projects higher up the software stack would, such as django, cherrypy and turbogears with .5,.45,.4 respectively. But you have to admit that the increasing value for SQLAlchemy is intriguing.

So there are indicators that there is validity to the measure and that it is measuring something close to modularity which might be useful in comparing projects. However, it might not be useful at the moment mainly because we don’t know much about it and it is not rich enough to be useful for day-to-day uses of a programmer.

Think of this measure as a compass that might give you a general direction but no specific indicators on how to reach your destination. Following the value of this metric blindly is equivalent to following the north indicator in the compass, it will definitely get you lost! Yet the compass is very useful when used in combination with a map and you know how to use it.

Wow.. I think it’s true for SQLAlchemy, the code is very modular.. I think the big jump in there was when they released 0.5.

Thanks Harro,

I believe the biggest jump came around May to July 2007, which I assume is the time v0.4 was starting to take shape in trunk. There certainly was alot of refactoring going on on May 2007 according to commit logs. v0.5 seems to maintain the same trend.

mike bayer 15.04.2009 18:14

interesting stuff, im still trying to understand exactly how “modularity” is being measured here. but yeah, its on our site:

Different parts of SQLAlchemy can be used independently of the rest. Elements like connection pooling, SQL statement compilation and transactional services can be used independently of each other, and can also be extended through various plugin points. The Object Relational Mapper (ORM) is a separate package which builds on top of these, and itself has several extension systems for modifying behavior at various levels.

Thanks Mike,

In network analysis, modularity is a characteristic of how well a graph is partitioned. I simply applied it to the source dependency graph after I partitioned it.

Partitioning is the tricky part, because it is very computationally intensive and there are many ways in which a graph could be partitioned. So I used methods that group files that share dependencies to maximize within group dependency (cohesion) and minimize cross group dependencies (coupling). These methods basically maximize modularity.

I used a method by Newman, which can run in a reasonable amount of time for large graphs and attempts to maximize the modularity measure when deciding how to group files. So the modularity measure I have here can be seen as an upper limit of actual modularity and it can be used to tell how well the source code files and dependencies are organized. Mind you, as Malcolm pointed out, there are design trad-offs that can affect the value which the measure does not take into account.

You can find more information about the package and algorithm I’m using to measure modularity here. The link even includes a cite of the academic paper that developed the modularity measure.

Since I have your attention, would you mind if I crawl the sqlalchemy trac page? I plan on performing more statistical analysis to see if there is a relationship between modularity and other measures like participation or the time it takes to close tickets. I can also report some activity results of the project over time.

I’ll have more analysis and reports soon, hopefully I’ll include pylons and twisted as well.