All articles, tagged with “phd”

The Mythical Django Pony, a blessing?

On my way to have lunch on the first day of DjangoCon in Portland, I met Eric Holscher in one of the corridors of the hotel holding a pink unicorn. It seemed odd to me so I approached and asked him about what he had in his hands. He explained that this was the unofficial Django mascot, which was a pony. I didn’t make the obvious observation that what he had in his hands was a unicorn, but asked how this came to be. He explained that one of the core dev on the Django mailing list responded to one of the feature requests saying “no, you can’t have a Pony!” as a way of politely refusing the feature request.

I was surprised to be in a DjangoCon session two days later where Russell Keith-Magee talks about declined feature requests and how they are referred to as ponies, and suggests ways in which your features are likely to get accepted. He also explained the whole story behind the mythical Django pony (which is really a unicorn!).

Why am I bringing this up now? well as I write the concluding chapters of my dissertation, I notice an odd relationship between the number of modules present in a code base and the number of new contributors. The statistical model suggests that adding modules to a code base is associated with an increase in the number of new contributors. However, this relationship reverses itself for projects that have an above average number of modules. So adding modules when there are already a high number of modules results in fewer contributors joining the development effort over time. The same effect could be observed for average module size (measured in SLOC), where an increase in the average size of a module is associated with an increase in new contributors up to a certain point. Then, the relationship starts to reverse (Quadratic effect for those of you who are statistically inclined).

It dawned on me that one of the explanations for such an observation is that an increase in the number of modules or an increase in the average size of a module is a result of the increased complexity in the code base from adding or implementing a feature. Assuming that the projects in my sample are not mature to a point where there is no need for new contributors to join, then we can attribute the decrease or increase in numbers of new contributor to the balancing act of complexity where the community correctly decides to include just enough features as to not make the codebase overly complex for new participants and yet valuable enough for new members to start using and contributing to it (Please don’t bite my head off for implying causation here, I am just forwarding a hypothesis that seems to be supported by the data. It’s up to you whether to accept it or not).

So how does this relate to ponies? Well, I just might have put my finger on one of the things that makes Django unique, and that’s the core developers know how to play this balancing act by knowing when to include or refuse a new feature. This I believe is possible because there is what we can refer to as a Django philosophy in deciding which feature requests are considered ponies. This seems to be paying off as participation in Django is way off the charts. Don’t ask me what the Django philosophy is, as I have no idea. I am just observing its results. If someone out there thinks he knows what it is, or has a link, please do share.

Take away from this, at least for other FLOSS projects that want to learn from the Django community. Be clear on the goals you want to achieve with your project, and don’t be afraid of saying “No! you can’t have a pony!”. As for the Django community, you are already on the right track in trying to explain why features are refused. Keep at it!

Update: Here is the Django philosophy summed up quite eloquently by Dougal Matthews:

I think the philosophy is quite simple generally. “Does this need to be in the core?”

You can read his comment for more explanation.

Django … an outlier

While analyzing the development activity and code metrics for over 240 of the most actively developed FLOSS projects, guess which project popped out?

Yes Django! Its an outlier in terms of its activity. It’s influencing the results of my statistical analysis more than any other project as per the Cook’s distance diagnostic index. Let me bring your attention to the lonely dot that is close to the value 1 at the top right corner. I missed it at first, but noticed it when I looked at the sorted values.

This is telling us that at least among the sample that I have ,Python, C and C++ based actively developed FLOSS projects, Django (including its community) is quite unique.

I leave you with the graph of sorted Cook’s D values from my analysis:

Click to Enlarge Image

Richness Vs. Generalizability

It was an interesting couple of days at the OSS2009 conference. I presented the FLOSS marketplace paper as part of the PhD consortium and found the feedback to be very constructive. My goal was to get feedback on the validity of measures I am using to test my theories, and was able to get some valuable insights.

Being trained in quantitative methods and a positivists philosophy, my inclination was to build generalizable theories about FLOSS communities and attempt to falsify them. Which explains why I used theories like TCT and Organizational Information Processing Theory to build my research models with a project unit of analysis. This proved the biggest discussion point in many of my conversations. I managed to gather some useful insights, which I couldn’t have easily gathered on my own. This made me appreciate the value of diversity in research philosophies and methods in the conference.

What was made clear through the discourse was that each FLOSS community has unique processes, members and software. This made me reconsider some of the limitations of my methods, and improve on some of them whenever possible. One particular approach that was suggested to me at the conference was to use a mixed method approach, in which I use qualitative methods on a limited sample of the projects I am observing and show that the nuances in these project follow the predictions and logic of my theory. I could then use quantitative methods to generalize my findings.

To give an example of some of the methodological issues that caught me by surprise, consider the estimate of productivity when measured as lines of code added or removed in a commit. According to Hoffman and Reihle, the unified patch format does not track changed line, but rather, considered any changed line to be a line removed, and line added. A seemingly good estimate of productivity that takes such an issue into account was presented at the conference, which I thought to be valuable.

Getting to meet some FLOSS developers in the conference, and talking with them about research ideas and our work was also enlightening. It was interesting to know what were the important issues and challenges faced by developers which would allow us to make our research more relevant. Furthermore, I found it valuable to get an account from practitioners on how some of the assumptions and theoretical explanations in my work reflects in reality.

Based on the discussions, I got a feeling that the focus of most researchers I have met was on building rich and specialized theories. I felt that generalizable theories were somewhat under represented. I could not tell if this was due to the philosophical background of many of the european researchers I have met, or because the FLOSS phenomenon is not yet well understood and that richness is needed before we can generalize. Personally, I think there is value in generalization and we know enough about at least the software development practices to start building such theories.

I would love to hear the thoughts of whoever reads this post on whether its possible or valuable to build generalizable theories related to the FLOSS phenomenon.

Presentation material for OSS2009 PC

Click here to get the latest copy of my dissertation essays and my OSS2009 presentation.