Introductory Python Programming Sessions in Kuwait

KBSoft will be conducting a series of introductory sessions designed to introduce programmers to new programming tools and to help improve their programming skills. The sessions are:

Sunday, Nov 21st 2010 7pm — 9pm: An Introduction to Revision Control with Git.

The goal of this session is to introduce how revision control systems can be an indispensable tool to programmers. Git will be the tool of choice for this session and we will go through a number of exercises to show how useful it can be to both individual programmers and programming teams. We will also be introducing a number of best practices for using using Git and help the attendees get more familiar with the system.

Tuesday, Nov 23rd 2010 7pm — 9pm: An Introduction to the Python Programming Language.

The goal of this session is to introduce python as a general purpose programming language that can be used to solve most problems faced by programmers in Kuwait. There will be a number of exercises to introduce the language syntax and features. In addition to an overview of some of the useful packages in the standard library, language best practices, and how to setup a functional development environment.

Thursday, Nov 25th 2010 7pm — 9pm: An Introduction to Web Application Development with Django.

The goal of this session is to introduce Django as the tool of choice for web development. Our approach will be to contrast the Django development model with that of the common PHP model that most attendees might be familiar with as they explored and learned PHP. We will be introducing the main components of the Django framework and go through a simple exercise that would give the users an appreciation of how useful and time-saving this framework can be.

The sessions will be held in Kuwait Information Technology Society (KITS, formerly KCS) in AlRawda. The building is at the very corner of AlRawda directly in front of AlJabriya and on the intersection of the 4th ring road with King Fahad Highway.

Requirements:

  • Understanding of at least a single programming language (e.g., php, vb, c, Java)

  • Laptop with the following installed to go through the exercises (No love will be shown for Windows users, your on your own ;) ):

  • Strongly recommended: bring your own internet connection, as the connection there might not be reliable

The Mythical Django Pony, a blessing?

On my way to have lunch on the first day of DjangoCon in Portland, I met Eric Holscher in one of the corridors of the hotel holding a pink unicorn. It seemed odd to me so I approached and asked him about what he had in his hands. He explained that this was the unofficial Django mascot, which was a pony. I didn’t make the obvious observation that what he had in his hands was a unicorn, but asked how this came to be. He explained that one of the core dev on the Django mailing list responded to one of the feature requests saying “no, you can’t have a Pony!” as a way of politely refusing the feature request.

I was surprised to be in a DjangoCon session two days later where Russell Keith-Magee talks about declined feature requests and how they are referred to as ponies, and suggests ways in which your features are likely to get accepted. He also explained the whole story behind the mythical Django pony (which is really a unicorn!).

Why am I bringing this up now? well as I write the concluding chapters of my dissertation, I notice an odd relationship between the number of modules present in a code base and the number of new contributors. The statistical model suggests that adding modules to a code base is associated with an increase in the number of new contributors. However, this relationship reverses itself for projects that have an above average number of modules. So adding modules when there are already a high number of modules results in fewer contributors joining the development effort over time. The same effect could be observed for average module size (measured in SLOC), where an increase in the average size of a module is associated with an increase in new contributors up to a certain point. Then, the relationship starts to reverse (Quadratic effect for those of you who are statistically inclined).

It dawned on me that one of the explanations for such an observation is that an increase in the number of modules or an increase in the average size of a module is a result of the increased complexity in the code base from adding or implementing a feature. Assuming that the projects in my sample are not mature to a point where there is no need for new contributors to join, then we can attribute the decrease or increase in numbers of new contributor to the balancing act of complexity where the community correctly decides to include just enough features as to not make the codebase overly complex for new participants and yet valuable enough for new members to start using and contributing to it (Please don’t bite my head off for implying causation here, I am just forwarding a hypothesis that seems to be supported by the data. It’s up to you whether to accept it or not).

So how does this relate to ponies? Well, I just might have put my finger on one of the things that makes Django unique, and that’s the core developers know how to play this balancing act by knowing when to include or refuse a new feature. This I believe is possible because there is what we can refer to as a Django philosophy in deciding which feature requests are considered ponies. This seems to be paying off as participation in Django is way off the charts. Don’t ask me what the Django philosophy is, as I have no idea. I am just observing its results. If someone out there thinks he knows what it is, or has a link, please do share.

Take away from this, at least for other FLOSS projects that want to learn from the Django community. Be clear on the goals you want to achieve with your project, and don’t be afraid of saying “No! you can’t have a pony!”. As for the Django community, you are already on the right track in trying to explain why features are refused. Keep at it!

Update: Here is the Django philosophy summed up quite eloquently by Dougal Matthews:

I think the philosophy is quite simple generally. “Does this need to be in the core?”

You can read his comment for more explanation.

Django … an outlier

While analyzing the development activity and code metrics for over 240 of the most actively developed FLOSS projects, guess which project popped out?

Yes Django! Its an outlier in terms of its activity. It’s influencing the results of my statistical analysis more than any other project as per the Cook’s distance diagnostic index. Let me bring your attention to the lonely dot that is close to the value 1 at the top right corner. I missed it at first, but noticed it when I looked at the sorted values.

This is telling us that at least among the sample that I have ,Python, C and C++ based actively developed FLOSS projects, Django (including its community) is quite unique.

I leave you with the graph of sorted Cook’s D values from my analysis:

Click to Enlarge Image

Contributor analysis

Some of you might find it interesting to know that over 900 unique contributors have participated in Django’s development since Jan 1st, 2007, as attributed by the svn log messages. I would say the community is very healthy especially if you compare it to other well known projects:

Django: 906
Pylons: 80
PyPy: 240
Linux Kernel: 4043
PostgreSQL: 150
Appache HTTP server: 118
SQLAlchemy: 36 (Contributors identified from trac tickets mentioned in svn log)
Python: 428

Note that I consider anonymous or guest contributors as a single contributor, so these number can be considered conservative if the project allows anonymous contributions.

Identifying contributors in FLOSS projects

As part of my graduate work, I need to analyze FLOSS repositories to identify number of external contributors. What I mean by an external contributor is any individual who made a patch contribution without having commit access to the source code repositories in addition to being a first time contributor.

What I usually do to identify contributors in general, is to parse the commit logs for any attribution to individuals who are not committers. Take for example the following log message:

Fixed #9859 – Added another missing force unicode needed in admin when 15 running on Python 2.3. Many thanks for report & patch to nfg. - (Django Revision 9656)

I wrote some regex based scripts to identify names or pseudo-names such as “nfg” from previous example.

Things however are not always clear cut for FLOSS projects as not all projects attribute contributors in the log message. For example, I noticed in the MapServer project, which seems to be actively developed, that there were no attributions in the log messages. After inquiring in IRC, it turns out that the attributions are available in the project tracker (thanks danmo!). What is included in the commit log message is a reference to the ticket number

So I pulled up my sleeves, and wrote a quick parser to identify all ticket numbers in the log messages. I then used httplib2 and beautifulsoup to connect to project tracker, and parse the patch name and contributor. The following is the code I used to perform that task:

import httplib2
from BeautifulSoup import BeautifulSoup as BS
def get_mapserver_author(ticket):

    url = 'http://trac.osgeo.org/mapserver/ticket/%s' % ticket

    h = httplib2.Http(".cache")
    resp, content = h.request(url, "GET")

    bs = BS(content)
    div = bs.find('div',id='attachments')
    patches = []
    for y in div.findAll('dt'): 
        try:
            patches.append((y.em.string,y.a.string))
        except:
            print 'Problem parsing ticket ',ticket
    return patches

I managed to identify 68 unique names for the duration between Jan. 1st, 2007 and June 1st, 2009. These are of course the names of contributors who are not necessarily first time contributors. Further analysis is needed before one can determine which of these contributors are “external.
Of course, it goes without saying, that the number of contributors is just an estimate. There might be some other contributions made through the mailing list (thanks FrankW for pointing this out). Not to mention the likelihood that an individual might have two different pseudo-names. As jmkenna (IRC: #mapserver) simply put it “It’s difficult to identify FLOSS contributors”

Just in case you are wondering, here are the names:

warmerdam
aalbarello
unicoletti
tamas
jimk
rouault
dmorissette
tomkralidis
aboudreault
brage
armin
bartvde
diletant
pramsey
nmandery
nharding
eshabtai@gmail.com
assefa
bartvde@osgis.nl
dfuhry
hschoenhammer
project10
hopfgartner
ujunge@pmcentral.com
sdlime
richf
dionw
nnikolov
abajolet
laurent
tbonfort
BobBruce
nsavard
woodbri
flavio
scott.e@goisc.com
dstrevinas
ivanopicco
jlacroix
cplist
kfaschoway
szigeti
zjames
elzouavo
mcoladas@telefonica.net
nfarrell@bom.gov.au
jparapar
vulukut@tescilturk.com
novorado
russellmcormond
msmitherdc
crschmidt
hjaekel
peter.hopfgartner@r3-gis.com
hulst
mturk@apache.org
thomas.bonfort@gmail.com
ivano.picco@aqupi.tk
jmckenna
drewsimpson
bartw
djay
sholl
dirk@advtechme.com
cph
jratike80
hobu
hpbrantley

Modularity of FreeBSD, Haiku, and OpenOffice.org

While analyzing data for my dissertation, I gave up on analyzing the modularity of FreeBSD, Haiku, and OpenOffice.org. I was trying to get 8 different modularity readings in time, but decided to stop analyzing these projects.

Why? well, after having my MBP run non-stop for a week without getting a single data point, it became clear that I will not be graduating if I continued to pursue that route. Anyone out there looked at the source code of these projects? was it surprising that the graph analysis I was performing took ages? How easy is it to find your way around these projects?

It took a whole day to obtain all readings of modularity of the Linux Kernel using the same measure. It averaged around .92, which is excellent. I take it the projects I stopped analyzing aren’t as modular as the Linux Kernel.

How Great is Django’s Documentation?

One aspect of Django that never ceases to amaze me, is how well it is documented. I believe this aspect of the Django project got many of us to use it, me included. While doing some boring graduate donkey work, Django’s name popped out, and not surprisingly, as one of the highly documented code bases out there (only projects focusing on documentation came close!). So lets me take this opportunity to thank all those who might be reading this, who had a hand in making Django what it is (Thank you!)

The Lines of documentation to SLOC ratio is plotted against SLOC for 1st of Jan, 2009. Django has one of the highest ratios compared to other open source projects of similar size (See the top part of the Graph). Of course I could have plotted doc lines to SLOC, but then I have to show the deviation of Django from the regression line. This graph just makes it more obvious and is easier to plot:

Click to enlarge, and don’t forget to zoom in.

I also have a few questions should any Django contributors pass by. Any of you guys think that you might have over done it? Is it difficult to maintain the quality of the documentation and keep it up-to-date and did anything change after moving to a regular release schedule? Is it only contributors experienced with the Django code base, who are able to write documentation to ensure quality?

Since I also live in a world where ponies roam and correlation is always seen as causation, let me end this post by saying:

Good documentation is the cause of success for Django, and high ice cream consumption is the cause of increased deaths in neighborhoods with swimming pools during summer!

Richness Vs. Generalizability

It was an interesting couple of days at the OSS2009 conference. I presented the FLOSS marketplace paper as part of the PhD consortium and found the feedback to be very constructive. My goal was to get feedback on the validity of measures I am using to test my theories, and was able to get some valuable insights.

Being trained in quantitative methods and a positivists philosophy, my inclination was to build generalizable theories about FLOSS communities and attempt to falsify them. Which explains why I used theories like TCT and Organizational Information Processing Theory to build my research models with a project unit of analysis. This proved the biggest discussion point in many of my conversations. I managed to gather some useful insights, which I couldn’t have easily gathered on my own. This made me appreciate the value of diversity in research philosophies and methods in the conference.

What was made clear through the discourse was that each FLOSS community has unique processes, members and software. This made me reconsider some of the limitations of my methods, and improve on some of them whenever possible. One particular approach that was suggested to me at the conference was to use a mixed method approach, in which I use qualitative methods on a limited sample of the projects I am observing and show that the nuances in these project follow the predictions and logic of my theory. I could then use quantitative methods to generalize my findings.

To give an example of some of the methodological issues that caught me by surprise, consider the estimate of productivity when measured as lines of code added or removed in a commit. According to Hoffman and Reihle, the unified patch format does not track changed line, but rather, considered any changed line to be a line removed, and line added. A seemingly good estimate of productivity that takes such an issue into account was presented at the conference, which I thought to be valuable.

Getting to meet some FLOSS developers in the conference, and talking with them about research ideas and our work was also enlightening. It was interesting to know what were the important issues and challenges faced by developers which would allow us to make our research more relevant. Furthermore, I found it valuable to get an account from practitioners on how some of the assumptions and theoretical explanations in my work reflects in reality.

Based on the discussions, I got a feeling that the focus of most researchers I have met was on building rich and specialized theories. I felt that generalizable theories were somewhat under represented. I could not tell if this was due to the philosophical background of many of the european researchers I have met, or because the FLOSS phenomenon is not yet well understood and that richness is needed before we can generalize. Personally, I think there is value in generalization and we know enough about at least the software development practices to start building such theories.

I would love to hear the thoughts of whoever reads this post on whether its possible or valuable to build generalizable theories related to the FLOSS phenomenon.

Brook’s law at work

Here is a snapshot of how different FLOSS projects compare in terms of source lines of code (log scale) graphed to modularity from March 2008 and for a total of 200+ observations (as described in previous post).

Click to Enlarge Image

Given that I have sampled the projects from the top 1000 listed on ohloh.net, which are mostly actively developed FLOSS projects, one could see that a good organization of source code dependencies is needed to be maintained for development to continue as suggested by Brook’s Law. This might explain why we have an almost empty bottom right quadrant, which represents projects with a large code base and poor organization of dependencies.

The significance of this graph lies in that it adds to the validity of the graph modularity measure we used in comparing different python based projects. This is hopefully but one step in many for us to better understand FLOSS project management.

Presentation material for OSS2009 PC

Click here to get the latest copy of my dissertation essays and my OSS2009 presentation.