dependency graph analysis

I tried analyzing the Haiku and FreeBDS’s dependency graph to see how modular the code base is. I was running the analysis on a Dell XPS gaming rig, and the analysis kept running for 7 straight days with 100% cpu utilization and it still hasn’t finished.

For my analysis, I need to perform this kind of analysis for both operating systems at 5 different points in time. So I think it would be a good idea to drop these two projects from my analysis, if I couldn’t obtain a single data point after a full week of analysis.

Hard Disc Corruption

After recently upgrading to Jaunty Jackelope, I thought everything went well. I found the UI very responsive compared to Hardy Heron and all the applications I used were running perfectly. That is, until I tried to check the data I was storing in MySQL.

I knew there was a problem when tried to retrieve data using a Django ORM and the queryset was returning some data when I do qs.objects.all(). But when I do qs.objects.count() it returned 0. Turnes out that there was a corruption in the hard disc. Luckely I had some back ups, but much of the manual donkey work that I was doing was lost and I have to redo it all over again (that is to look at 80K log messages).

After restoring everything, I finally realized how large is the amount of data I am dealing with. I downloaded over 200 Python and C based source code for open source projects, which amounts to over 46GB! I have been parsing all these source files for data for the past couple of months. I hope this effort finally pays off and I graduate. I am kind of excited to look at the results of all this analysis. But I have to complete the theoretical development part of my dissertation.

Is It Piracy When ISPs Charge Per Usage?

Seems like many internet service provider in the US are attempting to push for pay per usage instead of capacity. Here is a side which I’ve yet to hear anyone talk about. Would ISPs be considered pirating copy righted material if they charge based on the bits passing through their networks?

Think about, mobile carriers transfer bits of voice data which phone users produced so they are not infringing on the copyrights of anyone. ISPs on the other hand would be copying bits of a song downloaded from one router to the next and charging the end user for it. Now correct me if I’m wrong, but wouldn’t they be making money off of that content which they have no rights over?

I would buy the argument that they are only providing linking service if they are selling us capacity. Charging per bit is very much tied to content. I think both RIAA and MPAA should be going after ISPs. I think the activity of bittorrent trackers like pirate bay seem more legitimate than ISPs charging per bits, since they do not host the content or make money off of it, yet pirate bay is the target of law suits. Why should ISPs be treated any differently?

In my mind, a legitimate approach for ISPs to charge per bits is to make sure that all available content on the internet is free. Which entails that they work out deals with all content providers. This would make the internet model closer to the arguments I hear that people pay per usage for electricity and water. Well, the electric and water companies produce these commodities for us, what do ISPs produce?

Django Code Base Modularity

Let me start first by defining what I mean by modularity. Modularity is how well source code files are arranged into groups that share maximum dependency (i.e. imports) within group and minimum dependency between groups.

Groups that share a high degree of dependency are said to be cohesive, and they usually serve a single function. When these cohesive groups have little dependencies between them the code base is said to be loosely coupled. When a code base is non-modular, then the whole group of source files share a high level of dependency between one another which makes the code base seem as a single monolithic unit.

This obsession with modularity and dependency graphs was actually sparked by Mark Ramm’s presentation in Djangocon. He had some rather excellent lessons learned for the community, but one part of his presentation stuck out for me where he compared django’s dependency graph with that of turbogears (around the 9th minute). I am no graph expert, but I am almost certain that eye balling graphs is not a good way to compare them or decide how well they are arranged. I think you now see where this is going.

I went ahead and generated the dependency graphs for both django trunk and turbogears trunk. For the fun of it, I also included other python based projects, CherryPy, SqlAlchemy and Genshi. Let me be clear on what I mean by dependency graph of trunk. I actually went through the whole trunk history of these projects and generated the dependency graph for each commit.

I ended up with a lot of graphs and eyeballing is certainly not a good way to compare them. As it turns out, the concept of modularity exists in graph theory and it matches the definition I just gave. I used a method by newman which identifies groups in graphs using a clustering method that attempts to maximize modularity. Modularity in graph theory is basically a characteristic of how a graph is partitioned.

When applying the method on a source dependency graph, the method groups files that share dependencies into groups (i.e. modules) and the identified groups would maximize the modularity of the graph. The identified modularity value from this method would be an upper limit for how modular the code base is. So without further ado, I give you the the result of the analysis where I calculated the modularity of the dependency graph after each commit, and averaged the values per month:

Modularity graphs

Some highlights

  • Django seems to have a good increasing trend (Django community, keep up the good work!).
  • Turbogears, what happened? this is Turbogears trunk btw so it’s V2.0, I think they should have listened to Mark Ramm’s presentation. Seems like something went wrong, maybe backwards compatibility?
  • I marked out the two highest jumps in Django’s modularity. I attributed the first to the Boulder sprint, since I couldn’t find any other significant news during April 2007. The second can be attributed to newform-admin branch merging into trunk.
  • If you are wondering where queryset-refactor is, look 3 points prior to merging of newforms-admin. I dont think it had an effect on modularity, any ideas why?
  • SQLAlchemy, well done guys! anyone worked on SQLAlchemy and can confirm that indeed their code is modular? I would appreciate any comments to confirm that there is some level of reliability in the method I am using (I need to graduate people).

I hope you find this all interesting. I’ll be sharing some more analysis about other FLOSS projects. I’m currently working on Pylons, Twisted, and Trac. I thought about doing Zope but my computer begged for mercy. Stay tuned!

A Theory on IBM’s Sun Buyout

With regards to SUN’s buyout by IBM, it stuck me as odd that IBM would offer $7Billion for a company which it has alot of overlap with. Both IBM and SUN sell servers, both are also familiar to the OpenSource scene and should have no problem benefiting from one another’s expertise in software. So why would IBM offer to buyout SUN then withdraw?

Well here’s my theory

I think this is a move by IBM who might feel currently threatened by Cisco’s recent entry into the server business. Initially they might have thought they could buyout SUN and prevent Cisco from acquiring them, but they weren’t willing to pay the full price because they know they will not benefit from Sun. So the plan is to make it difficult for Cisco to buy SUN out given the low and crazy valuation set by the marketplace.

After IBM’s move, Cisco will have to beat IBM’s offer before they can buyout SUN. IBM’s plan is to make a dent in their foes balance sheet, so they would be less effective in their competition. If the move would deter likely suitors from the buyout, then this would be a blessing for IBM.

My prediction is A buyout will occur from either Cisco, for the server expertise, or Oracle, for the service and software talent (anyone for OracleDB + ZFS + Oracle Sparc Servers). Both companies have good synergies with Sun and the resources to buy them out.

Is this different from MS, Yahoo buyout?

yes! Sun could afford to play hard to get because, unlike yahoo, they have two suitors who are better than IBM. Yahoo only had MS, and to this day I don’t see how the two companies could function as a unit.

Disclaimer: The idea was born out of a discussion I had on /., I also don’t own JAVA,IBM,CSCO, or ORCL stocks, but seriously considering JAVA should it drop further.

Tinyurl Paranoia

Am I being paranoid, or can tinyurl be used to hide links that people would usually know will lead them to malicious content?

Dissertation Donkey Work Has Started!

It’s a boring part of my dissertation, but it must be done. After trying to automate the parsing of patch contributors from RCS logs and deciding the approach wasn’t very reliable, I decided to use django to display the log messages with suspected contributor names ten at a time. I sift through these messages and approve them ten at a time.

Thank god Linux Kernel development is done using git which stores the author’s name and I don’t have to parse the log message. This saved me from sifting through 50K log messages. I will check out Wine too to see if the author information is captured by git also. But even with this, I got 90K log messages to go through.

To the Postgres committers out there, I appreciate what you’re doing, but let me say this: I hate you for making my life a living hell. Would it hurt to put “patch by” in front of author’s name in the log message?

Django Forms Gotcha

Well, I just wasted 4 good hours of my time thinking I was facing a bug. I set the prefix for a form while trying to set the initial value for for a field. Only the first instance of the instantiated form would include the initial value. This code illustrates my problem:

    from django import forms

    class MyForm(forms.Form):
        names = forms.CharField(required=False)
        id = forms.CharField(widget=forms.HiddenInput,required=False)


    print [MyForm({'id':y},prefix=y).as_p() for y in range(2)]

    #output:
    #
    #[u'<p><label for="id_names">Names:</label> <input type="text" name="names" #id="id_names" /><input type="hidden" name="id" value="0" id="id_id" /></p>',
    # u'<p><label for="id_1-names">Names:</label> <input type="text" name="1-names" #id="id_1-names" /><input type="hidden" name="1-id" id="id_1-id" /></p>']
    #
    # Notice how 2nd form instance doesnt have value=1 for hidden input field.

Solution was simple. Thanks to the guys at #django on freenode, all I needed to do was instantiate my form as such:

MyForm(initial={'id':y})

That fixed it for me. I was told that when using data=, I need to match the data key to field name including the prefix. With initial keyword, you dynamically set that.

Making Sense of Unicode in Python

I finally managed to make sense out of string encoding in Python. Seems like we have two types of strings, unicode that is identified as u’unicode string’, and regular ascii encoded strings, which are the standard python strings.

The replace solution which I suggested in an earlier post kept failing with ascii decoder errors, which I couldn’t explain at that time given that I was encoding the strings using ‘utf-8’ as such: encodedstring = rawstring.encode('utf-8','replace')

I even tried out the excellent chardet module that predicts what encoding is used for the strings thinking that I needed to identify the correct encoding and use it instead of utf-8. But as it turns out, rawstring is an ascii based python string and encoding it will also result in an ascii based python string. Before I could properly encode to utf-8 I needed to ensure that the string is a unicode python string.

So I tried the following, but to no avail: encodedstring = unicode(rawstring).encode('utf-8','replace')

The solution to my problem lied with the codecs module. As it turns out, you need to convert your strings to unicode as early as possible in the lifetime of your program to avoid undefined behavior. For me, it was the moment i read the byte streams from the log file I was trying to parse. What I did was:

import codecs file = codecs.open('utf-8','r','replace') for line in file.readline(): #do some work on unicde, utf-8 encoded line

line is now a unicode string encoded as utf-8. I was finally able to parse log messages with weird international characters and store them using django orm.

YAY!

Parsing xml using xml.etree was not as straight forward, I’ll keep that discussion for another time.

Solution to my Encoding Problem

When you analyse over 300K commit messages, changing problematic encodings gets a bit tiresome. This got me motivated to look for a solution and here it is:

obj.text = parsed_log_message.encode('utf-8','replace')

Saving obj will stop DjangoUnicodeDecodeError from squealing and replace problematic characters with a ‘?’. I guess this is part of the zen of Python:

Errors should never pass silently. Unless explicitly silenced.