Archive for April, 2009

Hard Disc Corruption

After recently upgrading to Jaunty Jackelope, I thought everything went well. I found the UI very responsive compared to Hardy Heron and all the applications I used were running perfectly. That is, until I tried to check the data I was storing in MySQL.

I knew there was a problem when tried to retrieve data using a Django ORM and the queryset was returning some data when I do qs.objects.all(). But when I do qs.objects.count() it returned 0. Turnes out that there was a corruption in the hard disc. Luckely I had some back ups, but much of the manual donkey work that I was doing was lost and I have to redo it all over again (that is to look at 80K log messages).

After restoring everything, I finally realized how large is the amount of data I am dealing with. I downloaded over 200 Python and C based source code for open source projects, which amounts to over 46GB! I have been parsing all these source files for data for the past couple of months. I hope this effort finally pays off and I graduate. I am kind of excited to look at the results of all this analysis. But I have to complete the theoretical development part of my dissertation.

Is It Piracy When ISPs Charge Per Usage?

Seems like many internet service provider in the US are attempting to push for pay per usage instead of capacity. Here is a side which I’ve yet to hear anyone talk about. Would ISPs be considered pirating copy righted material if they charge based on the bits passing through their networks?

Think about, mobile carriers transfer bits of voice data which phone users produced so they are not infringing on the copyrights of anyone. ISPs on the other hand would be copying bits of a song downloaded from one router to the next and charging the end user for it. Now correct me if I’m wrong, but wouldn’t they be making money off of that content which they have no rights over?

I would buy the argument that they are only providing linking service if they are selling us capacity. Charging per bit is very much tied to content. I think both RIAA and MPAA should be going after ISPs. I think the activity of bittorrent trackers like pirate bay seem more legitimate than ISPs charging per bits, since they do not host the content or make money off of it, yet pirate bay is the target of law suits. Why should ISPs be treated any differently?

In my mind, a legitimate approach for ISPs to charge per bits is to make sure that all available content on the internet is free. Which entails that they work out deals with all content providers. This would make the internet model closer to the arguments I hear that people pay per usage for electricity and water. Well, the electric and water companies produce these commodities for us, what do ISPs produce?

Django Code Base Modularity

Let me start first by defining what I mean by modularity. Modularity is how well source code files are arranged into groups that share maximum dependency (i.e. imports) within group and minimum dependency between groups.

Groups that share a high degree of dependency are said to be cohesive, and they usually serve a single function. When these cohesive groups have little dependencies between them the code base is said to be loosely coupled. When a code base is non-modular, then the whole group of source files share a high level of dependency between one another which makes the code base seem as a single monolithic unit.

This obsession with modularity and dependency graphs was actually sparked by Mark Ramm’s presentation in Djangocon. He had some rather excellent lessons learned for the community, but one part of his presentation stuck out for me where he compared django’s dependency graph with that of turbogears (around the 9th minute). I am no graph expert, but I am almost certain that eye balling graphs is not a good way to compare them or decide how well they are arranged. I think you now see where this is going.

I went ahead and generated the dependency graphs for both django trunk and turbogears trunk. For the fun of it, I also included other python based projects, CherryPy, SqlAlchemy and Genshi. Let me be clear on what I mean by dependency graph of trunk. I actually went through the whole trunk history of these projects and generated the dependency graph for each commit.

I ended up with a lot of graphs and eyeballing is certainly not a good way to compare them. As it turns out, the concept of modularity exists in graph theory and it matches the definition I just gave. I used a method by newman which identifies groups in graphs using a clustering method that attempts to maximize modularity. Modularity in graph theory is basically a characteristic of how a graph is partitioned.

When applying the method on a source dependency graph, the method groups files that share dependencies into groups (i.e. modules) and the identified groups would maximize the modularity of the graph. The identified modularity value from this method would be an upper limit for how modular the code base is. So without further ado, I give you the the result of the analysis where I calculated the modularity of the dependency graph after each commit, and averaged the values per month:

Modularity graphs

Some highlights

  • Django seems to have a good increasing trend (Django community, keep up the good work!).
  • Turbogears, what happened? this is Turbogears trunk btw so it’s V2.0, I think they should have listened to Mark Ramm’s presentation. Seems like something went wrong, maybe backwards compatibility?
  • I marked out the two highest jumps in Django’s modularity. I attributed the first to the Boulder sprint, since I couldn’t find any other significant news during April 2007. The second can be attributed to newform-admin branch merging into trunk.
  • If you are wondering where queryset-refactor is, look 3 points prior to merging of newforms-admin. I dont think it had an effect on modularity, any ideas why?
  • SQLAlchemy, well done guys! anyone worked on SQLAlchemy and can confirm that indeed their code is modular? I would appreciate any comments to confirm that there is some level of reliability in the method I am using (I need to graduate people).

I hope you find this all interesting. I’ll be sharing some more analysis about other FLOSS projects. I’m currently working on Pylons, Twisted, and Trac. I thought about doing Zope but my computer begged for mercy. Stay tuned!

A Theory on IBM’s Sun Buyout

With regards to SUN’s buyout by IBM, it stuck me as odd that IBM would offer $7Billion for a company which it has alot of overlap with. Both IBM and SUN sell servers, both are also familiar to the OpenSource scene and should have no problem benefiting from one another’s expertise in software. So why would IBM offer to buyout SUN then withdraw?

Well here’s my theory

I think this is a move by IBM who might feel currently threatened by Cisco’s recent entry into the server business. Initially they might have thought they could buyout SUN and prevent Cisco from acquiring them, but they weren’t willing to pay the full price because they know they will not benefit from Sun. So the plan is to make it difficult for Cisco to buy SUN out given the low and crazy valuation set by the marketplace.

After IBM’s move, Cisco will have to beat IBM’s offer before they can buyout SUN. IBM’s plan is to make a dent in their foes balance sheet, so they would be less effective in their competition. If the move would deter likely suitors from the buyout, then this would be a blessing for IBM.

My prediction is A buyout will occur from either Cisco, for the server expertise, or Oracle, for the service and software talent (anyone for OracleDB + ZFS + Oracle Sparc Servers). Both companies have good synergies with Sun and the resources to buy them out.

Is this different from MS, Yahoo buyout?

yes! Sun could afford to play hard to get because, unlike yahoo, they have two suitors who are better than IBM. Yahoo only had MS, and to this day I don’t see how the two companies could function as a unit.

Disclaimer: The idea was born out of a discussion I had on /., I also don’t own JAVA,IBM,CSCO, or ORCL stocks, but seriously considering JAVA should it drop further.

Tinyurl Paranoia

Am I being paranoid, or can tinyurl be used to hide links that people would usually know will lead them to malicious content?