Archive for March, 2009

Dissertation Donkey Work Has Started!

It’s a boring part of my dissertation, but it must be done. After trying to automate the parsing of patch contributors from RCS logs and deciding the approach wasn’t very reliable, I decided to use django to display the log messages with suspected contributor names ten at a time. I sift through these messages and approve them ten at a time.

Thank god Linux Kernel development is done using git which stores the author’s name and I don’t have to parse the log message. This saved me from sifting through 50K log messages. I will check out Wine too to see if the author information is captured by git also. But even with this, I got 90K log messages to go through.

To the Postgres committers out there, I appreciate what you’re doing, but let me say this: I hate you for making my life a living hell. Would it hurt to put “patch by” in front of author’s name in the log message?

Django Forms Gotcha

Well, I just wasted 4 good hours of my time thinking I was facing a bug. I set the prefix for a form while trying to set the initial value for for a field. Only the first instance of the instantiated form would include the initial value. This code illustrates my problem:

    from django import forms

    class MyForm(forms.Form):
        names = forms.CharField(required=False)
        id = forms.CharField(widget=forms.HiddenInput,required=False)


    print [MyForm({'id':y},prefix=y).as_p() for y in range(2)]

    #output:
    #
    #[u'<p><label for="id_names">Names:</label> <input type="text" name="names" #id="id_names" /><input type="hidden" name="id" value="0" id="id_id" /></p>',
    # u'<p><label for="id_1-names">Names:</label> <input type="text" name="1-names" #id="id_1-names" /><input type="hidden" name="1-id" id="id_1-id" /></p>']
    #
    # Notice how 2nd form instance doesnt have value=1 for hidden input field.

Solution was simple. Thanks to the guys at #django on freenode, all I needed to do was instantiate my form as such:

MyForm(initial={'id':y})

That fixed it for me. I was told that when using data=, I need to match the data key to field name including the prefix. With initial keyword, you dynamically set that.

Making Sense of Unicode in Python

I finally managed to make sense out of string encoding in Python. Seems like we have two types of strings, unicode that is identified as u’unicode string’, and regular ascii encoded strings, which are the standard python strings.

The replace solution which I suggested in an earlier post kept failing with ascii decoder errors, which I couldn’t explain at that time given that I was encoding the strings using ‘utf-8’ as such: encodedstring = rawstring.encode('utf-8','replace')

I even tried out the excellent chardet module that predicts what encoding is used for the strings thinking that I needed to identify the correct encoding and use it instead of utf-8. But as it turns out, rawstring is an ascii based python string and encoding it will also result in an ascii based python string. Before I could properly encode to utf-8 I needed to ensure that the string is a unicode python string.

So I tried the following, but to no avail: encodedstring = unicode(rawstring).encode('utf-8','replace')

The solution to my problem lied with the codecs module. As it turns out, you need to convert your strings to unicode as early as possible in the lifetime of your program to avoid undefined behavior. For me, it was the moment i read the byte streams from the log file I was trying to parse. What I did was:

import codecs file = codecs.open('utf-8','r','replace') for line in file.readline(): #do some work on unicde, utf-8 encoded line

line is now a unicode string encoded as utf-8. I was finally able to parse log messages with weird international characters and store them using django orm.

YAY!

Parsing xml using xml.etree was not as straight forward, I’ll keep that discussion for another time.

Solution to my Encoding Problem

When you analyse over 300K commit messages, changing problematic encodings gets a bit tiresome. This got me motivated to look for a solution and here it is:

obj.text = parsed_log_message.encode('utf-8','replace')

Saving obj will stop DjangoUnicodeDecodeError from squealing and replace problematic characters with a ‘?’. I guess this is part of the zen of Python:

Errors should never pass silently. Unless explicitly silenced.

Django Evolution Gotcha

Django evolution is the closest thing to a steroid when it comes to enhancing the productivity of a Django developer working with RDBMSs. So close, it’s even got some nasty side effects if you rely on it so much.

I made the mistake of doing ./manage.py reset app after changing a model structure for an app that was being tracked by django-evolution. Almost all ./manage.py commands gave nasty errors whenever I try to use them afterwords. So I removed the django-evolution app from my installed app list, and got things working again.

Two days pass, and I fall into withdrawal from relying on this drug known as django evolution, I had to have it again. I installed it again after doing a ./manage.py flush, and db related management commands refused to work. Then it hit me. All I had to do was:

./manage.py reset django_evolution

Things then got back to normal. I believe what happened is django evolution got out of sync with my db state after the reset that I did. This solution surely fixed it, but I lost all the evolution history of my database. If anyone out there knows a better way to fix this problem, and still maintain previous history, I would be thankful.

Encoding Blues

I’ve been working today on data collection for my dissertation and wrote some python scripts to parse the logs of a number of FLOSS repositories and store the data into a Django model to make querying the data easier. So I run a script to collect the log messages for the year 2008, and everything seems to be progressing fine. You can see the names of projects my script is working on flying by the screen, until it hit the Linux Kernel 2.6.

The activity on the project is absolutely enormous compared to the other projects in my data sample (which includes Wine and Django). The names that were flying up my screen simply stopped as if we hit a brick wall.

So I wait for a minute, then I got a flat tire: File "/usr/lib/python2.5/site-packages/django/utils/encoding.py", line 77, in force_unicode raise DjangoUnicodeDecodeError(s, *e.args) django.utils.encoding.DjangoUnicodeDecodeError: 'utf8' codec can't decode bytes in position 705-708: invalid data. You passed in '...Signed-off-by: Bj\xf6rn Steinbrink <B.Steinbrink@gmx.de> Signed-off-by: David S. Miller <davem@davemloft.net>\n\n' (<type 'str'>)

I beleive everything is configured for UTF8 encoding on my end, but I suspect this part of the string is problematic Bj\xf6rn. I normally would replace the character, but since im dealing with 200+ projects and well over 100K commit messages, I don’t think this would be a good option.

I hope this doesn’t take very long to fix.

Update on situation: Took me an hour of playing with encoding only to have conceded in the end. I decided to simply modify the text. I don’t think I have the time for this.

Dumping PHP in favor of Python and Django — Part 1

We have been using PHP at koutbo6.com since we have started at year 2000. In 2004 we started looking for a CMS that will make our life easier, our choice was Joomla! which covered our needs pretty well at that time, we have deployed Joomla! in 2005, after couple of years of using it we have discovered many strange behaviors of the CMS and poor performance when it comes to speed. The situation was annoying which lead to looking for an alternative by the end of 2006 we started CMS and Framework research and comparison which took couple of months, the comparison included typo3, drupal, joomla 1.5, zope, plone. In 2007 we were about to choose between Drupal and Plone. I dont remember how we have discovered Django framework (Thank God!), we were very much interested in Django features, we started tracking the project till the end of 2007.

It is fourtwelve.am, I dont feel like continuing my post now. I will probably continue the post when I feel like doing so.

First Blogging Attempt

Hello world!

Hopefully this will speed up my dissertation writing process. I'll try to post news about my progress and what I am doing every now and then. Hopefully, someone will find it interesting. More importantly, if my adviser is reading this, I hope you find this a valuable tool for our communication.

Setting up ByteFlow was a breeze, and let me tell the world, Django and Python are amazing. No more Java crap for me.