I finally managed to make sense out of string encoding in Python. Seems like we have two types of strings, unicode that is identified as u’unicode string’, and regular ascii encoded strings, which are the standard python strings.
The replace solution which I suggested in an earlier post kept failing with ascii decoder errors, which I couldn’t explain at that time given that I was encoding the strings using ‘utf-8’ as such:
encodedstring = rawstring.encode('utf-8','replace')
I even tried out the excellent chardet module that predicts what encoding is used for the strings thinking that I needed to identify the correct encoding and use it instead of utf-8. But as it turns out, rawstring is an ascii based python string and encoding it will also result in an ascii based python string. Before I could properly encode to utf-8 I needed to ensure that the string is a unicode python string.
So I tried the following, but to no avail:
encodedstring = unicode(rawstring).encode('utf-8','replace')
The solution to my problem lied with the codecs module. As it turns out, you need to convert your strings to unicode as early as possible in the lifetime of your program to avoid undefined behavior. For me, it was the moment i read the byte streams from the log file I was trying to parse. What I did was:
file = codecs.open('utf-8','r','replace')
for line in file.readline():
#do some work on unicde, utf-8 encoded line
line is now a unicode string encoded as utf-8. I was finally able to parse log messages with weird international characters and store them using django orm.
Parsing xml using xml.etree was not as straight forward, I’ll keep that discussion for another time.
When you analyse over 300K commit messages, changing problematic encodings gets a bit tiresome. This got me motivated to look for a solution and here it is:
obj.text = parsed_log_message.encode('utf-8','replace')
Saving obj will stop DjangoUnicodeDecodeError from squealing and replace problematic characters with a ‘?’. I guess this is part of the zen of Python:
Errors should never pass silently.
Unless explicitly silenced.
I’ve been working today on data collection for my dissertation and wrote some python scripts to parse the logs of a number of FLOSS repositories and store the data into a Django model to make querying the data easier. So I run a script to collect the log messages for the year 2008, and everything seems to be progressing fine. You can see the names of projects my script is working on flying by the screen, until it hit the Linux Kernel 2.6.
The activity on the project is absolutely enormous compared to the other projects in my data sample (which includes Wine and Django). The names that were flying up my screen simply stopped as if we hit a brick wall.
So I wait for a minute, then I got a flat tire:
File "/usr/lib/python2.5/site-packages/django/utils/encoding.py", line 77, in force_unicode
raise DjangoUnicodeDecodeError(s, *e.args)
django.utils.encoding.DjangoUnicodeDecodeError: 'utf8' codec can't decode bytes in position 705-708:
invalid data. You passed in '...Signed-off-by: Bj\xf6rn Steinbrink <B.Steinbrink@gmx.de>
Signed-off-by: David S. Miller <email@example.com>\n\n' (<type 'str'>)
I beleive everything is configured for UTF8 encoding on my end, but I suspect this part of the string is problematic Bj\xf6rn. I normally would replace the character, but since im dealing with 200+ projects and well over 100K commit messages, I don’t think this would be a good option.
I hope this doesn’t take very long to fix.
Update on situation: Took me an hour of playing with encoding only to have conceded in the end. I decided to simply modify the text. I don’t think I have the time for this.