I finally managed to make sense out of string encoding in Python. Seems like we have two types of strings, unicode that is identified as u’unicode string’, and regular ascii encoded strings, which are the standard python strings.
The replace solution which I suggested in an earlier post kept failing with ascii decoder errors, which I couldn’t explain at that time given that I was encoding the strings using ‘utf-8’ as such:
encodedstring = rawstring.encode('utf-8','replace')
I even tried out the excellent chardet module that predicts what encoding is used for the strings thinking that I needed to identify the correct encoding and use it instead of utf-8. But as it turns out, rawstring is an ascii based python string and encoding it will also result in an ascii based python string. Before I could properly encode to utf-8 I needed to ensure that the string is a unicode python string.
So I tried the following, but to no avail:
encodedstring = unicode(rawstring).encode('utf-8','replace')
The solution to my problem lied with the codecs module. As it turns out, you need to convert your strings to unicode as early as possible in the lifetime of your program to avoid undefined behavior. For me, it was the moment i read the byte streams from the log file I was trying to parse. What I did was:
file = codecs.open('utf-8','r','replace')
for line in file.readline():
#do some work on unicde, utf-8 encoded line
line is now a unicode string encoded as utf-8. I was finally able to parse log messages with weird international characters and store them using django orm.
Parsing xml using xml.etree was not as straight forward, I’ll keep that discussion for another time.