I’ve been working today on data collection for my dissertation and wrote some python scripts to parse the logs of a number of FLOSS repositories and store the data into a Django model to make querying the data easier. So I run a script to collect the log messages for the year 2008, and everything seems to be progressing fine. You can see the names of projects my script is working on flying by the screen, until it hit the Linux Kernel 2.6.
The activity on the project is absolutely enormous compared to the other projects in my data sample (which includes Wine and Django). The names that were flying up my screen simply stopped as if we hit a brick wall.
So I wait for a minute, then I got a flat tire:
File "/usr/lib/python2.5/site-packages/django/utils/encoding.py", line 77, in force_unicode
raise DjangoUnicodeDecodeError(s, *e.args)
django.utils.encoding.DjangoUnicodeDecodeError: 'utf8' codec can't decode bytes in position 705-708:
invalid data. You passed in '...Signed-off-by: Bj\xf6rn Steinbrink <B.Steinbrink@gmx.de>
Signed-off-by: David S. Miller <firstname.lastname@example.org>\n\n' (<type 'str'>)
I beleive everything is configured for UTF8 encoding on my end, but I suspect this part of the string is problematic Bj\xf6rn. I normally would replace the character, but since im dealing with 200+ projects and well over 100K commit messages, I don’t think this would be a good option.
I hope this doesn’t take very long to fix.
Update on situation: Took me an hour of playing with encoding only to have conceded in the end. I decided to simply modify the text. I don’t think I have the time for this.