Unicode normalization in Python
I am working on a Web Crawler in python 2.5/2.6. Crawler comes across numerous kind of data.
Any encoding I use, after some time it fails. I tried "ASCII" and "Latin-1"
e.g. error :- UnicodeEncodeError: 'latin-1' codec can't encode characters in position 89-91: ordinal not in range(256)
Is there any universal solution for this in Python. Any suggestions are welcome.
Thanks & Regards,
Why are you writing your own?
Perhaps you should look into Nutch and Lucene, which are FOSS crawlers.
Beyond that, I have nothing.
This is a research project and requirements are bit different.
We are crawling internet archieve by specific year. So crawl needs to be restricted to that perticular year say 1998 or 1999 etc. I guess year specific functionality is not there in Nutch and Lucene.
I am managing my task by giving different character set each time I crawl. But generic solution in this case will be helpful.
Perhaps you should be encoding to UTF-8 if you want to represent arbitrary unicode characters?
I tried UTF-8 format as well, but this is also failing, giving the same error, "cordinal out of range"