Unicode normalization in Python



  •  Hi,

     I am working on a Web Crawler in python 2.5/2.6. Crawler comes across numerous kind of data.

    Any encoding I use, after some time it fails. I tried "ASCII" and "Latin-1"

    e.g. error :- UnicodeEncodeError: 'latin-1' codec can't encode characters in position 89-91: ordinal not in range(256)

    Is there  any universal solution for this in Python. Any suggestions are welcome.

    Thanks & Regards,

    Ganesh 



  • Why are you writing your own? 

    Perhaps you should look into Nutch and Lucene, which are FOSS crawlers.

    Beyond that, I have nothing.



  • This is a research project and requirements are bit different.

    We are crawling internet archieve by specific year. So crawl needs to be restricted to that perticular year say 1998 or 1999 etc. I guess year specific functionality is not there in Nutch and Lucene.

    I am managing my task by giving different character set each time I crawl. But generic solution in this case will be helpful.

    Thanks!!!



  • Perhaps you should be encoding to UTF-8 if you want to represent arbitrary unicode characters?



  •  I tried UTF-8 format as well, but this is also failing, giving the same error, "cordinal out of range"


Log in to reply