Unicode normalization in Python



  •  Hi,

     I am working on a Web Crawler in python 2.5/2.6. Crawler comes across numerous kind of data.

    Any encoding I use, after some time it fails. I tried "ASCII" and "Latin-1"

    e.g. error :- UnicodeEncodeError: 'latin-1' codec can't encode characters in position 89-91: ordinal not in range(256)

    Is there  any universal solution for this in Python. Any suggestions are welcome.

    Thanks & Regards,

    Ganesh 



  • Why are you writing your own? 

    Perhaps you should look into Nutch and Lucene, which are FOSS crawlers.

    Beyond that, I have nothing.



  • This is a research project and requirements are bit different.

    We are crawling internet archieve by specific year. So crawl needs to be restricted to that perticular year say 1998 or 1999 etc. I guess year specific functionality is not there in Nutch and Lucene.

    I am managing my task by giving different character set each time I crawl. But generic solution in this case will be helpful.

    Thanks!!!



  • Perhaps you should be encoding to UTF-8 if you want to represent arbitrary unicode characters?



  •  I tried UTF-8 format as well, but this is also failing, giving the same error, "cordinal out of range"


Log in to reply
 

Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.