Unicode normalization in Python

gd_ganesh

Hi,

I am working on a Web Crawler in python 2.5/2.6. Crawler comes across numerous kind of data.

Any encoding I use, after some time it fails. I tried "ASCII" and "Latin-1"

e.g. error :- UnicodeEncodeError: 'latin-1' codec can't encode characters in position 89-91: ordinal not in range(256)

Is there any universal solution for this in Python. Any suggestions are welcome.

Thanks & Regards,

Ganesh

belgariontheking

Why are you writing your own?

Perhaps you should look into Nutch and Lucene, which are FOSS crawlers.

Beyond that, I have nothing.

gd_ganesh

This is a research project and requirements are bit different.

We are crawling internet archieve by specific year. So crawl needs to be restricted to that perticular year say 1998 or 1999 etc. I guess year specific functionality is not there in Nutch and Lucene.

I am managing my task by giving different character set each time I crawl. But generic solution in this case will be helpful.

Thanks!!!

arty

Perhaps you should be encoding to UTF-8 if you want to represent arbitrary unicode characters?

gd_ganesh

I tried UTF-8 format as well, but this is also failing, giving the same error, "cordinal out of range"