Character é encoding question
-
I'm connecting to an external webserver over which I have no control. Apparently it's hosted on Apache/Tomcat. When sending my request, I receive an error message in which the character é is represented as �
I assume that they haven't properly encoded their message. Maybe it's a unicode string that is translated into an ASCII codepage that doesn't contain a glyph for é? Is there an easy to way to transform this message back to the correct encoding so that � will be translated back to é?
Etat HTTP 401 - Invalid software.
type Rapport d'�tat
message Invalid software.
description La requ�te n�cessite une authentification HTTP (Invalid software.).
Apache Tomcat/6.0.18
-
If you encode � as ISO 8859-1 and then decode as UTF-8, you will get the character U+FFFD, also known as REPLACEMENT CHARACTER, which is typically substituted instead of an invalid character. My psychic skills guess that the error message string was originally encoded as ISO 8859-1, then decoded as UTF-8 (unrecoverably b0rking it in the process), then decoded as UTF-8 and encoded as ISO 8859-1 again.
-
Looks to me like their error document is in one encoding but they actually send another in the content-type header. Did you take a look what they send in the headers?
-
@Spectre said:
If you encode � as ISO 8859-1 and then decode as UTF-8, you will get the character U+FFFD, also known as REPLACEMENT CHARACTER, which is typically substituted instead of an invalid character. My psychic skills guess that the error message string was originally encoded as ISO 8859-1, then decoded as UTF-8 (unrecoverably b0rking it in the process), then decoded as UTF-8 and encoded as ISO 8859-1 again.
That makes sense. I noticed that both the é in "état" and the ê in "requête" have been replaced with the same �. This makes it indeed unrecoverable :-(
-
@PSWorx said:
The headers say UTF-8. Even the message of the HTTP Error 401 contains a wrong characterLooks to me like their error document is in one encoding but they actually send another in the content-type header. Did you take a look what they send in the headers?
RESPONSE: **************\n
HTTP/1.1 401 Non-Autoris�\r\n
Date: Thu, 19 Nov 2009 08:04:10 GMT\r\n
Server: Apache/2.2.8 (Ubuntu) mod_jk/1.2.25 mod_ssl/2.2.8 OpenSSL/0.9.8g\r\n
Content-Length: 1028\r\n
Connection: close\r\n
Content-Type: text/html;charset=utf-8\r\n