Character é encoding question



  • I'm connecting to an external webserver over which I have no control. Apparently it's hosted on Apache/Tomcat. When sending my request, I receive an error message in which the character é is represented as �

    I assume that they haven't properly encoded their message. Maybe it's a unicode string that is translated into an ASCII codepage that doesn't contain a glyph for é? Is there an easy to way to transform this message back to the correct encoding so that � will be translated back to é?

     

    Etat HTTP 401 - Invalid software.




    type Rapport d'�tat


    message Invalid software.


    description La requ�te n�cessite une authentification HTTP (Invalid
    software.).




    Apache Tomcat/6.0.18



  • If you encode � as ISO 8859-1 and then decode as UTF-8, you will get the character U+FFFD, also known as REPLACEMENT CHARACTER, which is typically substituted instead of an invalid character. My psychic skills guess that the error message string was originally encoded as ISO 8859-1, then decoded as UTF-8 (unrecoverably b0rking it in the process), then decoded as UTF-8 and encoded as ISO 8859-1 again.



  •  Looks to me like their error document is in one encoding but they actually send another in the content-type header. Did you take a look what they send in the headers?



  •  @Spectre said:

    If you encode � as ISO 8859-1 and then decode as UTF-8, you will get the character U+FFFD, also known as REPLACEMENT CHARACTER, which is typically substituted instead of an invalid character. My psychic skills guess that the error message string was originally encoded as ISO 8859-1, then decoded as UTF-8 (unrecoverably b0rking it in the process), then decoded as UTF-8 and encoded as ISO 8859-1 again.
    That makes sense. I noticed that both the é in "état" and the ê in "requête" have been replaced with the same �. This makes it indeed unrecoverable :-(



  • @PSWorx said:

     Looks to me like their error document is in one encoding but they actually send another in the content-type header. Did you take a look what they send in the headers?

    The headers say UTF-8. Even the message of the HTTP Error 401 contains a wrong character

    RESPONSE: **************\n
    HTTP/1.1 401 Non-Autoris�\r\n
    Date: Thu, 19 Nov 2009 08:04:10 GMT\r\n
    Server: Apache/2.2.8 (Ubuntu) mod_jk/1.2.25 mod_ssl/2.2.8 OpenSSL/0.9.8g\r\n
    Content-Length: 1028\r\n
    Connection: close\r\n
    Content-Type: text/html;charset=utf-8\r\n


Log in to reply
 

Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.