MSIE, PHP and UTF-8



  • Representative lines from an internal PHP application inherited from another developer.

    After several screens assigning constant strings (encoded in latin-1) to variables:

            if (eregi("MSIE", strtoupper($_SERVER['HTTP_USER_AGENT']))) {
    /** Constants without UTF8 **/
    define("CONTROLLER_NO_ACTION_MSG", $controllerNoActionMsg);

    [...repeat for each of the variables...]

    } else {
    /** Constants w/ UTF8 **/
    define("CONTROLLER_NO_ACTION_MSG", utf8_encode($controllerNoActionMsg));

    [...repeat for each of the variables...]


  • So they do a case insensitive string compare on a variable they pass through to strtoupper. Joy.



  • @Lingerance said:

    So they do a case insensitive string compare on a variable they pass through to strtoupper. Joy.

    That, and they assume that Internet Explorer can't handle UTF-8 (which it most definitely can).



    More likely their webserver is outputting a header line "Content-type: text/html; charset=iso-8859-1" while they have a META http-equiv "Content-type" tag specifying "text/html; charset=utf-8" (or vice-versa) - I'm not sure which browsers behave which way, but some will allow the META tag to win while others won't allow it to override the HTTP header (and MSIE and Firefox are divided over this behavior).



  • According to the HTML spec, the HTTP header should win if it conflicts with <meta> - and I know IE does it like this (at least IE6SP2 does - can't remember what it did before then)

    To sum up, conforming user agents must observe the following priorities when determining a document's character encoding (from highest priority to lowest):

    1. An HTTP "charset" parameter in a "Content-Type" field.
    2. A <SAMP class=einst>META</SAMP> declaration with "http-equiv" set to "Content-Type" and a value set for "charset".
    3. The <SAMP class=ainst>charset</SAMP> attribute set on an element that designates an external resource.

     http://www.w3.org/TR/html401/charset.html#idx-character_encoding-6


  • Considered Harmful

    @Pidgeot said:

    1. An HTTP "charset" parameter in a "Content-Type" field.
    2. A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset".
    3. The charset attribute set on an element that designates an external resource.

    In this case, the spec seems a little backwards. Isn't HTTP-EQUIV method most often used when the author can't control the HTTP headers of the server? E.g. In a static HTML file on a shared server. This would prevent her from overriding the default from the server.

    (Wouldn't it be neat if the web server could parse out the meta tag and update its header accordingly? Yes, I know, the server would have to be sure the document is actually HTML and buffer the output to be sure. Still.)



  • @joe.edwards said:

    (Wouldn't it be neat if the web server could parse out the meta tag and update its header accordingly? Yes, I know, the server would have to be sure the document is actually HTML and buffer the output to be sure. Still.)

    Have you ever seen the "Simpsons already did it" South Park episode?  Perl is kind of like that. 



  • @joe.edwards said:

    In this case, the spec seems a little backwards. Isn't HTTP-EQUIV method most often used when the author can't control the HTTP headers of the server? E.g. In a static HTML file on a shared server. This would prevent her from overriding the default from the server.

    The logic is probably that if you took steps to make the webserver tell something it wouldn't normally tell, there's probably a good reason for it and this should be prefered. It might also be a case of believing sysadmins would know more about these things than people who make web pages (keep in mind, this was written 9 years ago, where Unicode was basically unknown to most people).

    As far as shared servers are concerned, they're pretty much always set up to not return a default charset, allowing <meta> to be used instead. Of course, even if they do send out a default charset, you can usualyl override it with server-side scripting languages.

    If I read the XHTML spec correctly (http://www.w3.org/TR/xhtml1/#C_9), the XML prolog is prioritized the highest - of course, not only would you have use XHTML on your site (and unless you need inline MathML, I've not found much reason to do so), you'd also need to be dealing with a user agent that treats it as XHTML.


Log in to reply