String conversion is fun!



  • Recently found this deep in the code that processes XML data from the backend. And they were wondering why their umlauts got garbled... can you count the WTFs and especially the number of completely uneccesary and potentially destructive byte[]/String conversions? Anyone care to explain how someone who's apparently aware of the concept of string encodings could produce this code?

        public boolean setContent(byte[] xml)
        {
            boolean retVal = false;
            if(xml != null)
            {
                this.keys.clear();
                SAXBuilder builder = new SAXBuilder();
                builder.setValidation(false);
                try
                {
                    String strEncoded = new String(xml,encoding);
                    retVal = this.setContent(strEncoded);;
                }
                catch (UnsupportedEncodingException e)
                {
                    SystemProperties.getLogger().logException(getClass(), "Wrong Encoding", e);
                    retVal = false;
                }            
            }
            return retVal;
        }
     

        public boolean setContent(String xml)
        {
            boolean retVal = false;
            if(xml != null)
            {
                this.keys.clear();
                SAXBuilder builder = new SAXBuilder();
                builder.setValidation(false);
                try
                {
                    document = builder.build(
                            new InputSource(
                                    new StringReader(
                                            new String(xml.getBytes(),encoding))));
                } catch (JDOMException e)
                {
                    retVal = false;
                } catch (IOException e)
                {
                    SystemProperties.getLogger().logException(getClass(), "Cannot create JDOM document", e);
                    retVal = false;
                }            
                initMap();
            }
            return retVal;
        }
        



  • I think it'll work out ... as long as you don't use accents, umlauts... oh well, anything that you might encounter with non-English languages. Oops!

    On a related note, why does Javascript ignore the HTML escaping sequences? I was genuinely pissed when I found out that alert() would pass on the ó string instead of ó. Now I just use a simple UTF-8 converter right before the string reaches the JSP frontend, that solved most of my problems.



  • @danixdefcon5 said:

    On a related note, why does Javascript ignore the HTML escaping sequences?

    Because Javascript isn't HTML?  HTML entity encoding is meant for HTML, not JS. 



  • @morbiuswilters said:

    @danixdefcon5 said:

    On a related note, why does Javascript ignore the HTML escaping sequences?

    Because Javascript isn't HTML?  HTML entity encoding is meant for HTML, not JS. 

     

    Although you can of course have Javascript in HTML... <span onclick="alert(&quot;&ouml;&quot;);">huh</span>




  • @RoBorg said:

    Although you can of course have Javascript in HTML... <span onclick="alert(&quot;&ouml;&quot;);">huh</span>

    Yes.  And the HTML is parsed before the JS is interpreted, so it is encoded according to the rules of HTML.  It's also probably gzip'd before being send to the client. 



  • I believe the HTML is parsed from the mark-up first, including un-escaping things like &quote;, and then the javascript is parsed after that.

    EDIT: nevermind.  didn't see the previous reply!



  • @danixdefcon5 said:

    I think it'll work out ... as long as you don't use accents, umlauts... oh well, anything that you might encounter with non-English languages. Oops!

    Even then it can fail with the rare non-ASCII-compatible encoding like EBCDIC.

    OTOH it can (and did) work when the hardcoded encoding, Javas platform default encoding AND the one actually used by the XML data just happen to be identical. Quite a feat - most naive string conversion code just fails when the data's encoding doesn't match the platform default. 



  • @brazzy said:

    @danixdefcon5 said:

    I think it'll work out ... as long as you don't use accents, umlauts... oh well, anything that you might encounter with non-English languages. Oops!

    Even then it can fail with the rare non-ASCII-compatible encoding like EBCDIC.

    OTOH it can (and did) work when the hardcoded encoding, Javas platform default encoding AND the one actually used by the XML data just happen to be identical. Quite a feat - most naive string conversion code just fails when the data's encoding doesn't match the platform default. 

    Most of our Java platforms have "ISO-8859-1" encoding by default. But UTF-8 is the big thing on websites; my initial solution was simply to substitute with HTML encoding (&oacute; and such) ... until I found that anything under <script> tags wouldn't be unescaped. So now I convert the strings to UTF-8 before displaying ... and get some interesting results when the page is actually encoded for ISO-8859-1 as I found out yesterday.

Log in to reply