How the fuck am I supposed to parse this?



  • It's HTML code, but...

    <p>
    content content content &ldquo;content&rdquo;</P>
    
    
    <!--​ Everything below this point is necessary for Google Analytics and page closer.
    
    <!--​ New HR tag -->
    <br><div class="hr2"></div><p>&nbsp;</p><br><center>End of Chapter 1</center>
    <!--​ End HR tag code -->
    
    <p></p><div align="center">
    <script>
      (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
      (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
      m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
      })(window,document,'script','//www.google-analytics.com/analytics.js','ga');
    
      ga('create', 'UA-52978780-1', 'auto');
      ga('send', 'pageview');
    
    </script>
    
    </div>
    
    
    
    </body></html>
    

    The syntax highlighter in Discourse seems to handle it. But did you notice that there's an unclosed comment?

    http://i.imgur.com/whDuiUz.png

    <!--​ Everything below this point is necessary for Google Analytics and page closer.
    LOL THIS IS STILL IN A COMMENT
    <!--​ New HR tag -->
    

    Yeeeahhhh.... That's making Sigil (ebook editor/creator) throw a fit.


  • FoxDev

    That's an issue in Sigil surely; HTML comments don't nest 😆



  • Also note that the <script> tag is still there. I'm passing the HTML through a script that's supposed to clean it up a bit - and one of the steps is to remove any script tags. >.>

    Nope, the script tag gets removed in the cleanedmirror/ folder. Never mind. Let's see what the parse tree looks like...



  • that is.... ugly.

    but, i'm with spiny norman, it's a parser bug
    the spec:

    A comment declaration starts with <!, followed by zero or more comments, followed by >. A comment starts and ends with "--", and does not contain any occurrence of "--".

    so yes, it's all comment until the >



  • Oh, it's probably BeautifulSoup introducing the "matched comment tags" rule. :facepalm:



  • @Jarry said:

    the spec:

    A comment declaration starts with <!, followed by zero or more comments, followed by >. A comment starts and ends with "--", and does not contain any occurrence of "--".

    so yes, it's all comment until the >

    According to that definition, the page is ill-formed and throwing a fit should be acceptable. The issue is here:

    <!--​ Everything below this point is necessary for Google Analytics and page closer.
    
    <!--​ New HR tag -->
    

    The fist comment in the <! ... > tag is

    -- Everything below this point is necessary for Google Analytics and page closer.
    
    <!--​
    

    You then have text that is outside any comments in the comment tag (which, by that definition, can only have zero or more comments)

    New HR tag
    

    And then you have the next comment:

    -->
    
    <p></p><div align="center">
    <script>
      (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
      (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
      m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
      })(window,document,'script','//www.google-analytics.com/analytics.js','ga');
    
      ga('create', 'UA-52978780-1', 'auto');
      ga('send', 'pageview');
    
    </script>
    
    </div>
    
    
    
    </body></html>
    

    Which doesn't end, and the comment tag is also not closed.



  • AIUI nope, a comment could be multi line

    the first comment being:

     Everything below this point is necessary for Google Analytics and page closer.
    
    <!
    

    the second:

     New HR tag 
    

    and a third one, that happens to be empty



  • @Jarry said:

    that is.... ugly.

    but, i'm with spiny norman, it's a parser bug
    the spec:

    A comment declaration starts with <!, followed by zero or more comments, followed by >. A comment starts and ends with "--", and does not contain any occurrence of "--".

    so yes, it's all comment until the >

    Well, it probably gets parsed wrong, but, until the >, it's all comment. The bug is that what is between the first -- and the > should probably just be discarded and probably is being included in the comment, but that's hardly an important bug. I'm not sure it doesn't fall into undefined behavior though. Could it just be included as a second comment and be valid?

    Edit: :hanzo:


    Filed Under: Why isn't that :ninja:??



  • Great news, the Go html5 parser seems to give the most useful interpretation, by including the second comment start tag in the comment.

            // Delete stylesheets
            t.Apply(DeleteNode(), "link")
            // Delete comments
            t.ApplyWithCollector(func(n *html.Node) {
                    fmt.Println(n)
            }, &CommentCollector{})
    
    &{0xc2080c6ee0 <nil> <nil> 0xc2080c79d0 0xc2080c7ab0 4   Insert your story content below this point.   []}
    &{0xc2080c6ee0 <nil> <nil> 0xc208105340 0xc208105420 4   Everything below this point is necessary for Google Analytics and page closer.
    
    <!--​ New HR tag   []}
    &{0xc2080c6ee0 <nil> <nil> 0xc2081057a0 0xc208105880 4   End HR tag code   []}
    

    hmm, looks like I forgot the rule to replace "div class=hr2" with <hr/>.



  • Also, here's another CodeSOD from the same project:

            // Sigil seems to respond to commented-out CSS, so fix that
            t.Apply(func(n *html.Node) {
                    if n.Type != html.ElementNode {
                            return
                    }
                    css := n.FirstChild.Data
                    css = strings.Replace(css, "<!--​", "", 1)
                    css = strings.Replace(css, "-->", "", 1)
                    css = strings.Trim(css, " \n")
                    css = css + "\n "
                    n.FirstChild.Data = css
            }, "style")
    

    Sigil seems to respond to commented-out CSS

    ... by not using the styles. :wtf:

    And yes... that's HTML comments inside of a <style> tag.



  • @rad131304 said:

    Filed Under: Why isn't that :ninja:??

    Hanzo


  • FoxDev

    @riking said:

    And yes... that's HTML comments inside of a <style> tag.

    Welcome to 1998 😆



  • @Jarry said:

    @rad131304 said:
    Filed Under: Why isn't that :ninja:??

    Hanzo

    Fair enough; I still don't think I should have to hunt down a random TDWTF meme in order to type an emoji. That :ninja: should be an alias that, possibly, gets replaced by the MD parser - but only because we have the view-source.



  • @Jarry said:

    AIUI nope, a comment could be multi linehttp://www.w3.org/TR/html4/intro/sgmltut.html#h-3.2.4

    Check my post more closely. The problem is not the multiline comment. We both interpret the first comment the same way. Where we are reading it differently, is where you call New HR tag the second comment. This is debatable, and the actual definitions don't clarify it.

    First, the two definitions of HTML comments (the one you first quoted and the one you now linked) don't match up. Your first definition claims that a comment tag can have zero or more comments. The link shows examples of comment tags with exactly 1 comment in them, which doesn't conflict necessarily, but suggests that <!> would not be a valid comment, while your first definition explicitly says it is.

    Now, your first definition says that A comment starts and ends with "--", and does not contain any occurrence of "--". This can have two interpretations: The one I followed:

    <!--​ This is a comment -- -- This is another -->
    

    And the one you are suggesting:

    <!--​This is a comment -- This is another -->
    

    However, your link would disagree with you: A common error is to include a string of hyphens ("---") within a comment. Authors should avoid putting two or more adjacent hyphens inside comments. If your interpretation was correct, a string of hyphens would simply start and end consecutive comments, with an odd number of hyphens simply meaning that the last comment starts with a hyphen itself, which as I understand is not an illegal character in a comment. It would be impossible to include hyphens in a comment.

    This passage would actually mean that you can't have zero or more comments in a comment tag, however, because my interpretation of comments is also ruled out by that passage. Since the definitions contradict themselves, unless an authoritative source can describe how comments are actually supposed to be parsed, any interpretation is arbitrary.


  • FoxDev

    @Kian said:

    the fist comment

    …I'm happy not knowing 😆



  • Let's go directly to Hixie:

    Comments must start with the four character sequence U+003C LESS-THAN SIGN, U+0021 EXCLAMATION MARK, U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS (<!--​). Following this sequence, the comment may have text, with the additional restriction that the text must not start with a single U+003E GREATER-THAN SIGN character (>), nor start with a U+002D HYPHEN-MINUS character (-) followed by a U+003E GREATER-THAN SIGN (>) character, nor contain two consecutive U+002D HYPHEN-MINUS characters (--), nor end with a U+002D HYPHEN-MINUS character (-). Finally, the comment must be ended by the three character sequence U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN (-->).

    That clears things up nicely, and doesn't require SGML pendantry. Double-hyphens are forbidden in comment tags, except for the two in the prologue and the two in the epilogue.

    Now, let's look at the parser implementation...

    * TwelveBaud reads...

    Ah. In the event you fuck it up in this way, a parse error is thrown, but if the parser elects to continue it includes the two hyphens and goes directly back to the comment state.


Log in to reply