Event driven HTML tokenization



  • Why oh why would that ever make sense. Stream of events?!
    Shouldn't a tokenizer just return a list of tokens?



  • Standard SAX approach, move on, nothing to see.


  • Dupa

    Think this is what you actually wanted.



  • @swayde said:

    Shouldn't a tokenizer just return a list of tokens?

    Did you not attend the Basic Buzzword Compliance 101?



  • I fail to see the benefit, that's all. And using events like that seems to prevent parallelization, due to order of events..



  • @brotherelf said:

    Standard SAX approach, move on, nothing to see.

    Yeah, seems like an overall sane approach; from the callbacks you can get your list of tokens, or directly build a tree-like representation (AST, DOM or whatever you're going to call it). The buzzwords seem a bit unnecessary, but whatever.

    It seems like a way more sane approach than the discomangler that's used here.

    Disclaimer: I don't know anything about javascript/node, and didn't look at the implementation; just saying that the callback-approach seems fairly sane. Also, I mainly posted this for the opportunity to use the word "discomangler".



  • @swayde said:

    And using events like that seems to prevent parallelization

    Oh, they are tokens all right. You can parallelize to your heart's content (after collecting from iterator to list). It is really just different naming.


  • FoxDev

    @cvi said:

    I don't know anything about javascript/node, and didn't look at the implementation; just saying that the callback-approach seems fairly sane.

    Callbacks are the most common way of operating in Node; their whole API is built around the continuation passing style. I'd prefer it to be built on Promises/A+ myself, but the Node API was built before Promises/A+ was added to the ECMAScript spec.



  • @Bulb said:

    @swayde said:
    And using events like that seems to prevent parallelization

    Oh, they are tokens all right. You can parallelize to your heart's content (after collecting from iterator to list). It is really just different naming.


    Does nodejs create different threads for "events" (which is just another name for setTimeout, AFAIK)? I don't think so. There's probably not a JS library in the world that is thread-safe.


  • FoxDev

    IIRC, Node is strictly single-threaded; the 'parallelisation' comes from using setTimeout, setImmediate, and setInterval functions to yield control of the thread to the next queued event.



  • So there's no parallelism, so no benefit at all from using events to parse HTML. There's only the 0.something millisecond overhead from the sloppy scheduling, so a large HTML document will probably cost a few seconds just having "events". Doesn't sound like a good deal.


  • FoxDev

    Parallelism doesn't require multithreading. In Node, if you have a function that could block e.g. on I/O, it can yield control of the thread so other tasks can be completed. Then, when the I/O operation completes, the original task can be resumed.

    There's a reason Node is being increasingly used server-side for websites.



  • @RaceProUK said:

    it can

    The entire point of callbacks is that the callback only runs when the read is completed.

    @RaceProUK said:

    Parallelism doesn't require multithreading

    Yes it does. Running in a weird order is not parallelism, it's just running sequentially in a random order.

    @RaceProUK said:

    Node

    I was thinking of languages that are able to paralellize threads - like java (they have that kind of parser)


  • FoxDev

    @swayde said:

    The entire point of callbacks is that the callback only runs when the read is completed.

    And in order for it to happen that way, the blocking task must yield control of the thread.

    @swayde said:

    @RaceProUK said:
    Parallelism doesn't require multithreading

    Yes it does. Running in a weird order is not parallelism, it's just running sequentially in a random order.

    That's a very narrow definition of parallelism. What's so different about doing something while waiting for IO to complete, and having two threads? Besides, Node may be single-threaded, but that doesn't mean what Node sits on is.



  • It's true that it's the common approach, but it can be confusing when you find it for the first time. Especially considering it's called the "Simple API for XML". "I just wanted a list! Why do I have to build it myself?"

    On a related note, the Python "xml" module has no less than 9 submodules to parse it in various ways 😑 .

    And with all the XML-related standards (XSLT, etc.) sometimes I feel like the W3C people are trying to build an entire software platform around a simple file format.



  • @RaceProUK said:

    That's a very narrow definition of parallelism. What's so different about doing something while waiting for IO to complete, and having two threads? Besides, Node may be single-threaded, but that doesn't mean what Node sits on is.

    But it's the definition of parrallelism. What you refer to is called asychronous, isn't it?


  • FoxDev

    Asynchrony is a way of achieving parallelism as e.g. I am doing calculations while waiting for I/O to complete.



  • @anonymous234 said:

    "Simple API for XML"

    Well, the API is simple. It just does not do much useful thing. After all, it rhymes with Sucks. What do you expect from it?

    @anonymous234 said:

    I just wanted a list!

    I always want a tree. You can get list with “Stream API for XML” (StAX), which returns the “events” from generator rather than calling callbacks (not sure if it exists for node though; it probably does not, because it can't be easily written in the CPS). But that rarely helps much. The only sane way to deal with XML is a serializer.

    @anonymous234 said:

    And with all the XML-related standards (XSLT, etc.) sometimes I feel like the W3C people are trying to build an entire software platform around a simple file format.

    When all you have is a hammer, everything looks like a nail…



  • @Bulb said:

    I always want a tree

    For parsing. We (were) talking tokenization.



  • @Bulb said:

    When all you have a hammer, everything looks like a nail…

    XML: the ultimate hammer.

    I like this metaphor.


  • 🚽 Regular

    @Hanzo said:

    So there's no parallelism, so no benefit at all from using events to parse HTML. There's only the 0.something millisecond overhead from the sloppy scheduling, so a large HTML document will probably cost a few seconds just having "events". Doesn't sound like a good deal.

    Assuming you don't have anything else to do, that's correct.

    @Hanzo said:

    What you refer to is called asychronous, isn't it?
    Concurrent.

    The advantage of using events given the context of single-threaded javascript is that you are introducing more opportunities for any given sequence of instructions to yield the CPU to another concurrent sequence of instructions not needing to wait.

    Is there an advantage in using events instead of "regular" asynchronous callbacks? Performance-wise, no. But it might give you a simpler abstraction over the handling of green threads.


  • FoxDev

    @Bulb said:

    When all you have a hammer

    You smash robots with it to free the little birds, pigs, and rabbits trapped inside?



  • @Bulb said:

    The only sane way to deal with XML is a serializer.
    Most deserializers insist that their input stream consist of -- and returns objects that are -- the tree, the whole tree, and nothing but the tree. This can cause severe issues when dealing with large or dynamic inputs, which a SAX-based or StAX-based parser can handle without any issues whatsoever. Plus, most XML libraries let you temporarily hand off to a SAX-to-DOM generator, so you get those subtrees you always wanted.



  • @TwelveBaud said:

    Plus, most XML libraries let you temporarily hand off to a SAX-to-DOM generator[citation needed], so you get those subtrees you always wanted.

    I've never actually seen that.



  • @RaceProUK said:

    There's a reason Node is being increasingly used server-side for websites.

    Because PHP is more than a couple years old and learning a good language is beyond the capabilities of JS programmers? 🚎



  • Who are you trying to 🚎 there, exactly?


  • Discourse touched me in a no-no place

    @Bulb said:

    I always want a tree.

    Hooray for you, but it doesn't really scale too well and there are some colossal XML documents about. Sometimes that doesn't matter and you can use the tokens to build the tree you want. Sometimes that matters a heck of a lot. There's also XMPP, which is basically an infinite XML document delivered over a socket (actually a sequence of smaller documents, but even so…)

    Grumbling over there being a streaming XML tokenizer is just a demonstration of ignorance.


  • Discourse touched me in a no-no place

    This post is deleted!

Log in to reply