Event driven HTML tokenization
-
Shouldn't a tokenizer just return a list of tokens?
-
Standard SAX approach, move on, nothing to see.
-
-
Shouldn't a tokenizer just return a list of tokens?
Did you not attend the Basic Buzzword Compliance 101?
-
I fail to see the benefit, that's all. And using events like that seems to prevent parallelization, due to order of events..
-
Standard SAX approach, move on, nothing to see.
Yeah, seems like an overall sane approach; from the callbacks you can get your list of tokens, or directly build a tree-like representation (AST, DOM or whatever you're going to call it). The buzzwords seem a bit unnecessary, but whatever.
It seems like a way more sane approach than the discomangler that's used here.
Disclaimer: I don't know anything about javascript/node, and didn't look at the implementation; just saying that the callback-approach seems fairly sane. Also, I mainly posted this for the opportunity to use the word "discomangler".
-
And using events like that seems to prevent parallelization
Oh, they are tokens all right. You can parallelize to your heart's content (after collecting from iterator to list). It is really just different naming.
-
I don't know anything about javascript/node, and didn't look at the implementation; just saying that the callback-approach seems fairly sane.
Callbacks are the most common way of operating in Node; their whole API is built around the continuation passing style. I'd prefer it to be built on Promises/A+ myself, but the Node API was built before Promises/A+ was added to the ECMAScript spec.
-
@swayde said:
And using events like that seems to prevent parallelization
Oh, they are tokens all right. You can parallelize to your heart's content (after collecting from iterator to list). It is really just different naming.
Does nodejs create different threads for "events" (which is just another name for setTimeout, AFAIK)? I don't think so. There's probably not a JS library in the world that is thread-safe.
-
IIRC, Node is strictly single-threaded; the 'parallelisation' comes from using
setTimeout
,setImmediate
, andsetInterval
functions to yield control of the thread to the next queued event.
-
So there's no parallelism, so no benefit at all from using events to parse HTML. There's only the 0.something millisecond overhead from the sloppy scheduling, so a large HTML document will probably cost a few seconds just having "events". Doesn't sound like a good deal.
-
Parallelism doesn't require multithreading. In Node, if you have a function that could block e.g. on I/O, it can yield control of the thread so other tasks can be completed. Then, when the I/O operation completes, the original task can be resumed.
There's a reason Node is being increasingly used server-side for websites.
-
it can
The entire point of callbacks is that the callback only runs when the read is completed.
Parallelism doesn't require multithreading
Yes it does. Running in a weird order is not parallelism, it's just running sequentially in a random order.Node
I was thinking of languages that are able to paralellize threads - like java (they have that kind of parser)
-
The entire point of callbacks is that the callback only runs when the read is completed.
And in order for it to happen that way, the blocking task must yield control of the thread.@RaceProUK said:
Parallelism doesn't require multithreading
Yes it does. Running in a weird order is not parallelism, it's just running sequentially in a random order.
That's a very narrow definition of parallelism. What's so different about doing something while waiting for IO to complete, and having two threads? Besides, Node may be single-threaded, but that doesn't mean what Node sits on is.
-
It's true that it's the common approach, but it can be confusing when you find it for the first time. Especially considering it's called the "Simple API for XML". "I just wanted a list! Why do I have to build it myself?"
On a related note, the Python "xml" module has no less than 9 submodules to parse it in various ways .
And with all the XML-related standards (XSLT, etc.) sometimes I feel like the W3C people are trying to build an entire software platform around a simple file format.
-
That's a very narrow definition of parallelism. What's so different about doing something while waiting for IO to complete, and having two threads? Besides, Node may be single-threaded, but that doesn't mean what Node sits on is.
But it's the definition of parrallelism. What you refer to is called asychronous, isn't it?
-
Asynchrony is a way of achieving parallelism as e.g. I am doing calculations while waiting for I/O to complete.
-
"Simple API for XML"
Well, the API is simple. It just does not do much useful thing. After all, it rhymes with Sucks. What do you expect from it?
I just wanted a list!
I always want a tree. You can get list with “Stream API for XML” (StAX), which returns the “events” from generator rather than calling callbacks (not sure if it exists for node though; it probably does not, because it can't be easily written in the CPS). But that rarely helps much. The only sane way to deal with XML is a serializer.
And with all the XML-related standards (XSLT, etc.) sometimes I feel like the W3C people are trying to build an entire software platform around a simple file format.
When all you have is a hammer, everything looks like a nail…
-
-
When all you have a hammer, everything looks like a nail…
XML: the ultimate hammer.I like this metaphor.
-
So there's no parallelism, so no benefit at all from using events to parse HTML. There's only the 0.something millisecond overhead from the sloppy scheduling, so a large HTML document will probably cost a few seconds just having "events". Doesn't sound like a good deal.
Assuming you don't have anything else to do, that's correct.What you refer to is called asychronous, isn't it?
Concurrent.The advantage of using events given the context of single-threaded javascript is that you are introducing more opportunities for any given sequence of instructions to yield the CPU to another concurrent sequence of instructions not needing to wait.
Is there an advantage in using events instead of "regular" asynchronous callbacks? Performance-wise, no. But it might give you a simpler abstraction over the handling of green threads.
-
When all you have a hammer
You smash robots with it to free the little birds, pigs, and rabbits trapped inside?
-
The only sane way to deal with XML is a serializer.
Most deserializers insist that their input stream consist of -- and returns objects that are -- the tree, the whole tree, and nothing but the tree. This can cause severe issues when dealing with large or dynamic inputs, which a SAX-based or StAX-based parser can handle without any issues whatsoever. Plus, most XML libraries let you temporarily hand off to a SAX-to-DOM generator, so you get those subtrees you always wanted.
-
Plus, most XML libraries let you temporarily hand off to a SAX-to-DOM generator[citation needed], so you get those subtrees you always wanted.
I've never actually seen that.
-
There's a reason Node is being increasingly used server-side for websites.
Because PHP is more than a couple years old and learning a good language is beyond the capabilities of JS programmers?
-
Who are you trying to there, exactly?
-
I always want a tree.
Hooray for you, but it doesn't really scale too well and there are some colossal XML documents about. Sometimes that doesn't matter and you can use the tokens to build the tree you want. Sometimes that matters a heck of a lot. There's also XMPP, which is basically an infinite XML document delivered over a socket (actually a sequence of smaller documents, but even so…)
Grumbling over there being a streaming XML tokenizer is just a demonstration of ignorance.
-
This post is deleted!