Optimizing parsers can be tricky

mervgreen

//    - Matching and slicing on a huge input is often cause of slowdowns.
//      The solution is to chunkify the input into smaller strings.
//      The chunks are stored in the `chunks` var...


// Split the input into chunks.
chunks = (function (chunks) {
    var j = 0,
        skip = /(?:@\{[\w-]+\}|[^"'`\{\}\/\(\)\\])+/g,
        comment = /\/\*(?:[^*]|\*+[^\/*])*\*+\/|\/\/.*/g,
        string = /"((?:[^"\\\r\n]|\\.)*)"|'((?:[^'\\\r\n]|\\.)*)'|`((?:[^`]|\\.)*)`/g,
        level = 0,
        match,
        chunk = chunks[0],
        inParam;

    for (var i = 0, c, cc; i < input.length;) {
        skip.lastIndex = i;
        if (match = skip.exec(input)) {
            if (match.index === i) {
                i += match[0].length;
                chunk.push(match[0]);
            }
        }
        c = input.charAt(i);
        comment.lastIndex = string.lastIndex = i;

        if (match = string.exec(input)) {
            if (match.index === i) {
                i += match[0].length;
                chunk.push(match[0]);
                continue;
            }
        }

        if (!inParam && c === '/') {
            cc = input.charAt(i + 1);
            if (cc === '/' || cc === '*') {
                if (match = comment.exec(input)) {
                    if (match.index === i) {
                        i += match[0].length;
                        chunk.push(match[0]);
                        continue;
                    }
                }
            }
        }
        
        switch (c) {
            case '{': if (! inParam) { level ++; chunk.push(c); break }
            case '}': if (! inParam) { level --; chunk.push(c); chunks[++j] = chunk = []; break }
            case '(': if (! inParam) { inParam = true; chunk.push(c); break }
            case ')': if ( inParam) { inParam = false; chunk.push(c); break }
            default: chunk.push(c);
        }
        
        i++;
    }
    if (level != 0) {
        error = new(LessError)({
            index: i-1,
            type: 'Parse',
            message: (level > 0) ? "missing closing `}`" : "missing opening `{`",
            filename: env.filename
        }, env);
    }

    return chunks.map(function (c) { return c.join('') });;
})([[]]);

A fork later...

Speeds up compilation about 10x...

Soviut

Hah, this is from the LESS source. I just happened to read it the other day and noticed the remark about chunking. I didn't realize how much effort went into that single line.

Do you have a link to that exact commit? I'd love to bookmark it for posterity.

MiffTheFox

TRWTF is JavaScript.

Ragnax

@MiffTheFox said:

TRWTF is JavaScript.

Considering we have JavaScript libraries that can parse JavaScript itself, which is a good deal more complex than LESS, in acceptable time; no, TRWTF are the LESS guys.

Don't kid yourself either; the new implementation, while faster, is still is a horribly piece of failure. For instance, it will fail to chunk correctly when the exact sequence "\n}\n" never happens to be in your source code. (Poor Windows people stuck with "\r\n"...) That's not even the biggest flaw. The biggest flaw is using string.replace with a string parameter as the pattern: that will replace only the first occurence instead of all occurences. Nice; a chunker that will produce at most two chunks!

error

@Ragnax said:

the new implementation, while faster, is still is a horribly piece of failure. For instance, it will fail to chunk correctly when the exact sequence "\n}\n" never happens to be in your source code. (Poor Windows people stuck with "\r\n"...) That's not even the biggest flaw. The biggest flaw is using string.replace with a string parameter as the pattern: that will replace only the first occurence instead of all occurences. Nice; a chunker that will produce at most two chunks!

OK, I agree the fix is horribly broken, but if it actually succeeds in speeding compilation 10x as it claims, then the original assertion that chunking the input is an optimization seems invalid then maybe no chunking needs to be performed in the first place.

Disclaimer: I'm going entirely based on code comments here, as I was too lazy to thoroughly read the chunking code.

Lorne Kates

My only critique: why is this thread not titled "Blowing Chunks"?

Ragnax

@joe.edwards said:

@Ragnax said:
the new implementation, while faster, is still is a horribly piece of failure. For instance, it will fail to chunk correctly when the exact sequence "\n}\n" never happens to be in your source code. (Poor Windows people stuck with "\r\n"...) That's not even the biggest flaw. The biggest flaw is using string.replace with a string parameter as the pattern: that will replace only the first occurence instead of all occurences. Nice; a chunker that will produce at most two chunks!

OK, I agree the fix is horribly broken, but if it actually succeeds in speeding compilation 10x as it claims, then the original assertion that chunking the input is an optimization seems invalid then maybe no chunking needs to be performed in the first place.

I'm going to go out on a limb here and wager that 'the fix' was only tested on a relatively simple & small input file where the cost of chunking somehow far outweighs the cost of everything else.

@Lorne Kates said:

My only critique: why is this thread not titled "Blowing Chunks"?

A sadly missed opportunity, I agree.

mervgreen

Yes, I have the link. This duet of fail comes to you from https://github.com/julfers/less.js/commit/6f0a19e65a41cee89b272f0e405e369b6e31dd60

@Lorne Kates said:

My only critique: why is this thread not titled "Blowing Chunks"?

Yep. Meta-fail. It's like an encore.