Regex confessions/regex challenge

Bulb

I just put together this sed command:

sed ':0
     s!^\(\([^\\"/]\|\\.\|"\([^\\"]\|\\.\)*"\|/[^/*]\)*\)/\*\([^*]\|\*\+[^/]\)*\*\+/!\1!
     s!^\(\([^\\"/]\|\\.\|"\([^\\"]\|\\.\)*"\|/[^/*]\)*\)//.*!\1!
     \!^\(\([^\\"/]\|\\.\|"\([^\\"]\|\\.\)*"\|/[^/*]\)*\)/\*\([^*]\|\*\+[^/]\)*\**$!{
        N
        b0
        }'

Try to guess what it does
Try to spot errors or unhandled corner-cases

Hint: The output is going to be fed to jq. Because Microsoft.

dcon

@Bulb I think you misplaced the Evil thread...

Gribnit

@Bulb said in Regex confessions/regex challenge:

I just put together this sed command:

sed ':0
     s!^\(\([^\\"/]\|\\.\|"\([^\\"]\|\\.\)*"\|/[^/*]\)*\)/\*\([^*]\|\*\+[^/]\)*\*\+/!\1!
     s!^\(\([^\\"/]\|\\.\|"\([^\\"]\|\\.\)*"\|/[^/*]\)*\)//.*!\1!
     \!^\(\([^\\"/]\|\\.\|"\([^\\"]\|\\.\)*"\|/[^/*]\)*\)/\*\([^*]\|\*\+[^/]\)*\**$!{
        N
        b0
        }'

Try to guess what it does

Turns something called JSON that isn't, into JSON.

Try to spot errors or unhandled corner-cases

Obviously, there are none.

Hint: The output is going to be fed to jq. Because Microsoft.

Applied Mediocrity

@Bulb said in Regex confessions/regex challenge:

Try to guess what it does

dkf

@Bulb said in Regex confessions/regex challenge:

Try to spot errors or unhandled corner-cases

TwelveBaud

@Bulb Convert .config files with shortcutted property names to proper JSON without said shortcuts.

Bulb

@TwelveBaud No. That would require some name replacing in the patterns, but they are all just special characters.

Gurth

@Bulb said in Regex confessions/regex challenge:

Try to guess what it does

Drive people crazy trying to figure it out.

HardwareGeek

@Gurth said in Regex confessions/regex challenge:

Drive people crazy

That's a shorter drive for some of us than for others.

cvi

@Bulb said in Regex confessions/regex challenge:

Try to guess what it does

Demonstrate that you forgot to enable extended regexp?

Is it trying to strip comments or something like that?

LaoC

@Bulb said in Regex confessions/regex challenge:

I just put together this sed command:
sed ':0
     s!^$\([^\\"/]\|\\.\|"\([^\\"]\|\\.$*"\|/[^/*]\)*\)/\*$[^*]\|\*\+[^/]$*\*\+/!\1!
     s!^$\([^\\"/]\|\\.\|"\([^\\"]\|\\.$*"\|/[^/*]\)*\)//.*!\1!
     \!^$\([^\\"/]\|\\.\|"\([^\\"]\|\\.$*"\|/[^/*]\)*\)/\*$[^*]\|\*\+[^/]$*\**$!{
        N
        b0
        }'
Try to guess what it does

Try to spot errors or unhandled corner-cases

Hint: The output is going to be fed to jq. Because Microsoft.

You're joining lines until … something. Maybe quoted strings shouldn't be multiline?
At this backslash density, are you sure you don't want -E? (ed)
And why stuff like [^\\"/]\|\\.\? "Either a character that's neither backslash nor doublequote nor slash, or a period". Am I missing something or does the second branch of the disjunction never match?

Bulb

@LaoC said in Regex confessions/regex challenge:

You're joining lines until … something. Maybe quoted strings shouldn't be multiline?

While (the pattern matches), not until.

… though the third pattern is the first one without the ending, so until we get that ending (\*+/)

At this backslash density, are you sure you don't want -E?

I keep forgetting sed can do extended regex.

And why stuff like [^\\"/]\|\\.\? "Either a character that's neither backslash nor doublequote nor slash, or a period". Am I missing something or does the second branch of the disjunction never match?

You are missing a backslash. The left side matches one character, while the right side matches two characters, a backslash and anything.

… and I've noticed an error myself. \ is only a valid character inside "d string. Thank you, rubber duck!

topspin

@Bulb said in Regex confessions/regex challenge:

Try to guess what it does

Exactly what it says it does.

Try to spot errors or unhandled corner-cases

"Works as coded."

Bulb

@LaoC said in Regex confessions/regex challenge:

At this backslash density, are you sure you don't want -E?

Ok, here goes a bit simplified version (will test tomorrow). And with the mistake fixed, that saves a case too.

sed -E ':0
        s!^(([^"/]|"([^\\"]|\\.)*"|/[^/*])*)/\*([^*]|\*+[^/])*\*+/!\1!
        s!^(([^"/]|"([^\\"]|\\.)*"|/[^/*])*)//.*!\1!
        \!^(([^"/]|"([^\\"]|\\.)*"|/[^/*])*)/\*([^*]|\*+[^/])*\**$!{
           N
           b0
           }'

cvi

@Bulb Still thinking this based on the first line - the version with fewer \ makes it easier to spot, though.

@cvi said in Regex confessions/regex challenge:

Is it trying to strip comments or something like that?

Somewhere in the first line it matches a quoted string and spends some time dealing with escaped elements. Second part seems to match a C-style comment. The substitute only keeps the first group = string, though, hence thinking that it strips comments.

Second line probably deals with the case when things extend across line. It looks similar enough to the first one and the grouped commands pull in more input.

Bulb

@cvi Yes, it is.

The first part that is kept is everything that is not a comment, not just a string. But it needs to handle strings because a string can contain something that would otherwise look like a comment.

The first pattern deletes complete C-style comments. The second pattern deletes complete C++-style comments, and the loop adds next line while it matches unterminated C-style comment.

I needed this because Microsoft loves using JSONC for configuration, but jq does not support comments and I want to stick to things available in Ubuntu for the build agent (the toolchain is in a container installed by whatever procedure upstream provides, but this used to configure the container).

cvi

@Bulb Not sure I would have been brave enough to attempt that in sed myself, but I guess now I know that it's doable. I shall use this new-found power irresponsibly. :-)

Specifically because of the strings -and escapes in the strings-, I'd probably gone for a very pared-down lexer. More lines of code for sure. Unclear in terms of development and testing time for myself. Would have required an extra tool, so more painful to deploy.

Bulb

@cvi said in Regex confessions/regex challenge:

I'd probably gone for a very pared-down lexer.

There are plenty of parsers already written. But the point was that this is part of the start-up, so I wanted to do it with just standard Unix tools + jq.

Bulb

@cvi said in Regex confessions/regex challenge:

Specifically because of the strings -and escapes in the strings-

Strings are easy with regex. They are a quote, then sequence of either non-special character, or backslash followed by anything, then closing quote. Which is just this simple regex: "([^"\\]|\\.)"

… also the thing would have been simpler if sed had non-greedy Kleene star, but it does not.

dkf

@Bulb said in Regex confessions/regex challenge:

@cvi said in Regex confessions/regex challenge:

I'd probably gone for a very pared-down lexer.

There are plenty of parsers already written. But the point was that this is part of the start-up, so I wanted to do it with just standard Unix tools + jq.

If I was in your position, I'd probably make some assumptions about what sorts of comments and strings are in use so that simple REs could be used. Generality is often much more difficult to achieve.

More specifically, if you can assume that all comments are on lines by themselves and have at most only whitespace before the start of the // starter and always proceed to the end of the line, they're very easy to filter. It's getting block comments right or allowing starting at arbitrary locations when you get into the nasty cases. JSONC's support for multiple comment styles looks like something slapped in by someone who is used to commenting in C and/or C++ and who hasn't thought about the consequences for parsers at all.

HardwareGeek

@dkf said in Regex confessions/regex challenge:

hasn't thought ... at all.

Gribnit

@Bulb said in Regex confessions/regex challenge:

Kleene star

Don't we have a "no correct terms" policy? For those who are offended to their very core, like myself, he's talking about a Nathan Hale.

cvi

@dkf said in Regex confessions/regex challenge:

It's getting block comments right or allowing starting at arbitrary locations when you get into the nasty cases.

Eh. I mean, it's not exactly rocket science either. Making comments easily strippable by regex would be quite far down my list of priorities, definitively after general usability concerns.

Bulb

@dkf said in Regex confessions/regex challenge:

If I was in your position, I'd probably make some assumptions about what sorts of comments and strings are in use so that simple REs could be used. Generality is often much more difficult to achieve.

It would come back wasting a couple of my hours a year down the road when I'd forget which limitations it has. And generality turned out not to be that much more difficult in this case anyway—only strings can hide a comment start sequence, and there is only one style of strings here.

More specifically, if you can assume that all comments are on lines by themselves and have at most only whitespace before the start of the // starter and always proceed to the end of the line, they're very easy to filter.

That was the first version. Then I decided the risk of someone (either me when I forget, or someone else who starts to use the feature) adding a different one is too high.

It's getting block comments right or allowing starting at arbitrary locations when you get into the nasty cases. JSONC's support for multiple comment styles looks like something slapped in by someone who is used to commenting in C and/or C++ and who hasn't thought about the consequences for parsers at all.

For parsers, newlines are not significant except for terminating C++-style comments.

… hm, I realized one more case I missed—multiple C-style comments on one line. A t0 after the first line should fix that.

dkf

@Bulb Yes; you don't have to deal with either multiple string styles or multi-line strings. This isn't YAML after all...

Bulb

@dkf If it was yaml, all parsers would already support comm … but there wouldn't be any easy to install, standardish parser usable from shell.