Regex parsing...fun times

Benjamin Hall

I have a string that I need to parse via regex

Lightly anonymized format:

Part1: strings Part2: alphanum-with-spaces Part3: string\r\n

except there's guaranteed to only be one of the parts and there may be many.

I need to match it into a bunch of capture groups (this regex will be applied globally to the string) of the form

$1 = PartX
$2 = Stuff after the :

So in the test string
Type: Foo (bar) City: Some City Address: This is a complicated address which might be things like 1/4 mile past some-street\r\n
the output should be

[
{ $1: 'Type', $2: 'Foo (bar)'}, {$1: 'City', $2: 'Some City'}, {$1: 'Address', $2: 'This is a complicated address which might be things like 1/4 mile past some-street'}
]

The current regex is /(\w+):(.*?)(?=((\w+:)|(\r\n)))/g and it's running on Perl (so PCRE, I believe). That one fails--it only parses the first two parts. Specifically, it always ignores the last $1:$2 group. If I did something like Type: Foo (bar) City: Some City Address: This is a complicated address which might be things like 1/4 mile past some-street Apartment: any characters\r\n it would match everything up to and including the address, but not the apartment.

Help? And don't ask me to use something other than regex...I can't at this stage.

Gribnit

@Benjamin-Hall looks like you need to either repeat or allow repetition on your last two capture groups, which may mean changing how you deal with the newline.

PleegWat

@Benjamin-Hall Are you absolutely sure the line ending is making it into your match buffer, and that it isn't being converted into \n only somewhere?

ChaosTheEternal

@Benjamin-Hall Testing on the two regex sites I know of, I came up with this, which looks to work for the test lines provided:

(\w+):(.*?)(?=((\w+:)|(\r\n)|$))

I don't actually have a real PCRE regex tester beyond those two sites so I can't be sure if it's actually working.

Benjamin Hall

@PleegWat said in Regex parsing...fun times:

@Benjamin-Hall Are you absolutely sure the line ending is making it into your match buffer, and that it isn't being converted into \n only somewhere?

Working on it a bit more--it seems that the pre-processing (which happens before the parser gets it) is actually converting the two-character \r\n form into the 4-character escaped form (effectively \\r\\n).

Confusing the issue is that our tooling for writing the parser scripts doesn't use the same pre-processing steps--it grabs the raw message from the database itself. So it sees the appropriate pair of characters.

Which explains why this particular parser works just fine when run in the tooling...but doesn't actually parse things correctly. Because it's getting subtly different literal binary characters.

Dragoon

@Benjamin-Hall

I don't think it can be done in a single regex like you have it. But this should work:

$base_string = 'Type: Foo (bar) City: Some City Address: This is a complicated address which might be things like 1/4 mile past some-street';

$loop_string = $base_string;
$run = 1;
while($run)
{
	if($loop_string =~/(\w+):(.*?)(\w+:.*)/){
		print $1," ",$2,"\n";
		$loop_string = $3;
	}
	else{
	$loop_string =~ /(\w+):(.*)\\r\\n/;
	print $1," ",$2,"\n";
	$run = 0
}

}

PleegWat

@Dragoon He is probably using perl's equivalent to php's preg_match_all, which returns multiple matches for the same string. One thing to be careful of is that it will not return two overlapping matches, which is probably why @Benjamin-Hall is using a lookahead patterns.

Dragoon

@PleegWat

Yeah, it wasn't working for me because I was ignoring the \r\n (I always chomp those off) and I couldn't get the final segment to match because it has a different pattern from all the other segments. The \r\n actually makes it easier as you can 'or' that as a separate final pattern to get around the patterns not being the same, alternatively even in the chomp case I think you can make 'end of string' $ work, but I would need to spend more time with it.

Mason_Wheeler

@Benjamin-Hall said in Regex parsing...fun times:

Help? And don't ask me to use something other than regex...I can't at this stage.

Use something other than regex anyway. It is by-definition incapable of doing what you want in the general case, and special-case attempts to work around that are known to be pathological cases in regular expression engines, at times bad enough that they create denial of service vulnerabilities in software that uses them.

Gribnit

@Zecc you're too kind - I only help people I know are ignoring me.

dkf

@Benjamin-Hall said in Regex parsing...fun times:

The current regex is /(\w+):(.*?)(?=((\w+:)|(\r\n)))/g and it's running on Perl (so PCRE, I believe).

A bit of experimenting (and bearing in mind that you have weird preprocessing, which was your real problem) leads to the RE: (\w+?):\s+(.*?)(?=\s+\w+:|\\r\\n|$), tested with a different RE engine (not Perl or PCRE) but it is in the common subset. I would personally have stripped the trailing \r\n before feeding it into the matcher, but apparently you aren't. Note that you don't need parens much in lookaheads.

The big question about this is what happens when there is a colon in the data in the $2 part.

Perl doesn't use PCRE. It has its RE engine much more deeply entwined with the rest of Perl. Arguably, Perl is a bunch of other stuff hacked into a RE engine (and PCRE is an attempt to get the sane part of the syntax in a separate library, done by someone not high on mixed meth and weed the whole time so not 100% into the Perl vibe).

dkf

@Mason_Wheeler said in Regex parsing...fun times:

@Benjamin-Hall said in Regex parsing...fun times:

Help? And don't ask me to use something other than regex...I can't at this stage.

Use something other than regex anyway. It is by-definition incapable of doing what you want in the general case, and special-case attempts to work around that are known to be pathological cases in regular expression engines, at times bad enough that they create denial of service vulnerabilities in software that uses them.

This particular case is one that the RE engines support without that, requiring only a fixed depth of recursion in an automata-theoretic engine (exactly 1, for the lookahead, where that lookahead is expressible using a simple DFA). It doesn't need the capability to count or operate a stack of context.

Benjamin Hall

@dkf we ended up just stripping the new lines and using the end of line marker as an alternate. It seemed to work fine.

Gribnit

@Benjamin-Hall said in Regex parsing...fun times:

@dkf we ended up just stripping the new lines and using the end of line marker as an alternate. It seemed to work fine.

Take that, arbitrary-domain concerns!

Bulb

@Benjamin-Hall said in Regex parsing...fun times:

running on Perl (so PCRE, I believe)

No, perl has it's own context-free-grammar-with-regular-expression-syntax parser engine. PCRE is a separate library that only implements the actually regular expression parts with almost compatible syntax.

@Mason_Wheeler said in Regex parsing...fun times:

@Benjamin-Hall said in Regex parsing...fun times:

Help? And don't ask me to use something other than regex...I can't at this stage.

Use something other than regex anyway. It is by-definition incapable of doing what you want

This is not true in context. Perl regular expression engine does have support for recursive matching. Including being able to build parse trees if you pull out the ultimate cannon, embedded code and are capable of sufficiently twisting and bending your brain to make it work.

Gribnit

@Bulb ... soon.