Efficiently parsing a string that can vary in format (C#)

Benjamin Hall

So as part of a project, I have strings that are of one of these forms:

#d#+#, #d#-#, or #d#

What's the simplest (in code terms, performance isn't super critical) to parse out the three numbers (including the sign)?

For example,

3d10+5 => (3, 10, 5);
3d10-5 => (3, 10, -5);
3d10 => (3, 10, 0);

Where the output format really doesn't matter, except that I'd like to be able to assign them to separate variables to pass into a constructor.

Regex? String.split?

hungrier

A regex would probably be the way to go. Off the top of my head it looks like (\d+)d(\d+)(\+(\d+)|(-\d+))?

Benjamin Hall

@hungrier said in Efficiently parsing a string that can vary in format (C#):

(\d+)d(\d+)(+(\d+)|(-\d+))?

Thanks, that looks like it will work.

Magus

@benjamin-hall The main advantage of using regex for this is that you can enforce the entire format easily.

Dreikin

@benjamin-hall said in Efficiently parsing a string that can vary in format (C#):

@hungrier said in Efficiently parsing a string that can vary in format (C#):

(\d+)d(\d+)(+(\d+)|(-\d+))?

Thanks, that looks like it will work.

Is there any chance there will be other symbols like a decimal separator or currency symbol in there? Or exponent-type number strings (16e8, for example)?

Benjamin Hall

@dreikin said in Efficiently parsing a string that can vary in format (C#):

@benjamin-hall said in Efficiently parsing a string that can vary in format (C#):

@hungrier said in Efficiently parsing a string that can vary in format (C#):

(\d+)d(\d+)(+(\d+)|(-\d+))?

Thanks, that looks like it will work.

Is there any chance there will be other symbols like a decimal separator or currency symbol in there? Or exponent-type number strings (16e8, for example)?

Shouldn't be. These are dice expressions for D&D which have the fixed form NdM+Mod, where N is an integer > 0, M is the # of sides (conventionally 4, 6, 8, 10, 12, or 20), and Mod is an integer (of either sign).

Thus, 3d10+3 is Roll and sum 3 10-sided dice, and add 3. Anything else is an error.

pie_flavor

Maximum efficiency.

[DllImport("msvcrt.dll", CharSet = CharSet.Ansi, CallingConvention = CallingConvention.Cdecl)]
[return: MarshalAs(UnmanagedType.Bool)]
public static extern bool sscanf(string data, string format, __arglist);

//...
// string s;
int i1 = 0, i2 = 0, i3 = 0;
if (sscanf(s, "%dd%d+%d", __arglist(out i1, out i2, out i3))) return (i1, i2, i3);
if (sscanf(s, "%dd%d-%d", __arglist(out i1, out i2, out i3))) return (i1, i2, i3);
if (sscanf(s, "%dd%d", __arglist(out i1, out i2))) return (i1, i2, 0);
throw new ArgumentException();

edit: ofc highlight.js doesn't think __arglist is a keyword

Unperverted Vixen

@pie_flavor Maximum runtime efficiency assuming that you're running full framework or .NET Core on Windows. If you're running on Mono, or .NET Core on another platform, that kinda won't work.

Also, as bad as regexes are, they're better than doing interop with unmanaged code.

pie_flavor

@unperverted-vixen said in Efficiently parsing a string that can vary in format (C#):

@pie_flavor Maximum runtime efficiency assuming that you're running full framework or .NET Core on Windows. If you're running on Mono, or .NET Core on another platform, that kinda won't work.

Also, as bad as regexes are, they're better than doing interop with unmanaged code.

I wasn't aware that I had to use the /s moniker to indicate sarcasm.

Unperverted Vixen

@pie_flavor Depends on context. In the Coding Help category, I don't think it's unfair to take "assistance" at face value.

pie_flavor

@unperverted-vixen Fair.

anotherusername

Maximum speed efficiency would probably be a simple loop... if C# was Javascript, I'd do this:

function parse(str) {
  var a = [0, 0, 0], n = 0, d = 0, m = 0;
  for (var i = 0; i < str.length; ++ i) {
    var c = str.charAt(i);
    if (c >= '0' && c <= '9') a[n] = a[n] * 10 + str.charCodeAt(i) - 48;
    else if (c == 'd' && !d) { ++ n; d = 1; }
    else if (n == 1 && (c == '+' || c == '-')) { ++ n; m = c == '-'; }
    else throw 'invalid format';
  }
  if (m) a[2] = -a[2];
  return a;
}

PleegWat

@anotherusername said in Efficiently parsing a string that can vary in format (C#):

Maximum speed efficiency would probably be a simple loop... if C# was Javascript, I'd do this:

function parse(str) {
  var a = [0, 0, 0], n = 0, d = 0, m = 0;
  for (var i = 0; i < str.length; ++ i) {
    var c = str.charAt(i);
    if (c >= '0' && c <= '9') a[n] = a[n] * 10 + str.charCodeAt(i) - 48;
    else if (c == 'd' && !d) { ++ n; d = 1; }
    else if (n == 1 && (c == '+' || c == '-')) { ++ n; m = c == '-'; }
    else throw 'invalid format';
  }
  if (m) a[2] = -a[2];
  return a;
}

This kind of fiddlyness, it's usually better to use standard functions when appropriate, but avoid functions like scanf which need to do their own parsing step.

int scan_dice( char * instr, int * ndice, int * nsides, int * mod )
{
    if( !isdigit(*instr) )
        return -1;
    *ndice = strtol( instr, &instr, 10 );
    if( *instr++ != 'd' )
        return -1;
    if( !isdigit(*instr) )
        return -1;
    *nsides = strtol( instr, &instr, 10 );
    *mod = strtol( instr, &instr, 10 );
    if( *instr )
        return -1;
    return 0;
}

cvi

@pie_flavor said in Efficiently parsing a string that can vary in format (C#):

Maximum efficiency.
</snip>

Why make it so complicated? sscanf() returns the number of successfully matched thingies (and leaves the unmatched arguments unchanged). So

...
int a, b, c = 0;
int ret = sscanf( s, "%dd%d%d", __arglist(out a, out b, out c));
if( 2 == ret || 3 == ret )
    return (a,b,c);
throw new ArgumentException();

should do it. (Caveat: I don't do C#, so I don't know what will happen to out parameters that aren't set by the function; the C-equivalent does work, though.)

dkf

@pie_flavor said in Efficiently parsing a string that can vary in format (C#):

Maximum efficiency.

Doesn't check for characters after the text that you're after. The result of sscanf() is int precisely because it is the number of format groups that have been parsed. The subtleties of this sort of thing are why parsing jobs are often actually easier with regexps; it's actually what they're designed to do. However, if you're going to insist on doing parsing with sscanf(), remember to put a %c group at the end of the pattern and check that it hasn't been assigned a value by looking at the return code. Almost nobody does that, and almost everyone's parsing code is broken.

Benjamin Hall

And this reminds me why I'm glad I don't work on any of

Performance critical code
C or C++ code
Anything that needs scanf or relatives.

That all looks like an electric cat barfed on the screen.

cvi

@dkf said in Efficiently parsing a string that can vary in format (C#):

remember to put a %c group at the end of the pattern and check that it hasn't been assigned a value by looking at the return code

Also, this.

I was aiming for bug-to-bug compatibility, but didn't even quite achieve that. The original code will return (5,0,0) for a string like "5bugs", mine will throw an exception. ;-)

cvi

@benjamin-hall said in Efficiently parsing a string that can vary in format (C#):

That all looks like an electric cat barfed on the screen.

IMO, most parser code looks like electric-cat-barf. Maybe if you write a proper lexer-definition or even some BNF variant you could make it look organized, but that seems like overkill for this problem.

dkf

@cvi said in Efficiently parsing a string that can vary in format (C#):

IMO, most parser code looks like electric-cat-barf.

You can usually constrain the awful to a very small piece of code. With a RE of (\d+)d(\d+)([-+]\d+)? (and matching with anchoring switched on; I don't know if there's a default for that in C#) you can make the rest of the code really easy.

Benjamin Hall

@dkf said in Efficiently parsing a string that can vary in format (C#):

@cvi said in Efficiently parsing a string that can vary in format (C#):

IMO, most parser code looks like electric-cat-barf.

You can usually constrain the awful to a very small piece of code. With a RE of (\d+)d(\d+)([-+]\d+)? (and matching with anchoring switched on; I don't know if there's a default for that in C#) you can make the rest of the code really easy.

Yeah. I think I'm even going to move this chunk over to a part that only gets called when the user loads a stored file (and then the rest of the time changes happen in memory until saved) rather than persisting/loading on-the-fly.

The situation is that I'm interfacing with some (unchangeable) XSD-based classes that expect stringly-typed data in this particular format. But I'd like to be able to generate these items from the pieces, since usually part of them is calculated based on other numbers. I could just read it in and write it out as a string and only validate the format, but that feels icky and a pain to use.

Gurth

@pleegwat said in Efficiently parsing a string that can vary in format (C#):

int scan_dice( char * instr, int * ndice, int * nsides, int * mod )
{
    if( !isdigit(*instr) )
        return -1;
    *ndice = strtol( instr, &instr, 10 );
    if( *instr++ != 'd' )
        return -1;
    if( !isdigit(*instr) )
        return -1;
    *nsides = strtol( instr, &instr, 10 );
    *mod = strtol( instr, &instr, 10 );
    if( *instr )
        return -1;
    return 0;
}

Unless I’m mistaken, this would fail on something like “d8” rather than “1d8” as well as on “1D8”, which all mean the same thing to the average gamer. Not a problem if you control the input, but if it’s to take user input you may want to make sure it handles all of these.

PleegWat

@gurth Which is per the spec. Alternatively, replace the first three lines by:

if( !isdigit(*instr) )
    *ndice = 1;
else
    *ndice = strtol( instr, &instr, 10 );

blakeyrat

@benjamin-hall said in Efficiently parsing a string that can vary in format (C#):

Performance critical code

I'll bet you $50 this is not performance critical code.

Benjamin Hall

@blakeyrat said in Efficiently parsing a string that can vary in format (C#):

@benjamin-hall said in Efficiently parsing a string that can vary in format (C#):

Performance critical code

I'll bet you $50 this is not performance critical code.

Right. That's why I said I'm glad I don't work with such code. Nothing I do is performance sensitive--it's all dominated by input time (or sometimes by network latency).

blakeyrat

@benjamin-hall Oh sorry I guess I misread you.

Ben_Warre

I would probably look at it more in context.

Conceivably, (not a tt gamer) you could need to roll two types of dice so the string maybe 3d8/1d6+2 or something.

I would search the string for tolower('d') and then look left and right from there.

But a regexp works too.

Parody

If you'd like to see some awesome(-ly y?) dice roll parsing code in Javascript, a couple years ago I put together this:

ESG Dice Roller

It almost has instructions! It's got not-entirely-unlike-objects! It's got a workaround for old versions of IE because it was made to be embedded into a WinForms program!

Magus

C# regex apparently supports variable negative lookaround or whatever its called, that nothing else seems to support. So you can do weird things with it.

masonwheeler

@magus said in Efficiently parsing a string that can vary in format (C#):

variable negative lookaround

That sounds like a fun band name...

pie_flavor

@magus Yeah, .NET regex is pretty much the best. The only other regex engine that does it is the JGSoft one, and I don't think you can use that in your own applications.

masonwheeler

@pie_flavor said in Efficiently parsing a string that can vary in format (C#):

Yeah, .NET regex is pretty much the best regex system.

The best house in a bad neighborhood...

pie_flavor

@masonwheeler By all means, make a better system.

AlexMedia

Regexes rock.

anotherusername

@masonwheeler said in Efficiently parsing a string that can vary in format (C#):

@pie_flavor said in Efficiently parsing a string that can vary in format (C#):

Yeah, .NET regex is pretty much the best regex system.

The best house in a bad neighborhood...

Regexes are by far the best tool for those times when the only tool you have is a hammer. They're occasionally even the best tool for when the hammer isn't the only tool you have.

Benjamin Hall

@ben_warre said in Efficiently parsing a string that can vary in format (C#):

I would probably look at it more in context.

Conceivably, (not a tt gamer) you could need to roll two types of dice so the string maybe 3d8/1d6+2 or something.

I would search the string for tolower('d') and then look left and right from there.

But a regexp works too.

Fortunately, I don't need any such capabilities. They're all NdM(+|-)X. In fact, I could probably get away with just handling the NdM case for the actual interface code and do the rest in application code. But I tend to wear complicator's gloves.

If anyone wants to see how bad my code is, the repo is on github.

heterodox

@masonwheeler said in Efficiently parsing a string that can vary in format (C#):

That sounds like a fun band name...

I saw a comment like that on a blog post recently. :P

ScholRLEA

@anotherusername Actually... no. Regexs as a rule - and compiled regexs in particular - are much faster than an ad-hoc parser for anything that is more complicated than a simple split. The typical approach to implementing PCRE is to compile it to a fast deterministic finite-state automaton; this can in turn by instantiated as a combination of jump tables and branches, with each step only testing for the matches which have paths from the current state. It's basically a specialized sub-set of the technology used for lexical analyzer generators such as Lex and JavaCC.

In a line-compiled or block-compiled interpreter such as Perl (where each line or lexical block is compiled to an intermediate representation when the interpreter reaches it the first time in the program run, which is cached to be used on any subsequent calls or loop passes), it's compiled at runtime.

Most compiled languages either precompile any constant regex strings implicitly (and may or may not try to stub any constant sub-strings in a string built at runtime), or require the regex to be explicitly compiled by calling a regex parser. Some block-compiled languages which support regexs through a library rather than as a language feature (e.g., Python) follow the latter route as well.

And then there is CL-PPCRE, which is... interesting. If you are curious, Let Over Lambda has a whole section on the macrology used in the implementation. However, you would need to know the basics of Common Lisp read macros to understand it, and even then you would probably need to read the prior parts of the book as well. I have a suspicion not many here will bother, and those who would, have probably already read it. TL;DR: Read-macros make it absurdly easy to blur the line between 'library' and 'language feature', so CL-PPCRE basically does the Java/Python 'library call to a regex compiler' approach but makes it look more like Perl's built-in regexs.

Salamander

I generally find that as a rule, if the word 'backtracking' or 'backreference' appears anywhere in the documentation, then you need to pay very close attention to the pattern, or you can get into Fun that takes minutes or may not even return on some inputs.
Like in javascript (in chrome, at least), I have to kill the tab because this seems to go on forever: /^(a|aa)*b$/.test('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa')

Zecc

@benjamin-hall said in Efficiently parsing a string that can vary in format (C#):

Nothing I do is performance sensitive

There's always developer performance. Look for the code that is easier to maintain.

dkf

@scholrlea said in Efficiently parsing a string that can vary in format (C#):

The typical approach to implementing PCRE is to compile it to a fast deterministic finite-state automaton

That in turn depends on the complexity of the RE. Mixed greediness REs (and some of the other capabilities of PCRE) are typically implemented using recursive calls, which has a whole bunch of tricky consequences. RE engines are pretty high up the scale of “scary code ”.

Most compiled languages either precompile any constant regex strings implicitly (and may or may not try to stub any constant sub-strings in a string built at runtime), or require the regex to be explicitly compiled by calling a regex parser.

There are other approaches that can also work well, such as LRU caches (with most applications being happy with a cache bound of around 20 or so, though that'd obviously be a tuneable factor). Also, the internal compilation of an RE varies quite a bit; simple bytecodes are common, but it can even be as complex as direct machine code.

dkf

@salamander said in Efficiently parsing a string that can vary in format (C#):

Like in javascript (in chrome, at least), I have to kill the tab because this seems to go on forever: /^(a|aa)*b$/.test('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa')

Recursive RE engines hate that sort of thing! Automata-theoretic ones chew it up and spit it out without any hesitation, but get unhappy with REs that have very large DFA expansions. It tends to be a trade-off between RE compilation time and RE execution time, with recursive engines having the advantage at compilation and automata-based engines having the advantage at runtime.

anotherusername

@scholrlea said in Efficiently parsing a string that can vary in format (C#):

@anotherusername Actually... no. Regexs as a rule - and compiled regexs in particular - are much faster than an ad-hoc parser for anything that is more complicated than a simple split.

https://jsperf.com/regex-performance-vs-ad-hoc-parser-performance

ScholRLEA

@anotherusername That's rather surprising, TBH. I would have expected, in the absence of the kind of complex multiple/recursive paths already mentioned, that the regex would be a lot faster.

I would be interested in seeing the difference with other regex implementations, though. However, for the JavaScript implementations I've tested, the ad-hoc parser does does seem consistently beat the regex one. The result I got for Firefox 56 (Mint Linux 18.2, 8GB RAM) was 37% slower. It was even worse on Opera 44 (same platform) was even worse, 50% slower. Chromium 62 (same platform) got 24% slower.

I'll try it with Perl, Java, and CL-PPCRE later, but so far it is not looking good for PCREs. Interesting.

anotherusername

@scholrlea I tried modifying my parser, basically unrolling the loop into 3 separate loops (one for each value being parsed out) but using the same loop counter for all 3, so it still only looped across the whole string once and it didn't have to go through that if-statement inside the loop. Surprisingly, then it was actually around 15% slower... if I was writing assembly code instead of Javascript, I'd expect the opposite result. Maybe the cost of JIT-compiling the extra Javascript code outweighed the speed increase that I expected.

I also tried using the shortcut of (a << 3) + (a << 1) instead of a * 10, but it made no noticeable difference. Probably whatever was gained by eliminating the multiplication was lost in the extra type casts it required the JS engine to perform.

Magus

When I was trying to write a language translator for my company's 30yo language to C#, the weird .NET negative lookaround stuff saved my life, because of the weirdness of the syntax of that crazy language.

anotherusername

@magus for those times when you absolutely need negative lookahead, and your regex engine doesn't have it, there's always

/b(?=a)/.test('ab'.split('').reverse().join(''))

Magus

Went looking for some of the ones I used, and I found these:

private static readonly Regex containsRegex = new Regex("(?!{[^{]*}){");
private static readonly Regex notContainsRegex = new Regex("}(?<!{[^}]*})");
private static readonly Regex variableRegex = new Regex("\\^\\(\\w+\\)>");

And also

var pattern = new Regex("(?!" + literal + "\\w)" + literal + "(?<!\\w" + literal + ")");

Which all look super weird.

The language I'm parsing is super weird itself, though.

Benjamin Hall

@magus said in Efficiently parsing a string that can vary in format (C#):

Went looking for some of the ones I used, and I found these:
private static readonly Regex containsRegex = new Regex("(?!{[^{]*}){");
private static readonly Regex notContainsRegex = new Regex("}(?<!{[^}]*})");
private static readonly Regex variableRegex = new Regex("\\^\$\\w+\$>");
And also
var pattern = new Regex("(?!" + literal + "\\w)" + literal + "(?<!\\w" + literal + ")");
Which all look super weird.

The language I'm parsing is super weird itself, though.

Yeah, that's brain hurt material there. Line noise regexes ftw.

ScholRLEA

@anotherusername I did note a small problem with your algorithm for producing a test string: it generates a random die size in (2, 32). This matches the original problem description, but the description itself is bad, as the usable die types for D&D are actually the set {4, 6, 8, 10, 12, 20, 100}.

This assumption also affects the regex. A more sensible (if more complicated) version might be

/^(\d*)d[4|6|8|10|12|20|100]([-+]\d+)?$/i

Since the middle substring had a fixed set of options rather than an open-ended one, this would probably have an impact on performance.

This assumes that the string input @Benjamin-Hall is parsing is meant to represent actual die rolls.

This brings up a more serious question, though: why the hell is the input coming in as a string in the first place? Seriously, it makes no damn sense to me. I get that this is the canonical way to write the die values in text, but this isn't text, it's a command line. Is it too much to ask that you write it as separate values or as named options?

It would be even worse if this were a GUI, though; not doing them as an integer entry box, followed by an option box, followed by a plus/minus select, followed by another integer entry box, would be Swampy level idiocy. You would be making more work for yourself for no real advantage to the users.

masonwheeler

@dkf said in Efficiently parsing a string that can vary in format (C#):

Recursive RE engines hate that sort of thing! Automata-theoretic ones chew it up and spit it out without any hesitation, but get unhappy with REs that have very large DFA expansions. It tends to be a trade-off between RE compilation time and RE execution time, with recursive engines having the advantage at compilation and automata-based engines having the advantage at runtime.

This can be generalized: as a general rule, optimizing something takes longer to compile but produces more efficient code. (Among other things, this is why good AOT beats out JIT every time, despite the advantages live code analysis can bring to a JIT. The JIT needs to compile quickly, which means it's required to produce crappy optimization.)