Efficiently parsing a string that can vary in format (C#)



  • @scholrlea said in Efficiently parsing a string that can vary in format (C#):

    This brings up a more serious question, though: why the hell is the input coming in as a string in the first place?

    Because the moronic XML format that I'm interoping with (I'm writing a desktop front-end that creates files to be imported into an iPad app written by someone else) requires them as strings. Most of the time, it's a one-way process--take user input from a GUI and write out an XML file. But I'd like to be able to load saved files, so I need a way to parse the XML into a more sane data structure.

    On my side of the code, it's represented as a pseudo-immutable struct

    public struct DiceValue
    {
      public Sides {get; private set;}
      public Number {get; private set;}
      public Modifier {get; private set;}
    
      //snip lots of equality code, ToString implementation, constructors
    }
    

  • Impossible Mission - B

    @magus said in Efficiently parsing a string that can vary in format (C#):

    When I was trying to write a language translator for my company's 30yo language to C#, the weird .NET negative lookaround stuff saved my life, because of the weirdness of the syntax of that crazy language.

    :wtf:

    If you're trying to parse a programming language complicated enough to be in use as a general-purpose language, you almost certainly don't want regexes anywhere near it. Build a formal grammar and use a recursive parser of some sort.



  • @scholrlea it shouldn't really matter for the purposes of just giving the parsing function something valid to nom on. I had to pick numbers; any numbers should do. I ensured that it at least will generate multi-digit numbers sometimes, to make sure that the parsers can handle them, but it really doesn't change the performance any. That part of the code is the same for both types of parser and it's not timed anyway.

    Anyway, I modified it to create die sizes in the range (2, 100) instead of (2, 32).



  • @anotherusername said in Efficiently parsing a string that can vary in format (C#):

    @scholrlea it shouldn't really matter for the purposes of just giving the parsing function something valid to nom on. I had to pick numbers; any numbers should do. I ensured that it at least will generate multi-digit numbers sometimes, to make sure that the parsers can handle them, but it really doesn't change the performance any. That part of the code is the same for both types of parser and it's not timed anyway.

    Anyway, I modified it to create die sizes in the range (2, 100) instead of (2, 32).

    And for this purpose, nothing in the code or XML checks to see if that's a valid dice. It's also very possible (and even happens sometimes) to use things like a d3 (roll a d6, cut in half) or a dX, where X is an arbitrary integer >= 2.



  • @anotherusername You are missing the point; it isn't a range at all, and the fact that it would be a small fixed number of options would definitely impact regex performance. It would go from a Kleene recursion to a simple match, and simple matches are almost certainly going to be much faster.

    I expect you missed this since I edited it while you were replying, but this is a more appropriate regex:

    /^(\d*)d[4|6|8|10|12|20|100]([-+]\d+)?$/i


  • @benjamin-hall sigh I should've expected something like that. Ugh, you have my sympathies.

    I retract my comments.


  • Java Dev

    @benjamin-hall said in Efficiently parsing a string that can vary in format (C#):

    roll a d6, cut in half

    That reminds me of the scene in The Last Hero where Fate challenges Cohen the Barbarian to roll a 7 on a 6-sided die, and he succeeds by cutting the die in half in mid-air.



  • @scholrlea said in Efficiently parsing a string that can vary in format (C#):

    @benjamin-hall sigh I should've expected something like that. Ugh, you have my sympathies.

    I retract my comments.

    Yeah. Specifically, I needed to parse things like

    $string|$integer|$diceExpression (with literal | characters as delimiters). Splitting based on the pipe is easy, parsing the dice expression is not as much so.



  • @benjamin-hall I won't bother doing the tests, then, either, because frankly, you have bigger problems to worry about than performance at this point. Is the any way to apply Remote Strangulation Protocol to the original author?



  • @scholrlea said in Efficiently parsing a string that can vary in format (C#):

    @benjamin-hall I won't bother doing the tests, then, either, because frankly, you have bigger problems to worry about than performance at this point. Is the any way to apply Remote Strangulation Protocol to the original author?

    Nope. As I said, none of this is performance sensitive. If it takes 30 ms vs 300 ms, it's still waiting for input 99+% of the time. The app works well enough--but the import function seems to think that you'll be writing the XML by hand. And ain't nobody got the patience for that.



  • @masonwheeler I use them in specific scenarios, to recognize certain types of tokens. Regex is certainly not my general solution to language parsing: I do the recursive stuff separately for the most part.



  • @scholrlea said in Efficiently parsing a string that can vary in format (C#):

    @anotherusername You are missing the point; it isn't a range at all, and the fact that it would be a small fixed number of options would definitely impact regex performance. It would go from a Kleene recursion to a simple match, and simple matches are almost certainly going to be much faster.

    I expect you missed this since I edited it while you were replying, but this is a more appropriate regex:

    /^(\d*)d[4|6|8|10|12|20|100]([-+]\d+)?$/i
    

    Both parsers work exactly the same way: they match any sequence of decimal digits for the dice sides value. Tweaking the regex as you describe means you wouldn't be comparing apples to apples anymore.

    Also, those should be parentheses, not square brackets. It's matching a subexpression, not just a single character, and it needs to be captured.

    Nevertheless, I modified the tests to add that regex as a third test (with parentheses instead of braces), and it actually got slower. But then, Javascript isn't a compiled language; it's JIT-compiled. A more complicated regex probably costs you more in terms of performance than you might save by reducing the range of values that the regex is looking for. 🤷♂



  • @anotherusername just for the sake of comparison, I created global regex objects and had the functions reuse the same regex object instead of constructing a new one each time. That way, I simulate precompilation as best I can in Javascript (and the same regex can be reused over and over, so it really only needs to be constructed once).

    It sped them up somewhat, but the non-regex parser still outperforms them easily.

    It did result in both regexes running at close to the same speed inside the timed section, though. I'm assuming the latter regex took longer to parse.

    edit: done tweaking the tests, for now. Reload the jsperf link and run them again if you're interested in seeing what I did.



  • @scholrlea said in Efficiently parsing a string that can vary in format (C#):

    This matches the original problem description, but the description itself is bad, as the usable die types for D&D are actually the set {4, 6, 8, 10, 12, 20, 100}.

    Those are the physical sizes for dice supported by the rules, but there's also many times where you can use them to just make arbitrary decisions. With physical dice there's usually some kind of grouping or re-roll policy, such as a d2 is equivalent to rolling odds/evens on a d6.
    On a computer, I'd see no reason to prevent that.



  • @pleegwat said in Efficiently parsing a string that can vary in format (C#):

    @benjamin-hall said in Efficiently parsing a string that can vary in format (C#):

    roll a d6, cut in half

    That reminds me of the scene in The Last Hero where Fate challenges Cohen the Barbarian to roll a 7 on a 6-sided die, and he succeeds by cutting the die in half in mid-air.

    Damn, :hanzo:d. I was going to say, “What’ll you do if Cohen the Barbarian decides to play in your game?”



  • @gurth said in Efficiently parsing a string that can vary in format (C#):

    @pleegwat said in Efficiently parsing a string that can vary in format (C#):

    @benjamin-hall said in Efficiently parsing a string that can vary in format (C#):

    roll a d6, cut in half

    That reminds me of the scene in The Last Hero where Fate challenges Cohen the Barbarian to roll a 7 on a 6-sided die, and he succeeds by cutting the die in half in mid-air.

    Damn, :hanzo:d. I was going to say, “What’ll you do if Cohen the Barbarian decides to play in your game?”

    Let him do whatever he wants? I'm not stupid =)



  • @benjamin-hall said in Efficiently parsing a string that can vary in format (C#):

    Yeah. Specifically, I needed to parse things like

    $string|$integer|$diceExpression (with literal | characters as delimiters). Splitting based on the pipe is easy, parsing the dice expression is not as much so.

    Why not just lowercase it split it on d? (You may need to add a 0 in front if d8 results in ['8'] rather than ['', '8'] — I have no idea) Then if the die type contains a +, -, or whatever other operators you want, split that as well on the operator? Since you already said performance isn’t an issue, this might take a few steps but sounds a lot simpler than trying to tinker with a regex that does everything.



  • @gurth said in Efficiently parsing a string that can vary in format (C#):

    @benjamin-hall said in Efficiently parsing a string that can vary in format (C#):

    Yeah. Specifically, I needed to parse things like

    $string|$integer|$diceExpression (with literal | characters as delimiters). Splitting based on the pipe is easy, parsing the dice expression is not as much so.

    Why not just lowercase it split it on d? (You may need to add a 0 in front if d8 results in ['8'] rather than ['', '8'] — I have no idea) Then if the die type contains a +, -, or whatever other operators you want, split that as well on the operator? Since you already said performance isn’t an issue, this might take a few steps but sounds a lot simpler than trying to tinker with a regex that does everything.

    The regex posted up at the top does everything I want it to do, actually. I've run it through test cases and so far all green.



  • @benjamin-hall What if someone wants to roll 2D6 + 4 exactly like that? :)



  • @gurth said in Efficiently parsing a string that can vary in format (C#):

    @benjamin-hall What if someone wants to roll 2D6 + 4 exactly like that? :)

    Not my problem--the user won't be entering the dice expressions directly. They'll be giving the components (number, dice size, and static modifier); the format produced is the one the app expects and the one output by the ToString() method on the DiceValue class.



  • @benjamin-hall In that case I’m kind of wondering why you need to tear apart a string at all. If it’s for internal use only, why not just do something like:

    function dice(sides, number=1, modifier=0) {
    // random stuff here
    }
    

  • Discourse touched me in a no-no place

    @masonwheeler said in Efficiently parsing a string that can vary in format (C#):

    The JIT needs to compile quickly, which means it's required to produce crappy optimization.

    That depends on whether the host environment allows live code substitution. If it does, the JITter can fire up the heavy analysis in the background and that'll produce its results eventually (while the quickly-done stuff is used in the meantime). Ultimately, the best results come from being able to do whole-program level optimisation (as that lets everything be made concrete), but most of the best algorithms are superlinear so there's also a danger that a whole program optimisation run will be practically intractable.



  • @gurth because I want to support loading (and editing) a saved XML document. Since I'd rather not have two persistence approaches or carry around a database, I have to be able to get it back out of the string form the ultimate consumer (the iPad app) expects.



  • @anotherusername Gaah, the brackets thing was... I dunno what, I got confuzzled I guess. Mii r teh dumazz.



  • @gurth said in Efficiently parsing a string that can vary in format (C#):

    @benjamin-hall In that case I’m kind of wondering why you need to tear apart a string at all. If it’s for internal use only, why not just do something like:

    function dice(sides, number=1, modifier=0) {
    // random stuff here
    }
    

    That was basically my question to @Benjamin-Hall earlier. IIUC, the problem isn't user input (as I supposed it was), it's reading and formatting an XML document that is being sent to another author's iOS app, which happens to require the data be passed and saved in a badly-designed XML schema which he didn't design and can't change.

    So @Benjamin-Hall can get the user input and write it as he pleases when saving his generated XML doc, but then if he needs to read the data again in his program, he needs to read the string saved in this crappy XML format.



  • @scholrlea said in Efficiently parsing a string that can vary in format (C#):

    @gurth said in Efficiently parsing a string that can vary in format (C#):

    @benjamin-hall In that case I’m kind of wondering why you need to tear apart a string at all. If it’s for internal use only, why not just do something like:

    function dice(sides, number=1, modifier=0) {
    // random stuff here
    }
    

    That was basically my question to @Benjamin-Hall earlier. IIUC, the problem isn't user input (as I supposed it was), it's reading and formatting an XML document that is being sent to another author's iOS app, which happens to require the data be passed and saved in a badly-designed XML schema which he didn't design and can't change.

    So @Benjamin-Hall can get the user input and write it as he pleases when saving his generated XML doc, but then if he needs to read the data again in his program, he needs to read the string saved in this crappy XML format.

    I need to write it in this crappy format as well, but that's simple compared to parsing. Formatted strings FTW.

    The XML is being imported by the app (that flow is one-way. No export). I'm pretty sure it's ad-hoc, because all they gave me was a page (in the app, not in any copy-pastable/machine accessible instructions anywhere) that described the expected format and had instructions on how to start an xml document (meaning I'm pretty sure they intended for it to be written by hand. Ugh).

    Edit: I reverse-engineered a schema based on the instructions to build the serialization classes, but they's uuuuuugly. So I'm trying to hide them from the rest of the application, and especially from the GUI.



  • @salamander said in Efficiently parsing a string that can vary in format (C#):

    @scholrlea said in Efficiently parsing a string that can vary in format (C#):

    This matches the original problem description, but the description itself is bad, as the usable die types for D&D are actually the set {4, 6, 8, 10, 12, 20, 100}.

    Those are the physical sizes for dice supported by the rules, but there's also many times where you can use them to just make arbitrary decisions. With physical dice there's usually some kind of grouping or re-roll policy, such as a d2 is equivalent to rolling odds/evens on a d6.
    On a computer, I'd see no reason to prevent that.

    D&D rules also use d2's and d3's. Without, afaik, detailing how you're supposed to convert other dice to them (possibly because anyone can find solutions after thinking a bit about it). There are 2: division (high/low for d2, 1-2/3-4/5-6 for d3) and modulo (odd/even for d2, 1,4/2,5/3,6 for d3). Both are equally valid, provided you round up in the division. If you need weird dice size because you're dealing with stuff like Mirror Images (and need arbitrary 1 in X probabilities), you need to make sure all outcomes are equally likely by rerolling some results .


Log in to reply