Regexp to split strings at spaces, unless in quotes

  • I 've been reading documentation and searching for a couple hours in an attempt to write a regular expression that splits a string wherever a space occurs, unless it's in quotes:

    "foo bar baz "qux quux"" should be split in to "foo", "bar", "baz", "qux quux"

    The conditional (?(?=".*")"|\ ) seems like it should work, as splitting at " gives

    "foo bar baz", "qux quux"

    and splitting at space gives

    "foo", "bar", "baz", "qux", ""qux", "quux""

  • The forum software seems to have posted an old version of my message. WTF?

    These are Perl-style regular expressions.

    I'm sure I'm missing something obvious, but regular expressions look like line-noise to me.

  • The trick is not to split on what you don't want, but to select what you DO want.

    Since all I have is JS:

    var foo = 'foo bar "sdkgyu sdkjbh zkdjv" baz "qux quux" skduy "zsk"'
    var re = /[^" ]+|("[^"]*")/g


    I'm sure you're capable of adapting it to Perl.

  • PS.

    I make no claims about being a Regex master or summat. Better solutions are severely welcome.

    drops "welcome" mat on doorstep

  • I'd try
    [I came up with that a little too quickly, so it might not be right.  It did work fine on a few test strings, though, and it's pretty close if not]
    This isn't a perfect solution.  First, it'll be bloody slow on any decent amount of text: it has to rematch the remainder of the string repeatedly.  Secondly, it matches quotes from the END of the string, not the beginning, because perl5 doesn't support variable-length lookbehinds.  Doing something with (?? would probably be easier, but you didn't say you were using perl, only that the regex style was perl.
    I think you'll be better served with a customer split routine.

    Think about it like this: how do YOU tell if you're in the middle of a set of quotes or not?  You count the number of quote marks to one side, if it's an even number, you aren't, else you are.  So, with a modification of your example string:
    'foo bar "sdkgyu sdkjbh zkdjv" marie is hot foo baz "qux quux" bam "aasdf" asdf asdf', the space after foo matches, because there are 6 quote marks left, which means my assertion matches.  The one inside qux quux", however, does not, as there are 3 quote marks, and the assertion doesn't match.  So long as your text has a balanced number of quotes, you can match from either side.

    m/\s                                     #Match a space
                       (?=                   #Followed by, but do not match:
                        (?:[^"]            #Possibly chars
    ")           #2 Quotes, containing any chars that are not quotes
                                             #Any number of times
                   #Possibly followed by more chars
                        \z)                   #until end of line

    Now we get to watch the forum software completely mess up my post, I'm sure I used the backspace key at least once...

  • how do YOU tell if you're in the middle of a set of quotes or not?

    By selecting the quoted parts as a whole.

    I originally thought that there wasn't a regex to do that, so I started to loop through the string, setting a 'stringMode', repalcing spaces from there on with a special unique string, like XxX or summat, and after the string was formatted, split it on ' ' and then replace /XxX/g with ' ' once more.

    Would'ave been somewhat slower. 🙂

    My regex solution doesn't use any dodgy lookbehind nor counts anything. I have a dislike for lookbehinds, primarily because I can never remember how to write them.

     It works like a fucking charm, though.

    Though if I had a Fucking Charm, I'd definitly wouldn't use it to inspire regular expressions. If you see what I mean.

    In the interest of honesty,
    It fails if there's a double quote that does not denote a string end/start (such as in [migh"tily]), and it doesn't support escaped quotes.

  • Yeah.  You could twiddle it pretty easily to only accept a space at the
    front or back, otherwise not match pretty easily, though: I actually
    went back and added that AFTER the fact, when I realized that my first thought would
    only match if there were a space before and after, not with a comma or a tab or anything.  Escaped quotes
    should be just a matter of adding the backslash to the exclusion list.

    Selecting the quoted parts as a whole is counting implicitly, because you're selecting two, and then 'marking' them as selected.  Try reading a paragraph that has a lot of quote marks and see what is quoted and what is just text, and start in the middle.

    Lookaheads and lookbehinds are really useful for split, because you can make a single char match in only specific instances, so you don't lose all the other instances.

    I wouldn't disparage the brute-force solution, either.  Even though that's probably going to generate a large number of copies, it may not be slower than my solution for a fairly long text string.  If there are 100 spaces and 40 quotes left in the string, I have to recursively match the remainder of those 40 quotes for each space.  Saving the pattern state should help things along, but it's still a recursive (n log n or worse) algorithm against a linear one that has a high coeff.

  • (sorry, i missed the thread )

    														You could use The Regex Coach it realy helps with the regular expresions. <br>Here is the link where you can download it:<br>I use it to make a regular expresion.<br>

  • I second the Regex Coach. ❤

    Though it never matches globally even though I have the G checkbox On. Weird.

  • I can't believe I didn't know about the regex coach; I'm using cl-ppcre and a few other Edi Weitz libraries for this project.

    Thanks for all the replies.

  • ([^" ]("[^"]")[^" ]*)|[^" ]+

  • wtf, i just cant get the hang of this editor...

Log in to reply

Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.