The story of two regexes



  • Two bash regexes.

    regex1="([\\b/_.\\-]|^)test([\\b/_.\\-]|$)"
    regex2="([\\b/_\\-.]|^)test([\\b/_\\-.]|$)"
    

    Regex 1 works:

    $ [[ '_test_' =~ $regex1 ]] && echo YES || echo NO
    YES
    

    Regex 2 doesn't work:

    $ [[ '_test_' =~ $regex2 ]] && echo YES || echo NO
    NO
    

    Why? What am I missing with that escaping code?


  • Java Dev

    @cartman82 I don't think backslash escaping works in character classes - - is special unless it is the first or last character in the class (and ] is special unless it is the first character in the class).



  • @PleegWat So you're saying both are wrong and the character class should be [-\\b/_.] or [\\b/_.-]?


  • Discourse touched me in a no-no place

    @Khudzlin Perhaps with \ included as a literal character in the character class as well. Or maybe not; we know what was written, not what was actually wanted, and REs are sensitive to people knowing what they want. :)



  • @PleegWat said in The story of two regexes:

    @cartman82 I don't think backslash escaping works in character classes - - is special unless it is the first or last character in the class (and ] is special unless it is the first character in the class).

    Escaping definitely work in sane languages.
    Maybe bash is "special" in that regard...

    @dkf said in The story of two regexes:

    @Khudzlin Perhaps with \ included as a literal character in the character class as well. Or maybe not; we know what was written, not what was actually wanted, and REs are sensitive to people knowing what they want.

    I want a capture group that matches /, -, ., _ and word boundary.


  • Discourse touched me in a no-no place

    @cartman82 said in The story of two regexes:

    I want a capture group that matches /, -, ., _ and word boundary.

    Pull the word boundary out of the set (it's a zero-width constraint) and put the - first in the set. So this:(\\b|[-/_.]|$)
    The backslash is only doubled because of how you're (not) quoting the RE.



  • @cartman82 said in The story of two regexes:

    I want a capture group that matches /, -, ., _ and word boundary.

    stephen@kitchen:~$ regex3='([-/._]|\b)test([-/._]|\b)'
    stephen@kitchen:~$ [[ test =~ $regex3 ]] && printf '[%s]\n' "${BASH_REMATCH[@]}" || echo NO
    [test]
    []
    []
    stephen@kitchen:~$ [[ tests =~ $regex3 ]] && printf '[%s]\n' "${BASH_REMATCH[@]}" || echo NO
    NO
    stephen@kitchen:~$ [[ _tests =~ $regex3 ]] && printf '[%s]\n' "${BASH_REMATCH[@]}" || echo NO
    NO
    stephen@kitchen:~$ [[ '_test\' =~ $regex3 ]] && printf '[%s]\n' "${BASH_REMATCH[@]}" || echo NO
    [_test]
    [_]
    []
    stephen@kitchen:~$ [[ '_test/' =~ $regex3 ]] && printf '[%s]\n' "${BASH_REMATCH[@]}" || echo NO
    [_test/]
    [_]
    [/]
    

    Notes:

    When assigning quoted strings in bash, especially metacharacter-heavy strings like regexes and extra-especially regexes that contain \ or $, delimiting them with single quotes is generally helpful unless you actually do want to interpolate variables inside them or they actually need to contain embedded single quotes. In the latter case, '\'' is idiomatic (close currently open quote, add an explicitly escaped quote, open a new quote).

    This is mainly because \ is a literal character rather than an escape inside single quotes, so you avoid the whole "do I need one escape or two" anguish; every \ you put inside a single-quoted string will be seen as a \ by the regex parser.

    Word boundaries are not characters, so putting \b inside a character class won't make it match a word boundary. In fact it will make it match on the characters \ and b since \ is not special inside a bash character class. This is pretty common for regex engines - off the top of my head, Javascript's (and maybe C#'s?) are the only ones that support escaped characters inside character classes.

    If you want newlines, tabs and so forth inside a character class, either put them there literally or use the $'escape-interpreting-quoted-string' bashism (in which case you're back to needing doubled backslashes when you want backslashes, but at least you don't also risk unintended interpolation from $ characters).

    stephen@kitchen:~$ regex4='([\n])'
    stephen@kitchen:~$ [[ '
    ' =~ $regex4 ]] && printf '[%s]\n' "${BASH_REMATCH[@]}" || echo NO
    NO
    stephen@kitchen:~$ [[ '\' =~ $regex4 ]] && printf '[%s]\n' "${BASH_REMATCH[@]}" || echo NO
    [\]
    [\]
    stephen@kitchen:~$ regex5='([
    ])'
    stephen@kitchen:~$ [[ '
    ' =~ $regex5 ]] && printf '[%s]\n' "${BASH_REMATCH[@]}" || echo NO
    [
    ]
    [
    ]
    stephen@kitchen:~$ [[ '\' =~ $regex5 ]] && printf '[%s]\n' "${BASH_REMATCH[@]}" || echo NO
    NO
    

    Finally, there's no need to test for ^ or $ if you're already testing for a word boundary, since line extremes are word boundaries anyway.


  • Discourse touched me in a no-no place

    @flabdablet said in The story of two regexes:

    This is pretty common for regex engines - off the top of my head, Javascript's (and maybe C#'s?) are the only ones that support escaped characters inside character classes.

    There are more, but exactly what is supported is a bit complicated to describe. Use a double backslash when you want a backslash in a character class unless you've read the appropriate bit of the manual very recently…



  • @dkf said in The story of two regexes:

    Use a double backslash when you want a backslash in a character class

    That's a good idea, though it might encourage some future maintainer to try to extend that character class with something like \t on the grounds that oh looky, this engine obviously does support escapes in character classes or they wouldn't have doubled that backslash there.

    unless you've read the appropriate bit of the manual very recently

    When dealing with regexes, this is always the best idea. Even when the last time you read it was just yesterday.


  • Java Dev

    One of the problems of regex is that every regex-supporting tool uses a slightly different syntax, mostly surrounding whether specific characters are magic when escaped and literal when not escaped, or the other way around.


  • Fake News



  • Thanks, @flabdablet

    This was what I needed.

    I could have cobble together something that worked, but I was curious why that specific thing didn't work as I expected.



  • @flabdablet said in The story of two regexes:

    In the latter case, '\'' is idiomatic (close currently open quote, add an explicitly escaped quote, open a new quote).

    I have '"'"' in my muscle memory. As a bonus, it can be typed with one finger without taking it off a key.


  • Java Dev

    @ben_lubar said in The story of two regexes:

    @flabdablet said in The story of two regexes:

    In the latter case, '\'' is idiomatic (close currently open quote, add an explicitly escaped quote, open a new quote).

    I have '"'"' in my muscle memory. As a bonus, it can be typed with one finger without taking it off a key.

    Hold ' and press/release shft with perfect timing?


Log in to reply