MySQL backslash escaping



  • I seem to have discovered a bug in MySQL's matching of backslashes in LIKE clauses. Bug aside, the rules for escaping are WTFy enough in their own right:

    • Two characters have special meaning: % and _
    • To find those specific characters you can precede them with an escape character, by default, backslash: \%, \_
    • "Because MySQL uses C escape syntax in strings (for example, \n to represent a newline character), you must double any \ that you use in LIKE strings. For example, to search for \n, specify it as \\n. To search for \, specify it as \\\\; this is because the backslashes are stripped once by the parser and again when the pattern match is made, leaving a single backslash to be matched against."
    • "Exception: At the end of the pattern string, backslash can be specified as \\. At the end of the string, backslash stands for itself because there is nothing following to escape."

    The last two are verbatim from the docs. Okay, so to search for a backslash in the middle of the string, according to the docs, I need to use LIKE "%\\\\%". Let's test that out...

    I've created a database with the following values:

    TESTITEM
    %TESTITEM
    TESTITEM%
    TEST\ITEM
    TEST\\ITEM
    TEST\ITEM\


    Here are the queries and results:

    SELECT item_code FROM items WHERE item_code LIKE '%\%';
    
    +-----------+
    | item_code |
    +-----------+
    | %TESTITEM |
    | TESTITEM% |
    +-----------+
    2 rows in set (0.00 sec)
    

    The backslash should escape the second % right? This should only find columns ending with a % character as far as I can see, not just those containing one.


    SELECT item_code FROM items WHERE item_code LIKE '%\\%';
    
    +-----------+
    | item_code |
    +-----------+
    | %TESTITEM |
    | TESTITEM% |
    +-----------+
    2 rows in set (0.00 sec)
    

    This one's not really covered by the spec, I put it in for completeness.


    SELECT item_code FROM items WHERE item_code LIKE '%\\\%';
    
    Empty set (0.00 sec)
    

    Again, not really covered by the spec. It starts to get interesting after this...


    SELECT item_code FROM items WHERE item_code LIKE '%\\\\%';
    
    Empty set (0.00 sec)
    

    According to the documentation, this is the correct syntax to find a single backslash anywhere in the string, and yet it returns nothing.


    SELECT item_code FROM items WHERE item_code LIKE '%\\\\\%';
    
    +------------+
    | item_code  |
    +------------+
    | TEST\\ITEM |
    | TEST\ITEM  |
    | TEST\ITEM\ |
    +------------+
    3 rows in set (0.00 sec)
    

    Five backslashes (FIVE!) returns all of the rows containing a backslash. Surely this should match values ending with \%.


    SELECT item_code FROM items WHERE item_code LIKE '%\\\\\\%';
    
    +------------+
    | item_code  |
    +------------+
    | TEST\\ITEM |
    | TEST\ITEM  |
    | TEST\ITEM\ |
    +------------+
    3 rows in set (0.00 sec)
    

    Six backslashes does the same as five.


    SELECT item_code FROM items WHERE item_code LIKE '%\\\\\\\%';
    
    Empty set (0.00 sec)
    
    SELECT item_code FROM items WHERE item_code LIKE '%\\\\\\\\%';
    
    Empty set (0.00 sec)
    

    Seven and eight backslashes return nothing.


    SELECT item_code FROM items WHERE item_code LIKE '%\\\\\\\\\%';
    
    +------------+
    | item_code  |
    +------------+
    | TEST\\ITEM |
    +------------+
    1 row in set (0.00 sec)
    

    Nine (and also ten, for reference) backslashes finally matches the sequence \\ somewhere in the string.

    I'm pretty sure this was what the term "clusterfuck" was invented to describe.


    UPDATE!

    I thought I'd try one more thing before posting this and changed the collation of the column from utf8_unicode_ci to utf8_general_ci. Suddenly matching on backslashes works correctly*‽

    Does someone want to explain to me how I'm TRWTF for using that collation, or is it a bug which only affects certain collations?


    * Although both seven and eight backslashes matched TEST\\ITEM, where by rights, only eight should have.


  • sockdevs

    TRWTF is MySQL



  • @RaceProUK said:

    TRWTF is MySQL

    Not when you're using both MySQL and PHP.

    EDIT: Which I am, btw.



  • @RaceProUK said:

    TRWTF is MySQL

    This.

    @Keith said:

    MySQL and PHP.

    :open_mouth:



  • @loopback0 said:

    @Keith said:
    MySQL and PHP.

    :open_mouth:

    Envy is one of the seven deadly sins you know.



  • At my workplace, we use a very complicated makefile-based build system. You run tests through make, too. It invokes some bash scripts then. There is an option for specifying regex that will match the tests to be run. Except it's not typical regex - lone * matches anything (no need for .*). Because of bash, if you want to use pipe in regex, you have to escape it five times: \\\\\|. Same with parens and brackets. And it rarely ever works with more complicated patterns. Thankfully, 99% of time you want testsuite.*.

    @DiscourceBot: if you have space after asterisk, it doesn't trigger italics. And if you have some stray unescaped asterisk in post, and another that's escaped with backticks, they match each other. Except when backticks contain two asterisks next to each other.


  • sockdevs

    @Keith said:

    Not when you're using both MySQL and PHP.

    [size=6]See? I can post images that don't include cute critters! ☺[/size]


  • sockdevs

    @Gaska said:

    @DiscourceBot

    You mean @discoursebot, I assume?



  • @RaceProUK - Days Since Last Discourse Bug: 0

    <!-- Posted by SockBot 0.13.0 "Devious Daine" on Fri Dec 05 2014 10:37:18 GMT+0000 (UTC)-->


  • @discoursebot said:

    @RaceProUK - Days Since Last Discourse Bug: 0

    <!-- Posted by SockBot 0.13.0 "Devious Daine" on Fri Dec 05 2014 10:37:18 GMT+0000 (UTC)-->


    I'm amused by the fact that I've never seen a number in this other than zero.



  • @RaceProUK - Days Since Last Gąska Fail: 0



  • @Steve_The_Cynic said:

    I'm amused by the fact that I've never seen a number in this other than zero.

    I always thought it's hardcoded as 0.



  • @Gaska said:

    I always thought it's hardcoded as 0.

    I think it might once have been genuinely calculated, but has since been optimised as it turned out to be a constant.



  • Universal constants:

    c
    π
    DAYS_SINCE_LAST_DISCOURSE_BUG



  • Correct. Plus when the bots were causing Discodeath looking up the real value was adding extra calls.


  • sockdevs

    @Steve_The_Cynic said:

    I'm amused by the fact that I've never seen a number in this other than zero.

    huh... discoursebot also appears to be running an old version of my scripts.... current version is 0.15.1. @loopback0, this is one of yours, right?



  • Yeah, they're all on an old version, I'll update at some point probably.


  • sockdevs

    @loopback0 said:

    probably.

    :laughing: i know that feels.



  • @Keith said:

    Envy is one of the seven deadly sins you know.

    I work with Oracle SQL and Java, I'm pretty sorted for WTFs already.



  • @Gaska said:

    if you have space after asterisk, it doesn't trigger italics. And if you have some stray unescaped asterisk in post, and another that's escaped with backticks, they match each other. Except when backticks contain two asterisks next to each other.

    Thread about backslashes and character escaping, without a mention of the best backslash escape sequence in Discourse? *



  • @loopback0 said:

    I work with Oracle SQL and Java, I'm pretty sorted for WTFs already.

    One thing I like about using JDBC / Hibernate / Oracle is that statements with semi-colons generate an error. I've never understood why other drivers don't do that, too (or if it's a JDBC or Hibernate or Oracle thing, TBH).



  • I'm pretty sure it's a JDBC with Oracle thing.
    I've never really looked into it, but it's definitely not Hibernate causing it.

    Can't remember whether it happens when using JDBC with MSSQL.



  • @Keith said:

    I seem to have discovered a bug

    @Keith said:

    MySQL

    I found the bug.

    Sorry, it's a dirty job but someone had to do it.



  • Bruce Dickinson sure is looking cartoony in his old age.



  • @Keith said:

    MySQL

    It's developed by Oracle. What did you expect?



  • Run for your life? Be more concise, just say: "Escape!"



  • @Lawrence said:

    Run for your life? Be more concise, just say: "Escape!"

    Doesn't fit the music though, does it?

    Iron Maiden - Run To The Hills (Low Definition) – 03:50
    — Iron Maiden



  • Just file a bug report.



  • @Keith said:

    I think it might once have been genuinely calculated, but has since been optimised as it turned out to be a constant.

    There was one time that it reported 1 day; I think there had been 25 hours since the last report, or something. It was hardcoded when

    @loopback0 said:

    the bots were causing Discodeath[, and] looking up the real value was adding extra calls.



  • This might be the only instance.....

    It said null once, but I can't remember if that was a legit message, a bug, or I'd posted it manually.



  • @loopback0 said:

    This might be the only instance.

    That would be the one.

    @loopback0 said:

    It said null once, but I can't remember if that was a legit message
    It has said that more that once. I think it still randomly selects between "days since last" and "last day without," it just doesn't bother actually calculating either.



  • @HardwareGeek said:

    It has said that more that once.

    I did do a Discosearch before that, and only got a single result, but now using less words in the search I have indeed found more posts with null.
    I've slept since then, difficult to remember that far back.

    @HardwareGeek said:

    I think it still randomly selects between "days since last" and "last day without," it just doesn't bother actually calculating either.

    It's been configured with a single message for a while....
    [code] "messages": [
    "@%username% - Days Since Last Discourse Bug: 0"
    ]
    [/code]

    Have added null back in as I'd forgotten I'd even removed it. added it in the first place.


  • sockdevs

    @loopback0 said:

    Have added null back in as I'd forgotten I'd even removed it. added it in the first place.

    since you're adding null you might as well also add -1

    :-P



  • @Keith said:

    Does someone want to explain to me how I'm TRWTF for using that collation, or is it a bug which only affects certain collations?

    Found a couple bug reports that have never been fixed - looks like this has been a problem for at least 6 years:



  • @accalia said:

    since you're adding null you might as well also add -1

    I've logged out now - if I remember the next time, then sure.



    Filed under: Apparent Discostability is a barrier to needing to log into EC2 for bot maintenance



  • I figured I'd only just forget, so I've just added it.


  • Discourse touched me in a no-no place

    @Keith said:

    >RaceProUK:
    TRWTF is MySQL

    Not when you're using both MySQL and PHP.

    You realize you can have more than one TRWTF, right? I realize they both each become _A_RWTF or TRWTFs, but it's still possible.

    In fact, MySQL and PHP each taken alone are RTWs, and the combination may be a simultaneous third.


  • Discourse touched me in a no-no place

    @HardwareGeek said:

    There was one time that it reported 1 day

    I could swear I saw it return null a couple of times.



  • @FrostCat said:

    I could swear I saw it return null a couple of times.

    Keep reading.


  • Discourse touched me in a no-no place

    @HardwareGeek said:

    Keep reading.

    Dammit, now I can't edit my post to say I was Hanzo'd, because you Hanzo'd my edit.



  • It's a multi-Hanzo!


  • Discourse touched me in a no-no place

    @HardwareGeek said:

    It's a multi-Hanzo!

    It's Hanzos all the way down.



  • I did the test on Postgres (8.4). The rules for LIKE statements are the same, but the string quoting rules are a bit different. I think the results are a tad wonky as well (queries 2 and 4 don't look right to me)

    => SELECT * FROM items;
     item_code
    ------------
     TESTITEM
     %TESTITEM
     TESTITEM%
     TEST\ITEM
     TEST\\ITEM
     TEST\ITEM\
    (6 rows)
    
    => SELECT item_code FROM items WHERE item_code LIKE '%\%';
     item_code 
    -----------
     TESTITEM%
    (1 row)
    
    => SELECT item_code FROM items WHERE item_code LIKE '%\\%';
     item_code  
    ------------
     TEST\ITEM
     TEST\\ITEM
     TEST\ITEM\
    (3 rows)
    
    => SELECT item_code FROM items WHERE item_code LIKE '%\\\%';
     item_code 
    -----------
    (0 rows)
    
    => SELECT item_code FROM items WHERE item_code LIKE '%\\\\%';
     item_code  
    ------------
     TEST\\ITEM
    (1 row)
    
    => SELECT item_code FROM items WHERE item_code LIKE '%\\\\\%';
     item_code 
    -----------
    (0 rows)
    
    => SELECT item_code FROM items WHERE item_code LIKE '%\\\\\\%';
     item_code 
    -----------
    (0 rows)
    
    => SELECT item_code FROM items WHERE item_code LIKE '%\\\\\\\%';
     item_code 
    -----------
    (0 rows)
    
    => SELECT item_code FROM items WHERE item_code LIKE '%\\\\\\\\%';
     item_code 
    -----------
    (0 rows)
    
    => SELECT item_code FROM items WHERE item_code LIKE '%\\\\\\\\\%';
     item_code 
    -----------
    (0 rows)
    

    Postgres has two syntaxes (syntices?) for strings: 'string' and E'string'. The latter syntax allows backslash escape sequences. The only escape sequence the former syntax allows is a doubled single quote. Here's what happens.

    => SELECT item_code FROM items WHERE item_code LIKE E'%\%';
     item_code  
    ------------
     TESTITEM
     %TESTITEM
     TESTITEM%
     TEST\ITEM
     TEST\\ITEM
     TEST\ITEM\
    (6 rows)
    
    => SELECT item_code FROM items WHERE item_code LIKE E'%\\%';
     item_code 
    -----------
     TESTITEM%
    (1 row)
    
    => SELECT item_code FROM items WHERE item_code LIKE E'%\\\%';
     item_code 
    -----------
     TESTITEM%
    (1 row)
    
    => SELECT item_code FROM items WHERE item_code LIKE E'%\\\\%';
     item_code  
    ------------
     TEST\ITEM
     TEST\\ITEM
     TEST\ITEM\
    (3 rows)
    
    => SELECT item_code FROM items WHERE item_code LIKE E'%\\\\\%';
     item_code  
    ------------
     TEST\ITEM
     TEST\\ITEM
     TEST\ITEM\
    (3 rows)
    
    => SELECT item_code FROM items WHERE item_code LIKE E'%\\\\\\%';
     item_code                                  
    -----------
    (0 rows)
    
    => SELECT item_code FROM items WHERE item_code LIKE E'%\\\\\\\%';
     item_code
    -----------
    (0 rows)
    
    => SELECT item_code FROM items WHERE item_code LIKE E'%\\\\\\\\%';
     item_code
    ------------
     TEST\\ITEM
    (1 row)
    

    ...It's too early in the morning for my brain to wrap around this thing.



  • While I'm at it, here's how postgres sees the strings after its SQL parser is done with them.

    => SELECT '%\%', '%\\%', '%\\\%', '%\\\\%', '%\\\\\%', '%\\\\\\%', '%\\\\\\\%', '%\\\\\\\\%', '%\\\\\\\\\%';
     ?column? | ?column? | ?column? | ?column? | ?column? | ?column? | ?column?  |  ?column?  |  ?column?
    ----------+----------+----------+----------+----------+----------+-----------+------------+-------------
     %\%      | %\\%     | %\\\%    | %\\\\%   | %\\\\\%  | %\\\\\\% | %\\\\\\\% | %\\\\\\\\% | %\\\\\\\\\%
    (1 row)
    
    => SELECT E'%\%', E'%\\%', E'%\\\%', E'%\\\\%', E'%\\\\\%', E'%\\\\\\%', E'%\\\\\\\%', E'%\\\\\\\\%', E'%\\\\\\\\\%';
     ?column? | ?column? | ?column? | ?column? | ?column? | ?column? | ?column? | ?column? | ?column? 
    ----------+----------+----------+----------+----------+----------+----------+----------+----------
     %%       | %\%      | %\%      | %\\%     | %\\%     | %\\\%    | %\\\%    | %\\\\%   | %\\\\%
    (1 row)




  • @monkeyArms said:

    http://bugs.mysql.com/bug.php?id=39808

    I like how the reporter of that bug is listed as 'Jaka Jančar' -- and that the bug is filed under 'Charsets'

    Unicode is hard. Unicode with MySQL is twice as hard.

    Edit: His name is Jaka Jančar.


  • Discourse touched me in a no-no place

    Staple Ruby(-on-Rails) in front of an old MySQL DB (e.g., one that's been in production for a few years) and you can get some very interesting failures. It's entirely possible to have the DB, the user and the system on three separate encodings, and for Ruby to make this is all too evident in a “I'm going to blow up in your face now. Wheee!” way.


  • Winner of the 2016 Presidential Election

    @hhaamu said:

    I did the test on Postgres (8.4). The rules for LIKE statements are the same, but the string quoting rules are a bit different. I think the results are a tad wonky as well (queries 2 and 4 don't look right to me)

    It seems 'something' parses only once (the right and proper way to do that), and E'something' parses twice, then the matching is done. So:

    • '%\\\\\\\\%' => '%\\\\%' => match anything with four backslashes in it.
    • E'%\\\\\\\\%' => '%\\\\%' => '%\\%' => match anything with two backslashes in it.

    Is that not what is supposed to happen?


    Filed under: That has to be the most WTF-y markup I've ever done



  • Oh, yeah, certainly. Postgres's makes perfect sense. My brains just weren't yet at the office at that hour.

    I was trying to correlate it with MySQL's behaviour as delineated in the OP and could not make any sense out of it. If you assume the strings are stripslashes'd once before fed to LIKE (such as in pg's case), around six out of eight return resultsets that are completely illogical.


  • Discourse touched me in a no-no place

    You've got to stop thinking in terms of what you write in the SQL and start thinking in terms of what the characters in the values are. That's what LIKE works with. Once you've done that, then you think about how to encode those characters so that the abstract value you're dealing with is going to get to the SQL engine without damage. Or you use prepared statements so you can skip that noise entirely (though that might make the execution less efficient; some common patterns of LIKE usage with constants lead to great improvements through the use of indices to speed up searches).


Log in to reply
 

Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.