MySQL backslash escaping
-
I seem to have discovered a bug in MySQL's matching of backslashes in LIKE clauses. Bug aside, the rules for escaping are WTFy enough in their own right:
- Two characters have special meaning:
%
and_
- To find those specific characters you can precede them with an escape character, by default, backslash:
\%
,\_
- "Because MySQL uses C escape syntax in strings (for example,
\n
to represent a newline character), you must double any\
that you use in LIKE strings. For example, to search for\n
, specify it as\\n
. To search for\
, specify it as\\\\
; this is because the backslashes are stripped once by the parser and again when the pattern match is made, leaving a single backslash to be matched against." - "Exception: At the end of the pattern string, backslash can be specified as
\\
. At the end of the string, backslash stands for itself because there is nothing following to escape."
The last two are verbatim from the docs. Okay, so to search for a backslash in the middle of the string, according to the docs, I need to use
LIKE "%\\\\%"
. Let's test that out...I've created a database with the following values:
TESTITEM
%TESTITEM
TESTITEM%
TEST\ITEM
TEST\\ITEM
TEST\ITEM\
Here are the queries and results:
SELECT item_code FROM items WHERE item_code LIKE '%\%'; +-----------+ | item_code | +-----------+ | %TESTITEM | | TESTITEM% | +-----------+ 2 rows in set (0.00 sec)
The backslash should escape the second
%
right? This should only find columns ending with a%
character as far as I can see, not just those containing one.
SELECT item_code FROM items WHERE item_code LIKE '%\\%'; +-----------+ | item_code | +-----------+ | %TESTITEM | | TESTITEM% | +-----------+ 2 rows in set (0.00 sec)
This one's not really covered by the spec, I put it in for completeness.
SELECT item_code FROM items WHERE item_code LIKE '%\\\%'; Empty set (0.00 sec)
Again, not really covered by the spec. It starts to get interesting after this...
SELECT item_code FROM items WHERE item_code LIKE '%\\\\%'; Empty set (0.00 sec)
According to the documentation, this is the correct syntax to find a single backslash anywhere in the string, and yet it returns nothing.
SELECT item_code FROM items WHERE item_code LIKE '%\\\\\%'; +------------+ | item_code | +------------+ | TEST\\ITEM | | TEST\ITEM | | TEST\ITEM\ | +------------+ 3 rows in set (0.00 sec)
Five backslashes (FIVE!) returns all of the rows containing a backslash. Surely this should match values ending with
\%
.
SELECT item_code FROM items WHERE item_code LIKE '%\\\\\\%'; +------------+ | item_code | +------------+ | TEST\\ITEM | | TEST\ITEM | | TEST\ITEM\ | +------------+ 3 rows in set (0.00 sec)
Six backslashes does the same as five.
SELECT item_code FROM items WHERE item_code LIKE '%\\\\\\\%'; Empty set (0.00 sec) SELECT item_code FROM items WHERE item_code LIKE '%\\\\\\\\%'; Empty set (0.00 sec)
Seven and eight backslashes return nothing.
SELECT item_code FROM items WHERE item_code LIKE '%\\\\\\\\\%'; +------------+ | item_code | +------------+ | TEST\\ITEM | +------------+ 1 row in set (0.00 sec)
Nine (and also ten, for reference) backslashes finally matches the sequence
\\
somewhere in the string.I'm pretty sure this was what the term "clusterfuck" was invented to describe.
UPDATE!
I thought I'd try one more thing before posting this and changed the collation of the column from
utf8_unicode_ci
toutf8_general_ci
. Suddenly matching on backslashes works correctly*‽Does someone want to explain to me how I'm TRWTF for using that collation, or is it a bug which only affects certain collations?
* Although both seven and eight backslashes matched
TEST\\ITEM
, where by rights, only eight should have.
- Two characters have special meaning:
-
TRWTF is MySQL
-
-
-
-
At my workplace, we use a very complicated makefile-based build system. You run tests through make, too. It invokes some bash scripts then. There is an option for specifying regex that will match the tests to be run. Except it's not typical regex - lone * matches anything (no need for .*). Because of bash, if you want to use pipe in regex, you have to escape it five times:
\\\\\|
. Same with parens and brackets. And it rarely ever works with more complicated patterns. Thankfully, 99% of time you wanttestsuite.*
.@DiscourceBot: if you have space after asterisk, it doesn't trigger italics. And if you have some stray unescaped asterisk in post, and another that's escaped with backticks, they match each other. Except when backticks contain two asterisks next to each other.
-
Not when you're using both MySQL and PHP.
[size=6]See? I can post images that don't include cute critters! ☺[/size]
-
-
@RaceProUK - Days Since Last Discourse Bug: 0
-
@RaceProUK - Days Since Last Discourse Bug: 0
<!-- Posted by SockBot 0.13.0 "Devious Daine" on Fri Dec 05 2014 10:37:18 GMT+0000 (UTC)-->
I'm amused by the fact that I've never seen a number in this other than zero.
-
@RaceProUK - Days Since Last Gąska Fail: 0
-
I'm amused by the fact that I've never seen a number in this other than zero.
I always thought it's hardcoded as 0.
-
I always thought it's hardcoded as 0.
I think it might once have been genuinely calculated, but has since been optimised as it turned out to be a constant.
-
Universal constants:
c
π
DAYS_SINCE_LAST_DISCOURSE_BUG
-
Correct. Plus when the bots were causing Discodeath looking up the real value was adding extra calls.
-
I'm amused by the fact that I've never seen a number in this other than zero.
huh... discoursebot also appears to be running an old version of my scripts.... current version is 0.15.1. @loopback0, this is one of yours, right?
-
Yeah, they're all on an old version, I'll update at some point probably.
-
-
Envy is one of the seven deadly sins you know.
I work with Oracle SQL and Java, I'm pretty sorted for WTFs already.
-
if you have space after asterisk, it doesn't trigger italics. And if you have some stray unescaped asterisk in post, and another that's escaped with backticks, they match each other. Except when backticks contain two asterisks next to each other.
Thread about backslashes and character escaping, without a mention of the best backslash escape sequence in Discourse? *
-
I work with Oracle SQL and Java, I'm pretty sorted for WTFs already.
One thing I like about using JDBC / Hibernate / Oracle is that statements with semi-colons generate an error. I've never understood why other drivers don't do that, too (or if it's a JDBC or Hibernate or Oracle thing, TBH).
-
I'm pretty sure it's a JDBC with Oracle thing.
I've never really looked into it, but it's definitely not Hibernate causing it.Can't remember whether it happens when using JDBC with MSSQL.
-
I seem to have discovered a bug
MySQL
I found the bug.
Sorry, it's a dirty job but someone had to do it.
-
Bruce Dickinson sure is looking cartoony in his old age.
-
-
Run for your life? Be more concise, just say: "Escape!"
-
Run for your life? Be more concise, just say: "Escape!"
Doesn't fit the music though, does it?
-
Just file a bug report.
-
I think it might once have been genuinely calculated, but has since been optimised as it turned out to be a constant.
There was one time that it reported 1 day; I think there had been 25 hours since the last report, or something. It was hardcoded when
the bots were causing Discodeath[, and] looking up the real value was adding extra calls.
-
This might be the only instance.....
http://what.thedailywtf.com/t/conversations-overheard/3268/210?u=loopback0
It said null once, but I can't remember if that was a legit message, a bug, or I'd posted it manually.
-
This might be the only instance.
That would be the one.
It said null once, but I can't remember if that was a legit message
It has said that more that once. I think it still randomly selects between "days since last" and "last day without," it just doesn't bother actually calculating either.
-
It has said that more that once.
I did do a Discosearch before that, and only got a single result, but now using less words in the search I have indeed found more posts with null.
I've slept since then, difficult to remember that far back.I think it still randomly selects between "days since last" and "last day without," it just doesn't bother actually calculating either.
It's been configured with a single message for a while....
[code] "messages": [
"@%username% - Days Since Last Discourse Bug: 0"
]
[/code]Have added null back in as I'd forgotten I'd even
removed it.added it in the first place.
-
Have added null back in as I'd forgotten I'd even removed it. added it in the first place.
since you're adding null you might as well also add
-1
:-P
-
Does someone want to explain to me how I'm TRWTF for using that collation, or is it a bug which only affects certain collations?
Found a couple bug reports that have never been fixed - looks like this has been a problem for at least 6 years:
-
since you're adding null you might as well also add -1
I've logged out now - if I remember the next time, then sure.
Filed under: Apparent Discostability is a barrier to needing to log into EC2 for bot maintenance
-
I figured I'd only just forget, so I've just added it.
-
>RaceProUK:
TRWTF is MySQLNot when you're using both MySQL and PHP.
You realize you can have more than one TRWTF, right? I realize they both each become _A_RWTF or TRWTFs, but it's still possible.
In fact, MySQL and PHP each taken alone are RTWs, and the combination may be a simultaneous third.
-
There was one time that it reported 1 day
I could swear I saw it return null a couple of times.
-
-
Keep reading.
Dammit, now I can't edit my post to say I was Hanzo'd, because you Hanzo'd my edit.
-
It's a multi-Hanzo!
-
-
I did the test on Postgres (8.4). The rules for LIKE statements are the same, but the string quoting rules are a bit different. I think the results are a tad wonky as well (queries 2 and 4 don't look right to me)
=> SELECT * FROM items; item_code ------------ TESTITEM %TESTITEM TESTITEM% TEST\ITEM TEST\\ITEM TEST\ITEM\ (6 rows) => SELECT item_code FROM items WHERE item_code LIKE '%\%'; item_code ----------- TESTITEM% (1 row) => SELECT item_code FROM items WHERE item_code LIKE '%\\%'; item_code ------------ TEST\ITEM TEST\\ITEM TEST\ITEM\ (3 rows) => SELECT item_code FROM items WHERE item_code LIKE '%\\\%'; item_code ----------- (0 rows) => SELECT item_code FROM items WHERE item_code LIKE '%\\\\%'; item_code ------------ TEST\\ITEM (1 row) => SELECT item_code FROM items WHERE item_code LIKE '%\\\\\%'; item_code ----------- (0 rows) => SELECT item_code FROM items WHERE item_code LIKE '%\\\\\\%'; item_code ----------- (0 rows) => SELECT item_code FROM items WHERE item_code LIKE '%\\\\\\\%'; item_code ----------- (0 rows) => SELECT item_code FROM items WHERE item_code LIKE '%\\\\\\\\%'; item_code ----------- (0 rows) => SELECT item_code FROM items WHERE item_code LIKE '%\\\\\\\\\%'; item_code ----------- (0 rows)
Postgres has two syntaxes (syntices?) for strings:
'string'
andE'string'
. The latter syntax allows backslash escape sequences. The only escape sequence the former syntax allows is a doubled single quote. Here's what happens.=> SELECT item_code FROM items WHERE item_code LIKE E'%\%'; item_code ------------ TESTITEM %TESTITEM TESTITEM% TEST\ITEM TEST\\ITEM TEST\ITEM\ (6 rows) => SELECT item_code FROM items WHERE item_code LIKE E'%\\%'; item_code ----------- TESTITEM% (1 row) => SELECT item_code FROM items WHERE item_code LIKE E'%\\\%'; item_code ----------- TESTITEM% (1 row) => SELECT item_code FROM items WHERE item_code LIKE E'%\\\\%'; item_code ------------ TEST\ITEM TEST\\ITEM TEST\ITEM\ (3 rows) => SELECT item_code FROM items WHERE item_code LIKE E'%\\\\\%'; item_code ------------ TEST\ITEM TEST\\ITEM TEST\ITEM\ (3 rows) => SELECT item_code FROM items WHERE item_code LIKE E'%\\\\\\%'; item_code ----------- (0 rows) => SELECT item_code FROM items WHERE item_code LIKE E'%\\\\\\\%'; item_code ----------- (0 rows) => SELECT item_code FROM items WHERE item_code LIKE E'%\\\\\\\\%'; item_code ------------ TEST\\ITEM (1 row)
...It's too early in the morning for my brain to wrap around this thing.
-
While I'm at it, here's how postgres sees the strings after its SQL parser is done with them.
=> SELECT '%\%', '%\\%', '%\\\%', '%\\\\%', '%\\\\\%', '%\\\\\\%', '%\\\\\\\%', '%\\\\\\\\%', '%\\\\\\\\\%'; ?column? | ?column? | ?column? | ?column? | ?column? | ?column? | ?column? | ?column? | ?column? ----------+----------+----------+----------+----------+----------+-----------+------------+------------- %\% | %\\% | %\\\% | %\\\\% | %\\\\\% | %\\\\\\% | %\\\\\\\% | %\\\\\\\\% | %\\\\\\\\\% (1 row) => SELECT E'%\%', E'%\\%', E'%\\\%', E'%\\\\%', E'%\\\\\%', E'%\\\\\\%', E'%\\\\\\\%', E'%\\\\\\\\%', E'%\\\\\\\\\%'; ?column? | ?column? | ?column? | ?column? | ?column? | ?column? | ?column? | ?column? | ?column? ----------+----------+----------+----------+----------+----------+----------+----------+---------- %% | %\% | %\% | %\\% | %\\% | %\\\% | %\\\% | %\\\\% | %\\\\% (1 row)
-
view in Firefox: http://aceattorney.sparklin.org/jeu.php?id_proces=57684
-
http://bugs.mysql.com/bug.php?id=39808
I like how the reporter of that bug is listed as 'Jaka JanÄar' -- and that the bug is filed under 'Charsets'
Unicode is hard. Unicode with MySQL is twice as hard.
Edit: His name is Jaka Jančar.
-
Staple Ruby(-on-Rails) in front of an old MySQL DB (e.g., one that's been in production for a few years) and you can get some very interesting failures. It's entirely possible to have the DB, the user and the system on three separate encodings, and for Ruby to make this is all too evident in a “I'm going to blow up in your face now. Wheee!” way.
-
I did the test on Postgres (8.4). The rules for LIKE statements are the same, but the string quoting rules are a bit different. I think the results are a tad wonky as well (queries 2 and 4 don't look right to me)
It seems
'something'
parses only once (the right and proper way to do that), andE'something'
parses twice, then the matching is done. So:- '%\\\\\\\\%' => '%\\\\%' => match anything with four backslashes in it.
- E'%\\\\\\\\%' => '%\\\\%' => '%\\%' => match anything with two backslashes in it.
Is that not what is supposed to happen?
Filed under: That has to be the most WTF-y markup I've ever done
-
Oh, yeah, certainly. Postgres's makes perfect sense. My brains just weren't yet at the office at that hour.
I was trying to correlate it with MySQL's behaviour as delineated in the OP and could not make any sense out of it. If you assume the strings are
stripslashes
'd once before fed to LIKE (such as in pg's case), around six out of eight return resultsets that are completely illogical.
-
You've got to stop thinking in terms of what you write in the SQL and start thinking in terms of what the characters in the values are. That's what
LIKE
works with. Once you've done that, then you think about how to encode those characters so that the abstract value you're dealing with is going to get to the SQL engine without damage. Or you use prepared statements so you can skip that noise entirely (though that might make the execution less efficient; some common patterns ofLIKE
usage with constants lead to great improvements through the use of indices to speed up searches).