Java regex help
-
An old bug has reared its ugly head again, and I'm seriously stumped why.
The code uses Java's Pattern library to do some regex parsing (from Coldfusion, but that's not important right now). The error thrown is that, inside a while loop that checks match.find(), there is no such group as group 0 (which denotes the whole match).
What the heck kind of edge condition can cause this? Why would find return true if there's no match? Why would group 0 be empty if there was a match?
-
Maybe stupid, but would the regex match an empty string as well? Or is it strictly non-empty strings that are accepted?
-
Empty string matches should be returning empty string though, shouldn't they? http://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#group(int)
-
If the match was successful but the group specified failed to match any part of the input sequence, then null is returned. Note that some groups, for example (a*), match the empty string. This method will return the empty string when such a group successfully matches the empty string in the input.
It would seem so. Huh. The first part is a suspect, but it shouldn't ever happen on the whole pattern. I'm not very familiar with Java, but PHP uses the same principle for matching, so I'm familiar with the concept at least.
What happens if you feed it a
NULL
? Logically,.find()
should either return false or throw an exception, but other than that all I can think of is a bug in the implementation itself.
-
Can you supply the regex and the offending text?
(from Coldfusion, but that's not important right now)
This sort of thing (but that's not important now) always throws up a red flag for me. Do you mean to say that you have coldfusion somehow calling Java code? Are you really sure something isn't breaking down in the interface between the two?
-
Do you mean to say that you have coldfusion somehow calling Java code?
Yes:
regexVars.PatternClass = createObject("java", "java.util.regex.Pattern"); regexVars.nonAsciiChars = regexVars.PatternClass.Compile("\P{InBasic_Latin}"); regexVars.theMatcher = regexVars.nonAsciiChars.matcher(editedString);
Good thing I wrote unit tests back when we didn't mandate them, because now we're using the tests I wrote to pin down exactly what characters are causing this issue.
-
What happens if you feed it a NULL?
It's not receiving a null in our simplest reproduction case
-
"...Then I asked tdwtf for help with my regular expressions. Now I have three problems."
That sounds odd, but I can't picture what kind of code could cause the problem you described.
All the usual starting points apply, though. I'm sure you've thought of or tried most of these, but The Sign of the Four rule always applies, so here's what I would try: Can you force a complete, guaranteed match with the right input? If you pass it directly to the match and skip the while loop, do you get the answer you expect? Is the result of match.find being lost somewhere, or replaced with the result of some other call before you look at it? Have you tried turning it off and on again? Are you really passing what you think you are passing to each function? Isn't match usually spelled with only one "t"? Can you step through the whole mess with a debugger (or even a redneck debugger made up entirely of printf statements and duck tape), and see where everything goes wrong?
Something has to be happening somewhere.
-
The
CharSequence
isn't annotated@NotNull
so it should just return false.
-
Something has to be happening somewhere.
We have an intern looking into it now. I can reproduce with use cases that call just the function containing the regex with a short string of only special characters, so he's going to narrow down what characters cause the issue, because our best guess is Unicode Weirdness.
-
Where's your capturing group? I just see a pattern that matches, not captures.
-
Group zero denotes the entire pattern, so the expression m.group(0) is equivalent to m.group().
From the docs: http://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#group(int)
-
What happens when you just call group()? I assume the same thing?That method literally calls group(0)if (group < 0 || group > groupCount()) throw new IndexOutOfBoundsException("No group " + group); ... Returns the number of capturing groups in this matcher's pattern. Group zero denotes the entire pattern by convention. It is not included in this count. Any non-negative integer smaller than or equal to the value returned by this method is guaranteed to be a valid group index for this matcher. Returns: The number of capturing groups in this matcher's pattern 544 545 public int groupCount() { 546 return parentPattern.capturingGroupCount - 1; 547 }
-
-
Shouldn't that be
\\P
instead of\P
?
Is there another call tofind
/lookingAt
/matches
/etc. afterfind()
and beforegroup()
?
And why are some spaces around thebackticks
missing in the previous line?
-
. afterfind()and beforegroup()?
while (regexVars.theMatcher.find()) { regexVars.charToEscape = regexVars.theMatcher.group(0); editedString = Replace(editedString, regexVars.charToEscape, "&##" & asc(regexVars.charToEscape) & ';'); }
Shouldn't that be \P instead of \P?
I dunno. It was working for most of the replacements, that's all I know. It didn't throw any kind of invalid regex error.
-
-
Weekly reminder: I don't develop anymore :) I've got my own shit to do.
This is like the fifth task my coworker has tried to assign his intern, and each time, the intern comes back with a major " " face, and it turns out, there's some platform architecture problem behind the seemingly trivial issue, and it actually needs his attention instead. He's in charge of platform maintenance, but to be fair, it was falling to bits before he got it, so he's still trying to find all the rough patches.
-
If that pattern holds true then my statement stands, albeit for different reasons :p
-
You're using Java? Now you have
NullPointerException
problems.
-
My problems are null? Sounds pretty sweet to me!
-
Is
regexVars
shared between requests?
Is a debugger attached which callsfind()
?
-
Is regexVars shared between requests?
It's a local variable.
Is a debugger attached which calls find()?
I doubt it. It fails when run by mxunit
-
Wait... Doesn't
Replace
search the string again? Why is the searched string modified in the loop? And are the double#
really correct? What does this function actually do? I'm confused now.
-
@fatbull said:
Shouldn't that be \P instead of \P?
I dunno. It was working for most of the replacements, that's all I know. It didn't throw any kind of invalid regex error.It's right there in the Pattern documentation.
Backslashes within string literals in Java source code are interpreted as required by The Java™ Language Specification as either Unicode escapes (section 3.3) or other character escapes (section 3.10.6) It is therefore necessary to double backslashes in string literals that represent regular expressions to protect them from interpretation by the Java bytecode compiler. The string literal
"\b"
, for example, matches a single backspace character when interpreted as a regular expression, while"\\b"
matches a word boundary. The string literal"\(hello\)"
is illegal and leads to a compile-time error; in order to match the string (hello) the string literal"\\(hello\\)"
must be used.(I bolded the relevant section)
-
NULL_NOT_FOUND
-
And are the double # really correct?
Coldfusion.
to protect them from interpretation by the Java bytecode compiler.
Coldfusion.
leads to a compile-time error;
I get no errors from that string, so I figure it must be okay :)
This works for most of the unit tests, but fails on specific characters. You'll just have to trust me on that I guess >.>
-
-
Dunno yet. One of them for sure is the degree symbol.
-
This post is deleted!