Java regex help


  • mod

    An old bug has reared its ugly head again, and I'm seriously stumped why.

    The code uses Java's Pattern library to do some regex parsing (from Coldfusion, but that's not important right now). The error thrown is that, inside a while loop that checks match.find(), there is no such group as group 0 (which denotes the whole match).

    What the heck kind of edge condition can cause this? Why would find return true if there's no match? Why would group 0 be empty if there was a match?


  • Winner of the 2016 Presidential Election

    Maybe stupid, but would the regex match an empty string as well? Or is it strictly non-empty strings that are accepted?


  • mod

    Empty string matches should be returning empty string though, shouldn't they? http://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#group(int)


  • Winner of the 2016 Presidential Election

    If the match was successful but the group specified failed to match any part of the input sequence, then null is returned. Note that some groups, for example (a*), match the empty string. This method will return the empty string when such a group successfully matches the empty string in the input.

    It would seem so. Huh. The first part is a suspect, but it shouldn't ever happen on the whole pattern. I'm not very familiar with Java, but PHP uses the same principle for matching, so I'm familiar with the concept at least.

    What happens if you feed it a NULL? Logically, .find() should either return false or throw an exception, but other than that all I can think of is a bug in the implementation itself.



  • Can you supply the regex and the offending text?

    @Yamikuronue said:

    (from Coldfusion, but that's not important right now)

    This sort of thing (but that's not important now) always throws up a red flag for me. Do you mean to say that you have coldfusion somehow calling Java code? Are you really sure something isn't breaking down in the interface between the two?


  • mod

    @boomzilla said:

    Do you mean to say that you have coldfusion somehow calling Java code?

    Yes:

    regexVars.PatternClass = createObject("java", "java.util.regex.Pattern");
    regexVars.nonAsciiChars = regexVars.PatternClass.Compile("\P{InBasic_Latin}");
    regexVars.theMatcher = regexVars.nonAsciiChars.matcher(editedString);
    

    Good thing I wrote unit tests back when we didn't mandate them, because now we're using the tests I wrote to pin down exactly what characters are causing this issue.


  • mod

    @Onyx said:

    What happens if you feed it a NULL?

    It's not receiving a null in our simplest reproduction case



  • "...Then I asked tdwtf for help with my regular expressions. Now I have three problems."

    That sounds odd, but I can't picture what kind of code could cause the problem you described.

    All the usual starting points apply, though. I'm sure you've thought of or tried most of these, but The Sign of the Four rule always applies, so here's what I would try: Can you force a complete, guaranteed match with the right input? If you pass it directly to the match and skip the while loop, do you get the answer you expect? Is the result of match.find being lost somewhere, or replaced with the result of some other call before you look at it? Have you tried turning it off and on again? Are you really passing what you think you are passing to each function? Isn't match usually spelled with only one "t"? Can you step through the whole mess with a debugger (or even a redneck debugger made up entirely of printf statements and duck tape), and see where everything goes wrong?

    Something has to be happening somewhere.


  • Winner of the 2016 Presidential Election

    The CharSequence isn't annotated @NotNull so it should just return false.


  • mod

    @DCRoss said:

    Something has to be happening somewhere.

    We have an intern looking into it now. I can reproduce with use cases that call just the function containing the regex with a short string of only special characters, so he's going to narrow down what characters cause the issue, because our best guess is Unicode Weirdness.



  • Where's your capturing group? I just see a pattern that matches, not captures.


  • mod

    Group zero denotes the entire pattern, so the expression m.group(0) is equivalent to m.group().

    From the docs: http://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#group(int)



  • What happens when you just call group()? I assume the same thing? That method literally calls group(0)

        if (group < 0 || group > groupCount())
            throw new IndexOutOfBoundsException("No group " + group);
    ...
    
    Returns the number of capturing groups in this matcher's pattern.
    Group zero denotes the entire pattern by convention. It is not included in this count.
    Any non-negative integer smaller than or equal to the value returned by this method is guaranteed to be a valid group index for this matcher.
    Returns:
    The number of capturing groups in this matcher's pattern
    544 
    545     public int groupCount() {
    546         return parentPattern.capturingGroupCount - 1;
    547     }
    

    :shrug:


  • mod

    @JazzyJosh said:

    That method literally calls group(0)

    Was just returning to report.



  • Shouldn't that be \\P instead of \P?
    Is there another call to find/lookingAt/matches/etc. after find() and before group()?
    And why are some spaces around the backticks missing in the previous line?


  • mod

    @fatbull said:

    . afterfind()and beforegroup()?

    while (regexVars.theMatcher.find()) {
    					regexVars.charToEscape = regexVars.theMatcher.group(0);
    					editedString = Replace(editedString, regexVars.charToEscape, "&##" & asc(regexVars.charToEscape) & ';');
    				}
    

    @fatbull said:

    Shouldn't that be \P instead of \P?

    I dunno. It was working for most of the replacements, that's all I know. It didn't throw any kind of invalid regex error.



  • @Yamikuronue said:

    We have an intern looking into it now.

    ... four problems now. :p


  • mod

    Weekly reminder: I don't develop anymore :) I've got my own shit to do.

    This is like the fifth task my coworker has tried to assign his intern, and each time, the intern comes back with a major " :wtf: " face, and it turns out, there's some platform architecture problem behind the seemingly trivial issue, and it actually needs his attention instead. He's in charge of platform maintenance, but to be fair, it was falling to bits before he got it, so he's still trying to find all the rough patches.



  • If that pattern holds true then my statement stands, albeit for different reasons :p



  • You're using Java? Now you have NullPointerException problems.


  • mod

    My problems are null? Sounds pretty sweet to me!



  • Is regexVars shared between requests?
    Is a debugger attached which calls find()?


  • mod

    @fatbull said:

    Is regexVars shared between requests?

    It's a local variable.

    @fatbull said:

    Is a debugger attached which calls find()?

    I doubt it. It fails when run by mxunit



  • Wait... Doesn't Replace search the string again? Why is the searched string modified in the loop? And are the double # really correct? What does this function actually do? I'm confused now.



  • @Yamikuronue said:

    @fatbull said:
    Shouldn't that be \P instead of \P?

    I dunno. It was working for most of the replacements, that's all I know. It didn't throw any kind of invalid regex error.

    It's right there in the Pattern documentation.

    Backslashes within string literals in Java source code are interpreted as required by The Java™ Language Specification as either Unicode escapes (section 3.3) or other character escapes (section 3.10.6) It is therefore necessary to double backslashes in string literals that represent regular expressions to protect them from interpretation by the Java bytecode compiler. The string literal "\b", for example, matches a single backspace character when interpreted as a regular expression, while "\\b" matches a word boundary. The string literal "\(hello\)" is illegal and leads to a compile-time error; in order to match the string (hello) the string literal "\\(hello\\)" must be used.

    (I bolded the relevant section)



  • NULL_NOT_FOUND <!-- ggg -->


  • mod

    @fatbull said:

    And are the double # really correct?

    Coldfusion.

    @powerlord said:

    to protect them from interpretation by the Java bytecode compiler.

    Coldfusion.

    @powerlord said:

    leads to a compile-time error;

    I get no errors from that string, so I figure it must be okay :)

    This works for most of the unit tests, but fails on specific characters. You'll just have to trust me on that I guess >.>



  • @Yamikuronue said:

    specific characters

    What characters?


  • mod

    Dunno yet. One of them for sure is the degree symbol.


  • Discourse touched me in a no-no place

    This post is deleted!

Log in to reply
 

Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.