The email address validation regexp



  • Ok, email address validation might look easy.
    But not when the RFC822 is thrust into your face.

    Take a look at this unholy creation.



  • I think this may reflect more poorly on regexps than it does on the RFC in question.



  • Yes, but one would normally use a regexp for email address validation.
    The RFC is the one that complicates matters here.



  • @byte_lancer said:

    Yes, but one would normally use a regexp for email address validation.
    The RFC is the one that complicates matters here.



    My point is that this is probably a case in which the use of a regex is inappropriate.

    "<font size="-1">Some people, when confronted with a problem, think “I know, I’ll use regular
    expressions.” Now they have two problems."</font>



  • Agreed.
    While my point is that RFCs sometimes contains legalese.



  • @merreborn said:

    My point is that this is probably a case in which the use of a regex is inappropriate.
    Since the RFC describes something that can be verified by a deterministic finite automaton (DFA), a regular expression is an appropriate way to express the pattern to verify against.  Wrapping such a beast inside a library so that it can be easily reused is appropriate.  With a sufficiently varied and large set of test addresses, you could be reasonably sure that it was correct.  Once completed, you'd never have to touch it again, unlesss the spec changed.

    That said, I wouldn't want to have to track down a typo in that as written, or extend it to do something it didn't before.

    I thought there were flags on Perl regular expressions that let you insert comments and non-significant whitespace to make them more readable and maintainable?



  • The problem in this case is that the RFC for email addresses is just quite a bit more complicated than most people realize.

    I agree that following RFCs is good and I hate websites that reject perfectly valid addresses... but at some point you just have to say, "no, I will not allow comment fields in the middle of my email addresses."



  • @esd said:

    I agree that following RFCs is good and I hate websites that reject perfectly valid addresses... but at some point you just have to say, "no, I will not allow comment fields in the middle of my email addresses."
    I think you misunderstood.  The regular expression in question is for email addresses that have already had the comments ripped out!
    This regular expression will only validate addresses that have had any comments
    stripped and replaced with whitespace (this is done by the module).
    Looking again at the expresion, much of it is repetitions of certain character classes.  Some regular expression libraries let you define your own character classes that you can reference by name in the regular expression.  That would shorten this one significantly.



  • They say you can see a schooner hidden in that regexp. All I see is a stupid boat.



  • @Thuktun said:

    ...That said, I wouldn't want to have to track down a typo in that as written, or extend it to do something it didn't before.
    <font size="5">T</font>here is also no opportunity to give a helpful error message, like, ¨Sorry but more than one ´@´ is not allowed.¨  ¨Invalid e-mail address,¨ is not much help.



  • That is just the final form of a regexp built up in a sane way, with lots of comments.  Look at the source.  You can get regex above with


    perl -M'Mail::RFC822::Address' -e 'print Mail::RFC822::Address->make_rfc822re();'

    The end result is a lot more impressive/brain damaging to look at though :)



  • I should have added that the source can be viewed here.



  • @Thuktun said:

    @merreborn said:
    My point is that this is probably a case in which the use of a regex is inappropriate.
    Since the RFC describes something that can be verified by a deterministic finite automaton (DFA), a regular expression is an appropriate way to express the pattern to verify against.


    Forgive me, my choice of words was poor.

    This is probably a case in which a non-regexp solution would be preferable, considering things mentioned above, such as readability, maintainability, and error-handling.

    You're right:  a regexp absolutely can do this job; it's in the class of problems that regexps were designed to solve.  However, that doesn't mean that alternative methods aren't preferable.  Similarly, you could do the same work in assembly; hell, you could almost certainly write a such a program that would be faster than this regexp.  But you wouldn't, for the reasons mentioned above.

    Of course, in this case, the library's been written, and heavily tested.  However, were you to face this problem without this work already done, a non-regexp solution would probably be preferable.



  • Theoretically you could use regexps to recognize integers between 5 and 3976 but anyone with a little common sense won't do such thing.



  • @byte_lancer said:

    Yes, but one would normally use a regexp for email address validation.
    The RFC is the one that complicates matters here.

     

    And one would be wrong, since the domain side of an e-mail address can consist of an IP address, and how would one validate that the individual octets are correctly ranged?

    It's actually just easier to write this in code than to manage a hideous regular expression...of course, in saying "hideous regular expression," I repeat myself.



  • @merreborn said:

    hell, you could almost certainly write a such a program that would be faster than this regexp.
    Not necessarily.  If a regular expression is translated at compile-time into native
    matching code, I doubt you could hand-code anything that would run
    significantly faster.

    Regular expressions are almost always (if you know what you're doing) implemented by translating it to a non-deterministic finite automaton (NFA), transforming that into a deterministic finite automaton (DFA), and running that to validate the input.  The speed of this depends on how this transformation is done and how the DFA is run.

    • If the transformation is done at runtime, you pay the cost then, hopefully just once per instance of the program.  If you do this at compile time, it's done once, period.
    • If the regexp is translated directly into native code that efficiently implements the DFA, it will be very fast.  If the DFA is an in-memory graph of nodes that gets walked by (essentially) an interpreter, it will be much slower.

    I don't know enough about Perl to say what it does with its own code, much less with regular expression handling.  Someone with more details on the innards of Perl might know.



  • @biziclop said:

    Theoretically you could use regexps to recognize integers between 5 and 3976 but anyone with a little common sense won't do such thing.

    If recognizing a string of arbitrary length, someone with a little common sense would try to convert the string into an integer before doing a couple of inequality comparisons.  Even if the language supported arbitrarily large integers, a regular expression should be faster at evaluating a string of several billion numerical characters, depending on the language.  An evil professor would assign the creation of that function as homework in a higher level class and fail students who missed that one.  I hope to be an evil professor someday - thanks for the idea!

     

    btw,

    [5-9]|([1-9][0-9]{1,2})|(3[1-8][0-9]{2})|(39[0-6][0-9])|(397[0-6])



  • @Oscar L said:


    ...

    btw,

    [5-9]|([1-9][0-9]{1,2})|(3[1-8][0-9]{2})|(39[0-6][0-9])|(397[0-6])



    Nitpicking:
    Did you leave 1000-2999 out on purpose?


  • @Thuktun said:

    @merreborn said:
    hell, you could almost certainly write a such a program that would be faster than this regexp.
    Not necessarily.  If a regular expression is translated at compile-time into native matching code, I doubt you could hand-code anything that would run significantly faster.

    Regular expressions are almost always (if you know what you're doing) implemented by translating it to a non-deterministic finite automaton (NFA), transforming that into a deterministic finite automaton (DFA), and running that to validate the input.  The speed of this depends on how this transformation is done and how the DFA is run.

    • If the transformation is done at runtime, you pay the cost then, hopefully just once per instance of the program.  If you do this at compile time, it's done once, period.
    • If the regexp is translated directly into native code that efficiently implements the DFA, it will be very fast.  If the DFA is an in-memory graph of nodes that gets walked by (essentially) an interpreter, it will be much slower.

    But it'll be a stone bitch to debug if it's like the one shown to validate an RFC822-compliant address.  Something that unreadable should be LISP by default.



  • You've just demonstrated why you should never use regexps if it's not absolutely necessary. You would make a great evil professor, because it's really evil to accept a buggy solution for a problem. :)

    Speed is almost never an issue in these cases as it mostly involves parsing text input from forms but readability and error-proneness is.

    Btw, I didn't suggest converting it into an integer. If I needed the number afterwards, I'd convert it but if I had several billion digits to check, I'd only check if all the characters are proper digits and check for bounds only if the length of the string equals the length of the upper or the lower bound. Anyway, I'd love to see you build a regexp like this for your billion-digit number.

    On second thought, I wouldn't.



  • @biziclop said:

    ...

    Speed is almost never an issue in these cases as it mostly involves parsing text input from forms but readability and error-proneness is.

    Btw, I didn't suggest converting it into an integer. If I needed the number afterwards, I'd convert it but if I had several billion digits to check, I'd only check if all the characters are proper digits and check for bounds only if the length of the string equals the length of the upper or the lower bound.

    ...



    That would be an awesome form. :)


  • @byte_lancer said:

    Ok, email address validation might look easy.
    But not when the RFC822 is thrust into your face.

    Take a look at this unholy creation.


    Hmm, something seems off about this.
    I seem to recall writing an RE that was much shorter than that, and which complied with the RFC... It was also a bit more readible.

    I'll look for it and post it next time I'm at work.


Log in to reply