How to validate an e-mail address the hard way?



  • From the same developper as my previous post, a really nice Java method to validate (well... sort of) an e-mail address.

        /**
         * Controle si l'email est correct
         * 
         * @param email
         *            mail a valider
         * @return true s'il est valide, false sinon
         */
        public static boolean isEmailValid(String email) {
            if (email == null) return false;
            email = email.toLowerCase().trim();
            if (!Tools.contains(email, "@")
                    || (Tools.contains(email, " ") || !Tools.contains(email, "."))) {
                return false;
            }
            Pattern p = Pattern.compile("^\\.|^\\@");
            Matcher m = p.matcher(email);
            if (m.find()) {
                // System.err.println("Email addresses don't start" + " with dots or @ signs.");
                return false;
            }
            // Checks for email addresses that start with
            // www. and prints a message if it does.
            p = Pattern.compile("^www\\.");
            m = p.matcher(email);
            if (m.find()) {
                // System.out.println("Email addresses don't start" + " with \"www.\", only web pages
                // do.");
                return false;
            }
            p = Pattern.compile("[^A-Za-z0-9\\.\\@_\\-~#]+");
            m = p.matcher(email);
            StringBuffer sb = new StringBuffer();
            boolean result = m.find();
            boolean deletedIllegalChars = false;
            while (result) {
                deletedIllegalChars = true;
                m.appendReplacement(sb, "");
                result = m.find();
            }
            // Add the last segment of input to the new String
            m.appendTail(sb);
            email = sb.toString();
            if (deletedIllegalChars) {
                // System.out.println("It contained incorrect characters" + " , such as spaces or
                // commas.");
                return false;
            }
            return true;
        }

    Regular expressions are difficult, I know ;-)
    Maybe it could become the reference implementation of the RFC



  • This makes me want to register www@gmail.com. Just 'cause.



  • @vax said:


    // Checks for email addresses that start with
    // www. and prints a message if it does.
    p = Pattern.compile("^www\.");
    m = p.matcher(email);
    if (m.find()) {
    // System.out.println("Email addresses don't start" + " with "www.", only web pages
    // do.");
    return false;
    }

    AFAIK the is nothing stopping you from having an email address starting www. Unusual maybe, but certainly not impossible. 



  • @Welbog said:

    This makes me want to register www@gmail.com. Just 'cause.

    You can do this, but not www.Welbog@gmail.com. His pattern includes a dot... 



  • I stopped trying to understand the final part with the deletedIllegalChars variable.

    But I like how the error messages are broken in two strings:

      // System.err.println("Email addresses don't start" + " with dots or @ signs.");



  • @vax said:

    @Welbog said:
    This makes me want to register www@gmail.com. Just 'cause.
    You can do this, but not www.Welbog@gmail.com. His pattern includes a dot... 
    Oh, right. I guess I'm not totally awake yet. Monday morning, y'know.



  • You could have www.welbog@gmail.com.  Periods/dots are valid in email except they can't be the first or last character.  Technically this is a valid email address: www.blah.,!#$%&'*+-/=?^_`{|}~@gmail.com



  • We all know the simple, clean way to valid emails address is: "^[a-zA-Z][\w.-][a-zA-Z0-9]@[a-zA-Z0-9][\w.-][a-zA-Z0-9].[a-zA-Z][a-zA-Z.]*[a-zA-Z]$"

    What's he thinking?

    (PS: Yes, I'm making fun of regular expressions. Kinda.)



  • Doesn't this block all email addresses with a "." in the user part? (As an aside, my company email has a "." in it, and it's surprising how many websites reject it.)
     

    @code said:

            if (!Tools.contains(email, "@")
    || (Tools.contains(email, " ") || !Tools.contains(email, "."))) {
    return false;
    }


  • It blocks addresses that don't contain a single dot...



  • @rbowes said:

    We all know the simple, clean way to valid emails address is: "^[a-zA-Z][\w.-][a-zA-Z0-9]@[a-zA-Z0-9][\w.-][a-zA-Z0-9].[a-zA-Z][a-zA-Z.]*[a-zA-Z]$"

    What's he thinking?

    (PS: Yes, I'm making fun of regular expressions. Kinda.)

    Bah, that is, like, totally wrong.

    "(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])" is much more accurate, for sure.

     



  • @vax said:

    @Welbog said:

    This makes me want to register www@gmail.com. Just 'cause.

    You can do this, but not www.Welbog@gmail.com. His pattern includes a dot... 

     

    There's absolutely nothing preventing you from registering www.welbog@gmail.com.  I have several email addresses with dots in them.



  • @halcyon said:

    @rbowes said:
    We all know the simple, clean way to valid emails address is: "^[a-zA-Z][\w.-][a-zA-Z0-9]@[a-zA-Z0-9][\w.-][a-zA-Z0-9].[a-zA-Z][a-zA-Z.]*[a-zA-Z]$"

    What's he thinking?

    (PS: Yes, I'm making fun of regular expressions. Kinda.)

    Bah, that is, like, totally wrong.

    "(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])" is much more accurate, for sure.

     

     

    You're about 6k light.

    http://www.tuxick.net/docs/regex/chBB.html

     

    (and even that's not truly exhaustive...) 



  • @merreborn said:

    You're about 6k light.

    http://www.tuxick.net/docs/regex/chBB.html

    (and even that's not truly exhaustive...)

     

    This thing looks suspiciously like a Borg cube... 



  • @merreborn said:

    You're about 6k light.

    http://www.tuxick.net/docs/regex/chBB.html

     

    (and even that's not truly exhaustive...) 

    First one to convert that to BNF wins holidays... in asylum. 



  • It's amazing how many sites reject email addresses with ' (apostrophe) in them.  Or think that the entire address is case insensitive.  And even start down the road of trying to accept all of the quote / comment formats (which are all sort of deprecated anyway).  But Sean.O'Sullivan@example.com is perfectly valid, and not the same thing as sean.o'sullivan@example.com either.  Though for the purposes of validation, doing a toLower() on it isn't going to hurt anything.  The truly irritating thing is some of the most common (in users) mail systems aren't case sensitive, so users don't know their addresses are case sensitive, either.  You can either handle it the pedantically correct way, or let the user log in with dummy@gmail.com when they registered with Dummy@gmail.com.  Either way will cause some amount of trouble.  I suppose it depends on the audience you're pitching to.



  • @poochner said:

    It's amazing how many sites reject email addresses with ' (apostrophe) in them.  Or think that the entire address is case insensitive.  And even start down the road of trying to accept all of the quote / comment formats (which are all sort of deprecated anyway).  But Sean.O'Sullivan@example.com is perfectly valid, and not the same thing as sean.o'sullivan@example.com either.  Though for the purposes of validation, doing a toLower() on it isn't going to hurt anything.  The truly irritating thing is some of the most common (in users) mail systems aren't case sensitive, so users don't know their addresses are case sensitive, either.  You can either handle it the pedantically correct way, or let the user log in with dummy@gmail.com when they registered with Dummy@gmail.com.  Either way will cause some amount of trouble.  I suppose it depends on the audience you're pitching to.

     

    Why should any email address be case sensitive (morally speaking, ignoring any standards or history...).

     I mean, if I live at 742 Evergreen Terrace, shouldn't I get mail addressed to 742 EverGREen TerRaCe ?  Why not get email at my address (ned.flanders.SUCKS@compuglobalmeganet.com) when somebody sends it to Ned.Flanders.sucks@CompuGlobalMegaNet.com ?

     


     



  •         if (m.find()) {
    // System.out.println("Email addresses don't start" + " with \"www.\", only web pages
    // do.");
    return false;

     

     

    Account: www.Test.Q.Testerson.com@gmail.com

    Password: ProofOfConcept 

     

    Ha! 



  • @iAmNotACantalope said:

    Why should any email address be case sensitive (morally speaking, ignoring any standards or history...).

     I mean, if I live at 742 Evergreen Terrace, shouldn't I get mail addressed to 742 EverGREen TerRaCe ?  Why not get email at my address (ned.flanders.SUCKS@compuglobalmeganet.com) when somebody sends it to Ned.Flanders.sucks@CompuGlobalMegaNet.com ?



    Agreed.  I don't understand why anyone would think and think it's a good idea to have something with a domain name attached to it, but its validation rules are almost the complete opposite of the ones for domains.  And why the hell would you want to include comments in an address?

    EDIT:  What the hell is with the word wrap?



  • Exactly!

    If you where to register Poochner@gmail.com and I went along and registered poochner@gmail.com I would get a fair hunk of your mail from people who can't be bothered capitalising and not realising it matters. Having case sensitivity in email addresses just leaves everybody open to identity theft.

     

    Kane Elson 



  • @Cap'n Steve said:

    @iAmNotACantalope said:

    Why should any email address be case sensitive (morally speaking, ignoring any standards or history...).

    I mean, if I live at 742 Evergreen Terrace, shouldn't I get mail addressed to 742 EverGREen TerRaCe ? Why not get email at my address (ned.flanders.SUCKS@compuglobalmeganet.com) when somebody sends it to Ned.Flanders.sucks@CompuGlobalMegaNet.com ?



    Agreed. I don't understand why anyone would think and think it's a good idea to have something with a domain name attached to it, but its validation rules are almost the complete opposite of the ones for domains. And why the hell would you want to include comments in an address?
    This standard was written by people who thought it makes sense to make a commonly validated kind of string so complex that you need a finite state machine to properly validate it. We can be happy that they didn't define mail addresses as a turing-complete language.



  • It reminds me of my pondering on why we have case-sensitive filenames on some operating systems. To me, the advantage of case-insensitive, case-preserving file systems is that they cater to my sense of aesthetics and pedantry (and in some cases, genuine accuracy) without requiring me to remember or care precisely what the capitalisation was when auto-completing the filename.

    I am guessing that case-sensitive file systems were introduced simply to improve look-up speed when scanning a directory for a record, since you can throw away mismatches a lot faster. The really fun part with case-insensitive file systems is when you go Unicode and have to start dealing with the depth of horror that is truly international case handling. (I have never tried this at the Windows command line).

    E-mail addresses simply look like twisted sadomasochism.

    The really fun one of course is whether or not accents are considered important in matching, especially in SQL ... For example, MySQL 5's full-text search won't find "vidéo" if I ask for "video", which sucks.



  • @Daniel Beardsmore said:

    To me, the advantage of case-insensitive, case-preserving file systems is that they cater to my sense of aesthetics and pedantry (and in some cases, genuine accuracy) without requiring me to remember or care precisely what the capitalisation was when auto-completing the filename.

    Case-preserving filesystems are a performance disaster. It is extremely difficult to determine rapidly whether or not a given filename exists, and you pay for it with extra storage.
     



  • @j6cubic said:

    @Cap'n Steve said:


    Agreed. I don't understand why anyone would think and think it's a good idea to have something with a domain name attached to it, but its validation rules are almost the complete opposite of the ones for domains. And why the hell would you want to include comments in an address?
    This standard was written by people who thought it makes sense to make a commonly validated kind of string so complex that you need a finite state machine to properly validate it. We can be happy that they didn't define mail addresses as a turing-complete language.

    It was not created in a vacuum - email was designed as a simpler replacement for existing, more complicated systems. They felt it necessary to attempt rough feature-compatibility with the earlier systems. When you start with the requirements of creating something that has account names, full names, common names, and comments, and not resorting to something like X.400 addresses, there aren't any simple solutions left. 



  • @asuffield said:

    Case-preserving filesystems are a performance disaster. It is extremely difficult to determine rapidly whether or not a given filename exists, and you pay for it with extra storage.

    File system design is never straightforward. For example, Apple took a bizarre route of putting all file records into a single balanced tree, with hierarchy done as per a database table. File search across a whole volume was far in advance of contemporary file systems. There are trade-offs but I was always glad to find mislaid files rapidly.

    Where things get screwy is when case sensitivity becomes *optional*, which is the route Apple took with Mac OS X. Anyone who takes advantage of the perverse ability of a case-sensitive file system to store files which differ only in case (something that will especially delight nerds) you'll have screwed yourself over if you try putting the files on a case-insensitive Mac volume :) (or a Windows PC for that matter)

    I've never considered case preservation to be a performance issue at the desktop level, but it seems to be a good desktop/server split. Desktop PCs have all sorts of performance curiosities. For example, Windows will launch all your StartUp items simultaneously, which seems to decrease disc load performance and make the process take forever. I am biased though because, while my OS 9 Mac loads all my apps serially and extremely efficiently, software in those days *was* efficient and did load rapidly and use very little RAM, something I sorely miss.



  • Case-preserving filesystems are a performance disaster. It is extremely difficult to determine rapidly whether or not a given filename exists, and you pay for it with extra storage.

    If you're "paying for it with extra storage", that means you're storing a case-insensitive version along with the real filename, and thus not having any problem finding it rapidly if you index by the case-insensitive version. And the extra storage is negligible in any case, so I don't see the "disaster".



  • @Random832 said:

    Case-preserving filesystems are a performance disaster. It is extremely
    difficult to determine rapidly whether or not a given filename exists,
    and you pay for it with extra storage.

    If you're "paying for it with extra storage", that means you're storing a case-insensitive version along with the real filename, and thus not having any problem finding it rapidly if you index by the case-insensitive version. And the extra storage is negligible in any case, so I don't see the "disaster".

    There is no method for constructing a case-insensitive version of an arbitrary unicode string. (There is also a secondary problem that the question of whether two strings match when disregarding case depends on the locale you are comparing them in) 


Log in to reply