The richness of language

Samus_

At least in human languages, a rich one allows you to say the same thing in many different ways... at least in human languages...

today I was wandering in some community and stopped by a post of someone asking to explain this javascript code:

(please don't complain about the indentation or wrapping, this is the way that community's software shows it, in fact I'm copying it directly from the page's source)

well I explained and made my own version:

function emailCheck(email) {
// declare valid TLDs
var TLD = 'aero|biz|cat|com|coop|info|jobs|mobi|mu
seum|name|net|org|pro|travel|gov|edu|mil
|int';
var ccTLD = 'ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|
at|au|aw|ax|az|' +
'ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|
bs|bt|bv|bw|by|bz|' +
'ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|
cu|cv|cx|cy|cz|' +
'de|dj|dk|dm|do|dz|' +
'ec|ee|eg|eh|er|es|et|eu|' +
'fi|fj|fk|fm|fo|fr|' +
'ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|
gr|gs|gt|gu|gw|gy|' +
'hk|hm|hn|hr|ht|hu|' +
'id|ie|il|im|in|io|iq|ir|is|it|' +
'je|jm|jo|jp|' +
'ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|' +
'la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|' +
'ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|
mr|ms|mt|mu|mv|mw|mx|my|mz|' +
'na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|' +
'om|' +
'pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|
py|' +
'qa|' +
're|ro|rs|ru|rw|' +
'sa|sb|sc|sd|se|sg|sh|si|sj|sk|sl|sm|sn|
so|sr|st|su|sv|sy|sz|' +
'tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|
tt|tv|tw|tz|' +
'ua|ug|uk|um|us|uy|uz|' +
'va|vc|ve|vg|vi|vn|vu|' +
'wf|ws|' +
'ye|yt|yu|' +
'za|zm|zw';

// define 'email-like' regexp
var re = new RegExp('^[A-Z0-9._%-]+@[A-Z0-9-]+(\.[A-Z
0-9-]+)*(\.(' + TLD + '|' + ccTLD + '))$', 'i');

// trim parameter
email = email.toString().replace(/^\s*|\s*$/g, '');

return re.test(email);
}

I must admit that felt a bit like this, but it left me thinking -how good- it is to have lots of ways to do the same thing in programming languages...

Jivlain

*shudders in inhuman terror and fear*

*and then literally rolls on the floor laughing my flaming ass off for a while at the comic*

Or just use a regexp to match the bit after the @ and then just use DNS to check that the domain exists. Better yet, send an email to that address asking for confirmation.

fennec

@SpComb said:

Or just use a regexp to match the bit after the @ and then just use DNS to check that the domain exists. Better yet, send an email to that address asking for confirmation.

You can't do that with JavaScript*.

It's an open question why this is being done in JavaScript to begin with, mind you, but...

(* Unless you have some sort of AJAX routine on your server there to help, or something like that. )

m0ffx

Grr...though there was a bug, then re-read. But the forum won't let me delete this post.

quamaretto

Looks like you aren't trying to eliminate all invalid email addresses, nor support all valid ones (! paths) .

Under the circumstances, I would have just used "/^.+\@.+\..+$/". Leaves a bit to be desired, but stops the idiots who think gmail.com is an e-mail address. You also don't have to change it when there are new TLDs. :)

(Reminds me, I just did the TEST problem in Sphere Online Judge with a regex. Overkill, no? The solution was "awk -e '/^42$/{exit}{print}'".)

realmerlyn

But those are flawed! They don't match valid RFC822/2822 email. Don't use them.

Cap_n_Steve

At least he's thorough. I always just look for a dot followed by at least two letters and count that as a TLD. Of course if he was really thorough he would have used that page-long regexp posted here a while back.

masklinn

@realmerlyn said:

But those are flawed! They don't match valid RFC822/2822 email. Don't use them.

On the other hand, RFC 822 and 2822 define many e-mail address schemes you're not likely to encounter (local network addresses and the likes).

To validate e-mail addresses coming from the intarwebs, one only needs to match against a subset of the RFC822 address space.

Cthulhu

So the almost infinite majority of non-existing email addresses would get though, but a minor subset of those with invalid suffixes wouldn't. All this only at the cost of development, maintanance and risk. What a bargin.

RevEng

That's a great point. What is the point of validating email addresses using regexs? Especially client-side.

Here are the problems:

There are many possible email address schemes if you truly want to follow the RFCs. Last I saw, somebody made a regexp that could almost match them all, but it was several pages long and almost impossible to read.
You're only verifying that it could be an address according to the RFC. That doesn't mean that the email address actually exists or belongs to the person entering it.
If you're validating with JavaScript, it's trivial to disable JavaScript, edit the JavaScript locally, or just submit the thing manually, making it completely trivial to break by those who have a good reason to want to.

I use a fake email address all the time. It is no@body.com. It's a completely valid email address (in fact, it's even a valid domain). Short of sending email there with a confirmation, there's little that one could do to disprove my rightful ownership of it.

The only real reason to verify it is for the user's sake (so they don't type in their home address instead of their email address), and that can be done simply and quickly by searching for an @. If you really need to confirm the legitimacy of their email address, only a confirmation email will accomplish that.

Derrick_Pallas

That's why I wrote a page about regular expression email address validation.

@masklinn said:

@realmerlyn said:
But those are flawed! They don't match valid RFC822/2822 email. Don't use them.
On the other hand, RFC 822 and 2822 define many e-mail address schemes you're not likely to encounter (local network addresses and the likes). To validate e-mail addresses coming from the intarwebs, one only needs to match against a subset of the RFC822 address space.

savar

@RevEng said:

There are many possible email address schemes if you truly want to follow the RFCs. Last I saw, somebody made a regexp that could almost match them all, but it was several pages long and almost impossible to read.

You're only verifying that it could be an address according to the RFC. That doesn't mean that the email address actually exists or belongs to the person entering it.

I also wonder, given how complex the RFC is, if there aren't mail servers which accept addresses that are NOT compliant.