Parsing email addresses

Erufael

I love how email addresses don't have to contain . but everyone always writes code that requires it.

My email address contains a . .

LaoC

@ben_lubar pope@va is a valid email address. Sure, there aren't many cases where something like this happens in practice, but it's totally valid.

va. doesn't have an MX record but of course they could. I like their NS records though, and I wonder if the guys running the backup NS are trolling.

lolwhat

@RaceProUK said in Parsing email addresses:

When I validate an email address, I use a very complex check: I check there are some characters, then an @, then some characters.

How well does it handle user@z̨̻̯̻̻͔a̻̲̭̩̟̘̤ͨ͑ͭ̿ͧ̒l̦̹̯̟̗̼̖ͭg̘͗̈̿o̖̔̑̾̓̐̄̊?

RaceProUK

@lolwhat There's some characters, then an @, then some characters, so it passes ;)

Edit: To show the difficulties of validating email addresses, this is a regex that enforces RFC1035:

\A(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])\z

Bechemel

And that's why I don't bother :D

asdf

@RaceProUK said in Parsing email addresses:

To show the difficulties of validating email addresses, this is a regex that enforces RFC1035

That only proves that regular expressions are a shitty tool for more complex grammars and completely unmaintainable.

PleegWat

How do you prevent the user enters multiple email addresses, separated by comma's, and uses you as a spam relay?

LaoC

@asdf said in Parsing email addresses:

@RaceProUK said in Parsing email addresses:

To show the difficulties of validating email addresses, this is a regex that enforces RFC1035

That only proves that regular expressions are a shitty tool for more complex grammars and completely unmaintainable.

Think of them as the assembler of parsing. The engines are highly optimized but they're easy to screw up. You can write the above example (which is still b0rken BTW as it allows leading dashes for instance) in a readable way by breaking it up into parts:

my $dn_chars = "[a-z0-9!#$%&'*+/=?^_`{|}~-]+";
my $domain_name = "$dn_chars (?: \. $dn_chars )*";
...
my $rfc1035 = qr/ \A (?: $domain_name | $mail_like | $ip_addr ) \z /x;

If you simply translate something like the grammar from RFC 1035 to this form without much thinking, that's less work than writing a single RE and at the same time it stays readable.

masonwheeler

@LaoC said in Parsing email addresses:

Which I find hard to believe but then again I don't have much of a clue about C++.

That's OK. Neither does anyone else, including the C++ standardization committee.