Parsing email addresses
-
(oh god)
I need to parse a list of addresses into unique entities (so a user inputted line can be turned into a bunch of "pills"). I looked at our current code and choked.
CString curId = id.Tokenize(_T(",; "), pos);
Yeah, that works well with a copy/pasted line from gmail like
"user 1" <user1@example.com>, "last, first" <user2@example.com>
I'm looking for some pointers to code examples (for C++) - my (admittedly fast, not very thorough) search didn't find much except for regexes.
-
So do you want just a list of email addresses, discarding the names? Do you want valid email addresses or a subset?
-
@LB_ said in Parsing email addresses:
So do you want just a list of email addresses, discarding the names? Do you want valid email addresses or a subset?
Looks like a subset. (The server does have a pattern we have to match - but not all of those valid addresses pass)
It looks like the biggest thing will be stripping all the comments out.
-
What formats do you have them in, is it always
"sender name" <sender@emailaddre.ss>
? If it is, a regular expression is your friend. Maybe something like this (untested):"(.*?)" \<(.*?@(?:.*?\.)+\w+)\>
-
-
@TimeBandit bah, humbug.
-
@AlexMedia said in Parsing email addresses:
What formats do you have them in,
LOL - format. User typed. Or copy pasted from unknown source.
-
@dcon So, not even the brackets are a given?
-
@AlexMedia Right. Though what we'll typically see is a user types "name@domain.tld" or copy/paste from something like gmail with the bracket format.
-
@dcon
is this plain C++ or is it managed C++?If it is managed you can use System.Net.Mail.MailAddress object and put pass the string in as parameter to when newing the object.
-
Right, that's pretty shitty.
How about
(\w.*?@(?:.*?\.)+\w+)
? It should capture any email address, including ones with dots or plus signs in the local part as well as multi level subdomains.
-
@lucas1 said in Parsing email addresses:
@dcon
is this plain C++ or is it managed C++?Plain. Old school win32.
-
@dcon Oh okay. I solved the same problem in C# last week.
-
@dcon You could search for every
@
in the string and expand outward from them along valid characters.
-
@LB_ said in Parsing email addresses:
@dcon You could search for every
@
in the string and expand outward from them along valid characters.user(this @ is valid)@example.com
"user@home.com" <user@example.com>
@AlexMedia , that regex would pick up the
user@home.com
too, wouldn't it?edit: tho in that case, I could run it thru a sanitizer that simply removes the
"string"
s. Anyone see a problem with that?
-
@dcon In the first case, the first
@
would be ignored as being an invalid address and then be gobbled by the second@
. In the second example, your edit seems like a good solution.
-
It would, yeah. But how often do you see people using quotes and parenthesis in their email addresses?
I have never seen that in practical use, despite it being allowed according to the RFCs.
-
@AlexMedia copypasting from the
To:
field of some email services would generally net you the"name" <address>
syntax, wouldn't it?
-
@AlexMedia said in Parsing email addresses:
But how often do you see people using quotes and parenthesis in their email addresses?
Parens? Never. Quotes? Every email program I've used. Oh wait, that's usually more like
name last <user@example.com>
.
-
@LB_ If you copy/paste from Outlook, yes. Although I don't think Outlook adds quotes.
-
@dcon said in Parsing email addresses:
@AlexMedia said in Parsing email addresses:
But how often do you see people using quotes and parenthesis in their email addresses?
Parens? Never. Quotes? Every email program I've used. Oh wait, that's usually more like
name last <user@example.com>
.Just playing with pasting into gmail's To box. It pasted with both quotes and no quotes correctly.
I think the place I've usually seen quotes is when the mail comes in as
"last, first" <user@example.com>
- the comma requires quotes.
-
-
@lucas1 said in Parsing email addresses:
Hmmm. That could be interesting... Just have to translate to C++...
-
@dcon Pretty much my thoughts. I am sure this is the port of the same .NET 4.6 code I was using.
-
@lucas1
I want to upvote this more than once for being the only suggestion involving an actual parser instead of a broken regex. Also, parsing the address in reverse (as this solution does) actually sounds like a good idea to reduce complexity.
-
@asdf I took me quite a few read through to get it. But yeah it is a nice way of doing it.
-
@dcon said in Parsing email addresses:
@LB_ said in Parsing email addresses:
@dcon You could search for every
@
in the string and expand outward from them along valid characters.user(this @ is valid)@example.com
"user@home.com" <user@example.com>
@AlexMedia , that regex would pick up the
user@home.com
too, wouldn't it?edit: tho in that case, I could run it thru a sanitizer that simply removes the
"string"
s. Anyone see a problem with that?How about this way: split the string on spaces, then discard every element that either contains no
@
, or does contain"
(even if it also has@
in it) — plus, of course, every element that has characters which aren’t valid in an email address anyway.In case you expect people to do things like
"Wil.i.@m” <whatever@example.com>
you could also do a check if there’s, say, at least one dot after the@
, and if not, discard the element as well.
-
@Gurth said in Parsing email addresses:
@dcon said in Parsing email addresses:
@LB_ said in Parsing email addresses:
@dcon You could search for every
@
in the string and expand outward from them along valid characters.user(this @ is valid)@example.com
"user@home.com" <user@example.com>
@AlexMedia , that regex would pick up the
user@home.com
too, wouldn't it?edit: tho in that case, I could run it thru a sanitizer that simply removes the
"string"
s. Anyone see a problem with that?How about this way: split the string on spaces, then discard every element that either contains no
@
, or does contain"
(even if it also has@
in it) — plus, of course, every element that has characters which aren’t valid in an email address anyway.In case you expect people to do things like
"Wil.i.@m” <whatever@example.com>
you could also do a check if there’s, say, at least one dot after the@
, and if not, discard the element as well."Donald 'reallycoolguy@whitehouse.biz' Trump" <saladhater420@mail.ru>
-
@ben_lubar said in Parsing email addresses:
"Donald 'reallycoolguy@whitehouse.biz' Trump" <saladhater420@mail.ru>
Is a single quote mark a legal character in the actual address? If not, my suggestion still works — yes, I know, until someone does
"Donald ' reallycoolguy@whitehouse.biz ' Trump" <saladhater420@mail.ru>
. So combine with @dcon’s earlier suggestion: remove everything in quote marks.
-
-
@masonwheeler said in Parsing email addresses:
This.
Please for the love of $DEITY, guys, this should be fucking obvious: don't write your own parsers for email addresses! If you do it correctly, you'll waste lots of work, and if you don't they'll suck. If you really cannot find a parser for your language, you can be almost 100% correct (save for rather pathological cases that have comments in their address) with the regexp used by Perl's Mail::RFC822::Address. It's not perfect but at least it won't have me sending you angry comments about stupidly rejecting my plus-extension addressing.
-
@LaoC said in Parsing email addresses:
@masonwheeler said in Parsing email addresses:
This.
Please for the love of $DEITY, guys, this should be fucking obvious: don't write your own parsers for email addresses! SEND A DAMN EMAIL TO VALIDATE EMAIL ADDRESSES!Fixed that for you.
the ONLY way to validate that what you have been provided is a valid emailbox that can receive email is to ACTUALLY SEND THE EMAIL so get off your tail and send the email!
-
Don't use a regex
Just send an email and then
Check if it bounces
-
@accalia yeah, sending the email is going to be really easy if you don't know the address.
This topic was about parsing email address lists last time I checked.
-
@ben_lubar said in Parsing email addresses:
@accalia yeah, sending the email is going to be really easy if you don't know the address.
This topic was about parsing email address lists last time I checked.
-
@accalia after THREADS ARE FREE became a thing, I was no longer able to post JUST TO RECAP due to the reduced topic drift.
-
@LaoC I posted a parser that is written by the .NET team. Porting it I think is the best option.
-
@lucas1 said in Parsing email addresses:
@LaoC I posted a parser that is written by the .NET team. Porting it I think is the best option.
After working thru C#->C++ differences, fixing syntax errors, and removing some unused functions (yea! didn't have to figure out what to do with StringBuilder), it compiled. Tested it - OMG - it. just. worked.
Now we'll be able to translate a list of pasted addresses into "pills" (like how various emailers work).
It's not how I would structure things, classes are composed only of static methods, some of which are private. I debated making those class-private-static functions file-static, but decided against it. It makes keeping my translation in sync with the base c# code easier...
-
@dcon StringBuilder is used mostly because strings are immutable in C# so you don't waste memory when concatenating. I have no idea with C++.
-
@lucas1 said in Parsing email addresses:
@dcon StringBuilder is used mostly because strings are immutable in C# so you don't waste memory when concatenating. I have no idea with C++.
I know what it is - I just wasn't how I was going to translate it - std::stringstream? just be inefficient and pass a CString? Luckily those functions weren't used, so I didn't have to figure it out!
-
@dcon tbh unless you are building up massive strings it doesn't matter.
-
I love how email addresses don't have to contain
.
but everyone always writes code that requires it.
-
@dcon said in Parsing email addresses:
didn't have to figure out what to do with StringBuilder
std::ostringstream
-
@Magus said in Parsing email addresses:
I love how email addresses don't have to contain
.
but everyone always writes code that requires it.Are there any top level domains with MX records? I'm pretty sure IPv6 addresses can't stand in for domains, but I've been wrong before.
-
@ben_lubar I think it's more about the fact that, say,
user@localhost
is valid.
-
@accalia said in Parsing email addresses:
Please for the love of $DEITY, guys, this should be fucking obvious: don't write your own parsers for email addresses! SEND A DAMN EMAIL TO VALIDATE EMAIL ADDRESSES!
Fixed that for you.
the ONLY way to validate that what you have been provided is a valid emailbox that can receive email is to ACTUALLY SEND THE EMAIL so get off your tail and send the email!
Depends on your application. You certainly don't want to do that if you have a list of a million addresses to check. And if it's a single one or a single user, it would still be nice to give them some feedback that what they just entered in the email field can't possibly be an email address rather than saying "I sent you an email, wait for it" and have them twiddle their thumbs and curse the provider's spam filter while your MTA bounces the verification mail to postmaster just because they entered their credit card number in the wrong form field.
-
IMO the best course of action is to first reject unusual emails and say in the error message "if you are sure this email address is correct, submit the form again and we'll accept it this time".
-
When I validate an email address, I use a very complex check: I check there are some characters, then an
@
, then some characters.
-
@ben_lubar
pope@va
is a valid email address. Sure, there aren't many cases where something like this happens in practice, but it's totally valid.
-
@lucas1 said in Parsing email addresses:
@LaoC I posted a parser that is written by the .NET team. Porting it I think is the best option.
If nobody has done such a thing for C++ yet, that's indeed a good one. Which I find hard to believe but then again I don't have much of a clue about C++.