validate email addresses using a regex. [duplicate] - regex

This question already has answers here:
How can I validate an email address using a regular expression?
(79 answers)
Closed 7 years ago.
I am trying to validate email addresses using a regex. This is what I have now ^([\w-.]+)#([\w-]+).(aero|asia|be|biz|com.ar|co.in|co.jp|co.kr|co.sg|com|com.ar|com.mx|com.sg|com.ph|co.uk|coop|de|edu|fr|gov|in|info|jobs|mil|mobi|museum|name|net|net.mx|org|ru)*$ I found many solutions using non-capturing groups but did not know why. Also can you tell me if this is the correct regex and also if there are any valid emails which are not being validated correctly and vice-versa

Don’t bother, there are many ways to validate an email address. Ever since there are internationalized domain names, there’s no point in listing TLDs. On the other hand, if you want to limit your acceptance to only a selection of domains, you’re on the right track. Regarding your regex:
You have to escape dots so they become literals: . matches almost anything, \. matches “.”
In the domain part, you use [\w-] (without dot) which won’t work for “#mail.example.com”.
You probably should take a look at the duplicate answer.
This article shows you a monstrous, yet RFC 5322 compliant regular expression, but also tells you not to use it.
I like this one: /^.+#.+\...+$/ It tests for anything, an at sign, any number of anything, a dot, anything, and any number of anything. This will suffice to check the general format of an entered email address. In all likelihood, users will make typing errors that are impossible to prevent, like typing john#hotmil.com. He won’t get your mail, but you successfully validated his address format.
In response to your comment: if you use a non-capturing group by using (?:…) instead of (…), the match won’t be captured. For instance, all email addresses have an at sign, you don’t need to capture it. Hence, (john)(?:#)(example\.com) will provide the name and the server, not the at sign. Non-capturing groups are a regex possibility, they have nothing to do with email validation.

Related

Improve exim regex to catch everything but specified adresses

I'm using this regex to catch any incoming e-mails excluding mails from from specific people.
^(.(?!(zulgrib#exemple.com|zulgrib#example.org)).)*$/i
This regex correctly let through these scenarios
Zulgrib at example.com <Zulgrib#example.com>
<Zulgrib#example.com>
<Zulgrib#example.com> In behalf of Robot
Regex correctly catches these kind of headers
Associate#example.org
Your Associate Associate#example.com
If an excluded e-mail address is alone, it will catch it, I would like to prevent that. Example:
zulgrib#exemple.org
What should be modified to allow this to work and why my current method is not correct ?
If I understand the documentation, . matches any character, void is not a character, but using * is not working.
First, some issues in your current regex:
exemple has a different spelling than example
Literal points need to be escaped. So \.com instead of .com.
There are two dots (.) in the outermost group, which means you only capture text with an even number of characters, and don't exclude the case where the email addresses start at the beginning of the string. The first dot should not be there.
To make an exception for when the email address is the only thing in the input, I fear you'll have to specify that as a separate alternative in which (unfortunately) you'll have to repeat those email addresses:
^(?:zulgrib#example\.com|zulgrib#example\.org)$|^(?!(?:.*(?:zulgrib#example\.com|zulgrib#example\.org))).*$

How to write a conditional in regex

I have the follow line of regex (javascript)
/^[a-z0-9_.\-]+#(yahoo|gmail|excite})\.com$/
However, I am unsure of how to make this include subdomains (IF one is present).
So this expression should match uk.yahoo.com and yahoo.com email address as well... How can this be done?
Well, if you want just the subdomain uk.yahoo.com:
/^[a-z0-9_.\-]+#((?:uk\.)?yahoo|gmail|excite)\.com$/
The addition of (?:uk\.)? specifies a optional noncapturing group that matches either 0 or 1 occurrence of the pattern "uk.".
However, using regexes to validate email addresses is an awful idea. RFC2822 is a very complex standard. It's much better to blindly send an email to whatever minimally-validated address the user enters, fail early, and give the user a chance to correct the mistake.

Regular expression - for email spam filtering, match email address variants other than the original

I am a email spam quarantine administrator and I can write regular expression rules to block email messages. There is a common classification of email spam hitting our domain such that the username of any of our email addresses is spoofed in front of some other domain.
For example, suppose my email address is jwclark#domain.com. In that case, spammers are writing to me from all kinds of other domains that start with my username such as:
jwclark1234#whatever.com
jwclark#wrongdomain.com
jwclark#a.domain.com
How can I write a regular expression rule to match everything including jwclark and any wildcards, but not match the original jwclark#domain.com? I would like a regex that matches everything above except for my actual example email address jwclark#domain.com.
I've made this regexp here
^jwclark.*[#](?!domain\.com).*$
it's in javascript format, but it should be easy to adapt to php or something else.
Given the nature of your problem, you might be better off making a regex builder function that makes the proper regexp for you, given the parameters.
Or, actually use a different approach. I recently found out how to parse ranges of floating point numbers with regexp, but that doesn't make it the proper solution to finding numbers within ranges. :P
edit - fixed silly redundancy thanks to zx81
edit - change to comply with strange limitations:
^jwclark.{0,25}[#][^d][^o][^m][^a][^i][^n].{0,25}\.com.{0,25}$
demo for the strange one

CAtlRegExp for a regular expression that matches 4 characters max

Short version:
How can I get a regex that matches a#a.aaaa but not a#a.aaaaa using CAtlRegExp ?
Long version:
I'm using CAtlRegExp http://msdn.microsoft.com/en-us/library/k3zs4axe(VS.80).aspx to try to match email addresses. I want to use the regex
^[A-Z0-9._%+-]+#(?:[A-Z0-9-]+\.)+[A-Z]{2,4}$
extracted from here.
But the syntax that CAtlRegExp accepts is different than the one used there. This regex returns the error REPARSE_ERROR_BRACKET_EXPECTED, you can check for yourself using this app: http://www.codeproject.com/KB/string/mfcregex.aspx
Using said app, I created this regex:
^[a-zA-Z0-9\._%\+\-]+#([a-zA-Z0-9-]+\.)+[a-zA-Z]$
But the problem is this matches a#a.aaaaa as valid, I need it to match 4 characters maximum for the op-level domain.
So, how can I get a regex that matches a#a.aaaa but not a#a.aaaaa ?
Try: ^[a-zA-Z0-9\._%\+\-]+#([a-zA-Z0-9-]+\.)+\c\c\c?\c?$
This expression replaces the [A-Z]{2,4} sequence which CAtlRegExp doesn't support with \c\c\c?\c?
\c serves as an abbreviation of [a-zA-Z]. The question marks after the 3rd and 4th \c's indicate they can match either zero or one characters. As a result, this portion of the expression matches 2, 3 or 4 characters, but neither more nor less.
You are trying to match email addresses, a very widely used critical element of internet communication.
To which I would say that this job is best done with the most widely used most correct regex.
Since email address format rules are described by RFC822, it seems useful to do internet searches for something like "RFC822 email regex".
For Perl the answer seems to be easy: use Mail::RFC822::Address: regexp-based address validation
RFC 822 Email Address Parser in PHP
Thus, to achieve the most correct handling of email addresses, one should either locate the most precise regex that there is out somewhere for the particular toolkit (ATL in your case) or - in case there's no suitable existing regex yet - adapt a very precise regex of another toolkit (Perl above seems to be a very complete albeit difficult candidate).
If you're trying to match a specific sub part of email addresses (as seems to be the case given your question), then it probably still makes sense to start with the most up-to-date/correct/universal regex and specifically limit it to the parts that you require.
Perhaps I stated the obvious, but I hope it helped.

Question about URL Validation with Regex [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 14 years ago.
Improve this question
I have the following regex that does a great job matching urls:
((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:##%/;$()~_?\+-=\\\.&]*)`
However, it does not handle urls without a prefix, ie. stackoverflow.com or www.google.com do not match. Anyone know how I can modify this regex to not care if there is a prefix or not?
EDIT: Does my question too vague? Does it need more details?
(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\)))?[\w\d:##%/;$()~_?\+-=\\\.&]*)
I added a ()? around the protocols like Vinko Vrsalovic suggested, but now the regex will match nearly any string, as long as it has valid URL characters.
My implementation of this is I have a database that I manage the contents, and it has a field that either has plain text, a phone number, a URL or an email address. I was looking for an easy way to validate the input so I can have it properly formatted, ie. creating anchor tags for the url/email, and formatting the phone number how I have the other numbers formatted throughout the site. Any suggestions?
The below regex is from the wonderful Mastering Regular Expressions book. If you are not familiar with the free spacing/comments mode, I suggest you get familiar with it.
\b
# Match the leading part (proto://hostname, or just hostname)
(
# ftp://, http://, or https:// leading part
(ftp|https?)://[-\w]+(\.\w[-\w]*)+
|
# or, try to find a hostname with our more specific sub-expression
(?i: [a-z0-9] (?:[-a-z0-9]*[a-z0-9])? \. )+ # sub domains
# Now ending .com, etc. For these, require lowercase
(?-i: com\b
| edu\b
| biz\b
| gov\b
| in(?:t|fo)\b # .int or .info
| mil\b
| net\b
| org\b
| name\b
| coop\b
| aero\b
| museum\b
| [a-z][a-z]\b # two-letter country codes
)
)
# Allow an optional port number
( : \d+ )?
# The rest of the URL is optional, and begins with / . . .
(
/
# The rest are heuristics for what seems to work well
[^.!,?;"'<>()\[\]{}\s\x7F-\xFF]*
(?:
[.!,?]+ [^.!,?;"'<>()\[\]{}\s\x7F-\xFF]+
)*
)?
To explain this regex briefly (for a full explanation get the book) - URLs have one or more dot separated parts ending with either a limited list of final bits, or a two letter country code (.uk .fr ...). In addition the parts may have any alphanumeric characters or hyphens '-', but hyphens may not be the first or last character of the parts. Then there may be a port number, and then the rest of it.
To extract this from the website, go to http://regex.info/listing.cgi?ed=3&p=207 It is from page 207 of the 3rd edition.
And the page says "Copyright © 2008 Jeffrey Friedl" so I'm not sure what the conditions for use are exactly, but I would expect that if you own the book you could use it so ... I'm hoping I'm not breaking the rules putting it here.
If you read section 5 of the URL specification (http://www.isi.edu/in-notes/rfc1738.txt) you'll see that the syntax of a URL is at a minimum:
scheme ':' schemepart
where scheme is 1 or more characters and schemepart is 0 or more characters. Therefore if you don't have a colon, you don't have a URL.
That said, /users/ don't care if they've given you a url, to them it looks like one. So here's what I do:
BEFORE validation, if there isn't a colon in it, prepend http://, then run it through whatever validator you want. This turns any legitimate hostname (which may not include domain info, after all) into something that looks like a URL.
frob -> http://frob
(Nearly) the only rule for the host part is that it can't begin with a digit if it contains no dots. Now, there are specific validations that should be performed for specific schemes, which none of the regexes given thus far accomplish. But, spec compliance is probably not what you want to 'validate'. Therefore a dns query on the hostname portion may be useful, but unless you're using the same resolver in the same context as your user, it isn't going to work in all cases.
Your regexp matches everything starting with one of those protocols, including a lot of things that cannot possibly be existent URLs, if you relax the protocol part (making it optional with ?) then you'll just be matching almost everything, including the empty string.
In other words, it does a great job matching URLs because it matches almost anything starting with http://,https://,ftp:// and so on. Well, it also matches ftp:\\ and ms-help://, but let's ignore that.
It may make sense, depending on actual usage, because the other regexp approach of whitelisting valid domains becomes non maintainable quickly enough, but making the protocol part optional does not make sense.
An example (with the relaxed protocol part in place):
>>> r = re.compile('(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+)?[\w\d:##%/;$()~_?\+-=\\\.&]*)')
>>> r.search('oompaloompa_is_not_an_ur%&%%l').groups()[0]
'oompaloompa_is_not_an_ur%&%%l' #Matches!
>>> r.search('oompaloompa_isdfjakojfsdi.sdnioknfsdjknfsdjk.fsdnjkfnsdjknfsdjk').groups()[0]
'oompaloompa_isdfjakojfsdi.sdnioknfsdjknfsdjk.fsdnjkfnsdjknfsdjk' #Matches!
>>>
Given your edit I suggest you either make the user select what is he adding, adding an enum column, or create a simpler regex that'll check for at least a dot, besides the valid characters and maybe some common domains.
A third alternative which will be VERY SLOW and only to be used when URL validation is REALLY REALLY IMPORTANT is actually accessing the URL and do a HEAD request on it, if you get a host not found or an error you know it's not valid. For emails you could try and see if the MX host exists and has port 25 open. If both fails, it'll be plain text. (I'm not suggesting this either)
You can surround the prefix part in brackets and match 0 or 1 occurrences
(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+)?
So the whole regex will become
(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+)?[\w\d:##%/;$()~_?\+-=\\\.&]*)
The problem with that is it's going to match more or less any word. For example "test" would also be a match.
Where are you going to use that regex? Are you trying to validate a hostname or are you trying to find hostnames inside a paragraph?
Just use:
.*
i.e. match everything.
The things you want to match are just hostnames, not URL (technically).
There's no structure you can use to definitively identify hostnames.
Perhaps you could look for things that end in ".com" but then you'll miss any .co.uk, net, .org, etc.
Edit:
In other words: If you remove the requirement that the URL-like things start with a protocol you won't have any thing to match on.
Depending on what you are using the regular expression on:
Treat everything as a URL
Keep the requirement for a protocol
Hack checks for common endings for hostnames (e.g. .com .net .org) and accept you'll miss some.