I was looking at email validation. I read in RFC specs that consecutive . (dot) are not allowed, like, mail..me#server.com.
But are different wild characters allowed to occur consecutively? Like, mail.$me#server.com.
And if so, how do I make a regular expression which will take only single occurance of wild characters as long as they are different? It shouldn't accept the ones like, .. && $$, but accept the ones like, &$ .$ &.
And since there's a big list of wild characters allowed, I don't think a regex like \^(&&|$$|..)\ etc, is not an option.
There are a few RFC compliant email validation regexes. They are not pretty, in fact they are pretty awful, spanning hundreds of characters. You really don't want to create one, either use it or write regular code you can understand and maintain.
This is one of the RFC compliant regexes
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
Check this link for expanded information and alternative (more practical) regexes http://www.regular-expressions.info/email.html
I finally used something like this:
/^([a-zA-Z0-9]+([\.\!\'\#\$\%\&\*\+\-\/\=\?\^\_\`\{\|\}\~]{0,1}))*[a-zA-Z0-9]+\#(([a-zA-Z0-9\-]+[\.]?[a-zA-Z0-9]+){0,2})[\.][a-zA-Z]{2,4}$/
Not pretty :)
but very much served my specifications.
Different characters like $ are allowed to occur multiple times in a row, yes. sam$$iam#example.com is a completely valid email address.
I would use a simple regex of email validation + another regex that checks double chars like /[.&$]{2}/
I suppose it depends on what you're doing with this email validation, but I've done this for years in online ASP.NET regex validators for form entry purposes.
For a few months I thought I had what was a pretty cool regular expression to take care of this. I found it online and it seemed to be a popular one. However, on several occasions I'd get a call from a customer trying to fill out the application where the form validation didn't like their email address. And who knows how many people had the same problem but didn't call.
I learned the lesson the hard way that it's better to err on the side of greediness than to try to be too strict. In other words, since there are soooooo many rules in defining what makes an email address valid (and invalid), I simply define a loose open-ended regex to cover all of my bases. It may match some invalid email addresses as well, but for my purposes that's not as big of a deal. Besides, quite honestly -- most of the time if the user is screwing up their email address it's going to be a misspelling which regex isn't going to catch anyways.
So here's what I use now:
^[^<>\s\#]+(\#[\w\-]+(\.[\w\-]+)+)$
And here's a working example to test this:
http://regexhero.net/tester/?id=b90d359f-0dda-4b2a-a9b7-286fc513cf40
This doesn't address your primary concern as this will still match consecutive dots, dashes, etc. And I still can't claim this will match every valid email address because I honestly don't know. But I can say that I've been using it for the past 3 years with over 25,000 users and not a single complaint.
See these answers:
stackoverflow.com/questions/997078/email-regular-expression
stackoverflow.com/questions/201323/what-is-the-best-regular-expression-for-validating-email-addresses
stackoverflow.com/questions/36261/test-expand-my-email-regex
Just remember, as stated before: the only way to tell if an email address is truly valid is to send email to it!
Related
I've heard that it is a bad thing to validate email addresses with a regex, and that it actually can cause harm. Why is that?
I thought it never could be a bad thing to validate data. Maybe unnecessary, but never a bad thing provided that you perform the validation correctly. Why is this right or wrong? If it can cause harm, please give an example.
In general, yes - using regular expressions to validate email addresses is harmful. This is because of bad (incorrect) assumptions by the author of the regular expression.
As klutt indicated, an email address has two parts, the local-part and the domain. It's worth noting some things about these parts that aren't immediately obvious:
The local-part can contain escaped characters and even additional # characters.
The local-part can be case sensitive, however it is up to the mail server at that specific domain how it wants to distinguish case.
The domain part can contain zero or more labels separated by a period (.), though in practice there are no MX records corresponding to the root (zero labels) or on the TLDs (one label) themselves.
So, there are some checks that you can do without rejecting valid email addresses that correspond with the above:
Address contains at least one #
The local-part (everything to the left of the rightmost #) is non-empty
The domain part (everything to the right of the rightmost #) contains at least one period (again, this isn't strictly true, but pragmatic)
That's it. As others have pointed out, it's best practice to test deliverability to that address. This will establish two important things:
Whether the email currently exists; and
That the user has access to the email address (is the legitimate user or owner)
If you build email activation processes into your business process, you don't need to worry about complicated regular expressions that have issues.
Some further reading for reference:
RFC 5321: Simple Mail Transfer Protocol
OWASP: Input Validation Cheat Sheet
TL;DR
Don't use regexes for validating emails, unless you have a good reason not to. Use a verification mail instead. In most cases, a regex that simply checks that the string contains an # is enough.
Short version
Constructing regexes for validating emails can be a good and fun exercise, but in general, you should really avoid it in production code. The proper way of verifying an email address is in most cases to send a verification mail. Trying to verify if a mail address matches the specification is very tricky, and even if you get it right, it's still often useless information unless you know that it's a mail address that you can send mails to and that someone reads.
Think of it. How often do you have use for storing a mail address that's wrong?
If you're just want to make sure that a user does not mix up input fields, check that the mail address contains a # character. That's enough. Well, it would not catch those who insists on that character in user names or passwords, but that's their head ache. ;)
Long version
In a majority of the cases where you would want to use this, just knowing that the email address is valid does not mean a thing. What you really want to know is if it is the right email address.
The reason may differ. You may want to send newsletters, use it for regular communication, password recovery or something else. But whatever it is, it's important that it is the right address. It's not important to know if the address fulfills a complicated standard. The only important thing is to know if it can be used for the purpose you have of storing the address.
The proper way to verify this is by sending a mail with a verification link.
If you have verified the email address with a verification link, there's often no point in checking if it is a correct email address, since you know it works. It could however be used for basically checking that the user is entering the email address in the correct field. My advice in this case is to be extremely forgiving. I'd say it's enough to just check that it is a # in the field. It's a simple check and ALL email addresses includes a #. If you want to make it more complicated than that, I would suggest just warning the user that it might be something wrong with the address, but not forbidding it. A pretty simple regex that would have extremely few false negatives (if any) is
.+#.+\..+
This means a non empty string before # followed by a non empty domain, a dot and a non empty top domain. But actually, I'd just stick with #.+ which means that the right part is non empty, and I don't know of any dns server that would accept an empty server name.
Properly checking an email against the standard is actually really tricky
But one worse concern is that a regex for accurately verifying an email address is actually a very complex matter. If you try to create a regex on your own, you will almost certainly make mistakes. One thing worth mentioning here is that the standard RFC 5322 does allow comments within parentheses. To make things worse, nested comments are allowed. A standard regex cannot match nested patterns. You will need extended regex for this. While extended regexes are not unusual, it does say something about the complexity. And even if you get it right, will you update the regex when a new standard comes?
The mail server might support non-standard addresses
And one more thing, even if you get it 100% right, that still may not be enough. An email address has the local part on the left side of the # and domain part on the right. Everything in the local part is meant to be handled by the server. Sure, RFC 5322 is pretty detailed about what a valid local part looks like, but what if a particular email server accepts addresses that is not valid according to RFC 5322? Are you really sure you don't want to allow a particular email address that does work just because it does not follow the standard? Do you want to lose customers for your business just because they have chosen an obscure email provider?
If you really want to check if an address is correct in production code, then use MailAddress class or something equivalent. But first take a minute to ponder if this really is what you want. Ask yourself if the address has any value if it is not the correct address. If the answer is no, then you don't. Use verification links instead.
That being said, it can be a good thing to validate input. The important thing is to know why you are doing it. Validating the email with a regex or (preferably) something like the Mailaddress class could give some protection against malicious input, such as SQL injections and such. But if this is the only method you have to protect you against malicious input, then you're doing something else very wrong.
In addition to other answers, I would like to point out, that regex engines that use backtracking are susceptible to ReDoS - regex denial of service attacks. The attack is based on the fact that many non-trivial regular expressions have inputs that can take an extraordinary amount of CPU cycles to produce a non-match.
Crafting such an input might cause trouble to the availability of the site even with small botnet.
Mitigations of the issue:
it is often possible to rewrite the regex expression to avoid catastrophic backtracking; or:
using a regex engine without support for backtracking - while most support it, engines without such support do exist - a notable example would be the RE2 regex engine used by Go/Golang.
For more information: "Regular Expressions Denial of the Service (ReDoS) Attacks"
If your regular expression is ill-formed then you might deny valid email addresses. This goes for any "email validation" rule.
I know of an email address which is regularly denied by forms which doesn't contain any email oddities; it's merely long. It really annoys the person it belongs to because the part before the # is their legal name - an obvious choice for an email address.
That is part of the potential harm of email validation done incorrectly: annoying users by denying valid email addresses from entering the system.
It is not inherently bad to validate email addresses.
It is not even inherently bad to validate email addresses using regexes ... though there are arguably better ways to validate them1.
The real issues are that validation of email addresses based on the syntax is ineffective:
It does not tell you if the address corresponds to a valid, working mailbox.
It does not tell you if it is an address for the correct user (or agent).
Since users often accidentally (or deliberately2) enter syntactically valid but incorrect email addresses, you need to do something else if you need to know if the address is the correct address for the person involved. For example, you could send some kind of "activation" or "confirmation" email to the address provided.
So, assuming that you are going to implement the second stage of checking, the first stage of syntax checking the email address is relatively unimportant, and not even strictly necessary.
1 - Creating a regex that correctly deals with all of the edge-cases in the email syntax is non-trivial. However, it may be acceptable to disallow some of the more abstruse edge-cases, provided it doesn't unduly inconvenience a significant number of users.
2 - Regex validation is next to useless for filtering out deliberately fake email addresses.
I've heard that it is a bad thing to validate email addresses with a regex, and that it actually can cause harm. Why is that?
This is correct. The regex solution is attractive, because an email address is a structured string, and regex is used to find structure in strings.
It is also the wrong solution, because when you ask the user for an email address, it is usually so you can contact them.
The validation is incorrect because:
the address may be valid, but not an address the user has access to. I could fill in the address billgates#microsoft.com to any form, and it will probably be accepted as a valid email address ( disclaimer: I am not Bill Gates :) ).
the syntax for email addresses is very tricky to get correctly (see the examples here) - by defining your own regex for email validation, you will end up rejecting valid addresses, and accepting invalid ones.
I thought it never could be a bad thing to validate data.
It's not bad to validate data. In this case though, you will provide a feature in your application, that is defective by design:
Your application looks to your developers as if it is validating the input, but the validation is unnecessary, probably incomplete, and at the end of the validation, you don't know if you have an address that will allow you to contact the user.
Maybe unnecessary, but never a bad thing provided that you perform the validation correctly.
It is not unnecessary; it is necessary. It's just that regex is the wrong tool for it.
At the end of the day, the best way to check that the address is valid for the user is unique token exchange for that address:
send an email to the address, containing a unique random token (store token with user data)
ask user in the email to "click the link/button", effectively sending you the token back.
verify the token.
Regex is not harmful.
Use a good email regex to filter the impatient fake user.
If you are selling to that individual, you might want to contact them for further validation, though sellers don't care about email too much and just validating the credit card is good enough for them.
Otherwise, the only other place where validation is necessary is when someone wants access to and interact with your forum, and for some reason you want get remuneration by selling their email to mass advertisers, even though you say you won't do that.
A general email regex in the HTML5 specification is this -
^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+#[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$
http://www.w3.org/TR/html5/forms.html#valid-e-mail-address
^
[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+
#
[a-zA-Z0-9]
(?:
[a-zA-Z0-9-]{0,61}
[a-zA-Z0-9]
)?
(?:
\.
[a-zA-Z0-9]
(?:
[a-zA-Z0-9-]{0,61}
[a-zA-Z0-9]
)?
)*
$
A regular expression is probably the best way to validate an email address; so long as you use the correct one. Once you've checked the address with a regular expression, there's only a few additional requirements that must be checked (that the address is not too long, and that it is valid UTF-8).
This is because the ABNF grammar that defines the form of email addresses is "regular", which means it can be described exactly as a regular expression; without backtracking, recursion, or any non-regular features.
It's only a matter of understanding the specification; but once you do that, it turns out the regular expression for email address is actually very simple: How can I validate an email address using a regular expression?
When I was first learning how to use regular expressions we were taught how to parse things like phone numbers (obviously always 5 digits, an optional space and a further 6 digits), email addresses (obviously always alphanumerics, then a single '#', then alphanumerics followed by a '.' and three letters) which we should always do to validate the data that the user enters.
Of course as I've developed I've learned how silly the basic approach can be, but the more I look, the more I question the concept altogether, the most open careful correct validation of something like an email address through regexes ends up being hundreds if not thousands of characters long in order to both accept all the legal cases and correctly reject only the illegal ones. Even worse, all that effort does absolutely nothing for the actual validity, the user may have accidentally added an 'a', or may not use that email address at all, or even is using someone else's address, or may even use a '+' symbol which is being flagged inappropriately.
Yet at the same time seemingly every site I come across still does this kind of technical checking, preventing me from putting more obscure characters in an email address or name, or objecting to the idea that someone would have more or less than a single title, then a single firstname and a single lastname, all made purely from latin characters yet without any form of check that it's my real name.
Is there a benefit to this? Once injection attacks are handled (which should be through methods other than sterilizing the input) is there any other point to these checks?
Or on the other hand, is there actually a sure fire way to actually validate user details other than to 'use' them in whatever way makes sense contextually and see if it falls over?
Overly validating things is indeed one of the banes of the internet. Especially if the person writing the validation code has no actual knowledge of the problem domain. No, you probably do not actually know what the valid syntax for email addresses is. Or real-world addresses, especially internationally. Or telephone numbers. Or people's names.
Looking at a few localised examples (my email address) and extrapolating to rules covering all possible values within the domain (all email addresses) is madness. Unless you have perfect domain knowledge, you should not come up with rules about the domain. In the case of email addresses this leads to only a very narrow subset of possible email addresses actually being usable in daily life. Ghee, thanks, guys.
As for people's names, whatever a person tells you is their name is by definition their name. It's what you call them by. You cannot validate it automatically; they'd have to send in a copy of their birth certificate for actual official validation. And even then, is that really what you're interested in knowing? Or do you merely need a "handle" to greet and identify them on your forum page?
Facebook does (did?) strict name validation in order to force people to use their real names to register. Well, many people I know on Facebook still use some made up nonsense name. The filter obviously doesn't work. Having said this, perhaps it works well enough for Facebook so that most people use their actual name because they couldn't be bothered to figure out which particular pattern will pass the validation. In that sense, such a filter can serve some purpose.
In the end it's up to you to decide on reasons for validation and the specific limits you want to enforce. The issue is that people often do not think about the bigger picture before writing validation code and they have no good reason for their specific limits. Don't fall into that trap.
is there any other point to these checks?
Certainly. Knowing that your data is valid is very important. In the case of email addresses, for example, sending an email to an address you haven't validated will, at the very least, lead to bounces. Enough bounces and your mailhost might block you for spamming. Not validating a phone number could lead to unnecessary costs if your app tries to send SMS to them. The list goes on and on.
Or on the other hand, is there actually a sure fire way to actually validate user details other than to 'use' them in whatever way makes sense contextually and see if it falls over?
Yes, but regex is generally bad way to validate data. If a phone number is supposed to be "5 digits a space then 6 digits", then your check is going to fail if I type "5 digits two spaces then 6 digits" or "5 digits a dash then 6 digits" or "11 digits". Use common sense, and expect any crazy format the user provides. Know what the absolute minimal requirement is. For example, if you need 11 digits total, then strip everything that's not a digit first. Then formatting doesn't matter.
Also, read the RFCs. I can't count the number of times my email address has been rejected because it has a plus sign in it. The amount of those that were large tech-oriented company with programmers that should know better was rather disappointing.
In my website, I am using the below regex to validate email.
^[a-zA-Z0-9]+[a-zA-Z0-9_.-]+[a-zA-Z0-9_-]+#[a-zA-Z0-9]+[a-zA-Z0-9.-]+[a-zA-Z0-9]+.[a-z]{2,4}$
My doubts are:
Can I use the below regex for the same functionality?
^[a-zA-Z0-9_.-]+#[a-zA-Z0-9.-]+.[a-z]{2,4}$
The reason why I ask this is, I tried to study the meaning. So I got a confusion that
[a-zA-Z0-9_.-] cover all the instances by [a-zA-Z0-9] and [a-zA-Z0-9_-]
I a not sure about this, as I am a beginner.
I got the regex from
http://regexlib.com/
I checked both regex in http://regex101.com/#pcre. And I can't find any difference in result. May be it is because of my limited knowledge
Please give a clarification. Thanks to all in advance
Maybe it's not the answer you were looking for, but I have to say that I ended up with this kind of email validation: ^.+#.+\..{2,}$ after trimming it.
What does it check? Existence of some symbols, # itself, some other symbols, dot, and at least two symbols for top-level domain. It says "Dude, there should be an email, not your hilarious username". And that's enough, I guess.
By the way, .[a-z]{2,4}$ is a huge mistake for checking TLD, since there are few popular domains that are longer than 4 symbols (i.e., .travel) and a lot of less popular ones.
Why do I think that you don't need a detailed validation? First of all, there are a lot of requirements which you'll miss anyway. Do you know that cyrillic symbols are allowed in the email address?
And, please, think what do you want from this validation? Avoid incorrect emails? You won't. Somebody will enter an email, which meet all of your requirements, but it'll be incorrect anyway. Is email#gmial.com a good one? No. Will it be checked by regexp? I'm afraid, the answer is "no" once again.
So, it's better to explain that user should provide valid email to get confirmation mail and to make an explanation "if you aren't the one, who registered at mysite.com, please just ignore this letter" in the email text.
Because regexp will never filter enough, but you can lose a couple of users with strange email adresses because of it.
And since this should be an answer for your question:
It won't be the same functionality, but both regexp's are far from being perfect.
Long regexp checks that first symbols in login mustn't be a dot, dash or underscore, last symbol shouldn't be a dot, while other symbols might be, but avoid the fact that login might be shorter than 3 symbols. Short regexp is better (= simpler) but it doesn't meet requirements mentioned above.
So, if you want to use your variant, just remove 4 from it. If you need the logic of the original one, you can't make it shorter.
Differences might be found using these examples: .mail#mail.com, a#gmail.com, gmail#a.com
I am new to regular expressions and have just started learning some. I was wondering what are some of the most commonly used regular expressions by the programmers. Put it in another way, I would like to know what are the regular expressions most useful for? How can they help me in my every day tasks? I would prefer to know regular expressions useful for every day programming, not occasionally used regular expressions such email address matching.
Anyone? Thanks.
Edit: Most of the answers include regular expressions to match email addresses, URLs, dates, phone numbers etc. Please note that not all programmers have to worry about these things in their every day tasks. I would like to know some more generic uses of regular expressions, if there are any, which programmers in general (may) use regardless what language are domain they are working in.
Regular expression examples for
Decimals input
Positive Integers ^\d+$
Negative Integers ^-\d+$
Integer ^-?\d+$
Positive Number ^\d*\.?\d+$
Negative Number ^-\d*\.?\d+$
Positive Number or Negative Number ^-?\d*\.?\d+$
Phone number ^\+?[\d\s]{3,}$
Phone with code ^\+?[\d\s]+\(?[\d\s]{10,}$
Year 1900-2099 ^(19|20)\d{2}$
Date (dd mm yyyy, d/m/yyyy, etc.)
^([1-9]|0[1-9]|[12][0-9]|3[01])\D([1-9]|0[1-9]|1[012])\D(19[0-9][0-9]|20[0-9][0-9])$
IP v4:
^(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]){3}$
Alphabetic input
Personal Name ^[\w.']{2,}(\s[\w.']{2,})+$
Username ^[\w\d_.]{4,}$
Password at least 6 symbols ^.{6,}$
Password or empty input ^.{6,}$|^$
email ^[_]*([a-z0-9]+(\.|_*)?)+#([a-z][a-z0-9-]+(\.|-*\.))+[a-z]{2,6}$
domain ^([a-z][a-z0-9-]+(\.|-*\.))+[a-z]{2,6}$
Other regular expressions
- Match no input ^$
- Match blank input ^\s\t*$
- Match New line [\r\n]|$
- Match white Space ^\s+$
- Match Url = ^http\:\/\/[a-zA-Z0-9.-]+\.[a-zA-Z]{2,3}$
I would take a different angle on this and say that it's most helpful to know when to use regular expressions and when NOT to use them.
For example, imagine this problem: "Figure out if a string ends with a whitespace character." A regular expression could be used here, but if you're using C#, this code is much faster:
bool EndsWithWhitespace(string s)
{
return !string.IsNullOrEmpty(s) && char.IsWhiteSpace(s[s.Length - 1]);
}
Regular expressions are powerful, and it's important to know when they're too powerful for the problem you're trying to solve.
Think about input fields that require validation, such as zip codes, telephone numbers, et cetera. Regular expressions are very utilized to validate those. Also, take a look at this site, which contains many tutorials, and many more examples, some of which I present next:
Numeric Ranges. Since regular
expressions work with text rather than
numbers, matching specific numeric
ranges requires a bit of extra care.
Matching a Floating Point Number. Also
illustrates the common mistake of
making everything in a regular
expression optional.
Matching an Email Address. There's a
lot of controversy about what is a
proper regex to match email addresses.
It's a perfect example showing that
you need to know exactly what you're
trying to match (and what not), and
that there's always a trade-off
between regex complexity and accuracy.
Matching Valid Dates. A regular
expression that matches 31-12-1999 but
not 31-13-1999.
Finding or Verifying Credit Card
Numbers. Validate credit card numbers
entered on your order form. Find
credit card numbers in documents for a
security audit.
And many, many, many more possible applications.
E-mail address
Website
File-Paths
Phone-numbers/Fax/ZIP and other numbers used in business (chemistry numbers, ect.)
file content (check if the file can be a valid XML-file,...)
code modification and formatting (with replacement)
data types (GUID, parsing of integers,...)
...
Upto closing tag
([^<]*)
Seriously. I use combinations of that way too often for comfort... We should all ditch regex:en for peg-parsers, especially since there's a nice regex-like grammar style for them.
Well... I kind of think your question is wrong. It sounds like you're asking about regular expressions that could/should be as much a part of one's coding, or nearly so, as things like mathematical operators. Really, if your code depends that pervasively on regular expressions, you're probably doing something very wrong. For pervasive use throughout code, you want to use data structures that are better defined and more efficient to work with than regular-expression-managed strings.
The closest thing to what you're asking for that would make much sense to me would be something like /\s+/ used for splitting strings on arbitrary amounts of whitespace.
This is a little like asking 'what are the most useful words for programmers?'
It depends what you're going to use them for, and it depends which language. And you didn't say.
Some programmers never need to worry about matching email addresses, phone numbers, ZIP codes and IP addresses.
My copy of
Mastering Regular Expressions, O'Reilly, 3rd Edition, 2006
devotes a lot of space to the flavours of regex used by different languages.
It's a great reference, but I found the 2nd edition more readable.
How can they help me in my every day tasks?
A daily use for programmers could include
search/replace of sample data for testing purposes
searching through log files for String patterns (Exceptions, for example)
searching a directory structure for files of a certain type (as simple as dir *.txt does this)
to name just a few
E-mail
Website URL
Phone-numbers
ZIP Code
Alpha Numeric, (user name consist of alpha number and only start with alpha character
IP Address
This will be completely dependent on what domain you work in. For some it will be phone numbers and SSN's and others it will be email addresses, IP addresses, URLs. The most important thing is knowing when you need a regex and when you don't. For example, if you're trying to parse data from an XML or HTML file, it's usually better to use a library specifically designed to parse that content than to try and write something yourself.
This question already has answers here:
Closed 14 years ago.
Duplicate: Using a regular expression to validate an email address
There seem to be an awful lot of different variants on this on the web and was wondering if there is a definitive answer?
Preferably using the .net (Regex) dialect of regular expressions.
This question has been asked and answered several times:
Using a regular expression to validate an email address
Why are people using regexp for email and other complex validation?
Regexp recognition of email address hard?
Specifically related to .NET:
Validating e-mail with regular expression VB.Net
regular-expressions says that this one matches about 99%
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
The definitive answer? Or the normal answer?
I ask because the formal email address specification allows all sorts of weird things (parenthesis, quoted phrases, etc) that most people don't bother to account for.
See this page for a list of both comprehensive and normal regex'es.
I don´t think there´s a silver bullet for email regex verification.
what people are commonly doing is to verify only for mistakes, like the absence of # and one dot. And then send a email verification to that address. It´s the only way to be sure that they email is actually valid.
I've had the same problem some time ago. RFC 2822 defines it and according to this page this one is useful and is the one i picked: "[a-z0-9!#$%&'+/=?^_{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_{|}~-]+)#(?:a-z0-9?.)+(?:[A-Z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum)\b"
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
Probably want to add A-Z next to all the lower case versions in order to allow uppercase letters as well.
I don't know if there's one definitive answer for this one, but if you put aside actually checking if the domain exists, email addresses boil down to <username>#<domain>, where <domain> contains at least one dot and two to four characters in the suffix. You can do all kinds of things to check for illegal/special characters, but the simplest one would be:
^[\w-\.]+#([\w-]+\.)+[\w-]{2,4}$