How to allow "+" character in email validation? - regex

We're using the validate.js plugin for form validation. I want to allow "+" in the email field so I can use multiple test accounts with Gmail. However, the plugin validation doesn't allow it.
I'm not very good with regex, how would I alter the below to allow the + character? (I tried going to that commented URL but the page no longer exists it seems.)
email: function( value, element ) {
// contributed by Scott Gonzalez: http://projects.scottsplayground.com/email_address_validation/
return this.optional(element) || /^((([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+(\.([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+)*)|((\x22)((((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(([\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|\x21|[\x23-\x5b]|[\x5d-\x7e]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(\\([\x01-\x09\x0b\x0c\x0d-\x7f]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]))))*(((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(\x22)))#((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))$/i.test(value);
},

That is a terribly written regular expression. Not only is it far too long and complicated, but for some reason the author decided it was a good idea to escape characters in a [...] character class, which you absolutely should not do.
Instead, try this Regex, from here (these guys know their Regex).
/[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?/i
I also agree with Bergi's comment - don't validate e-mail addresses using complex Regex. Even the one I suggest is overkill for most of today's applications.

Related

Regular expression - for email spam filtering, match email address variants other than the original

I am a email spam quarantine administrator and I can write regular expression rules to block email messages. There is a common classification of email spam hitting our domain such that the username of any of our email addresses is spoofed in front of some other domain.
For example, suppose my email address is jwclark#domain.com. In that case, spammers are writing to me from all kinds of other domains that start with my username such as:
jwclark1234#whatever.com
jwclark#wrongdomain.com
jwclark#a.domain.com
How can I write a regular expression rule to match everything including jwclark and any wildcards, but not match the original jwclark#domain.com? I would like a regex that matches everything above except for my actual example email address jwclark#domain.com.
I've made this regexp here
^jwclark.*[#](?!domain\.com).*$
it's in javascript format, but it should be easy to adapt to php or something else.
Given the nature of your problem, you might be better off making a regex builder function that makes the proper regexp for you, given the parameters.
Or, actually use a different approach. I recently found out how to parse ranges of floating point numbers with regexp, but that doesn't make it the proper solution to finding numbers within ranges. :P
edit - fixed silly redundancy thanks to zx81
edit - change to comply with strange limitations:
^jwclark.{0,25}[#][^d][^o][^m][^a][^i][^n].{0,25}\.com.{0,25}$
demo for the strange one

HTML5 Input Pattern vs. Non-Latin Letters

I want to make pre-validation of some input form with new HTML5 pattern attirbute. My dataset is "Domain Name", so <input type="url"> regex preset isn't applied.
But there is a problem, I wont use A-Za-z , because of damned IDN's (Internationalized domain name).
So question: is there any way to use <input pattern=""> for random non-english letters validation ?
I tried \w ofcource but it works only for latin...
Maybe someone has a set of some \xNN-\xNN which guarantees entering of ALL unicode alpha characters, or some another way?
edit: "This question may already have an answer here:" - no, there is no answer.
Based on my testing, HTML5 pattern attributes supports Unicode character code points in the exact same way that JavaScript does and does not:
It only supports \u notation for unicode code points so \u00a1 will match 'ยก'.
Because these define characters, you can use them in character ranges like [\u00a1-\uffff]
. will match Unicode characters as well.
You don't really specify how you want to pre-validate so I can't really help you more than that, but by looking up the unicode character values, you should be able to work out what you need in your regex.
Keep in mind that the pattern regex execution is rather dumb overall and isn't universally supported. I recommend progressive enhancement with some javascript on top of the pattern value (you can even re-use the regex more or less).
As always, never trust user input - It doesn't take a genius to make a request to your form endpoint and pass more or less whatever data they like. Your server-side validation should necessarily be more explicit. Your client-side validation can be more generous, depending upon whether false positives or false negatives are more problematic to your use case.
I know this isn't what you want to hear, but...
The HTML5 pattern attribute isn't really for the programmer so much as it's for the user. So, considering the unfortunate limitations of pattern, you are best off providing a "loose" pattern--one that doesn't give false negatives but allows for a few false positives. When I've run into this problem, I found that the best thing to do was a pattern consisting of a blacklist + a couple minimum requirements. Hopefully, that can be done in your case.

ReqEx expression for form validation

I am trying to add form validation to my html site in order to prevent xss injection attacks.
I am using a simple java form validator genvalidator_v4.js that allows me to use regex expressions to determine what is allowed in a text box. I am trying to write one that would prevent "<" or ">" or any other tags that could be used in this kind of attack, but still allow alphanumeric, punctuation, and other special characters.
Any ideas? Also open to other methods of preventing xss attacks but I am very inexperienced in this area so please keep it as simple as possible.
You are trying to blacklist dangerous input. That's very tricky, it's very easy to get it wrong because of the sheer number of tokens that could be dangerous.
Thus, the following two practices are recommended instead:
Escape everything read from the database before outputting it on a web page. If you correctly HtmlEncode everything (your language of choice surely has a library method for that), it doesn't matter if a user entered <script>/* do something evil */</script> and that code got stored in your database. Correctly encoded, this will just be printed verbatim and do no harm.
If you still want to filter input (which might be useful as an additional layer of security), whitelists are generally safer than blacklists. So, instead of saying that < is harmful, you say that letters, digits, punctuation, etc. are safe. What exactly is safe depends on what type of field you are filtering.

Find the regex used by HTML5 forms for validation

Some HTML5 input elements accept the pattern attribute, which is a regex for form validation. Some other HTML5 input elements, such as, input type=email does the validation automatically.
Now it seems that the way validation is handled is different accross browsers. Given a specific browser, say Chrome, is it possible to programmatically extract the regex used for validation? Or maybe there is documentation out there?
The HTML5 spec currently lists a valid email address as one matching the ABNF:
1*( atext / "." ) "#" ldh-str *( "." ldh-str )
which is elucidated in this question. #SLaks answer provides a regex equivalent.
That said, with a little digging through the source, shows that WebKit implemented email address validation using basically the same regex as SLaks answer, i.e.,
[a-z0-9!#$%&'*+/=?^_`{|}~.-]+#[a-z0-9-]+(\.[a-z0-9-]+)*
However, there is no requirement that email addresses be validated by a regex. For example, Mozilla (Gecko) implemented email validation using a pretty basic finite state machine. Hence, there needn't be a regex involved in email validation.
The HTML5 spec now gives a (non-normative) regex which is supposed to exactly match all email addresses that it specifies as valid. There's a copy of it on my blog here:
http://blog.gerv.net/2011/05/html5_email_address_regexp/
and in the spec itself:
https://html.spec.whatwg.org/#e-mail-state-(type=email))
The version above is incorrect only in that it does not limit domain components to max 255 characters and does not prevent them beginning or ending with a "-".
Gerv
this works for me:
pattern="[^#]+#[^#]+.[a-zA-Z]{2,6}"

Regular expression for email [duplicate]

This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
What is the best regular expression for validating email addresses?
I am using this particular regular expression for checking emails.
"^[A-Za-z0-9](([a-zA-Z0-9,=\.!\-#|\$%\^&\*\+/\?_`\{\}~]+)*)#(?:[0-9a-zA-Z-]+\.)+[a-zA-Z]{2,9}$"
The issue I have is that, this allows periods before the "#" symbol.
Is there any way this expression be modified so that this does not happen, yet all other conditions are maintained.
a.#b.com should be invalid
Thanks in advance
-Rollin
The best answer I've seen so far. Honestly, if you gave some indication of which language or toolset you were using, I would point you to the library that does it for you rather than telling you how to hand-roll a regular expression for this.
Edit: Given the additional information that this is on .NET, I would use the MailAddress class and abandon the thought of using regular expressions altogether like so:
public bool IsAddressValid(string text)
{
try
{
MailAddress address = new MailAddress(text);
return true;
}
catch (FormatException)
{
return false;
}
}
If there are additional requirements over and above validating the address itself (like making sure it is from a particular set of domains or some such) then you can do that with much simpler tests after you have verified that the address is valid as I suggested in another post.
A strange game. The only winning move
is not to play.
Seriously, the only winning way to validate email addresses with a regular expression is to not validate email addresses with a regular expression. The grammar that accepts email addresses is not regular. It's true that modern regular expressions can support some non-regular grammars, but even so, maintaining a regular expression for such a complex grammar is (as you can see) nearly impossible.
The only reasonable use of regular expressions with email addresses that I can think of is to do some minimal checking on the client side (does it contain an # symbol?).
What you should do is:
Send an email to the email address with a link for the user to click. If the user clicks the link, the email address is valid. Furthermore, it exists, and the user is probably the one who entered the email address into your form. Not only does this 100% validate the email address, it gives you even more guarantees.
If you can't do 1, use a prepackaged email validator. Better, use both 1 and 2.
If you can't do 1 or 2, write a real parser to validate email addresses.
you could put [^\.] before the # so that it will allow any character except the dot
of course this is probably not what you want, so you could just put a [] with any legal characters in it
just in case someone has a email name (i mean the part before the #) that is just 1 character, you might need to get creative with the |
If your regex engine has lookbehind assertions then you can just add a "(?<!\.)" before the "#".
If you're doing this in Perl - the following script is an example
my $string = 'name#domain.com';
if($string =~/(\w+#[a-zA-Z_]+?\.[a-zA-Z]{2,9})/)
{
print "gotcha!";
}
else
{
print "nope :(";
}
As you can see, the Perl regex character \w handles periods gracefully. If you change $string to "name.#domain.com" it will fail.