Find the regex used by HTML5 forms for validation - regex

Some HTML5 input elements accept the pattern attribute, which is a regex for form validation. Some other HTML5 input elements, such as, input type=email does the validation automatically.
Now it seems that the way validation is handled is different accross browsers. Given a specific browser, say Chrome, is it possible to programmatically extract the regex used for validation? Or maybe there is documentation out there?

The HTML5 spec currently lists a valid email address as one matching the ABNF:
1*( atext / "." ) "#" ldh-str *( "." ldh-str )
which is elucidated in this question. #SLaks answer provides a regex equivalent.
That said, with a little digging through the source, shows that WebKit implemented email address validation using basically the same regex as SLaks answer, i.e.,
[a-z0-9!#$%&'*+/=?^_`{|}~.-]+#[a-z0-9-]+(\.[a-z0-9-]+)*
However, there is no requirement that email addresses be validated by a regex. For example, Mozilla (Gecko) implemented email validation using a pretty basic finite state machine. Hence, there needn't be a regex involved in email validation.

The HTML5 spec now gives a (non-normative) regex which is supposed to exactly match all email addresses that it specifies as valid. There's a copy of it on my blog here:
http://blog.gerv.net/2011/05/html5_email_address_regexp/
and in the spec itself:
https://html.spec.whatwg.org/#e-mail-state-(type=email))
The version above is incorrect only in that it does not limit domain components to max 255 characters and does not prevent them beginning or ending with a "-".
Gerv

this works for me:
pattern="[^#]+#[^#]+.[a-zA-Z]{2,6}"

Related

RegEx equivalent for C# data annotation [DataType(DataType.Password)]

I have an iOS native login that works with a custom API for a site with .Net's Identity.
I need a regEx expression (for setting the password when signing up) that matches the requirements for the data annotation [DataType(DataType.Password)] in C#.
Does anyone know where to look?
DataType.Password doens't trigger any specific (regex) validations. If you have an Html.EditorFor a password type inputfield it will generate a html that contains the ***** (hidden inputs)
Otherwise password strength is validated by the membership provider (or what are you using to store your user). And even then it often can't easily be captured in a regex since it contains requirements as
- at least 1 digit
- at least 1 lower & uppercase letter
- at least 6 characters long.
- etc
those kind of requirements often turn into very nasty regex expressions
([a-z]+[A-Z]+[a-zA-Z])|([A-Z]+[a-z]+[a-zA-Z]) ....
it becomes easier if you split each requirement in it's own regular expression.

Regular expression - for email spam filtering, match email address variants other than the original

I am a email spam quarantine administrator and I can write regular expression rules to block email messages. There is a common classification of email spam hitting our domain such that the username of any of our email addresses is spoofed in front of some other domain.
For example, suppose my email address is jwclark#domain.com. In that case, spammers are writing to me from all kinds of other domains that start with my username such as:
jwclark1234#whatever.com
jwclark#wrongdomain.com
jwclark#a.domain.com
How can I write a regular expression rule to match everything including jwclark and any wildcards, but not match the original jwclark#domain.com? I would like a regex that matches everything above except for my actual example email address jwclark#domain.com.
I've made this regexp here
^jwclark.*[#](?!domain\.com).*$
it's in javascript format, but it should be easy to adapt to php or something else.
Given the nature of your problem, you might be better off making a regex builder function that makes the proper regexp for you, given the parameters.
Or, actually use a different approach. I recently found out how to parse ranges of floating point numbers with regexp, but that doesn't make it the proper solution to finding numbers within ranges. :P
edit - fixed silly redundancy thanks to zx81
edit - change to comply with strange limitations:
^jwclark.{0,25}[#][^d][^o][^m][^a][^i][^n].{0,25}\.com.{0,25}$
demo for the strange one

How to allow "+" character in email validation?

We're using the validate.js plugin for form validation. I want to allow "+" in the email field so I can use multiple test accounts with Gmail. However, the plugin validation doesn't allow it.
I'm not very good with regex, how would I alter the below to allow the + character? (I tried going to that commented URL but the page no longer exists it seems.)
email: function( value, element ) {
// contributed by Scott Gonzalez: http://projects.scottsplayground.com/email_address_validation/
return this.optional(element) || /^((([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+(\.([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+)*)|((\x22)((((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(([\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|\x21|[\x23-\x5b]|[\x5d-\x7e]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(\\([\x01-\x09\x0b\x0c\x0d-\x7f]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]))))*(((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(\x22)))#((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))$/i.test(value);
},
That is a terribly written regular expression. Not only is it far too long and complicated, but for some reason the author decided it was a good idea to escape characters in a [...] character class, which you absolutely should not do.
Instead, try this Regex, from here (these guys know their Regex).
/[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?/i
I also agree with Bergi's comment - don't validate e-mail addresses using complex Regex. Even the one I suggest is overkill for most of today's applications.

How can I accept mail from only one domain in smtp-gated?

I need a Perl-compatible regular expression for filtering my email with smtp-gated. I only want to allow one domain ('mydomain.com') and reject everything else. How do I do this in a foolproof way? (regex_reject_mail_from)
I know this question halfway belongs on serverfault, but basically it's a Perl regex question so I think it fits stackoverflow more.
EDIT:
This should match so I can reject it:
"Someone" <someone#somedomain.com>
This should not match:
"Me" <me#mydomain.com>
This shouldn't match also:
you#mydomain.com
-
I'd suggest the following:
\b[A-Z0-9._%+-]+#(?!mydomain\.com)[A-Z0-9.-]+\.[A-Z]{2,6}\b
Use the /i option to make it case-insensitive.
This will match most valid (and some invalid) e-mail addresses that don't have mydomain.com after the #. Keep in mind that e-mail validation is hard with regexes.
If your regex is going to be applied to the MAIL FROM line in the MTA communication then you do not need to concern yourself with the full 'email address' specification. MAIL FROM lines are just the email address enclosed in '<>', so any regex that tests for #mydomain.com> should work.
\b(?:"[ a-zA-Z]+")?\s*<?[a-zA-Z0-9_.]#(?!mydomain\.com)\w+(?:\.\w{2,})+>?\b
UPDATE: Note that this regex is fa{9,}r from being perfect. Check the official regex for email addresses for more info (Scroll down to the <p/> titled RFC 2822).

CAtlRegExp for a regular expression that matches 4 characters max

Short version:
How can I get a regex that matches a#a.aaaa but not a#a.aaaaa using CAtlRegExp ?
Long version:
I'm using CAtlRegExp http://msdn.microsoft.com/en-us/library/k3zs4axe(VS.80).aspx to try to match email addresses. I want to use the regex
^[A-Z0-9._%+-]+#(?:[A-Z0-9-]+\.)+[A-Z]{2,4}$
extracted from here.
But the syntax that CAtlRegExp accepts is different than the one used there. This regex returns the error REPARSE_ERROR_BRACKET_EXPECTED, you can check for yourself using this app: http://www.codeproject.com/KB/string/mfcregex.aspx
Using said app, I created this regex:
^[a-zA-Z0-9\._%\+\-]+#([a-zA-Z0-9-]+\.)+[a-zA-Z]$
But the problem is this matches a#a.aaaaa as valid, I need it to match 4 characters maximum for the op-level domain.
So, how can I get a regex that matches a#a.aaaa but not a#a.aaaaa ?
Try: ^[a-zA-Z0-9\._%\+\-]+#([a-zA-Z0-9-]+\.)+\c\c\c?\c?$
This expression replaces the [A-Z]{2,4} sequence which CAtlRegExp doesn't support with \c\c\c?\c?
\c serves as an abbreviation of [a-zA-Z]. The question marks after the 3rd and 4th \c's indicate they can match either zero or one characters. As a result, this portion of the expression matches 2, 3 or 4 characters, but neither more nor less.
You are trying to match email addresses, a very widely used critical element of internet communication.
To which I would say that this job is best done with the most widely used most correct regex.
Since email address format rules are described by RFC822, it seems useful to do internet searches for something like "RFC822 email regex".
For Perl the answer seems to be easy: use Mail::RFC822::Address: regexp-based address validation
RFC 822 Email Address Parser in PHP
Thus, to achieve the most correct handling of email addresses, one should either locate the most precise regex that there is out somewhere for the particular toolkit (ATL in your case) or - in case there's no suitable existing regex yet - adapt a very precise regex of another toolkit (Perl above seems to be a very complete albeit difficult candidate).
If you're trying to match a specific sub part of email addresses (as seems to be the case given your question), then it probably still makes sense to start with the most up-to-date/correct/universal regex and specifically limit it to the parts that you require.
Perhaps I stated the obvious, but I hope it helped.