HTML5 Pattern Attribute: Exclude Keywords - regex

I'm trying to write an HTML5 pattern to prevent users from entering free email accounts. So far I have this...
<input name="email"
placeholder="Work Email"
required
type="email"
title="Enter a valid work email address (No free email services)"
pattern="^((?!hotmail)(?!gmail)(?!ymail)(?!googlemail)(?!live)(?!gmx)(?!yahoo)(?!outlook)(?!msn)(?!icloud)(?!facebook)(?!aol)(?!zoho)(?!mail)(?!yandex)(?!hushmail)(?!lycox)(?!lycosmail)(?!inbox)(?!myway)(?!aim)(?!fastmail)(?!goowy)(?!juno)(?!shortmail)(?!atmail)(?!protonmail).)*$"/>
It is close, but is missing two key rules...
Should only look at what comes between the '#' and the '.'
Should be case insensitive.
Any ideas on how to get this working?
UPDATE
To avoid unhelpful comments about if I should be doing this, consider other use cases where an input field should not contain a list of known keywords. A similar use case could be swear words or an ID prefix where multiple prefixes exist, but you want to avoid just one type being entered... ID should never contain IXT and user enters... WIN-09880, IXT-2342, NTS-23422.

This is a really bad idea, but in the spirit of answering your question, here is my answer.
You can use:
pattern="^.+#((?!hotmail)(?!gmail)(?!ymail)(?!googlemail)(?!live)(?!gmx)(?!yahoo)(?!outlook)(?!msn)(?!icloud)(?!facebook)(?!aol)(?!zoho)(?!mail)(?!yandex)(?!hushmail)(?!lycox)(?!lycosmail)(?!inbox)(?!myway)(?!aim)(?!fastmail)(?!goowy)(?!juno)(?!shortmail)(?!atmail)(?!protonmail).)+\..+$"
to only look between the '#' and the '.'. HTML5 doesn't support the i flag for case-insensitivity, so you will either need to use JavaScript or hardcode case-insensitivity into the pattern.

You can't:
I've found a non-exhaustive list of free email providers which contains 2840 entries.
You'll block users that works at Microsoft, Google, Facebook, Yahoo, ProtonMail, Free, Orange, Sfr and a lot others.
What will you do with users that have theirs domains name?

Here's a corrected (and shortened) version of your regex that targets only the domain portion of the address:
^(?!.*#(?:hotmail|gmail|ymail|googlemail|live|gmx|yahoo|outlook|msn|icloud|facebook|aol|zoho|mail|yandex|hushmail|lycox|lycosmail|inbox|myway|aim|fastmail|goowy|juno|shortmail|atmail|protonmail)\.\w+$).*$
You can shorten it further if you need to:
^(?!.*#(?:live|gmx|yahoo|outlook|msn|icloud|facebook|aol|zoho|yandex|lycox|inbox|myway|aim|goowy|juno|(?:hot|[gy]|google|short|at|proton|hush|lycos|fast)?mail)\.\w+$).*$
You can't make it case insensitive because the JavaScript regex flavor, very annoyingly, doesn't support inline modifiers. But do you have to use a regex for this? I would prefer a code solution using an updatable list of banned domains.

Related

Regex fix for GSuite content compliance

I have a regex that I am trying to check for phishing emails.
The emails come in like:
Principal-joe smith <officeemailxyz#gmail.com>
I need to identify any email that has
principal*#gmail.com or #hotmail.com or #yahoo.com.
This is my regex:
(\W|^)(?i)pr[i!1]nc[i!1]p[a#]l#(yahoo|hotmail|gmail)\.com(\W|$)
(\W|^)(?i)pr[i!1]nc[i!1]p[a#]l---WHAT DOES HERE---#(yahoo|hotmail|gmail)\.com(\W|$)
Or is there a better way to do this?
First, I think you have to make the search case insensitive with the option /i.
Than, you should include any email-address-valid character plus the space, if I understand your example correctly.
I ran a couples of tests and the following seems to catch all cases.
/^[\w\s]*principal[\<+\s*a-zA-Z0-9._-]*?#[yahoo|hotmail|gmail]*\.com[\>]?$/gmi

Regular expression - for email spam filtering, match email address variants other than the original

I am a email spam quarantine administrator and I can write regular expression rules to block email messages. There is a common classification of email spam hitting our domain such that the username of any of our email addresses is spoofed in front of some other domain.
For example, suppose my email address is jwclark#domain.com. In that case, spammers are writing to me from all kinds of other domains that start with my username such as:
jwclark1234#whatever.com
jwclark#wrongdomain.com
jwclark#a.domain.com
How can I write a regular expression rule to match everything including jwclark and any wildcards, but not match the original jwclark#domain.com? I would like a regex that matches everything above except for my actual example email address jwclark#domain.com.
I've made this regexp here
^jwclark.*[#](?!domain\.com).*$
it's in javascript format, but it should be easy to adapt to php or something else.
Given the nature of your problem, you might be better off making a regex builder function that makes the proper regexp for you, given the parameters.
Or, actually use a different approach. I recently found out how to parse ranges of floating point numbers with regexp, but that doesn't make it the proper solution to finding numbers within ranges. :P
edit - fixed silly redundancy thanks to zx81
edit - change to comply with strange limitations:
^jwclark.{0,25}[#][^d][^o][^m][^a][^i][^n].{0,25}\.com.{0,25}$
demo for the strange one

REGEX rule to validate a domain field

For one of the products I offer it is only available to people with certain domain extensions.
On the order form there is a field for them to enter their domain, and the system I am using does allow me to validate that field before continuing the order process.
I can add a 'Validation REGEX' to be run on the value entered in the domain field.
The TDLs that are supported are: .com, .net, .org, .biz, .info, .name, .tv, .cc, .me, .pro, .mobi, .cm, .co, .com.co, .nom.co, .net.co, .ws
I am trying to find out what REGEX validation to use to determine if the domain entered in the field matches one of those TLDs.
I can't change any of the code for this task. I just have a field to enter the REGEX validation rule.
I appreciate any ideas or suggestions you may have.
If it's just a domain they enter, use
\.(com|net|org|biz|info|name|tv|cc|me|pro|mobi|cm|co|ws)$
This matches a domain ending in a point followed by one of the TLD's you specified.
Since you're already allowing .co as a TLD, there's no need to check for com.co, nom.co, or net.co; they're valid since they end in .co.
([^\s]+(\.(?i)(com|net|org|biz|info|name|tv|cc|me|pro|mobi|cm|co|nom|ws|com.co|nom.co|net.co))$)
This should work for:
Extendings only:
^\.(?i)(net.co|nom.co|com.co|com|net|org|biz|info|name|tv|cc|me|pro|mobi|co|cm|ws)$
name.extedning:
\.(?i)(net.co|nom.co|com.co|com|net|org|biz|info|name|tv|cc|me|pro|mobi|co|cm|ws)$

Find the regex used by HTML5 forms for validation

Some HTML5 input elements accept the pattern attribute, which is a regex for form validation. Some other HTML5 input elements, such as, input type=email does the validation automatically.
Now it seems that the way validation is handled is different accross browsers. Given a specific browser, say Chrome, is it possible to programmatically extract the regex used for validation? Or maybe there is documentation out there?
The HTML5 spec currently lists a valid email address as one matching the ABNF:
1*( atext / "." ) "#" ldh-str *( "." ldh-str )
which is elucidated in this question. #SLaks answer provides a regex equivalent.
That said, with a little digging through the source, shows that WebKit implemented email address validation using basically the same regex as SLaks answer, i.e.,
[a-z0-9!#$%&'*+/=?^_`{|}~.-]+#[a-z0-9-]+(\.[a-z0-9-]+)*
However, there is no requirement that email addresses be validated by a regex. For example, Mozilla (Gecko) implemented email validation using a pretty basic finite state machine. Hence, there needn't be a regex involved in email validation.
The HTML5 spec now gives a (non-normative) regex which is supposed to exactly match all email addresses that it specifies as valid. There's a copy of it on my blog here:
http://blog.gerv.net/2011/05/html5_email_address_regexp/
and in the spec itself:
https://html.spec.whatwg.org/#e-mail-state-(type=email))
The version above is incorrect only in that it does not limit domain components to max 255 characters and does not prevent them beginning or ending with a "-".
Gerv
this works for me:
pattern="[^#]+#[^#]+.[a-zA-Z]{2,6}"

How can I accept mail from only one domain in smtp-gated?

I need a Perl-compatible regular expression for filtering my email with smtp-gated. I only want to allow one domain ('mydomain.com') and reject everything else. How do I do this in a foolproof way? (regex_reject_mail_from)
I know this question halfway belongs on serverfault, but basically it's a Perl regex question so I think it fits stackoverflow more.
EDIT:
This should match so I can reject it:
"Someone" <someone#somedomain.com>
This should not match:
"Me" <me#mydomain.com>
This shouldn't match also:
you#mydomain.com
-
I'd suggest the following:
\b[A-Z0-9._%+-]+#(?!mydomain\.com)[A-Z0-9.-]+\.[A-Z]{2,6}\b
Use the /i option to make it case-insensitive.
This will match most valid (and some invalid) e-mail addresses that don't have mydomain.com after the #. Keep in mind that e-mail validation is hard with regexes.
If your regex is going to be applied to the MAIL FROM line in the MTA communication then you do not need to concern yourself with the full 'email address' specification. MAIL FROM lines are just the email address enclosed in '<>', so any regex that tests for #mydomain.com> should work.
\b(?:"[ a-zA-Z]+")?\s*<?[a-zA-Z0-9_.]#(?!mydomain\.com)\w+(?:\.\w{2,})+>?\b
UPDATE: Note that this regex is fa{9,}r from being perfect. Check the official regex for email addresses for more info (Scroll down to the <p/> titled RFC 2822).