Regular Expression to not allow disposable email addresses - regex

I'm trying to create a regex that does not allow disposable email addresses but allows everything else. So far, here is what I have:
^[A-Za-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[A-Za-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(((?:[A-Za-z0-9](?:[A-Za-z0-9-]*[A-Za-z0-9])?\.)+[A-Za-z0-9](?:[A-Za-z0-9-]*[A-Za-z0-9]))(?=.*(?!.*mailinator.com))(?=.*(?!.*trbvm.com))(?=.*(?!.*guerrillamail.com))(?=.*(?!.*guerrillamailblock.com))(?=.*(?!.*sharklasers.com))(?=.*(?!.*guerrillamail.net))(?=.*(?!.*guerrillamail.org))(?=.*(?!.*guerrillamail.biz))(?=.*(?!.*spam4.me|grr.la))(?=.*(?!.*guerrillamail.de))(?=.*(?!.*grandmasmail.com))(?=.*(?!.*zetmail.com))(?=.*(?!.*vomoto.com))(?=.*(?!.*abyssmail.com))(?=.*(?!.*anappthat.com))(?=.*(?!.*eelmail.com))(?=.*(?!.*yopmail.com))(?=.*(?!.*fakeinbox.com)))$
Right now, it accepts all email addresses.

Try this slightly modified regex using lookbehind:
^[A-Za-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[A-Za-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(((?:[A-Za-z0-9](?:[A-Za-z0-9-]*[A-Za-z0-9])?\.)+[A-Za-z0-9](?:[A-Za-z0-9-]*[A-Za-z0-9]))(?<!mailinator\.com)(?<!trbvm\.com)(?<!guerrillamail\.com)(?<!guerrillamailblock\.com)(?<!sharklasers\.com)(?<!guerrillamail\.net)(?<!guerrillamail\.org)(?<!guerrillamail\.biz)(?<!spam4\.me)(?<!grr\.la)(?<!guerrillamail\.de)(?<!grandmasmail\.com)(?<!zetmail\.com)(?<!vomoto\.com)(?<!abyssmail\.com)(?<!anappthat\.com)(?<!eelmail\.com)(?<!yopmail\.com)(?<!fakeinbox\.com))$
It matches bob#gmail.com but does not match bob#mailinator.com.
Fundamentally, you had a regex to match any email address, followed by positive and negative lookaheads like (?=.*(?!.*mailinator.com)). By the time those lookaheads are executed, you're already at the end of the string (further enforced by the $).
Looking ahead from the end of the string there is… nothing. Any lookahead (positive or negative) into nothingness will either always pass, or always fail, regardless of the input string. E.g. a lookahead of (?=.*) at the end of a string will always pass (.* matches the empty string), whereas one of (?=.) will always fail (. does not match the empty string).
In your case, the lookaheads like (?=.*(?!.*mailinator.com)) are okay with the nothingness beyond the end of the input string, so always pass. It's identical to if you didn't have them in the regex at all.
The simple fix, without overhauling the regex entirely, is to look behind with the (?<!) construct, instead of ahead. You're at the end of the string, and want to ensure it didn't end with one of the disposable email domains you have listed. To do that for one domain, it would be (?<!mailinator\.com).

There are many disposable email domains and they are constantly changing. Writing a regex for them is only going to capture a small number and will require constant maintenance and updating.
You may want to look at using some open source lists eg. https://github.com/disposable/disposable and then build a way to update them.
Alternatively you can use something like Upollo's free tier which does this for you.

Related

Using PCRE2 regex with repeating groups to find email addresses

I need to find all email addresses with an arbitrary number of alphanumeric words, separated through a period. To test the regex, I'm using the website https://regex101.com/.
The structure of a valid email addresses is word1.word2.wordN#word1.word2.wordN.word.
The regex /[a-zA-Z0-9.]+#[a-zA-Z0-9.]+.[a-zA-Z0-9]+/gm finds all email addresses included in the document string, but also includes invalid addresses like ........#....com, if present.
I tried to group the repeating parts by using round brackets and a Kleene star, but that causes the regex engine to collapse.
Invalid regex:
/([a-zA-Z0-9]+.?)*[a-zA-Z0-9]+#([a-zA-Z0-9]+.?)*[a-zA-Z0-9]+.[a-zA-Z0-9]+/gm
Although there are many posts concerning regex groups, I was unable to find an explanation, why the regex engine fails. It seems that the engine gets stuck, while trying to find a match.
How can I avoid this problem, and what is the correct solution?
I think the main issue that caused you troubles is:
. (outside of []) matches any character,you probably meant to specify \. instead (only matches literal dot character).
Also there is no need to make it optional with ?, because the non-dot part of your regex will just match with the alphanumerical characters anyway.
I also reduced the right part (x*x is the same as x+), added a case-insensitive flag and ended up with this:
/([a-z0-9]+\.)*[a-z0-9]+#([a-z0-9]+\.)+[a-z0-9]+/gmi

Regex for Google Analytics Goals

I've searched all the other Regex on Google Analytics questions but I can't use the answers as this is pretty specific to my problem.
I want to set a goal but use Regex to flag it as a goal IF string includes
/client-thank-you/ AND anything EXCEPT hire
so in other words
/client-thank-you/hire is not correct
/client-thank-you/anything/else is correct
Each of the following regexes will match any string that contains /client-thank-you/ and does not contain hire, depending on what assumption(s) you make about where "hire" is in the string.
Solution
Where can "hire" be located in the string?
Anywhere:
((?!hire).)*?/client-thank-you/((?!hire).)*
Only following the "/client-thank-you/":
.*?/client-thank-you/((?!hire).)*
Only immediately following the "/client-thank-you/":
.*?/client-thank-you/(?!hire).*
Notes
Optimization:
Each of these regexes will match the entire string. If your tool lets you determine if a string contains a substring match (rather than naively attempting to match the entire string), then you could optimize the second and third regexes by removing the leading .*?. Likewise, the third regex could be further optimized by removing the trailing .* as well.
Positively require "anything":
Note that all of these regexes assume that a string that ends with "/client-thank-you/" (with nothing after it) is valid. If this assumption is incorrect (i.e. the string .*/client-thank-you/$ is not a match), then change the trailing * on every regex to +. This would also mean that you have to keep the last .* on the third regex as a .+ (i.e. don't optimize that away).
EDIT:
The above will not work since GA uses a very limited version of regex (that does not include lookaround). If there is no other GA tool (other than a single regex) that you can use that meets your needs, then you could use the following as a last-ditch effort:
([-._~!$&'()*+,;=:#/0-9A-Za-gi-z]|h[-._~!$&'()*+,;=:#/0-9A-Za-hj-z]|hi[-._~!$&'()*+,;=:#/0-9A-Za-qs-z]|hir[-._~!$&'()*+,;=:#/0-9A-Za-df-z]|.{1,3}$)
And in expanded form for illustration purposes only:
( | | | | )
[-._~!$&'()*+,;=:#/0-9A-Za-gi-z] h[-._~!$&'()*+,;=:#/0-9A-Za-hj-z] hi[-._~!$&'()*+,;=:#/0-9A-Za-qs-z] hir[-._~!$&'()*+,;=:#/0-9A-Za-df-z] .{1,3}$
This regex will match 1-4 characters that do not form "hire". It does so by matching the minimum number of characters necessary to verify that the match is neither "hire" nor can serve as a prefix of "hire". It takes into account end-of-line (e.g. "hir" is valid if there is nothing else after it). The characters that it matches are all valid characters that can occur in the path component of a URL as specified in RFC 3986.
You use this regex by substituting it for every ((?!hire).) in any of the solutions given above. For example:
.*?/client-thank-you/([-._~!$&'()*+,;=:#/0-9A-Za-gi-z]|h[-._~!$&'()*+,;=:#/0-9A-Za-hj-z]|hi[-._~!$&'()*+,;=:#/0-9A-Za-qs-z]|hir[-._~!$&'()*+,;=:#/0-9A-Za-df-z]|.{1,3}$).*
This matches any url that contains "/client-thank-you/" but not "/client-thank-you/hire".
Do be careful, though. Doubled "h"s will make this workaround fail (e.g. "hhire"). However, if "hire" will only ever follow a path delimiter (i.e. /hire/), then that shouldn't be a problem.
If you can't use a lookahead like Travis suggested, then I suggest setting the goal to fire on an event instead of a pageview.
If you're using Google Tag Manager, you'll have the ability to write a more advanced regex, or at least set a blocking rule for the event that prevents it from firing when 'hire' is in the page URL.

Multiple spaces, multiple commas and multiple hypens in alphanumeric regex

I am very new to regex and regular expressions, and I am stuck in a situation where I want to apply a regex on an JSF input field.
Where
alphanumeric
multiple spaces
multiple dot(.)
multiple hyphen (‐)
are allowed, and Minimum limit is 1 and Maximum limit is 5.
And for multiple values - they must be separated by comma (,)
So a Single value can be:
3kd-R
or
k3
or
-4
And multiple values (must be comma separated):
kdk30,3.K-4,ER--U,2,.I3,
By the help of stackoverflow, so far I am able to achieve only this:
(^[a-zA-Z0-9 ]{5}(,[a-zA-Z0-9 ]{5})*$)
Something like
^[-.a-zA-Z0-9 ]{1,5}(,[-.a-zA-Z0-9 ]{1,5})*$
Changes made
[-.a-zA-Z0-9 ] Added - and . to the character class so that those are matched as well.
{1,5} Quantifier, ensures that it is matched minimum 1 and maximum 5 characters
Regex demo
You've done pretty good. You need to add hyphen and dot to that first character class. Note: With the hyphen, since it delegates ranges within a character class, you need to position it where contextually it cannot be specifying a range--not to say put it where it seems like it would be an invalid range, e.g., 7-., but positionally cannot be a range, i.e., first or last. So your first character class would look something like this:
[a-zA-Z 0-9.-]{1,5} or [-a-zA-Z0-9 .]{1,5}
So, we've just defined what one segment looks like. That pattern can reoccur zero or more times. Of course, there are many ways to do that, but I would favor a regex subroutine because this allows code reuse. Now if the specs change or you're testing and realize you have to tweak that segment pattern, you only need to change it in one place.
Subroutines are not supported in BRE or ERE, but most widely-used modern regex engines support them (Perl, PCRE, Ruby, Delphi, R, PHP). They are very simple to use and understand. Basically, you just need to be able to refer to it (sound familiar? refer-back? back-reference?), so this means we need to capture the regex we wish to repeat. Then it's as simple as referring back to it, but instead of \1 which refers to the captured value (data), we want to refer to it as (?1), the capturing expression. In doing so, we've logically defined a subroutine:
([a-zA-Z 0-9.-]{1,5})(,(?1))*
So, the first group basically defines our subroutine and the second group consists of a comma followed by the same segment-definition expression we used for the first group, and that is optional ('*' is the zero-or-more quantifier).
If you operate on large quantities of data where efficiency is a consideration, don't capture when you don't have to. If your sole purpose for using parenthesis is to alternate (e.g., \b[bB](asset|eagle)\b hound) or to quantify, as in our second group, use the (?: ... ) notation, which signifies to the regex engine that this is a non-capturing group. Without going into great detail, there is a lot of overhead in maintaining the match locations--not that it's complex, per se, just potentially highly repetitive. Regex engines will match, store the information, then when the match fails, they "give up" the match and try again starting with the next matching substring. Each time they match your capture group, they're storing that information again. Okay, I'm off the soapbox now. :-)
So, we're almost there. I say "almost" because I don't have all the information. But if this should be the sole occupant of the "subject" (line, field, etc.--the data sample you're evaluating), you should anchor it to "assert" that requirement. The caret '^' is beginning of subject, and the dollar '$' is end of subject, so by encapsulating our expression in ^ ... $ we are asserting that the subject matches in it's entirety, front-to-back. These assertions have zero-length; they consume no data, only assert a relative position. You can operate on them, e.g., s/^/ / would indent your entire document two spaces. You haven't really substituted the beginning of line with two spaces, but you're able to operate on that imaginary, zero-length location. (Do some research on zero-length assertions [aka zero-width assertions, or look-arounds] to uncover a powerful feature of modern regex. For example, in the previous regex if I wanted to make sure I did not insert two spaces on blank lines: s/^(?!$)/ /)
Also, you didn't say if you need to capture the results to do something with it. My impression was it's validation only, so that's not necessary. However, if it is needed, you can wrap the entire expression in capturing parenthesis: ^( ... )$.
I'm going to provide a final solution that does not assume you need to capture but does assume the entire subject should consist of this value:
^([a-zA-Z 0-9. -]{1,5})(?:,(?1))*$
I know I went on a bit, but you said you were new to regex, so wanted to provide some detail. I hope it wasn't too much detail.
By the way, an excellent resource with tutorials is regular-expressions dot info, and a wonderful regex development and testing tool is regex101 dot com. And I can never say enough about stack overflow!

C# regular express for list ips 65.232.211.[001-175]

I want to match IP against my IP list which stored in arraylist but it is in this format
65.232.211.[001-175]
eg. 68.232.211.133 must be match
68.232.211.199 not match
I want regualr express for this scenario but I dont know how it would be..
I tried but not getting correct ans..
Please help me..
You could use something like so: 68\\.232\\.211\\.0*([1-9][0-9]?|1[0-6][0-9]|17[0-5]). The last part should match the numerical range you are after (courtesy of Regex_For_Range).
Since the period character in regex is a special character (denoting any character), it needs to be escaped. This is done by adding an extra slash, like so: \.. Since you are using C# (it seems) you need to escape the slash as well since that is a special character in the C# language.
You could, alternatively (and even better than the above) use the following regex to split the IP in 2 and do what ever validation you need: ^([\d.]+?)\.(\d+)$. This regex would yield 2 groups, so taking 68.232.211.133 as an example, it would yield 68.232.211 and 133.
The above will allow you to match the initial part of the IP as a string and it will then allow you to take the last section of the IP, change it to a numerical value and perform range checks using mathematical operator.
In my opinion, the second approach should be favoured since it is (in my opinion) easier to maintain.

How to accept a dot within email address with Regex?

I have this regex [a-z0-9]*#metu\.edu which checks user input userside, using HTML5. I want to accept a dot (.) within the username, only a single dot. Such as: herp.derp#metu.edu
use below
^[a-z0-9]+[.]?[a-z0-9]+#metu\.edu$
DEMO
[a-z0-9][a-z0-9.]*#metu\.edu\.tr
This will require at least one (lower case character or number) and then (lower case character or number or dot), so email addresses cannot begin with or only consist of a single dot. However you should probably reconsider using this regex for email validations, as the acceptable email addresses contain a lot more cases (-, +, _ etc.) (and don't forget size limitations as well)
In general, e-mail addresses shouldn't be matched with regexes. However, in your specific case, it seems that you have a distinct pattern that you want to match against.
[a-z0-9]+(\.[a-z0-9]+)?#metu\.edu
Assuming that the single dot is optional, if it's mandatory, use this:
[a-z0-9]+\.[a-z0-9]+#metu\.edu
Add a dot to the character range: [a-z0-9.]*
To exclude dots at the beginning or end of the name, and only allow a single dot use multiple character classes:
[a-z0-9]+\.?[a-z0-9]+#metu\.edu
You should not be trying to parse e-mail adresses yourself, using regex, as you WILL fail. Please consider this: http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html
What I usually do is accept any string as e-mail. If it's just for user registration or whatever, there is really no need to validate it. It will fail when you send out an email, then you know it's wrong.