Can anyone explain this regular expression to me in detail? - regex

I have a RegEx here and I need to know if it will 100% omit any bad email addresses but I do not understand them fully so need to call on the community experts.
The string is as follows:
^[_a-zA-Z0-9-]+(.[_a-zA-Z0-9-]+)*#[a-zA-Z0-9-]+(.[a-zA-Z0-9-]+)*(.[a-zA-Z]{2,3})$
Thank you in advance!

Please, please, don't try to validate email addresses using regular expressions; this is a wheel that does not need re-inventing, and unless you write a horrendously hairy regular expression, you will let through invalid email addresses or reject valid ones.
There are plenty of modules on CPAN like Email::Valid which will take care of it all for you and are tried-and-tested.
Simple example:
use Email::Valid;
print (Email::Valid->address('someone#example.com') ? 'yes' : 'no');
Much simpler, and will just work.
Alternatively, using Mail::RFC822::Address:
if (Mail::RFC822::Address::valid('someone#example.com')) { ...}
For an example of how hairy a regular expression would have to be to successfully handle all RFC822-compliant addresses, take a look at this beauty.
People who try to hand-roll their own email address validation tend to end up with code that lets syntactically-invalid addresses slip through, and perhaps worse, reject perfectly valid addresses.
For example, some people use + in their address, like bob+amazon#example.com - this is known as an "address tag" or "sub-addressing". Quite a few naive attempts at validation would refuse that, and the customer will end up going elsewhere.
Also, in the past some people used to assume the TLD would always be 2 or 3 characters; when e.g. .info was launched, people with addresses at those domains would be told their perfectly-valid email address wasn't acceptable.
Finally, there are some pathological cases such as "Mickey Mouse"#example.com, bob#[1.2.3.4] which are syntactically-valid, but most people's hand-rolled validation would refuse.

^[_a-zA-Z0-9-]+(.[_a-zA-Z0-9-]+)*#[a-zA-Z0-9-]+(.[a-zA-Z0-9-]+)*(.[a-zA-Z]{2,3})$
Piece by piece
^ Start of the string
[_a-zA-Z0-9-]+ One or more characters of "_" (no quotes), a letter (a-z, A-Z), a number (0-9), or "-" (no quotes)
(.[_a-zA-Z0-9-]+)* zero or more substrings of type .something, or .123, or .a123. The substring must be formed by a . and a letter (same group of letters as before). So "." is not valid. ".a" or ".1" or ".-" is.
(up until now it will accept for example my.name12 or my.name12.surname34)
# a "#" (like max#something)
[a-zA-Z0-9-]+ One or more characters with the same pattern as before
(.[a-zA-Z0-9-]+)* Zero or more substrings of type ".something"... just as before
(.[a-zA-Z]{2,3}) A "." (dot) and 2 or 3 letters (a-z or A-Z)
$ The end of the string
So we have an email address, where you can't have something.#somethingelse.ss (no "dangling" dot before the #) or .something#somethingelse.ss (no beginning dot). The domain must start with a letter and can't have a dot just before the first level domain (.com/.uk/??), so no something#x..com. The first-level domain must have 2 or 3 letters (no numbers)
There is an error, the . (dot) must be escaped, so it should be \. . Depending on the language, the \ must be escaped in a string (so it could be \\.)

If I see it correctly, the following would be valid according to your regex: a#a#a#a#aa
The dot is the sign for any character!
Additionally, the following valid email address would not be accepted, although it should:
Someone%special#domain.de

Simple answer: it won't.
Next to the fact that a bad email address doesn't necessarily imply it's wrongly formatted (this_email_address_does_not_exist#someprovider.com is rightly formatted but is still bad), the RegEx will accept some bad addresses as well.
For example, the most right-hand part ((.[a-zA-Z]{2,3})$) states the verified string should end with a dot and then two or three letters. This will accept non-existing top level domain names (e.g. .aa) and will block four-letter TLD's (e.g. .info)

This RegEx will accept email addresses beginning with an underscore. That is (mostly) unacceptable.
You haven't placed any minimum limit on the size of the "username" (i.e. the part below "#" symbol). Thus, single character usernames will bypass this. Combined with the previous exception, email-ids of type _#something.com might escape undetected.
The . (dot) operator accepts any character. So, after the "#" part, (invalid) domains of type ##.com etc might be undetected.
Domains with only 2 or 3 chars are accepted, rest are ignored.

[_a-zA-Z0-9-]
Means you only want these characters (any alphanumeric char or '-' or '_') in your email address but it can be valid with all these characters : ! # $ % & ' * + - / = ? ^ _ ` { | } ~
The first part (before #) must be 253 characters long at most ({1,253}) and the second part (after #) can be 64 characters long max ({4,64}). (Add parenthesis to the first or second group before putting the ({4,64}) count limit)
If you want to know the EmailAddress Norm, just look wikipedia : The Article On Wiki

No, it will not exclude 100% of bad email addresses. Short of rejecting all addresses, this is impossible for a regex to accomplish because the vast majority of syntactically-valid addresses are for accounts which do not exist, such as shgercnhlch#stackoverflow.com.
The only way to truly verify the legitimacy of an email address is to attempt to send mail to it - and even that will only tell you that mail is accepted at that address, not that it is received by a human (as opposed to being fed to a script or silently discarded) and, even if it is received by a human, you have no guarantee that it's the human who claimed to own it. ("You insist that I have to give you a deliverable email address? Fine. My email address is president#whitehouse.gov.")

perhaps this regular expression will do?
^[_A-Za-z0-9-\+]+(\.[_A-Za-z0-9-]+)*#[A-Za-z0-9-]+(\.[A-Za-z0-9]+)*(\.[A-Za-z]{2,})$
taken from
http://www.mkyong.com/regular-expressions/how-to-validate-email-address-with-regular-expression/

To all the writers above that identify that the . accepts any character, I have found that in writing a response to another RegEx question, this edit-capture widget eats backslashes.
(IT'S A PROBLEM!)
Ok... Let's write it correctly:
^\s*([_a-zA-Z0-9]+(\\.[_a-zA-Z0-9\\-\\%]+)\*)#([a-zA-Z0-9]+(\\.[a-zA-Z0-9\\-]+)\*(\\.[a-zA-Z]{2,4}))\s*$
This also incorporates the % character as an allowed-inside value. The problem with this routine is that while it actually does a pretty good job parsing email addresses, it also is not very efficient, since RegEx is "greedy" and the terminating condition (which is supposed to match things like .com and .edu) will overshoot, then need to backtrack, costing considerable CPU time.
The real answer is to use the routines that are specific to this, as other posters have recommended. But if you don't have the CPAN modules, or the target environment does not, then the RegEx hack is arguably acceptable.

Related

Custom email validation regex pattern not working properly

So I've got /.+[^\x20-\x2A\x2C\x2F\x3A-\x40\x5B-\x5E\x60\x7B-\xFF]\#[\w+-?]+(.{1})\w{2,}/ pattern I want to use for email validation on client-side, which doesn't work as expected.
I know that my pattern is simple and doesn't cover every standard possibility, but it's part of my regex training.
Local part of address should be valid only when it has at least one digit [0-9] or letter [a-zA-Z] and can be mixed with comma or plus sign or underscore (or all at once) and then # sign, then domain part, but no IP address literals, only domain names with at least one letter or digit, followed by one dot and at least two letters or two digits.
In test string form it doesn't validate a#b.com and does validate baz_bar.test+private#e-mail-testing-service..com, which is wrong - it should be vice versa - validate a#b.com and not validate baz_bar.test+private#e-mail-testing-service..com
What specific error I've got there and where?
I can't locate this, sorry..
You need to change your regex
From: .+[^\x20-\x2A\x2C\x2F\x3A-\x40\x5B-\x5E\x60\x7B-\xFF]\#[\w+-?]+(\.{1})\w{2,}
To: .+[^\x20-\x2A\x2C\x2F\x3A-\x40\x5B-\x5E\x60\x7B-\xFF]?\#[\w+-]+(\.{1})\w{2,}
Notice that I added a ? before the # sign and removed the ? from the first "group" after the # sign. Adding that ? will make your regex to know that hole "group" is not mandatory.
See it working here: https://regex101.com/r/iX5zB5/2
You're requiring the local part (before #) to be at least two characters with the .+ followed by the character class [^...]. It's looking for any character followed by another character not in the list of exclusions you specify. That explains why "a#b.com" doesn't match.
The second problem is partly caused by the character class range +-? which includes the . character. I think you wanted [-\w+?]+. (Do you really want question marks?) And then later I think you wanted to look for a literal . character but it really ends up matching the first character that didn't match the previous block.
Between the regex provided and the explanatory text I'm not sure what rules you intend to implement though. And since this is an exercise it's probably better to just give hints anyway.
You will also want to use the ^ and $ anchors to makes sure the entire string matches.

Regular Expression to not allow disposable email addresses

I'm trying to create a regex that does not allow disposable email addresses but allows everything else. So far, here is what I have:
^[A-Za-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[A-Za-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(((?:[A-Za-z0-9](?:[A-Za-z0-9-]*[A-Za-z0-9])?\.)+[A-Za-z0-9](?:[A-Za-z0-9-]*[A-Za-z0-9]))(?=.*(?!.*mailinator.com))(?=.*(?!.*trbvm.com))(?=.*(?!.*guerrillamail.com))(?=.*(?!.*guerrillamailblock.com))(?=.*(?!.*sharklasers.com))(?=.*(?!.*guerrillamail.net))(?=.*(?!.*guerrillamail.org))(?=.*(?!.*guerrillamail.biz))(?=.*(?!.*spam4.me|grr.la))(?=.*(?!.*guerrillamail.de))(?=.*(?!.*grandmasmail.com))(?=.*(?!.*zetmail.com))(?=.*(?!.*vomoto.com))(?=.*(?!.*abyssmail.com))(?=.*(?!.*anappthat.com))(?=.*(?!.*eelmail.com))(?=.*(?!.*yopmail.com))(?=.*(?!.*fakeinbox.com)))$
Right now, it accepts all email addresses.
Try this slightly modified regex using lookbehind:
^[A-Za-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[A-Za-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(((?:[A-Za-z0-9](?:[A-Za-z0-9-]*[A-Za-z0-9])?\.)+[A-Za-z0-9](?:[A-Za-z0-9-]*[A-Za-z0-9]))(?<!mailinator\.com)(?<!trbvm\.com)(?<!guerrillamail\.com)(?<!guerrillamailblock\.com)(?<!sharklasers\.com)(?<!guerrillamail\.net)(?<!guerrillamail\.org)(?<!guerrillamail\.biz)(?<!spam4\.me)(?<!grr\.la)(?<!guerrillamail\.de)(?<!grandmasmail\.com)(?<!zetmail\.com)(?<!vomoto\.com)(?<!abyssmail\.com)(?<!anappthat\.com)(?<!eelmail\.com)(?<!yopmail\.com)(?<!fakeinbox\.com))$
It matches bob#gmail.com but does not match bob#mailinator.com.
Fundamentally, you had a regex to match any email address, followed by positive and negative lookaheads like (?=.*(?!.*mailinator.com)). By the time those lookaheads are executed, you're already at the end of the string (further enforced by the $).
Looking ahead from the end of the string there is… nothing. Any lookahead (positive or negative) into nothingness will either always pass, or always fail, regardless of the input string. E.g. a lookahead of (?=.*) at the end of a string will always pass (.* matches the empty string), whereas one of (?=.) will always fail (. does not match the empty string).
In your case, the lookaheads like (?=.*(?!.*mailinator.com)) are okay with the nothingness beyond the end of the input string, so always pass. It's identical to if you didn't have them in the regex at all.
The simple fix, without overhauling the regex entirely, is to look behind with the (?<!) construct, instead of ahead. You're at the end of the string, and want to ensure it didn't end with one of the disposable email domains you have listed. To do that for one domain, it would be (?<!mailinator\.com).
There are many disposable email domains and they are constantly changing. Writing a regex for them is only going to capture a small number and will require constant maintenance and updating.
You may want to look at using some open source lists eg. https://github.com/disposable/disposable and then build a way to update them.
Alternatively you can use something like Upollo's free tier which does this for you.

C# regular express for list ips 65.232.211.[001-175]

I want to match IP against my IP list which stored in arraylist but it is in this format
65.232.211.[001-175]
eg. 68.232.211.133 must be match
68.232.211.199 not match
I want regualr express for this scenario but I dont know how it would be..
I tried but not getting correct ans..
Please help me..
You could use something like so: 68\\.232\\.211\\.0*([1-9][0-9]?|1[0-6][0-9]|17[0-5]). The last part should match the numerical range you are after (courtesy of Regex_For_Range).
Since the period character in regex is a special character (denoting any character), it needs to be escaped. This is done by adding an extra slash, like so: \.. Since you are using C# (it seems) you need to escape the slash as well since that is a special character in the C# language.
You could, alternatively (and even better than the above) use the following regex to split the IP in 2 and do what ever validation you need: ^([\d.]+?)\.(\d+)$. This regex would yield 2 groups, so taking 68.232.211.133 as an example, it would yield 68.232.211 and 133.
The above will allow you to match the initial part of the IP as a string and it will then allow you to take the last section of the IP, change it to a numerical value and perform range checks using mathematical operator.
In my opinion, the second approach should be favoured since it is (in my opinion) easier to maintain.

Looking to build some regex to validate domain names (RFC 952/ RFC 1123)

One of our clients validates email addresses in their own software prior to firing it via an API call to our system. The issue is however that their validation rules do not match those our system, therefore they are parsing and accepting addresses which break our rules. This is causing lots of failed calls.
They are parsing stuff like "dave#-whatever.com", this goes against RFC 952/RFC 1123 rules as it begins with a hyphen. They have asked that we provide them with our regex list so they can update validation on their platform to match ours.
So, I need to find/build an RFC 952/RFC 1123 accepted. I found this in another SO thread (i'm a lurker :)), would it be suitable and prevent these illegal domains from being sent?
"^(([a-zA-Z]|[a-zA-Z][a-zA-Z0-9\-]*[a-zA-Z0-9])\.)*([A-Za-z]|[A-Za-z][A-Za-z0-9\-]*[A-Za-z0-9])$";
A domain part has a max length of 255 characters and can only consist of digits, ASCII characters and hyphens; a hyphen cannot come first.
Checking the validity of one domain component can be done using this regex, case insensitive, length notwithstanding:
[a-z0-9]+(-[a-z0-9]+)*
This is the normal* (special normal*)* pattern again, with normal being [a-z0-9] and special being -.
Then you take all this in another normal* (special normal*)* pattern as the normal part, and the special being ., and anchor it at the beginning and end:
^[a-z0-9]+(-[a-z0-9]+)*(\.[a-z0-9]+(-[a-z0-9]+)*)+$
If you cannot afford case insensitive matching, add A-Z to the character class.
But please note that it won't check for the max length of 255. It may be done using a positive lookahead, but the regex will become very complicated, and it is shorter to be using a string length function ;)

REGEX password validation without special characters

I am using this regex to validate my password.
My password -
should be alphanumeric ONLY,
contains at least 8 characters,
at least 2 numbers
and at least 2 alphabet.
My regex is
^.*(?=.{8,})(?=.*\d*\d)(?=.*[a-zA-Z]*[a-zA-Z])(?!.*\W).*$
but unfortunately it still matches if I try to put special characters at the beginning.
For example #password12, !password12.
Because your pattern begins and ends with .*, it will match anything at the beginning or end of the string, including special characters.
You shouldn't be solving this problem with a single regular expression, it makes the code hard to read and hard to modify. Write one function for each rule using whatever makes sense for that rule, then your validation script becomes crystal clear:
if is_alpha_only(password) &&
len(password) > = 8 &&
has_2_or_more_numbers(password) &&
has_2_or_more_alpha(password) ...
Seriously, what's the point of cramming all of that into a single regular expression?
And why disallow special characters? There's simply no reason for that.
You can use the following regex in case insensitive mode:
^(?=[a-z]*[0-9][a-z]*[0-9])^(?=[0-9]*[a-z][0-9]*[a-z])[a-z0-9]{8,}$
See it
I had a similar situation in which the client needed 4 alpha, 1 number, and between 8 and 20 characters. I've adapted my solution to your problem:
^(?=(?:[a-zA-Z0-9]*[a-zA-Z]){2})(?=(?:[a-zA-Z0-9]*\d){2})[a-zA-Z0-9]{8,}$
I understand the other answers dissuading you from this route, but sometimes the client wants what the client wants, regardless of your arguments to the contrary.