E-mail address validation using Regular Expressions - regex

I'm writing a simple, small app that allows me to share information. I have a question on using regx to validate email address.
I'm kind learning on my own. But when it comes to real-world examples, such that strings that can be validated with regular expressions, I'm kind stuck.
Exercise:
Untangle the following regular expression that validates an email address:
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
It looks like a jumble of characters.
Can someone please explain to me how does this work?
I try to use this online resources by by Jan Goyvaerts.
Any help I will appreciate it.

First of all, there is a good thread about totally the same thing:
Using a regular expression to validate an email address
Then, below there is the explanation of your regular expression:
[a-z0-9!#$%&'*+/=?^_`{|}~-]+
- The square brackets represent the symbol class, containing all the symbols which are in the square brackets. The plus sign ('+') is a quantifier, which means that the sequence of symbols, represented by this symbol class must be at least one character long.
Also, the '+' is greedy, and, therefore, this part of the pattern will match the symbol sequence of the maximal possible length.
Talking about the square brackets contents, 'a-z' means any symbol in a range, which could be described mathematically as [a, z], and '0-9' is similar. All the other symbols are just symbols in this case.
(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*
- In Regular Expressions, the brackets represent grouping, and the asterisk ('*') is a greedy quantifier, which means "occurs zero or more times". So here we are not sure if we are going to find the brackets content, but we do not rule out the possibility.
Then, inside the brackets, we see the ?: character combination, which, being put inside brackets tells us that the symbol group inside should not be captured as a sub-string for the further reference.
Going further, \. means just a usual dot (see Escape sequence), since a dot symbol is a meta-symbol in Regex.
After the dot we see again the character of symbols, explained above.
#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+
- Here we see the at symbol ('#'), which is just a symbol here, then there is a non-capturing symbol group, which will occur one or more times (because of + after it), and which includes a single symbol of [a-z0-9] class and another non-capturing group of symbols, which contents you can totally describe using my explanations above except for a question mark sign ('?'), which means "either once or not at all" in this context (i.e. if it is used as a quantifier).
[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
- This last part is similar to what is found in a symbol group, explained above, so I believe you have now enough information to understand it.
More on quantifier types here: Greedy vs. Reluctant vs. Possessive Quantifiers.
A good Regular Expressions reference: Regular Expression Language - Quick Reference
Some information on capturing in Regular Expressions: Regex Tutorial - Parentheses for Grouping and Capturing
About special characters: Regex Tutorial - Literal Characters and Special Characters

Regex statements can be a fun yet tricky to follow. There are 5 parts to this statement.
One valid characters for a username
[a-z0-9!#$%&'*+/=?^_`{|}~-]+
check for a single '.' and any additional amount of characters
(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*
The '#' symbol
Valid second / lower level domain
(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+
A valid top level domain
[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
I recommend http://www.ultrapico.com/expresso.htm. It will break the statement down for you.

I've found a remarkable tool for visualizing regular expressions here: http://regexper.com
It shows me that your regular expression breaks down like this. Hopefully this helps explain it.

[a-z0-9!#$%&'*+/=?^_`{|}~-]+
This looks for at least one of of the characters given here (a-z, 0-9, and those special characters).
(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)
This looks for the same as above, but only when it stands after a dot. This part is optional and can be repeated indefinitely. It prevents dots at the end of the name.
#
Matches the # symbol
(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+
This matches a-z, 0-9 ending with a dot and optional - in the middle ending with a dot. This has to be matched at least once.
[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
This looks for a-z or 0-9, optionally followed by a-z, 0-9, -, but it cant end with a - again.

Two Suggestions I have for you.
Escaping special characters is messy. 2. Email addresses are complicated. I probably recommend you to study this post if you are really interested. Please check out this other posts: Validation in Regex and Regex Help.

See this answer. The problem is probably too difficult to solve. Two problems you have here. 1. RegEx are not easy. 2. Escaping special characters is messy. Finally, Email addresses are complicated. I probably recommend you to study this post if you are really interested.

Related

What does a-z-A-Z mean in a regular expression?

I've been working with someone else's code and I ran across the regular expression [^0-9a-z-A-Z]. This bears close resemblance to the common [^0-9a-zA-Z] which is meant to exclude non-alphanumeric characters, but note the extra dash in the middle, between the lowercase z and uppercase A.
I'm not very familiar with regular expressions, but I've read several pages on them now, and none of the rules I've seen seem to cover what this syntax would mean. Perhaps it's not even valid syntax, but the Golang regex interpreter doesn't seem to mind. I'd appreciate any clarification. Thanks.
A dash in a character class in a place where it cannot be interpreted as a range is interpreted as a literal dash. So the expression excludes the characters 0 to 9, a to z, A to Z, and -. That's why there's no syntax error.
It's probably a typo though. If the dash is meant to be there, then to prevent confusion it should be escaped and/or moved out from between the ranges, such as [^0-9a-zA-Z\-]
It excludes the minus sign.
You can test regex handily here: http://www.regexr.com/

htaccess regular expression explaination

I have been tasked with changing an .htaccess file. Unfortunately, I know very little about regular expressions, and so most of the file is unreadable for me. In particular, I have these two REs...
1: ^(?!((www|web3|web4|web5|web6|cm|test)\.mydomain\.com)|(?:(?:\d+\.){3}(?:\d+))$).*$
2: ^/([^/][^/])/([^/][^/])/([^/]+)/Job-Posting/$ /Misc/jobposting\.asp\?country=$1&state=$2&city=$3
For the first one, I understand the first half, more or less. it's trying to match against something that ISN'T www.mydomain.com, or web3.mydomain.com, etc., and that it may match that zero or one times. What I'm not clear on is what the second half of that does. My research suggests that ?: implies some sort of flag, but I didn't see any example that explained what exactly that meant. Please explain what this part means, as well as provide an example that would match it.
For the second one, the comments say this is applicable for a url containing /US/NY/Rochester/Job-Posting/. From this I can infer that ^/ means one character, but again, I couldnt find that in my research so far. What is the formal definition of ^/ ? What is the significance of putting it into square brackets [^/] ?
If I can get a handle on these two RE I should be able to adapt them to my needs. Your help is much appreciated.
?: doesn't match anything in particular, it modifies the behavior of the parenthesis. The ?: means the parenthesis are non-capturing, and thus cannot be referenced in the rule. Non capturing parens are good to use when you don't need to reference the captured text because the system doesn't have to 'remember' the text, which saves resources.
the code in question:
(?:(?:\d+\.){3}(?:\d+))
matches one or more digits followed by a period times three, then one or more digit. This will match IP addresses (ex 127.0.0.1). This will also match 123456.1.1.3456789, so you might want to restrict the number of characters allowed (?:(?:\d{1,3}.){3}(?:\d{1,3})), thought I haven't tested this so take it with a grain of salt.
Info on non capturing groupings.
The second item revolves around using square brackets as a character set. Square brackets match anything noted inside them, with ^ negating the match. So [ad02] will match any of the four characters a,d,0 or 2, while [^ad02] will match any character that is not a,d,0, or 2. So, ^/ means any character that is not /.
One of the tricky things about square brackets is the number of items they will match. [^/] will match one character, but so does [ad02]. It doesn't matter how many characters you have in the set, it still obeys the modifiers on the brackets. So [^/]{3} will match any series of 3 characters that does not contain a forward slash, while [^/]{2} will match a 2 character string with the same restriction.
For more info on character sets see Character Classes or Character Sets

Optimal UUID RegEx with optional delimeters

I have already found a question for UUID regular expressions here, but these expressions do not account missing delimeters.
I have come up with the following expression, but is there a more optimal RegEx?
/\b([0-9a-f]{8}-?([0-9a-f]{4}-?){3}[0-9a-f]{12})\b/i
I am assuming by optimal that you mean a shorter expression. I simplified your regex down to the following:
/[\da-f]{8}-?([\da-f]{4}-?){3}[\da-f]{12}/i, which you can see in action here.
I removed the outer parentheses and \b because everything was correctly matched without them.
I was also able to shave three characters off by replacing [0-9a-f] with [\da-f].
I originally had [0-F], but after examining the ASCII sequence, I realized that matched
0123456789:;<=>?#ABCDEF, which includes some extra symbols that we do not want to match.
In conclusion, my expression is synonymous with yours but contains nine fewer characters.

What does (.+_)* mean when using Henry Spencer regular expression library?

With reference to Henry spencer regex library I want to know the difference between (.+_)* and (.)*.
(.+_)* tries to match the string from back as well. From my understanding . matches any single character, .+ will mean non zero occurrences of that character. _ will mean space or { or } or , etc.
Parentheses imply that any one can be considered for a match and the final * signifies 0 or more occurrences.
I feel (.)* would also achieve the same thing. The + after . might be redundant.
Can someone explain me the subtle difference between the two?
For example, aa aa will be matched by (.+_)* but not by (._)* because the latter expects only one character before the space.
I don't recall that underscore has any special meaning. The special thing about Henry Spencer regex library is that it combines both regex engine techniques - deterministic and non-determinstic.
This has a pro and a con.
The pro is that you regexps will be the fastest possible, simply built, while in other engines you might to use look a head and advanced regexp techniques (like making it fail early if there is no match) to achieve the same speed.
The con is that the entire regexp will be either greedy or non greedy. That is, if you used the * or + withouth a following a ?, then the entire regexp will be greedy, even though you use ? after that. If the first time you use a * or + you follow it by a ?, then the entire regexp will be non greedy.
This makes it a slightly trickier to craft the regexp, but really slightly.
The Henry Speced library is the engine behind tcl's regexp command, which makes this language very efficient for regexps.
As I know the _ doesn't have a special meaning, it is just a "_". See regular-expressions.info
Your two regexes are not the same.
(._)* will match one character followed by an underscore (if the underscore has a special meaning in your implementation replace "underscore" by that meaning), this sequence will be matched 0 or more times, e.g. "a_%_._?_"
(.+_)* will match at least one character followed by an underscore, this sequence will be matched 0 or more times, e.g. "abc45_%_.;,:_?#'+*~_"
(.+_)* will match everything that can be matched by (._)* but not the other way round.

Regular expression for parsing string inside ""

<A "SystemTemperatureOutOfSpec" >
What should be the regular expression for parsing the string inside "". In the above sample it is 'SystemTemperatureOutOfSpec'
In JavaScript, this regexp:
/"([^"]*)"/
ex.
> /"([^"]*)"/.exec('<A "SystemTemperatureOutOfSpec" >')[1]
"SystemTemperatureOutOfSpec"
Similar patterns should work in a bunch of other programming languages.
try this
string Exp = "\"!\"";
I am not sure I understand your question well but if you need to match everything between double quotes, here it is: /(?<=").*?(?=")/s
(?<=<A\s")(?<content>.*)(?="\s>)
Regular expressions don't get much easier than this, so you should be able to solve it by yourself. Here's how you go about doing that:
The first step is to try to define as precisely as possible what you want to find. Let's start with this: you want to find a quote, followed by some number of characters other than a quote, followed by a quote. Is that correct? If so, our pattern has three parts: "a quote", "some characters other than a quote", and "a quote".
Now all we need to do is figure out what the regular expressions for those patterns are.
A quote
For "a quote", the pattern is literally ". Regular expressions have special characters which you have to be aware of (*, ., etc). Anything that's not a special character matches itself, and " is one of those characters. For a complete list of special characters for your language, see the documentation.
Characters other than a quote
So now the question is, how do we match "characters other than a quote"? That sounds like a range. A range is square brackets with a list of allowable characters. If the list begins with ^ it means it is a list of not-allowed characters. We want any characters other than a quote, so that means [^"].
"Some"
That range just means any one of the characters in the range, but we want "some". "Some" usually means either zero-or-more, or one-or-more. You can place * after a part of an expression to mean zero-or-more of that part. Likewise, use + to mean one-or-more (and ? means zero-or-one). There are a few other variations, but that's enough for this problem.
So, "some characters other than a quote" is the range [^"] (any character other than a quote) followed by * (zero-or-more). Thus, [^"]*
Putting it all together
This is the easy part: just combine all the pieces. A quote, followed by some characters other than a quote, followed by a quote, is "[^"]*".
Capturing the interesting part
The pattern we have will now match your string. What you want, however, is just the part inside the quotes. For that you need a "capturing group", which is denoted by parenthesis. To capture a part of a regular expression, put it in parenthesis. So, if we want to capture everything but the beginning and ending quote, the pattern becomes "([^"]*)".
And that's how you learn regular expressions. Break your problem down into a precise statement composed of short sequences of characters, figure out the regular expression for each sequence, then put it all together.
The pattern in this answer may not actually be the perfect answer for you. There are some edge cases to worry about. For example, you may only want to match a quote following a non-word character, or only quotes at the beginning or end of a word. That's all possible, but is highly dependent on your exact problem. Figuring out how to do that is just as easy though -- decide what you want, then look at the documentation to see how to accomplish that.
Spend one day practicing on regular expressions and you'll never have to ask anyone for help with regular expressions for the rest of your career. They aren't hard, but they do require concentrated study.
Are you sure you need regular expression matching here? Looking at your "string" you might be better off using a Xml parser?