Unclear Complex Regular expression - regex

I'm new to Regular Expressions and has stumbled upon an expression I do not really understand.
The expression is:
.*[^0-9](?P<ref>[0-9]{3})[^0-9].*
and I think I understand the first part and the last part, but the part within parenthesis eludes me. I would be most grateful for an explanation or some links where I could find help with this.
Thanks.

The part within parentheses is a named capturing group that matches exactly three digits (and lets that group be referenced by the name ref). This feature was added because in very long, complex expressions, it's much clearer to used named groups than the usual numbered groups (which requires counting parentheses to see which group is which).
Exactly how the named capture referencing is done depends on the regular expression library and/or language being used. For example, in Python:
>>> import re
>>> match = re.search(r'.*[^0-9](?P<ref>[0-9]{3})[^0-9].*', 'a234b')
>>> match.group('ref')
'234'

here's a short graphical explanation for your expression:
.*[^0-9](?P<ref>[0-9]{3})[^0-9].*
Debuggex Demo
In words:
.* matches any number of any chars
[^0-9] matches one char, that is not a number
(?P<ref>[0-9]{3}) matches a group of three numbers and gives the group in the result the name ref
[^0-9] ... obvious
.* ... obvious

Related

Regular Expression, testing for numbers

I'm just starting to learn regular expressions, and one of the questions was to match the different types of numbers. The ones that I needed to match are in below:
my regex: -?\d+,?\d+\.?e?\d+
3.14529
-255.34
128
1.9e10
123,340.00
however, from my regular expression, I failed to meet the first one and the fourth one. I saw the solution but I did not quite understand why it uses brackets. Can anyone explain? Thank you!
Your regex does not allow digits to follow a literal dot when you only have one digit preceding it. This is because you have twice a \d+ before matching the dot. In general, you have three mandatory \d+ in there, so you cannot match anything with less than three digits.
I would suggest this regex:
^-?\d{1,3}(,\d{3})*(\.\d+)?(e\d+)?$

Automata - Regular Expression

I've been trying to make a regular expression from the below:
L = {01, 0011, 000111, 00001111, 0000011111, 000000111111, ...}
but I just could not figure it out. The first thing that came to my mind was
0(0)^* 1(1)^*
Is there an app where I could test it out?
If this can't be done through Regular Expression, can an NFA or DFA be done?
but I'm not sure if that is the answer to the language. Could some good Samaritan kindly help me with this? Appreciate it.
A subroutine may suit your needs:
(?<!0)(0(?1)?1)(?!1)
Debuggex Demo
(?1) means recall the pattern captured in the first group, i.e. between the parens. This isn't available in all regex engines though - neither is the (negative) lookbehind (?<!...) by the way.
The difference between (?1) and \1 is that (?1) recalls the captured pattern while \1 recalls the captured data.
I don't know about what you meant when you said that it should be regex, because it is mentioned automaton/regular expression too.
As per the automata theory :-
If you are talking about the regular expression for this formal language (having equal number of 0's and 1's and all 0's must be followed by 1's), it is not a regular language. It can be proved using the pumping lemma that this language is not regular.
But, this language can be expressed as {0i1i | i>0}; i belongs to set of positive integers.

E-mail address validation using Regular Expressions

I'm writing a simple, small app that allows me to share information. I have a question on using regx to validate email address.
I'm kind learning on my own. But when it comes to real-world examples, such that strings that can be validated with regular expressions, I'm kind stuck.
Exercise:
Untangle the following regular expression that validates an email address:
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
It looks like a jumble of characters.
Can someone please explain to me how does this work?
I try to use this online resources by by Jan Goyvaerts.
Any help I will appreciate it.
First of all, there is a good thread about totally the same thing:
Using a regular expression to validate an email address
Then, below there is the explanation of your regular expression:
[a-z0-9!#$%&'*+/=?^_`{|}~-]+
- The square brackets represent the symbol class, containing all the symbols which are in the square brackets. The plus sign ('+') is a quantifier, which means that the sequence of symbols, represented by this symbol class must be at least one character long.
Also, the '+' is greedy, and, therefore, this part of the pattern will match the symbol sequence of the maximal possible length.
Talking about the square brackets contents, 'a-z' means any symbol in a range, which could be described mathematically as [a, z], and '0-9' is similar. All the other symbols are just symbols in this case.
(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*
- In Regular Expressions, the brackets represent grouping, and the asterisk ('*') is a greedy quantifier, which means "occurs zero or more times". So here we are not sure if we are going to find the brackets content, but we do not rule out the possibility.
Then, inside the brackets, we see the ?: character combination, which, being put inside brackets tells us that the symbol group inside should not be captured as a sub-string for the further reference.
Going further, \. means just a usual dot (see Escape sequence), since a dot symbol is a meta-symbol in Regex.
After the dot we see again the character of symbols, explained above.
#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+
- Here we see the at symbol ('#'), which is just a symbol here, then there is a non-capturing symbol group, which will occur one or more times (because of + after it), and which includes a single symbol of [a-z0-9] class and another non-capturing group of symbols, which contents you can totally describe using my explanations above except for a question mark sign ('?'), which means "either once or not at all" in this context (i.e. if it is used as a quantifier).
[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
- This last part is similar to what is found in a symbol group, explained above, so I believe you have now enough information to understand it.
More on quantifier types here: Greedy vs. Reluctant vs. Possessive Quantifiers.
A good Regular Expressions reference: Regular Expression Language - Quick Reference
Some information on capturing in Regular Expressions: Regex Tutorial - Parentheses for Grouping and Capturing
About special characters: Regex Tutorial - Literal Characters and Special Characters
Regex statements can be a fun yet tricky to follow. There are 5 parts to this statement.
One valid characters for a username
[a-z0-9!#$%&'*+/=?^_`{|}~-]+
check for a single '.' and any additional amount of characters
(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*
The '#' symbol
Valid second / lower level domain
(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+
A valid top level domain
[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
I recommend http://www.ultrapico.com/expresso.htm. It will break the statement down for you.
I've found a remarkable tool for visualizing regular expressions here: http://regexper.com
It shows me that your regular expression breaks down like this. Hopefully this helps explain it.
[a-z0-9!#$%&'*+/=?^_`{|}~-]+
This looks for at least one of of the characters given here (a-z, 0-9, and those special characters).
(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)
This looks for the same as above, but only when it stands after a dot. This part is optional and can be repeated indefinitely. It prevents dots at the end of the name.
#
Matches the # symbol
(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+
This matches a-z, 0-9 ending with a dot and optional - in the middle ending with a dot. This has to be matched at least once.
[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
This looks for a-z or 0-9, optionally followed by a-z, 0-9, -, but it cant end with a - again.
Two Suggestions I have for you.
Escaping special characters is messy. 2. Email addresses are complicated. I probably recommend you to study this post if you are really interested. Please check out this other posts: Validation in Regex and Regex Help.
See this answer. The problem is probably too difficult to solve. Two problems you have here. 1. RegEx are not easy. 2. Escaping special characters is messy. Finally, Email addresses are complicated. I probably recommend you to study this post if you are really interested.

What is the difference between the regex (.*?) and (.*)?

I've been doing regex for a while but I'm not an expert on the subtleties of what particular rules do, I've always done (.*?) for matching, but with restriction, as in I understood it would stop the first chance it got, whereas (.*)? would continue and be more greedy.
but I have no real reason why I think that, I just think it because I read it once upon a time.
now I'd like to know, is there a difference? and if so, what is it...
(.*?) is a group containing a non-greedy match.
(.*)? is an optional group containing a greedy match.
Others have pointed out the difference between greedy and non-greedy matches. Here is an example of different results you can see in practice. Since regular expressions are often embedded in a host language, I'm going to use Perl as the host. In Perl, enclosing matches in parenthesis assigns the results of those matches to special variables. Therefore in this case, the matches may be the same but what's assigned to those variables may not:
For example, let's say your match string is 'hello'. Both patterns would match it, but the matched portions ($1) differ:
'hello' =~ /(.*?)l/;
# $1 == 'he'
'hello' =~ /(.*)?l/;
# $1 == 'hel'
Because * means "zero or more", it all gets slightly confusing. Both ?'s are quite different, which can be more clearly shown with a different example of each:
fo*? will match only f if you supply it foo. That is, this ? makes the match non-greedy. Removing it makes it match foo.
fo? will match f, but also fo. That is, this ? makes the match optional: the part that it applies to (in this case only o) must be present 0 or 1 times. Removing it makes the match required: it must then be present exactly once, so only fo will still match.
And while we're at different meanings of the ? in regexps, there's one more: a ? immediately following a ( is a prefix for several special operations, such as lookaround. That is, its meaning is not like any of the things you ask.
The ? has different meanings.
When it follows a character or a group it is a quantifier, matching 0 or 1 occurrence of the preceding construct. See here for details
When it follows a quantifier it modifies the matching behaviour of that quantifier, making it match lazy/ungreedy. See here for details

What does ?: do in regex

I have a regex that looks like this
/^(?:\w+\s)*(\w+)$*/
What is the ?:?
It indicates that the subpattern is a non-capture subpattern. That means whatever is matched in (?:\w+\s), even though it's enclosed by () it won't appear in the list of matches, only (\w+) will.
You're still looking for a specific pattern (in this case, a single whitespace character following at least one word), but you don't care what's actually matched.
It means only group but do not remember the grouped part.
By default ( ) tells the regex engine to remember the part of the string that matches the pattern between it. But at times we just want to group a pattern without triggering the regex memory, to do that we use (?: in place of (
Further to the excellent answers provided, its usefulness is also to simplify the code required to extract groups from the matched results. For example, your (\w+) group is known as group 1 without having to be concerned about any groups that appear before it. This may improve the maintainability of your code.
Let's understand by taking a example
In simple words we can say is let's for example I have been given a string say (s="a eeee").
Your regex(/^(?:\w+\s)(\w+)$/. ) will basically do in this case it will start with string finds 'a' in beginning of string and notice here there is 'white space character here) which in this case if you don't included ?: it would have returned 'a '(a with white space character).
If you may don't want this type of answer so u have included as*(?:\w+\s)* it will return you simply a without whitespace ie.'a' (which in this case ?: is doing it is matching with a string but it is excluding whatever comes after it means it will match the string but not whitespace(taking into account match(numbers or strings) not additional things with them.)
PS:I am beginner in regex.This is what i have understood with ?:.Feel free to pinpoint the wrong things explained.