Difference between * and + regex - regex

Can anybody tell me the difference between the * and + operators in the example below:
[<>]+ [<>]*

Each of them are quantifiers, the star quantifier(*) means that the preceding expression can match zero or more times it is like {0,} while the plus quantifier(+) indicate that the preceding expression MUST match at least one time or multiple times and it is the same as {1,} .
So to recap :
a* ---> a{0,} ---> Match a or aa or aaaaa or an empty string
a+ ---> a{1,} ---> Match a or aa or aaaa but not a string empty

* means zero-or-more, and + means one-or-more. So the difference is that the empty string would match the second expression but not the first.

+ means one or more of the previous atom. ({1,})
* means zero or more. This can match nothing, in addition to the characters specified in your square-bracket expression. ({0,})
Note that + is available in Extended and Perl-Compatible Regular Expressions, and is not available in Basic RE. * is available in all three RE dialects. That dialect you're using depends most likely on the language you're in.
Pretty much, the only things in modern operating systems that still default to BRE are grep and sed (both of which have ERE capability as an option) and non-vim vi.

* means zero or more of the previous expression.
In other words, the expression is optional.
You might define an integer like this:
-*[0-9]+
In other words, an optional negative sign followed by one or more digits.

They are quantifiers.
+ means 1 or many (at least one occurrence for the match to succeed)
* means 0 or many (the match succeeds regardless of the presence of the search string)

[<>]+ is same as [<>][<>]*

I'll bring some example to extend answers above. Let we have a text:
100test10
test10
test
if we write \d+test\d+, this expression matches 100test10 and test10 but \d*test\d* matches three of them

Related

Why the * regular expression indicates what can or cannot be it's previous character

Take this for an example which I found in some blog,
"How about searching for apple word which was spelled wrong in a given file where apple is misspelled as ale, aple, appple, apppple, apppppple etc. To find all patterns
grep 'ap*le' filename
Readers should observe that the above pattern will match even ale word as * indicates 0 or more of previous character occurrence."
Now it's saying that "ale" will be accept when we are having ap*le, isn't the "ap" and "le" fixed?
The * is a quantifier meaning 0 or more times for the previous pattern -- in this case a single literal p. You can also state the same as * with a quantifier:
ap{0,}le
The interesting question sometimes is 'what is the previous pattern?' It is often helpful to put a pattern in a group to aid understand of what the 'previous pattern' is.
Consider wanting to find any of:
ale, aple, appple, apppple, apppppple, able, abbbbbbble
Your first try might be:
/ap|b*le/
^ literal 'p' is the first alternative #WRONG regex will use 'ap'
^ or
^ literal 'b'
Demo
What you want in this case is:
/a(?:p|b)*le/
Demo
If you do not want to match ale and only match aple, appple, apppple, apppppple, use the + instead of the * which means one or more:
/ap+le/
And is equivalent to /ap{1,}le/
Demo
And if you want to only match aple, appple and leave out the variants with more than 3 'p's use the additional max quantifier:
/ap{1,3}le/
All the variants above will match apple correctly spelled. If you what only aple, appple, and not match apple, use alteration:
/a(?:p|p{3})le/
Demo
No its not.
"*" in your case means zero or any occurrence of p. While a and le is fixed. If you need fixed ap and le then this is what you need:
ap+le
"+" means at least once but no limit on number of occurrences.
This means now any number of p after a but before l. So it wont select ale now.

Shell script variable assignment with two values (regular expression)

I'm try to set a variable with two values. Here is an example:
letter='[[:alpha:]]'
digit='[[:digit:]]'
integer='$digit'
float='$digit.$digit'
The integer variable must appear one or more times. The variable float should display the first field (before the dot) zero or more times. How can I do this?
Thanks for help!
-- UPDATE --
It's very good to have the support of all of you. Below the solution that has served me:
letter='[[:alpha:]]'
digit='[[:digit:]]'
integer="${digit}+"
float="[0-9]*\\.[0-9]+"
Thank you guys! :D
I haven't looked into bash's expr command (which I assume you are using) to verify which flavor of regex they use, so you may need to do something like [a-zA-Z] instead of [[:alpha:]] and similar substitutions. But assume you have chosen the right value in letter and digit then this should work:
expr match "$string" "(${digit}*.${digit}*)"
or, using your float variable:
float="(${digit}*.${digit}*)"
expr match "$string" "$float"
Remove the parens if you just want to use the return value rather than returning the actual value matched.
Any of the following would be equivalent regexes for the integer:
integer="(${digit}+)"
integer="(${digit}{1,})"
integer="(${digit}${digit}*)"
Do be aware that there are different "flavors" of regex and in different contexts things need to be escaped where in another context they don't need it.
for egrep and grep -E on the bash command line:
float: [0-9]*\\.[0-9]+
integer: [0-9]+
see chart of egrep regxes at http://www.cyberciti.biz/faq/grep-regular-expressions/ for some hints but needs testing for specific situation
for perl and java:
float: [0-9]*?\.[0-9]+?
integer: [0-9]+?
+ matches preceding char or char class >= 1 times
* matches preceding char or char class >= 0 times
. matches any char
\. matches an uninterpreted period
[0-9] matches the class of any digit
? forces reluctant (non-greedy) matching

How to restrict the expression in QLineEdit

I need a QLineEdit which must represent a range.
F.E. (1,2] , and for this representation I want to set a validation checker for user not to write other symbols.
In this case I have char + int + char + int + char as shown in example below.
Does Qt have any feature to handle this?
Thanks in advance.
You can use Qt's Input Validator feature to achieve this goal.
The following snippet will restrict the input on a line edit as you specified.
QRegExp re("^[[,(]{1,1}(0|[1-9]{1,1}[0-9]{0,9})[,]{1,1}(0|[1-9]{1,1}[0-9]{0,9})[],)]{1,1}$");
QRegExpValidator *validator = new QRegExpValidator(re, this);
ui->lineEdit->setValidator(validator);
Edit
Updated the regex
QRegExp expr("^[[,(]{1,1}(0|[1-9]{1,1}[0-9]{0,9})[,]{1,1}(0|[1-9]{1,1}[0-9]{0,9})[],)]{1,1}$");
This is what I wanted! I must allow more then one leading 0-s.
It is not possible to write a regexp accepting only valid ranges, the reason is that you can check the syntax but not the numeric value (unless e regexp engine has some extensions). The difference between
[1234,5678)
and
[5678,1234)
is not in the syntax (what regexps are about), but in the semantics (where regexps are not that powerful).
For checking just the syntax a regexp could be
\[\d+,\d+\)
or, if you also allow other types of interval boundary conditions:
[\[)]\d+,\d+[\])]
I would recommend not allowing all chars but only the needed ones. Example:
QRegExp("[\\\\\\(\\)\\{\\}]\\d[\\\\\\(\\)\\{\\}]\\d[\\\\\\(\\)\\{\\}]");
I'll explain:
[] these contain the matchin characters for your char: \\ (this is actually matching the \ sign, as you need to escape it once for your Regular Expression \ and once more for Qt String makes it \\), \( is for opening bracket and so on. You can add all chars you would like to be matched. A good help is the Regular Expression Cheat Sheet for this.
\d is matching a single digit, if you want to have more than one digit you could use \d+ for at least one or \d{3} for exactly 3 digits. (+ 1 or more, ? 0 or 1, * 0 or more)
Another example would be:
QRegExp("[\\\\\\(\\)\\{\\}]\\d[,\\.]\\d[\\\\\\(\\)\\{\\}]");
for having the center character to be a . or a , sign.

Nongreedy regex with alternation and repetition [duplicate]

This question already has answers here:
Non-greedy regular expression match for multicharacter delimiters in awk
(3 answers)
Closed 8 years ago.
I am trying to match the contents between AB and BA using extended regex, for instance using awk.
Consider the two example strings AB12BABA and AB123BABA, I tried the following regex
AB([^B]|([^B][^A]|B[^A]|[^B]A))*BA
But it matches the whole string (greedy) for both examples.
Can anyone explain how the regex engine works for this case, and how I should change my regex so that it would work.
The BRE and ERE engines will match with the Leftmost Longest Rule, which is different from how Perl and other NFA-based regex engines matches the regex.
The documentation from Boost library is more detailed in regards to the technical aspect, so I quote it here:
The Leftmost Longest Rule
Often there is more than one way of matching a regular expression at a particular location, for POSIX basic and extended regular expressions, the "best" match is determined as follows:
Find the leftmost match, if there is only one match possible at this location then return it.
Find the longest of the possible matches, along with any ties. If there is only one such possible match then return it.
If there are no marked sub-expressions, then all the remaining alternatives are indistinguishable; return the first of these found.
Find the match which has matched the first sub-expression in the leftmost position, along with any ties. If there is only on such match possible then return it.
Find the match which has the longest match for the first sub-expression, along with any ties. If there is only one such match then return it.
Repeat steps 4 and 5 for each additional marked sub-expression.
If there is still more than one possible match remaining, then they are indistinguishable; return the first one found.
Marked sub-expression as mentioned in the text refers to () capturing groups. Note that they only does capturing and back-reference is not supported.
Therefore, in order to do a lazy matching, you need to construct a regular expression, such that it matches the repeated part, while avoid matching the tail part until the very end. Since ERE and BRE are equivalent to theoretical regular expression, as long as you can construct a DFA, there exists an equivalent regex that does the trick (just that constructing it is not trivial task in some cases).
For your requirement, this regex shall work:
AB([^B]|B+[^AB])*B*BA
The part ([^B]|B+[^AB])*B* matches any string that does not contain the string "BA".
Derivation
This is the DFA for matching a string that does not contain the string "BA".
The notation here is not standard, so I will explain a bit:
State q1/B means that the state is named q1 (just like how you name a variable), B is the current progress towards matching BA.
* means any character in the alphabet. [^B] means any character in the alphabet except for B.
In the DFA, q0 and q1 are final states, q0 is the initial state. Note that q2 is a trap state, since it is a non-final state, and there is no transition out of this state.
Use the steps here, or just use JFLAP to derive the regular expression. (In JFLAP, you should use some character, such as C to represent [^AB]).
Since q2 is a trap state, we can exclude it from the formula:
R0 = [^B]R0 + BR1 + λ
R1 = [^AB]R0 + BR1 + λ
Apply Arden's theorem to R1:
R1 = B*([^AB]R0 + λ)
Substitute R1 to R0:
R0 = [^B]R0 + BB*([^AB]R0 + λ) + λ
Distribute BB* over ([^AB]R0 + λ):
R0 = [^B]R0 + BB*[^AB]R0 + BB*λ + λ
Group together:
R0 = ([^B] + BB*[^AB])R0 + (BB* + λ)
Apply Arden's theorem to R0:
R0 = ([^B] + BB*[^AB])*(BB* + λ)
(BB* OR λ (empty string)) is equivalent to B*:
R0 = ([^B] + BB*[^AB])*B*
Let use rewrite it into awk's syntax: ([^B]|B+[^AB])*B*, which is what shown above.
Use look arounds and a non greedy quantifier:
(?<=AB).*?(?=BA)
If you want to match the delimiters too, simply:
AB.*?BA

(V)C++ (2010) regular expressions, "recursive captures"

I want match and capture operators and operands of an expression like:
1
x
1 + x
x + y + 3 + 10
etc...
So on regexpal,
(\w+)(\s*([+])\s*(\w+))*
Appears to do it, but how do I obtain the matched captures? Notice [+] and (\w+) is already in 1 capture.
Unfortunately this is not possible (at least in any regex flavor that I know of). If one capturing group is used multiple times, the capture will always be filled with the last thing it captured. Simpley example: ([a-z])* applied to abc will give you only c.
I recommend that you use the regex just to check for a valid format. Then you can split the string at the matches of \s*\b\s*. This should then result in an array containing x, +, y, +, 3, +, 10 for your last example.
Here is some example code that shows how to use regexes to split strings, using boost::regex.
Maybe this would be a better job for System.CodeDom.Compiler than for Regexes.
If boost is an option for you, then you can use boost::regex with boost::match_extra flag, then match_results::captures and sub_match::captures contain list of all captured items