Difference between regex quantifiers plus and star

Difference between regex quantifiers plus and star - regex

I try to extract the error number from strings like "Wrong parameters - Error 1356":
Pattern p = Pattern.compile("(\\d*)");
Matcher m = p.matcher(myString);
m.find();
System.out.println(m.group(1));
And this does not print anything, that became strange for me as the * means * - Matches the preceding element zero or more times from Wiki
I also went to the www.regexr.com and regex101.com and test it and the result was the same, nothing for this expression \d*
Then I start to test some different things (all tests made on the sites I mentioned):
(\d)* doesn't work
\d{0,} doesn't work
[\d]* doesn't work
[0-9]* doesn't work
\d{4} works
\d+ works
(\d+) works
[0-9]+ works
So, I start to search on the web if I could find an explanation for this. The best I could find was here on the Quantifier section, which states:
\d? Optional digit (one or none).
\d* Eat as many digits as possible (but none if necessary)
\d+ Eat as many digits as possible, but at least one.
\d*? Eat as few digits as necessary (possibly none) to return a match.
\d+? Eat as few digits as necessary (but at least one) to return a match.
The question
As english is not my primary language I'm having trouble to understand the difference (mainly the (but none if necessary) part). So could you Regex expert guys explain this in simple words please?
The closest thing that I find to this question here on SO was this one: Regex: possessive quantifier for the star repetition operator, i.e. \d** but here it is not explained the difference.

The * quantifier matches zero or more occurences.
In practice, this means that
\d*
will match every possible input, including the empty string. So your regex matches at the start of the input string and returns the empty string.

but none if necessary means that it will not break the regex pattern if there is no match. So \d* means it will match zero or more occurrences of digits.
For eg.
\d*[a-z]*
will match
abcdef
but \d+[a-z]*
will not match
abcdef
because \d+ implies that at least one digit is required.

\d* Eat as many digits as possible (but none if necessary)
\d* means it matches a digit zero or more times. In your input, it matches the least possible one (ie, zero times of the digit). So it prints none.
\d+
It matches a digit one or more times. So it should find and match a digit or a digit followed by more digits.

With the pattern /d+ at least one digit will need to be reached, and then the match will return all subsequent characters until a non-digit character is reached.
/d* will match all the empty strings (zero or more), as well at the match. The .Net Regex parser will return all these empty string groups in its set of matches.

Simply:
\d* implies zero or more times
\d+ means one or more times

Related

LookAround or default regex if symbol is not present

I have got this regex
^\d+(?<=\d)_?(?=\d)\d*
My original goal is to match these patterns:
5
55
5_5
55_5
But ignore
_5
5_
_
As long as I understand, it matches at least 1 digit from the beginning of the line and anderscore if it is surrounded by digits. Pretty simple. So,
5_5 is passed,
555_555 is also passed,
_5 is not passed, it is expected,
_ also not passed.
In additon, 55 is also passed, which is fine.
But for some reason 5 is not passed as well. Why? It is single digit and it has to passed even though there is no underscore sign later. Any ideas why is this happening? Thanks.
Tested on https://regex101.com/

The reason is because the pattern should match at least 2 digits.
This is due to the ^\d+ and asserting another digit to the right (?=\d)
In your pattern, you can remove the lookaround assertions, as you are also matching the digits that you are asserting so they are redundant.
Your pattern can be written as ^\d+_?\d+ where you can see that you have to match at least 2 digits with an optional underscore.
To get the current matches that you want, you might write the pattern as:
^\d+(?:_\d+)?$
Explanation
^ Start of string
\d+ Match 1+ digits
(?:_\d+)? Optionally match _ and 1+ digits (to prevent an underscore at the end)
$ End of the string
Regex demo

Regex to find a line with two capture groups that match the same regex but are still different

I am trying to analyse my source code (written in C) for not corresponding timer variable comparisons/allocations. I have a rage of timers with different timebases (2-250 milliseconds). Every timer variable contains its granularity in milliseconds in its name (e.g. timer10ms) as well as every timer-photo and define (e.g. fooTimer10ms, DOO_TIMEOUT_100MS).
Here are some example lines:
fooTimer10ms = timer10ms;
baaTimer20ms = timer10ms;
if (DIFF_100MS(dooTimer10ms) >= DOO_TIMEOUT_100MS)
if (DIFF_100MS(dooTimer10ms) < DOO_TIMEOUT_100MS)
I want to match those line where the timebases are not corresponding (in this case the second, third and fourth line). So far I have this regex:
(\d{1,3}(?i)ms(?-i)).*[^\d](\d{1,3}(?i)ms(?-i))
that is capable of finding every line where there are two of those granularities. So instead of just line 2, 3 and 4 it matches all of them. The only idea I had to narrow it down is to add a negative lookbehind with a back-reference, like so:
(\d{1,3}(?i)ms(?-i)).*[^\d](\d{1,3}(?i)ms(?-i))(?<!\1)
but this is not allowed because a negative lookbehind has to have a fixed length.
I found these two questions (one, two) but the fist does not have the restriction of having both capture groups being of the same kind and the second is looking for equal instances of the capture group.
If what I want can be achieved way easier, by using something else than regex, I would be happy to know. My mind is just stuck due to my believe that regex is capable of that and I am just not creative enough to use it properly.

One option is to match the timer part followed by the digits and use a negative lookahead with a backreference to assert that it does not occur at the right.
For the example data, a bit specific pattern using a range from 2-250 might be:
.*?(timer(?:2[0-4]\d|250|1?\d\d|[2-9])ms)\b\S*[^\S\r\n]*[<>]?=[^\S\r\n]*\b(?!\S*\1)\S+
The pattern matches
.*? Match any char except a newline, as least as possible (Non greedy)
( Capture group 1
timer Match literally
(?:2[0-4]\d|250|1?\d\d|[2-9]) Match a digit in the range of 2-250
ms Match literally
)\b Close group and a word boundary
\S*[^\S\r\n]* Match optional non whitespace chars and optional spaces without newlines
[<>]?= Match an optional < or > and =
[^\S\r\n]*\b Match optional whitespace chars without a newline and a word boundary
(?!\S*\1) Negative lookahead, assert no occurrence of what is captured in group 1 in the value
\S+ Match 1+ non whitespace chars
Regex demo
Or perhaps a broader pattern matching 1-3 digits and optional whitespace chars which might also match a newline:
.*?(timer\d{1,3}ms\b)\S*\s*[<>]?=\s*\b(?!.*\1)\S+
Regex demo
Note that {1-3} should be {1,3} and could also match 999

Using regex to match numbers which have 5 increasing consecutive digits somewhere in them

First off, this has sort of been asked before. However I haven't been able to modify this to fit my requirement.
In short: I want a regex that matches an expression if and only if it only contains digits, and there are 5 (or more) increasing consecutive digits somewhere in the expression.
I understand the logic of
^(?=\d{5}$)1*2*3*4*5*6*7*8*9*0*$
however, this limits the expression to 5 digits. I want there to be able to be digits before and after the expression. So 1111345671111 should match, while 11111 shouldn't.
I thought this might work:
^[0-9]*(?=\d{5}0*1*2*3*4*5*6*7*8*9*)[0-9]*$
which I interpret as:
^$: The entire expression must only contain what's between these 2 symbols
[0-9]*: Any digits between 0-9, 0 or more times followed by:
(?=\d{5}0*1*2*3*4*5*6*7*8*9*): A part where at least 5 increasing digits are found followed by:
[0-9]*: Any digits between 0-9, 0 or more times.
However this regex is incorrect, as for example 11111 matches. How can I solve this problem using a regex? So examples of expressions to match:
00001459000
12345
This shouldn't match:
abc12345
9871234444

While this problem can be solved using pure regular expressions (the set of strictly ascending five-digit strings is finite, so you could just enumerate all of them), it's not a good fit for regexes.
That said, here's how I'd do it if I had to:
^\d*(?=\d{5}(\d*)$)0?1?2?3?4?5?6?7?8?9?\1$
Core idea: 0?1?2?3?4?5?6?7?8?9? matches an ascending numeric substring, but it doesn't restrict its length. Every single part is optional, so it can match anything from "" (empty string) to the full "0123456789".
We can force it to match exactly 5 characters by combining a look-ahead of five digits and an arbitrary suffix (which we capture) and a backreference \1 (which must exactly the suffix matched by the look-ahead, ensuring we've now walked ahead 5 characters in the string).
Live demo: https://regex101.com/r/03rJET/3
(By the way, your explanation of (?=\d{5}0*1*2*3*4*5*6*7*8*9*) is incorrect: It looks ahead to match exactly 5 digits, followed by 0 or more occurrences of 0, followed by 0 or more occurrences of 1, etc.)

Because the starting position of the increasing digits isn't known in advance, and the consecutive increasing digits don't end at the end of the string, the linked answer's concise pattern won't work here. I don't think this is possible without being repetitive; alternate between all possibilities of increasing digits. A 0 must be followed by [1-9]. (0(?=[1-9])) A 1 must be followed by [2-9]. A 2 must be followed by [3-9], and so on. Alternate between these possibilities in a group, and repeat that group four times, and then match any digit after that (the lookahead in the last repeated digit in the previous group will ensure that this 5th digit is in sequence as well).
First lookahead for digits followed by the end of the string, then match the alternations described above, followed by one or more digits:
^(?=\d+$)\d*?(?:0(?=[1-9])|1(?=[2-9])|2(?=[3-9])|3(?=[4-9])|4(?=[5-9])|5(?=[6-9])|6(?=[7-9])|7(?=[89])|8(?=9)){4}\d+
Separated out for better readability:
^(?=\d+$)\d*?
(?:
0(?=[1-9])|
1(?=[2-9])|
2(?=[3-9])|
3(?=[4-9])|
4(?=[5-9])|
5(?=[6-9])|
6(?=[7-9])|
7(?=[89])|
8(?=9)
){4}
\d+
The lazy quantifier in the first line there \d*? isn't necessary, but it makes the pattern a bit more efficient (otherwise it initially greedily matches the whole string, requiring lots of failing alternations and backtracking until at least 5 characters before the end of the string)
https://regex101.com/r/03rJET/2
It's ugly, but it works.

Extract information through regexp

I have a question about groups in a rule i created to extract dates from text.
Let's consider the following string:
fherfrefercr17hfeuetvbyeituew
The string is composed by everything at the beginning, then there is a number composed by one or two digits and then everything again. I need to extract only the number "17" from the string listed above.
With the following rule i extract only 7 and not 17.
.*(\d{1,2}).*
Can anyone help me with that please?

Overview
Given your pattern:
.*(\d{1,2}).*
This works in the following way:
.* Match any character any number of times
The quantifier here is considered to be greedy because it will match as many characters as possible so long as the pattern matches the string.
\d{1,2} Since your pattern says to match 1 or 2 digits and the previous token is greedy, the regex is just going to match a single digit because this still satisfies the pattern (the previous token stole the first digit).
Code
There are multiple ways you can fix this issue
Method 1
This will simply extract all numbers (1+ digits) from the string. If you want to only match 1 or two digits use \d\d? or \d{1,2} instead.
\d+
\d\d?
\d{1,2}
Method 2
This method turns the greedy quantifier * (in .*) into a lazy quantifier .*?. This will match any character any number of times, but as few as possible. The drawback to this method is that it's expensive because the engine needs to backtrack.
.*?\d{1,2}.*
Method 3
This method matches any non-digit character any number of times, then it matches one or two digits. This is likely the solution you're looking for.
\D*(\d{1,2}).*

RegEx expression to handle multiple conditions of breaking sentences

I am trying to make a regex that is used in an exception.
Therefore it must return false for these sentences (the leading digits are included in the strings):
3.{17} this is italics and should break.{18} 
4. this is another sentence and should break. 
5. This is another sentence and should break. 
And it must return true for these:
There are 2 reasons for this 1. you are here and 2. you are communicating. 
Is it 2? they wanted to know. 
1 digit at the beginning but with 1. with a period should return true.
In other words, if the beginning of the string is a number followed by a period, it should return false (even if "\{\d+\}" follows it optionally) and the character following the space does not matter. And it must return true if the number and period (or ! or ?) is embedded in the sentence followed by a lower case character, in other cases it must be false.
As a further note: this goes into a java properties file, and the value is then passed to a perl5 regex engine to return broken text.
I try to express it in one expression, but somehow I cannot get it right.
This is what have come up with until now:
^([^0-9\.]+[\.]|
[^\.!\?]*[\?!]+[\?!\.]+|
[0-9]+[^\?!\.]+[\?!\.]+|
[^0-9]*[0-9]+[^\?!\.]+[\?!\.]+)
(\{\d+\}[\u0020\u00A0]|
[\u0020\u00A0]*)[a-z]
I seem to arrived at an impasse and can't see what is I have wrong.
Thanks for any advice.
Update:
A simpler format with look-ahead: ^(?!\d+\.)[^.!?]*[.!?]+(\{\d+\}\s|\s*)\p{Ll} based on the comments.

You may use
^(?!\d+\.)[^.!?]*[.!?]+(\{\d+\}\s|\s*)\p{Ll}
See the regex demo.
The pattern matches:
^ - start of string anchor
(?!\d+\.) - a negative lookahead that will fail the match if its pattern is matched at the start of the string: 1+ digits followed with a dot
[^.!?]* - 0+ chars other than ., ! and ?
[.!?]+ - 1 or more ., ! or ? symbols
(\{\d+\}\s|\s*) - either a { + 1 or more digits + } or 0+ whitespaces (if you are not interested in the value captured with this capturing group, you may turn it into a non-capturing one by adding ?: after the first ().
\p{Ll} - a lowercase letter (if a u modifier is used, it will also match all Unicode lowercase letters).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Difference between regex quantifiers plus and star - regex

The * quantifier matches zero or more occurences. In practice, this means that \d* will match every possible input, including the empty string. So your regex matches at the start of the input string and returns the empty string.

but none if necessary means that it will not break the regex pattern if there is no match. So \d* means it will match zero or more occurrences of digits. For eg. \d[a-z] will match abcdef but \d+[a-z]* will not match abcdef because \d+ implies that at least one digit is required.

Simply: \d* implies zero or more times \d+ means one or more times

Related

LookAround or default regex if symbol is not present

Regex to find a line with two capture groups that match the same regex but are still different

Using regex to match numbers which have 5 increasing consecutive digits somewhere in them

Extract information through regexp

RegEx expression to handle multiple conditions of breaking sentences

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Difference between regex quantifiers plus and star - regex

The * quantifier matches zero or more occurences. In practice, this means that \d* will match every possible input, including the empty string. So your regex matches at the start of the input string and returns the empty string.

but none if necessary means that it will not break the regex pattern if there is no match. So \d* means it will match zero or more occurrences of digits. For eg. \d*[a-z]* will match abcdef but \d+[a-z]* will not match abcdef because \d+ implies that at least one digit is required.

Simply: \d* implies zero or more times \d+ means one or more times

Related

LookAround or default regex if symbol is not present

Regex to find a line with two capture groups that match the same regex but are still different

Using regex to match numbers which have 5 increasing consecutive digits somewhere in them

Extract information through regexp

RegEx expression to handle multiple conditions of breaking sentences

Categories

Resources

but none if necessary means that it will not break the regex pattern if there is no match. So \d* means it will match zero or more occurrences of digits. For eg. \d[a-z] will match abcdef but \d+[a-z]* will not match abcdef because \d+ implies that at least one digit is required.