Regex quantifier not restricting match [duplicate] - regex

This question already has an answer here:
Restricting character length in a regular expression
(1 answer)
Closed 4 years ago.
I would like to match 1 or more capital letters, [A-Z]+ followed by 0 or more numbers, [0-9]* but the entire string needs to be less than or equal to 8 characters in total.
No matter what regex I come up with the total length seems to be ignored. Here is what I've tried.
^[A-Z]+[0-9]*{1,8}$ //Range ignored, will not work on regex101.com but will on rubular.com/
^([A-Z]+[0-9]*){1,8}$ //Range ignored
^(([A-Z]+[0-9]*){1,8})$ //Range ignored
Is this not possible in regex? Do I just need to do the range check in the language I'm writing in? That's fine but I thought it would be cleaner to keep in all in regex syntax. Thanks

The behaviour is expected. When you write the following pattern:
^([A-Z]+[0-9]*){1,8}$
The {1,8} quantifier is telling the regex to repeat the previous pattern, therefore the capturing group in this case, between one to eight times. Due to the greedyness of your operators, you will match and capture indefinitely.
You need to use a lookahead to obtain the desired behaviour:
^(?=.{1,8}$)[A-Z]+[0-9]*$
^ Assert beginning of string.
(?=.{1,8}$) Ensure that the string that follows is between one and eight characters in length.
[A-Z]+[0-9]*$ Match any upper case letters, one or more, and any digits, zero or more.
$ Asserts position end of string.
See working demo here.

The regex ^([A-Z]+[0-9]*){1,8}$ would match [A-Z]+[0-9]* 1 - 8 times. That would match for example a repetition of 8 times A1A1A1A1A1A1A1A1 but not a repetition of 9 times A1A1A1A1A1A1A1A1A1
You might use a positive lookahead (?=[A-Z0-9]{1,8}$) to assert the length of the string:
^(?=[A-Z0-9]{1,8}$)[A-Z]+[0-9]*$
That would match
^ From the start of the string
(?=[A-Z0-9]{1,8}$) Positive lookahead to assert that what follows matches any of the characters in the character class [A-Z0-9] 1 - 8 times and assert the end of the string.
[A-Z]+[0-9]*$ Match one or more times an uppercase character followed by zero or more times a digit and assert the end of the string. $

Related

trying to understand what this regex means [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
Trying to understand what the below regex means.
/^[0-9]{2,3}[- ]{0,1}[0-9]{3}[- ]{0,1}[0-9]{3}$/
Sorry not exactly a coding question.
Let's break this regex into a few different parts:
^: asserts position at start of the string
[0-9]{2,3}: Match a number between 0 and 9, between 2 and 3 times
[- ]{0,1} Matches a dash between zero and one times (Optional dash)
[0-9]{3}: Match a number between 0 and 9, exactly 3 times
[- ]{0,1} Matches a dash between zero and one times (Optional dash)
[0-9]{3}: Match a number between 0 and 9, exactly 3 times
$: asserts position at the end of the string, or before the line terminator right at the end of the string (if any)
Here are a few strings that would pass this regex:
123-123-123
123123123
12-123-123
12123123
Here's a good resource to learn/test regexes: regex101.com
It matches two or three digits followed by (optionally) a dash or space, then 3 digits, again optional dash or space and 3 digits. It seems to try to match a telephone number written in different formats.

Regex for excluding strings that start with consecutive leading zeroes or are only alphabets [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 4 years ago.
I am looking for a regex to select only the strings that are not starting with consecutive zeroes or consecutive alphabets before underscore in below strings.
For ex:
ABC_DE-001 is invalid
abc is invalid (only alphabets)
0_DE-001 is invalid (1 zero before underscore)
000_DE-001 is invalid (sequence of 3 consecutive zeroes)
00_DE-001 is invalid (sequence of 2 consecutive zeroes)
01_DE-001 is valid (0 followed by some other number is valid)
10_DE-001 is valid (starts with 1)
100_DE-001 is valid (starts with 1)
One of the approach I tried was:
(0[1-9]+|[1-9][0-9]+|0[0*$][1-9])_[A-Z0-9]+[-][0-9]{3}
I am not sure though if any scenario is missed with this. Also, how can the same thing be achieved using negative or positive lookaround?
For your examople data, you might match using an optional zero ^0? as that can occur but not more than 1 zero.
^0?[1-9][0-9]*_[A-Z]+-[0-9]{3}$
Regex demo
That will match
^0? An optional zero at the start of the string
[1-9][0-9]* Match a digit 1-9 followed by 0+ digits
_[A-Z]+ Match an _ followed by 1+ times A-Z
-[0-9]{3} Match-` followed by 3 digits
$ Assert the end of the string
You can try with negative look ahead groups:
grep -Pi '^(?![a-z]+(?:_|$|\s)|0+(?:_|$|\s))' test.txt
Explanation:
-Pi - use PCRE and process ignore case. This is grep specific, you can adapt these options to your case. If you cannot make the regex processor to ignore case, just replace [a-z] with [a-zA-Z]. And of course, PCRE support is required.
^ - beginning of the line
(?!rgx) - look forward without moving the cursor to check the line doesn't match the enclosed regular expression rgx.
[a-z]+(?:_|$|\s)|0+(?:_|$|\s) :
don't keep consecutive letters ([a-z]+) followed by an underscore, and end of line or a blank character ((?:_|$|\s))
don't keep consecutive zeroes (0+) followed by an underscore, and end of line or a blank character ((?:_|$|\s))
(?:) stands for a non capturing group (got content is not stored, use it if so to improve performances)
Output got:
01_DE-001 is valid (0 followed by some other number is valid)
10_DE-001 is valid (starts with 1)
100_DE-001 is valid (starts with 1)
Since grep only keeps valid lines (default behavior), non displayed lines were processed as invalid.

Using regex to match numbers which have 5 increasing consecutive digits somewhere in them

First off, this has sort of been asked before. However I haven't been able to modify this to fit my requirement.
In short: I want a regex that matches an expression if and only if it only contains digits, and there are 5 (or more) increasing consecutive digits somewhere in the expression.
I understand the logic of
^(?=\d{5}$)1*2*3*4*5*6*7*8*9*0*$
however, this limits the expression to 5 digits. I want there to be able to be digits before and after the expression. So 1111345671111 should match, while 11111 shouldn't.
I thought this might work:
^[0-9]*(?=\d{5}0*1*2*3*4*5*6*7*8*9*)[0-9]*$
which I interpret as:
^$: The entire expression must only contain what's between these 2 symbols
[0-9]*: Any digits between 0-9, 0 or more times followed by:
(?=\d{5}0*1*2*3*4*5*6*7*8*9*): A part where at least 5 increasing digits are found followed by:
[0-9]*: Any digits between 0-9, 0 or more times.
However this regex is incorrect, as for example 11111 matches. How can I solve this problem using a regex? So examples of expressions to match:
00001459000
12345
This shouldn't match:
abc12345
9871234444
While this problem can be solved using pure regular expressions (the set of strictly ascending five-digit strings is finite, so you could just enumerate all of them), it's not a good fit for regexes.
That said, here's how I'd do it if I had to:
^\d*(?=\d{5}(\d*)$)0?1?2?3?4?5?6?7?8?9?\1$
Core idea: 0?1?2?3?4?5?6?7?8?9? matches an ascending numeric substring, but it doesn't restrict its length. Every single part is optional, so it can match anything from "" (empty string) to the full "0123456789".
We can force it to match exactly 5 characters by combining a look-ahead of five digits and an arbitrary suffix (which we capture) and a backreference \1 (which must exactly the suffix matched by the look-ahead, ensuring we've now walked ahead 5 characters in the string).
Live demo: https://regex101.com/r/03rJET/3
(By the way, your explanation of (?=\d{5}0*1*2*3*4*5*6*7*8*9*) is incorrect: It looks ahead to match exactly 5 digits, followed by 0 or more occurrences of 0, followed by 0 or more occurrences of 1, etc.)
Because the starting position of the increasing digits isn't known in advance, and the consecutive increasing digits don't end at the end of the string, the linked answer's concise pattern won't work here. I don't think this is possible without being repetitive; alternate between all possibilities of increasing digits. A 0 must be followed by [1-9]. (0(?=[1-9])) A 1 must be followed by [2-9]. A 2 must be followed by [3-9], and so on. Alternate between these possibilities in a group, and repeat that group four times, and then match any digit after that (the lookahead in the last repeated digit in the previous group will ensure that this 5th digit is in sequence as well).
First lookahead for digits followed by the end of the string, then match the alternations described above, followed by one or more digits:
^(?=\d+$)\d*?(?:0(?=[1-9])|1(?=[2-9])|2(?=[3-9])|3(?=[4-9])|4(?=[5-9])|5(?=[6-9])|6(?=[7-9])|7(?=[89])|8(?=9)){4}\d+
Separated out for better readability:
^(?=\d+$)\d*?
(?:
0(?=[1-9])|
1(?=[2-9])|
2(?=[3-9])|
3(?=[4-9])|
4(?=[5-9])|
5(?=[6-9])|
6(?=[7-9])|
7(?=[89])|
8(?=9)
){4}
\d+
The lazy quantifier in the first line there \d*? isn't necessary, but it makes the pattern a bit more efficient (otherwise it initially greedily matches the whole string, requiring lots of failing alternations and backtracking until at least 5 characters before the end of the string)
https://regex101.com/r/03rJET/2
It's ugly, but it works.

Regex to match given amount of characters in undefined order [duplicate]

This question already has answers here:
Regex to match exactly n occurrences of letters and m occurrences of digits
(3 answers)
Closed 4 years ago.
I am looking for a regex that matches the following:
2 times the character 'a' and 3 times the character 'b'.
Additionally, the characters do not have to be subsequent, meaning that not only 'aabbb' and 'bbaaa' should be allowed, but also 'ababb', 'abbab' and so forth.
By the sound of it this should be an easy task, but atm I just can't wrap my head around it. Redirection to a good read is appreciated.
You need to use positive lookaheads. This is the same as the password validation problem described here.
Edit:
A positive lookahed will allow you to check a pattern against the string without changing where the next part of the regex matches. This means that you can test multiple regex patterns at the current position of the string and for the regex to match all the positive lookaheads will have to match.
In your case you are looking for 2 a' and 3 b's so the regex to match exactly 2 a's anywhere in the string is /^[^a]*a[^a]*a[^a]*$/ and for 3 b's is /^[^b]*b[^b]*b[^b]*b[^b]*$/ we now need to combine these so that we can match both together as follows /^(?=[^a]*a[^a]*a[^a]*$)(?=[^b]*b[^b]*b[^b]*b[^b]*$).*$/. This will start at the beginning of the string with the ^ anchor, then look for exactly 2 a's then the end of the string. Then because that was a positive lookahead the (?= ... ) the position for the next part of the pattern to match at in the string wont move so we are still at the start of the string and now match exactly 3 b's. As this is a positive lookahead we are still at the beginning of the string but now know that we have 2 a's and 3'b in the string so we match the whole of the string with .*$.

Regex - Exactly 7 digits no more no less

I am looking for help here. I want to write a regex to help me find EXACTLY a 7 digit in string - no more or less.
For instance in this string:
1234567 RE:TKT-2744870-R6P1G0: Gentle Reminder
It should return only 1234567
In this one:
12345678 RE:TKT-2744870-R6P1G0: Gentle Reminder
It should return none.
Can you help me with this one.
thanks in advance.
The proper regex should include \d{7} (7 digits) and 2 "border criteria",
for both start and end of the match, to block matching of a fragment
from longer sequence of digits.
My first thought was that neither before nor after the match there can be any digit.
But as I see from your example, these border criteria should be extended.
The set of "forbidden" chars (either before or after the match) should
include also - and letters.
E.g. 2744870 in your example data contains just 7 digits (no more, no less),
but you still don't want it to be matched, apparently because they are surrounded with - chars.
To keep the regex short, I propose:
(?<![\w-])\d{7}(?![\w-])
Details:
(?<![\w-]) - Negative lookbehind for word char or -.
\d{7} - 7 digits.
(?![\w-]) - Negative lookahead for word char or -.
If you decide to extend the set of "forbidden" chars in both border criteria,
just add them to [...] fragments in lookbehind / lookahead (but - char
should remain at the end, otherwise it must be quoted with \).
Regex like (\d{7})[^\d] (in other proposition) is wrong,
as it matches last 7 digits from any longer sequence of digits
(no "front border criterion").
It matches also both 2744870 (surronded with - chars), which are not
to be matched.
This one should do for your examples:
(\d{7})[^\d]
The first matching group contains the seven digits.
Alternatively –as suggested in the comments– you can use a negative lookahead to only match the seven digits and not require matching groups:
^\d{7}(?!\d)