Match list of incrementing integers using regex - regex

Is it possible to match a list of comma-separated decimal integers, where the integers in the list always increment by one?
These should match:
0,1,2,3
8,9,10,11
1999,2000,2001
99,100,101
These should not match (in their entirety - the last two have matching subsequences):
42
3,2,1
1,2,4
10,11,13

Yes, this is possible when using a regex engine that supports backreferences and conditions.
First, the list of consecutive numbers can be decomposed into a list where each pair of numbers are consecutive:
(?=(?&cons))\d+
(?:,(?=(?&cons))\d+)*
,\d+
Here (?=(?&cons)) is a placeholder for a predicate that ensures that two numbers are consecutive. This predicate might look as follows:
(?<cons>\b(?:
(?<x>\d*)
(?:(?<a0>0)|(?<a1>1)|(?<a2>2)|(?<a3>3)|(?<a4>4)
|(?<a5>5)|(?<a6>6)|(?<a7>7)|(?<a8>8))
(?:9(?= 9*,\g{x}\d (?<y>\g{y}?+ 0)))*
,\g{x}
(?(a0)1)(?(a1)2)(?(a2)3)(?(a3)4)(?(a4)5)
(?(a5)6)(?(a6)7)(?(a7)8)(?(a8)9)
(?(y)\g{y})
# handle the 999 => 1000 case separately
| (?:9(?= 9*,1 (?<z>\g{z}?+ 0)))+
,1\g{z}
)\b)
For a brief explanation, the second case handling 999,1000 type pairs is easier to understand -- there is a very detailed description of how it works in this answer concerned with matching a^n b^n. The connection between the two is that in this case we need to match 9^n ,1 0^n.
The first case is slightly more complicated. The largest part of it handles the simple case of incrementing a decimal digit, which is relatively verbose due to the number of said digits:
(?:(?<a0>0)|(?<a1>1)|(?<a2>2)|(?<a3>3)|(?<a4>4)
|(?<a5>5)|(?<a6>6)|(?<a7>7)|(?<a8>8))
(?(a0)1)(?(a1)2)(?(a2)3)(?(a3)4)(?(a4)5)
(?(a5)6)(?(a6)7)(?(a7)8)(?(a8)9)
The first block will capture whether the digit is N into group aN and the second block will then uses conditionals to check which of these groups was used. If group aN is non-empty, the next digit should be N+1.
The remainder of the first case handles cases like 1999,2000. This again falls into the pattern N 9^n, N+1 0^n, so this is a combination of the method for matching a^n b^n and incrementing a decimal digit. The simple case of 1,2 is handled as the limiting case where n=0.
Complete regex: https://regex101.com/r/zG4zV0/1
Alternatively the (?&cons) predicate can be implemented slightly more directly if recursive subpattern references are supported:
(?<cons>\b(?:
(?<x>\d*)
(?:(?<a0>0)|(?<a1>1)|(?<a2>2)|(?<a3>3)|(?<a4>4)
|(?<a5>5)|(?<a6>6)|(?<a7>7)|(?<a8>8))
(?<y>
,\g{x}
(?(a0)1)(?(a1)2)(?(a2)3)(?(a3)4)(?(a4)5)
(?(a5)6)(?(a6)7)(?(a7)8)(?(a8)9)
| 9 (?&y) 0
)
# handle the 999 => 1000 case separately
| (?<z> 9,10 | 9(?&z)0 )
)\b)
In this case the two grammars 9^n ,1 0^n, n>=1 and prefix N 9^n , prefix N+1 0^n, n>=0 are pretty much just written out explicitly.
Complete alternative regex: https://regex101.com/r/zG4zV0/3

Related

Regex: match at least N number of search terms but with patterns dependent on position

My question is similar to that in regex: Match at least two search terms, but with added complexity:
Given a set of M numerical strings of same length:
11001100
11101010
10010010
00101101
And given substring patterns of the type "11 at position 0" or "10 at position 6" (with the position being any multiple of 2), how can I search for strings matching at least N of these patterns?
For example: ^(11|\d{2}10|\d{6}10) matches all strings. However if I add {3,} to the regex to match "11101010" only (because it satisfies three out of three of those OR cases), it fails. Does anyone know how I can structure a regex like this?
If it matters, the patterns can also cover the same substring position, so for example it could be (11|\d{6}10|\d{6}00), and this ideally would match both the first and second lines in my example if I wanted to only catch strings with two or more matches.
Is this the expected result?
(\b(11\d{6}|10\d{6}|\d{6}01)\n?){3,}

Optimization of Regular Expression to match numbers bigger or equal to 50

I want to check if a number is 50 or more using a regular expression. This in itself is no problem but the number field has another regex checking the format of the entered number.
The number will be in the continental format: 123.456,78 (a dot between groups of three digits and always a comma with 2 digits at the end)
Examples:
100.000,00
50.000,00
50,00
34,34
etc.
I want to capture numbers which are 50 or more. So from the four examples above the first three should be matched.
I've come up with this rather complicated one and am wondering if there is an easier way to do this.
^(\d{1,3}[.]|[5-9][0-9]|\d{3}|[.]\d{1,3})*[,]\d{2}$
EDIT
I want to match continental numbers here. The numbers have this format due to internal regulations and specify a price.
Example: 1000 EUR would be written as 1.000,00 EUR
50000 as 50.000,00 and so on.
It's a matter of taste, obviously, but using a negative lookahead gives a simple solution.
^(?!([1-4]?\d),)[1-9](\d{1,2})?(\.\d{3})*,\d{2}\b
In words: starting from a boundary ignore all numbers that start with 1 digit OR 2 digits (the first being a 1,2,3 or 4), followed by a comma.
Check on regex101.com
Try:
EDIT ^(.{3,}|[5-9]\d),\d{2}$
It checks if:
there 3 chars or more before the ,
there are 2 numbers before the , and the first is between 5 and 9
and then a , and 2 numbers
Donno if it answer your question as it'll return true for:
aa50,00
1sdf,54
But this assumes that your original string is a number in the format you expect (as it was not a requirement in your question).
EDIT 3
The regex below tests if the number is valid referring to the continental format and if it's equal or greater than 50. See tests here.
Regex: ^((([1-9]\d{0,2}\.)(\d{3}\.){0,}\d{3})|([1-9]\d{2})|([5-9]\d)),\d{2}$
Explanation (d is a number):
([1-9]\d{0,2}\.): either d., dd. or ddd. one time with the first d between 1 and 9.
(\d{3}\.){0,}: ddd. zero or x time
\d{3}: ddd 3 digit
These 3 parts combined match any numbers equals or greater than 1000 like: 1.000, 22.002 or 100.000.000.
([1-9]\d{2}): any number between 100 and 999.
([5-9]\d)): a number between 5 and 9 followed by a number. Matches anything between 50 and 99.
So it's either the one of the parts above or this one.
Then ,\d{2}$ matches the comma and the two last digits.
I have named all inner groups, for better understanding what part of number is matched by each group. After you understand how it works, change all ?P<..> to ?:.
This one is for any dec number in the continental format.
^(?P<common_int>(?P<int>(?P<int_start>[1-9]\d{1,2}|[1-9]\d|[1-9])(?P<int_end>\.\d{3})*|0)(?!,)|(?P<dec_int_having_frac>(?P<dec_int>(?P<dec_int_start>[1-9]\d{1,2}|[1-9]\d|[1-9])(?P<dec_int_end>\.\d{3})*,)|0,|,)(?=\d))(?P<frac_from_comma>(?<=,)(?P<frac>(?P<frac_start>\d{3}\.)*(?P<frac_end>\d{1,3})))?$
test
This one is for the same with the limit number>=50
^(?P<common_int>(?P<int>(?P<int_start>[1-9]\d{1,2}|[1-9]\d|[1-9])(?P<int_end>\.\d{3})+|(?P<int_short>[1-9]\d{2}|[5-9]\d))(?!,)|(?P<dec_int_having_frac>(?P<dec_int>(?P<dec_int_start>[1-9]\d{1,2}|[1-9]\d|[1-9])(?P<dec_int_end>\.\d{3})+,)|(?P<dec_short_int>[1-9]\d{2}|[5-9]\d),)(?=\d))(?P<frac_from_comma>(?<=,)(?P<frac>(?P<frac_start>\d{3}\.)*(?P<frac_end>\d{1,3})))?$
tests
If you always have the integer part under 999.999 and fractal part always 2 digits, it will be a bit more simple:
^(?P<dec_int_having_frac>(?P<dec_int>(?P<dec_int_start>[1-9]\d{1,2}|[1-9]\d|[1-9])(?P<dec_int_end>\.\d{3})?,)|(?P<dec_short_int>[1-9]\d{2}|[5-9]\d),)(?=\d)(?P<frac_from_comma>(?<=,)(?P<frac>(?P<frac_end>\d{1,2})))?$
test
If you can guarantee that the number is correctly formed -- that is, that the regex isn't expected to detect that 5,0.1 is invalid, then there are a limited number of passing cases:
ends with \d{3}
ends with [5-9]\d
contains \d{3},
contains [5-9]\d,
It's not actually necessary to do anything with \.
The easiest regex is to code for each of these individually:
(\d{3}$|[5-9]\d$|\d{3},|[5-9]\d)
You could make it more compact and efficient by merging some of the cases:
(\d{3}[$,]|[5-9]\d[$,])
If you need to also validate the format, you will need extra complexity. I would advise against attempting to do both in a single regex.
However unless you have a very good reason for having to do this with a regex, I recommend against it. Parse the string into an integer, and compare it with 50.

Is it possible for a regex to identify all equal value subarrays?

An equal value subarray is a subarray containing one or more consecutive elements of the same value.
For example, lets say our array is:
1,1,3
There are four equal value sub-arrays:
[1], [1], [3], [1,1]
Note that elements can be part of more than one subarray.
I know [\d] matches digits, but this requirement is failing me. I am asking regex solution out of curiosity.
There's no way to do this with one regex. In fact, I recommend that you use more than one version of the string.
This regex should work:
^(\d+)(,\1){n}
I've made some adjustments to ensure a more robust regex:
Allows for numbers greater than 10
Will only match at the start, ensuring the count is not thrown off
For an array of length 4, you should replace n with 0, 1, 2, 3. This means that you will have to match against four regexes.
(Note that n=0 is the same as ^(\d+))
Furthermore, you will have to "behead" the string, meaning that you would first match against 1,1,1,3 (new example) and then 1,1,3, and then 1,3, and then 3.
Fun fact: you can use a regex to behead the string (group 1 will have the beheaded string):
^\d+,(.*)
(Obviously, you will need to ensure that you're not trying to behead an array of size 1.)
For an array of size 4, you will need to match against 4+3+2+1=10 regexes. You should test to see if the regex matched; if it did, you know to increment your count by 1. (Note that 10 is the maximum number of consecutive combinations for an array of 4.)
Here's an explanation of why you need to use more than one string. Take this regex:
(\d)(,?\1){n}
Again, n needs to be replaced. You would also need to use the g modifier (or its equivalent).
I'll use your example of 1,1,1,1:
n=0 gives 4 matches
n=1 gives 2 matches
n=2 gives 1 match
n=3 gives 1 match
As you can see, it does not handle overlapping matches very well, because that's not how regex was designed.

Regex for minimum and range of number

I am trying to create a regex for a exactly five digit number which should be in the range between 90000 – 96163.
I created a regex for exactly 5 number
#"^\d{5}$"
Now how do I make sure that it is between the range of 90000 – 96163?
Anything smaller than 90001 and over 96162 should not work.
Thanks
This is most easily achieved using a regular numeric comparison (using < and > operators) in your language.
You can do a range check using regular expressions, but it's tedious to implement and all but nicely readable. For the sake of completeness, here's a possible pattern:
9([0-5][0-9]{3}|6(0[0-9]{2}|1([0-5][0-9]|6[0-3])))
Broken up, the pattern reads as follows:
9 # The first digit must be a 9
(
[0-5][0-9]{3} # Covering the range 90000-95999
|
6 # Matching 96xxx
(
0[0-9]{2} # Covering the range 96000-96099
|
1 # Matching 961xx
(
[0-5][0-9] # Covering the range 96100-96159
|
6[0-3] # Covering the range 96160-96163
)
)
)
Please don't do this if it can be avoided. Just consider what happens when the range boundaries change: Imagine you have to check whether a value is between 7243 and 132843 — not fun.
Digit by digit:
/(?!^90000$)(^9([0-5]\d{3}|6(0\d{2}|1([0-5]\d|6[0-2]))))$/
(9[0-5][0-9]{3}|960[0-9]{2}|961[0-5][0-9]|9616[0-3])
http://gamon.webfactional.com/regexnumericrangegenerator/
This would be enough to find a regex to find a range. No need to ask about range problems

Regex Verification of String in Correct Order with Delimiters in PHP

I'm trying to make a expression to verify that the string supplied is a valid format, but it seems that if I don't use regex in a few months, I forget everything I learned and have to relearn it.
My expression is supposed to match a format like this: 010L0404FFCCAANFFCC00M000000XXXXXX
The four delimiters are (L, N, K, M) which arent in the 0-9A-F hexidecimal range to indicate uniqueness must be in that order or not in the list. Each delimiter can only exist once!
It breaks down to this:
Starts off with a 3 digit numbers, which is simply ^([0-9]{3}) and is always required
Second set begins with L, and must be 2 digits + 2 digits + 6 hexdecimal and is optional
Third set begins with N and must be a 6 digit hexdecimal and is optional
The fourth set K is simply any amount of numbers and is optional
The fifth set is M and can be any 6 hexdecimals or XXXXXX to indicate nothing, it must be in multiples of 6 excluding 0, like 336699 (6) or 336699XXXXXXFFCC00 (18) and is optional
The hardest part I cant figure out making it require it in that order, and in multiples, like the L delimiter must come before and K always if it's there (the reason so I don't get variations of the same string which means the same thing with delimiters swapped). I can already parse it, I just want to verify the string is the correct format.
Any help would be appreciated, thanks.
Requiring the order isn't too bad. Just make each set optional. The regex will still match in order, so if the L section, for example, isn't there and the next character is N, it won't let L occur later since it won't match any of the rest of the regex.
I believe a direct translation of your requirements would be:
^([0-9]{3})(L[0-9]{4}[0-9A-F]{6})?(N[0-9A-F]{6})?(K[0-9]+)?(M([0-9A-F]{6}|X{6})+)?$
No real tricks, just making each group optional except for the first three digits, and adding an internal alternative for the two patterns of six digits in the M block.
^([0-9]{3})(L[0-9]{4}[0-9A-F]{6})?(N[0-9A-F]{6})?(K[0-9]+)?(M([0-9A-F]{6})+|MX{6})$