Regex - Why does the question mark behave like this? - regex

I'm learning regex. When I match this:
\d[^\w]\d
on this
30-01-2003 15:20
I get 3 matches: 0-0, 1-2, 3 5, and 5:2.
When I try adding a question mark at the end of the regex (\d[^\w]\d?), my matches don't change.
When I move the question mark to after the square bracket (\d[^\w]?\d), the matches are now 30, 01, 20, 03, 15, and 20.
When I move the question mark to before the square bracket (\d?[^\w]\d), my matches are the same as in the first case.
Why is this? I know the ? operator marks the preceding character as optional, so I expected the behaviour in the second case, but not in the first or third.

Because ? is a greedy match. It will attempt to consume as much as possible. So, if a \d is present, it will always grab it.
Think of the ? at the end as defining two regexes: \d[^\w]\d and \d[^\w]. In your test case, you never have a match where the first regex doesn't match and the second one does (without overlaps, again, it's greedy). That's why your matches never change. If, however, you changed your test case to this:
30-01-2003 15:20/
You'll get an extra match of 0/ depending on whether or not you include the question mark at the end of the regex.

Your first and third cases produce the same results as the original only because of the particular string you're searching - they are NOT equivalent searches in general. Specifically, every occurrence of \d[^\w] in your string happens to be followed by a digit, so making the trailing digit optional does not change any of the matches. Likewise, every occurrence of [^\w]\d happens to be preceded by a digit. If your string had two spaces together, or a doubled punctuation mark somewhere, the results would differ for each case.

U just need it
-Two Solutions-
1. REGEXP:
\d+
1. Explanation:
\d =>numbers
+ => 1 or more
2. REGEXP
[0-9]+
2. Explanation
[0-9] <= Numbers
+ <= 1 or more
it will match all numbers (Solution 1 or 2)
Original Text:
30-01-2003 15:20
Result:
30
01
2003
15
20
Enjoy.
See: https://regex101.com/r/xXaLgN/6

Related

Negate a character group to replace all other characters

I have the following string:
"Thu Dec 31 22:00:00 UYST 2009"
I want to replace everything except for the hours and minutes so I get the following result:
"22:00"
I am using this regex :
(^([0-9][0-9]:[0-9][0-9]))
But its not matching anything.
This would be my line of actual code :
println("Thu Dec 31 22:00:00 UYST 2009".replace("(^([0-9][0-9]:[0-9][0-9]))".toRegex(),""))
Can someone help me to correct the regex?
The reason the one you have isn't working is because you are asserting that the line starts right before the minutes and seconds, which isn't the case. This can be fixed by removing the assertion (^).
If you need the assertion to remain, there is another way. In most languages, you wouldn't be able to use a variable-length positive lookbehind here, but lucky for you, it looks like you can in Kotlin.
A positive lookbehind is basically just telling the pattern "this comes before what I'm looking for". It's denoted by a group beginning with ?<=. In this case, you can use something like (?<=^[\w ]+). This will match all word characters or spaces between the beginning of the line and where the pattern that comes after it is able to match. Appending it to your expression would look something like (?<=^[\w ]+)([0-9][0-9]:[0-9][0-9]) (note you will have to escape the \w in order for it to be in a string and not be angry about it).
Side note, Yogesh_D is correct in saying that \d\d:\d\d is the same as your [0-9][0-9]:[0-9][0-9]. Using this, it would look more like (?<=^[\w ]+)\d\d:\d\d.
You may use various solutions, here are two:
val text = """Thu Dec 31 22:00:00 UYST 2009"""
val match = """\b(?:0?[1-9]|1\d|2[0-3]):[0-5]\d\b""".toRegex().find(text)
println(match?.value)
val match2 = """\b(\d{1,2}:\d{2}):\d{2}\b""".toRegex().find(text)
println(match2?.groupValues?.getOrNull(1))
Both return 22:00. See regex #1 demo and regex #2 demo.
The regex complexity should be selected based on how messy the input string is.
Details
\b - a word boundary
(?:0?[1-9]|1\d|2[0-3]) - an optional zero and then a non-zero digit, or 1 and any digit, or 2 and a digit from 0 to 3
: - a : char
[0-5]\d - 0, 1, 2, 3, 4 or 5 and then any one digit
\b - a word boundary.
If there is a match with this regex, you get it as a whole match, so you can access it via match?.value.
If you do not have to worry about any pre-valiation when matching, you may simply match 3 colon-separated digit pairs and capture the first two, see the second regex:
\b - a word boundary
(\d{1,2}:\d{2}) - Group 1: one or two digits, : and two digits
:\d{2} - a : and two digits (not captured)
\b - a word boundary.
If there is a match, we need Group 1 value, hence match2?.groupValues?.getOrNull(1) is used.
I am not sure what language you are using but why use negation when you can directly match the first digits in the hh:mm format.
Assuming that the date string format always is in the format with a hh:mm in there.
This regex snippet should have the first group match the hh:mm.
https://regex101.com/r/aHdehZ/1
The regex to use is (\d\d:\d\d)

How to select with regex this character?

For the example i have these four ip address:
10.100.0.11; wrong
10.100.1.12; good
10.100.11.4; good
10.100.44.1; wrong
The task has simple rules. In the 3rd place cant be 0, and the 4rd place cant be a solo 1.
I need to select they from an ip table in different routers and i know only this rules.
My solution:
^(10.100.[1-9]{1,3}.[023456789]{1,3})$
but in this case every number with 1 like 10, 100 etc is missing, so in this way this solution is wrong.
^(10.100.[1-9]{1,3}.[1-9]{2,3})$
This solve the problem of the single 1, but make another one.
From the rules you have given, this regex should work:
10\.100\.([123456789]\d*|\d{2,})\.([^1]$|\d{2,})
it also matches 3rd position number containing a 0 but not in the first place.
so 10.100.10.4 will match as well as 10.100.02.4
I don't know if it's the intended behavior since I'm not familiar with ip adress.
The last part \.([^1]$|\d{2,}) reads like this:
"after the 3rd dot is either
a character which is not 1 followed by the end of the line
or two or more digits"
If you want to avoid malformed string containing non-digit character like 10.100.12.a to be match you should replace [^1] by [023456789] or lazier (and therefore better ;) by [02-9]
I use https://regex101.com to debug regex. It's just awesome.
Here is your regex if you want to play with it
You might use
^10\.100\.[1-9]{1,3}\.(?:[02-9]|\d{2,3})$
The pattern matches
^ Start of string
10\.100\. Match 10.100. (note to escape the dot to match it literally)
[1-9]{1,3} Match 3 times a digit 1-9
\. Match a dot
(?: Non capture group
[02-9] Match a digit 0 or 2-9
| Or
\d{2,3} Match 2 or 3 digits 0-9
) Close the group
$ End of string
Regex demo

Using regex to match numbers which have 5 increasing consecutive digits somewhere in them

First off, this has sort of been asked before. However I haven't been able to modify this to fit my requirement.
In short: I want a regex that matches an expression if and only if it only contains digits, and there are 5 (or more) increasing consecutive digits somewhere in the expression.
I understand the logic of
^(?=\d{5}$)1*2*3*4*5*6*7*8*9*0*$
however, this limits the expression to 5 digits. I want there to be able to be digits before and after the expression. So 1111345671111 should match, while 11111 shouldn't.
I thought this might work:
^[0-9]*(?=\d{5}0*1*2*3*4*5*6*7*8*9*)[0-9]*$
which I interpret as:
^$: The entire expression must only contain what's between these 2 symbols
[0-9]*: Any digits between 0-9, 0 or more times followed by:
(?=\d{5}0*1*2*3*4*5*6*7*8*9*): A part where at least 5 increasing digits are found followed by:
[0-9]*: Any digits between 0-9, 0 or more times.
However this regex is incorrect, as for example 11111 matches. How can I solve this problem using a regex? So examples of expressions to match:
00001459000
12345
This shouldn't match:
abc12345
9871234444
While this problem can be solved using pure regular expressions (the set of strictly ascending five-digit strings is finite, so you could just enumerate all of them), it's not a good fit for regexes.
That said, here's how I'd do it if I had to:
^\d*(?=\d{5}(\d*)$)0?1?2?3?4?5?6?7?8?9?\1$
Core idea: 0?1?2?3?4?5?6?7?8?9? matches an ascending numeric substring, but it doesn't restrict its length. Every single part is optional, so it can match anything from "" (empty string) to the full "0123456789".
We can force it to match exactly 5 characters by combining a look-ahead of five digits and an arbitrary suffix (which we capture) and a backreference \1 (which must exactly the suffix matched by the look-ahead, ensuring we've now walked ahead 5 characters in the string).
Live demo: https://regex101.com/r/03rJET/3
(By the way, your explanation of (?=\d{5}0*1*2*3*4*5*6*7*8*9*) is incorrect: It looks ahead to match exactly 5 digits, followed by 0 or more occurrences of 0, followed by 0 or more occurrences of 1, etc.)
Because the starting position of the increasing digits isn't known in advance, and the consecutive increasing digits don't end at the end of the string, the linked answer's concise pattern won't work here. I don't think this is possible without being repetitive; alternate between all possibilities of increasing digits. A 0 must be followed by [1-9]. (0(?=[1-9])) A 1 must be followed by [2-9]. A 2 must be followed by [3-9], and so on. Alternate between these possibilities in a group, and repeat that group four times, and then match any digit after that (the lookahead in the last repeated digit in the previous group will ensure that this 5th digit is in sequence as well).
First lookahead for digits followed by the end of the string, then match the alternations described above, followed by one or more digits:
^(?=\d+$)\d*?(?:0(?=[1-9])|1(?=[2-9])|2(?=[3-9])|3(?=[4-9])|4(?=[5-9])|5(?=[6-9])|6(?=[7-9])|7(?=[89])|8(?=9)){4}\d+
Separated out for better readability:
^(?=\d+$)\d*?
(?:
0(?=[1-9])|
1(?=[2-9])|
2(?=[3-9])|
3(?=[4-9])|
4(?=[5-9])|
5(?=[6-9])|
6(?=[7-9])|
7(?=[89])|
8(?=9)
){4}
\d+
The lazy quantifier in the first line there \d*? isn't necessary, but it makes the pattern a bit more efficient (otherwise it initially greedily matches the whole string, requiring lots of failing alternations and backtracking until at least 5 characters before the end of the string)
https://regex101.com/r/03rJET/2
It's ugly, but it works.

Decoding a regex... I know what it's function is but I want to understand exactly what is happening

I have a regular expression that I'm going to be using to verify that an inputted number is in standard U.S. telephone format (i.e (###) ###-####). I am new to regex and still having some trouble figuring out the exact function of each character. If someone would go through this piece by piece/verify that I am understanding I would really appreciate it. Also if the regex is wrong I would obviously like to know that.
\D*?(\d\D*?){10}
What I think is happening:
\D*?( indicates an escape sequence for the parenthesis metacharacter... not sure why the \D*? is necessary
\d indicating digits
\D*? indicating there is a non-digit character (-) followed by the closing parenthesis.
{10} for the 10 digits
I feel very unsure explaining this, like my understanding is very vague in terms of why the regex is in the order that it is etc. Thanks in advance for help/explanations.
EDIT
It seems like this is not the best regex for what I want. Another possibility was [(][0-9]{3}[)] [0-9]{3}-[0-9]{4}, but I was told this would fail. I suppose I'll have to do a little more work with regular expressions to figure this out.
\D matches any non-digit character.
* means that the previous character is repeated 0 or more times.
*? means that the previous character is repeated 0 or more times, but until the match of the following character in the regex. It is a bit difficult perhaps at the start, but in your regex, the next character is \d, meaning \D*? will match the least amount of characters until the next \d character.
( ... ) is a capture group, and is also used to group things. For instance {10} means that the previous character or group is repeated 10 times exactly.
Now, \D*?(\d\D*?){10} will match exactly 10 numbers, starting with non-digit characters or not, with non-digit characters in between the digits if they are present.
[(][0-9]{3}[)] [0-9]{3}-[0-9]{4}
This regex is a bit better since it doesn't just accept anything (like the first regex does) and will match the format (###) ###-#### (notice the space is a character in regex!).
The new things introduced here are the square brackets. These represent character classes. [0-9] means any character between 0 to 9 inclusive, which means it will match 0, 1, 2, 3, 4, 5, 6, 7, 8 or 9. Adding {3} after it makes it match 3 similar character class, and since this character class contains only digits, it will match exactly 3 digits.
A character class can be used to escape certain characters, such as ( or ) (note I mentioned earlier they are for capturing groups, or grouping) and thus, [(] and [)] are literal ( and ) instead of being used for capturing/grouping.
You can also use backslashes (\) to escape characters. Thus:
\([0-9]{3}\) [0-9]{3}-[0-9]{4}
Will also work. I would also recommend the use of line anchors ^ and $ if you're only trying to see if a phone number matches the above format. This ensures that the string has only the phone number, and nothing else. ^ matches the beginning of a line and $ matches the end of a line. Thus, the regex will become:
^\([0-9]{3}\) [0-9]{3}-[0-9]{4}$
However, I don't know all the combinations of the different formats of phone numbers in the US, so this regex might need some tweaking if you have different phone number formats.
\D is "not a digit"; \d is "digit". With that in mind:
This matches zero or more non-digits, then it matches a digit and any number of non-digit characters 10 times. This won't actually verify that the number if formatted properly, just that it contains 10 digits. I suspect that the regex isn't what you want in the first place.
For example, the following will match your regex:
this is some bad text 1 and some more 2 and more 34567890
\D matches a character that is not a digit
* repeats the previous item 0 or more times
? find the first occurrence
\d matches a digit
so your group is matches 10 digits or non digits

Why does this regex does not match

I'm wondering why the following regex works for some strings and does not work for some others:
/^([0-3]+)(?!4|.*5)[0-9]+$/
1151 -> this does not match
1141 -> this does match, but why? since I can consider .* as empty and the regex becomes /^([0-3]+)(?!4|5)[0-9]+$/
I think that I'm misunderstanding the way the look-ahead works...
Let's look at how the regular expression would parse your string, step by step.
^([0-3]+)(?!4|.*5)[0-9]+$
First, some clarification. (?!4|.*5) is a negative look-ahead that checks if either 4 or .*5 follow the last consumed character. If it does, the current match fails and steps back. It could also be written as (?!(4|.*5)) if you wanted it to be slightly more clear about how exactly | affects it.
Let's start by looking at 1141
First, [0-3]+ consumes as many characters as possible, so it will consume up to and including the 11 in 1141. What's leftover is 41. The regular expression now checks to see if 4 is after the current characters, and since ?! is a negative look-ahead, the match will fail if it is found. Since 4 follows 11, the match fails and the regular expression steps backwards and tries again.
Instead of matching two 1s, it now attempts a single match and matches 1, with 141 left over. ?!4 checks to make sure 4 is the next character, and what do you know, it's not there. The regex leaves the negative look-ahead since it didn't match, and continues on to the rest of the regular expression. 141 is matched by the final [0-9]+, and thus the entire 1141 string is matched. Remember that look-arounds do not consume characters.
Now let's look at 1151
The same thing happens as last time, 11 is consumed and we have 51 left over. Now we look at the negative look-ahead, and evaluate the rest of the string off that. Obviously, 4 is no where in this string so we can ignore that, so let's look at .*5.
So the look-ahead .*5 tries to match 51. If it does end up matching, just as before the match will fail and the regular expression will step back. Now if you know any regex at all, it is obvious that .*5 will match the beginning of 51 since .* can evaluate to empty.
So we step back, and now we've matched a single 1 instead of both, and we're at the negative look-ahead again.
We have currently consumed 1, still have 151 left to match, and are on the (?!4|.*5) portion of the regex. Here, 4 is obviously non-existant in our string so it is not going to match, so let's look at .*5 again.
.*5 will match a portion of 151 since .* will consume the first 1, and the 5 will finish off by matching 5. This should also be obvious if you know regex.
So we've made a match in a negative look-ahead again, which is bad... so we step back again. We have no more integers to attempt to match with [0-3], and since you can't match 0 integers with a +, the entire string fails to match the regular expression.
1141 matches because the the regular expression engine can backtrack from matching 11 with the [0-3]+ to just matching the first 1, leaving the remaining numbers to be matched by the [0-9]+.
As the next character after the first 1 is 1 and not 4, the negative look-ahead, which only looks at the next character, does not prevent the match.
The 1151 does not match because the negative look-ahead with the added .* prevents it.
With the added .* put before the 5 the look-ahead now means 'don't match if the next character is 4, or after any number of any characters the next character is 5' (ignoring newlines).
So even if the engine backtracks to make [0-3]+ match just the first 1 of 1151, there is still a 5 ahead in the string, so a match is prevented.
Remember that look-aheads and look-behinds are zero-width.
If you want it to match 4 or 5 the best option would be
/^[0-3]+[45][0-9]+$/
but without a better explaination of what it's supposed to do it's hard to suggest anything more than that...
What regex flavour is that?
/^([0-3]+)(?!4|.*5)[0-9]+$/
Honestly the only way I would see it match 1141 and not 1151 is if the highlighted part of the regex would be evaluated as NOT 4 or .* followed by 5. If it was that case then the regex engine would fail to find a match for 1141 as it would match the 4 but would miss the 5 to make the inner match complete.
However, usually the alternation would be understood as 4 or .*5 - which is still not equivalent to 4 or 5, because the expression .* can prove quite powerful in case when the engine wants to make a match work.
What are you testing the expression in?