Regex Quantifier Which Number of Occurrence Gets Tested First? - regex

I'm completely new to regex and recently started learning it. Here's a part of my test string from which I'd like to find matches.
24 bit:
Black #000000
12 bit:
Black #000
My question is the following. When I use regex expression #(\w{1,2}), the group matches 00 in both 24-bit Black and 12-bit Black. However when I use regex #(\w{1,2})\1\1, the group matches 00 in 24-bit Black but 0 in 12-bit Black. Although I'm not familiar with how regex works, I'm curious what's the logic behind this. When I use curly braces quantifier {a,b} to indicate a <= (# occurrences) <= b, for the numbers a, a+1,...,b, which one is used to check for matching first? For example, with #(\w{1,2}) it seems 2 occurrences is used first. But after adding \1\1, it seems to me somehow regex was able to see that using 1 occurrence instead of 2 would result in matching 12-bit Black?

The pattern #(\w{1,2})\1\1 can match #000000 and #000 because \w{1,2} can backtrack 1 position to fit in the matches for the backreferences \1\1
You make the pattern a bit more specific
#([0-9a-fA-F]{1,2})\1\1
Or if there should be no surrounding non whitespace characters:
(?<!\S)#([0-9a-fA-F]{1,2})\1\1(?!\S)
See a regex101 demo.

Related

Regex for all strings over L={0,1} ending with an even number of 0s

sorry for the novice question by I've looked and can't seem to find a question that addressed this.
I want a regex that describes all strings over L={0,1} ending with an even number of 0s.
Examples: 00, 0000, 100, 0100, 001100... basically anything starting with 0 or 1 and ending with an even number of 0s
This is what I've got so far: ((0|1)*1)00+ but this doesn't allow me to get 00 since that must be a 1 always. I can't find a way to put as many 0s as I want at the beginning without having to put that 1.
Thanks a lot.
You could write the pattern as:
^([01]*1)?(00)+$
^ Start of string
( Capture group
[01]*1 Match zero or more repetitions of either 0 or 1 followed by matching 1
)? Close the group and make it optional using ?
(00)+ Match one or more repetitions of 00
$ End of string
See a Regex demo.
If supported, you can also use non capture groups (?:
An even number of 0s is (00)*. It needs to be at the end, so that part of the regex will be (00)*$.
What precedes that even number of 0s? Either nothing or an arbitrary sequence of 0s and 1s ending with a 1. So that's (|[01]*1).
Putting it together, we have:
^(|[01]*1)(00)*$
(I'm assuming extended regex syntax, where (, ), and | don't have to be escaped. Adjust the syntax as needed.)
I have not tested this.

Negate a character group to replace all other characters

I have the following string:
"Thu Dec 31 22:00:00 UYST 2009"
I want to replace everything except for the hours and minutes so I get the following result:
"22:00"
I am using this regex :
(^([0-9][0-9]:[0-9][0-9]))
But its not matching anything.
This would be my line of actual code :
println("Thu Dec 31 22:00:00 UYST 2009".replace("(^([0-9][0-9]:[0-9][0-9]))".toRegex(),""))
Can someone help me to correct the regex?
The reason the one you have isn't working is because you are asserting that the line starts right before the minutes and seconds, which isn't the case. This can be fixed by removing the assertion (^).
If you need the assertion to remain, there is another way. In most languages, you wouldn't be able to use a variable-length positive lookbehind here, but lucky for you, it looks like you can in Kotlin.
A positive lookbehind is basically just telling the pattern "this comes before what I'm looking for". It's denoted by a group beginning with ?<=. In this case, you can use something like (?<=^[\w ]+). This will match all word characters or spaces between the beginning of the line and where the pattern that comes after it is able to match. Appending it to your expression would look something like (?<=^[\w ]+)([0-9][0-9]:[0-9][0-9]) (note you will have to escape the \w in order for it to be in a string and not be angry about it).
Side note, Yogesh_D is correct in saying that \d\d:\d\d is the same as your [0-9][0-9]:[0-9][0-9]. Using this, it would look more like (?<=^[\w ]+)\d\d:\d\d.
You may use various solutions, here are two:
val text = """Thu Dec 31 22:00:00 UYST 2009"""
val match = """\b(?:0?[1-9]|1\d|2[0-3]):[0-5]\d\b""".toRegex().find(text)
println(match?.value)
val match2 = """\b(\d{1,2}:\d{2}):\d{2}\b""".toRegex().find(text)
println(match2?.groupValues?.getOrNull(1))
Both return 22:00. See regex #1 demo and regex #2 demo.
The regex complexity should be selected based on how messy the input string is.
Details
\b - a word boundary
(?:0?[1-9]|1\d|2[0-3]) - an optional zero and then a non-zero digit, or 1 and any digit, or 2 and a digit from 0 to 3
: - a : char
[0-5]\d - 0, 1, 2, 3, 4 or 5 and then any one digit
\b - a word boundary.
If there is a match with this regex, you get it as a whole match, so you can access it via match?.value.
If you do not have to worry about any pre-valiation when matching, you may simply match 3 colon-separated digit pairs and capture the first two, see the second regex:
\b - a word boundary
(\d{1,2}:\d{2}) - Group 1: one or two digits, : and two digits
:\d{2} - a : and two digits (not captured)
\b - a word boundary.
If there is a match, we need Group 1 value, hence match2?.groupValues?.getOrNull(1) is used.
I am not sure what language you are using but why use negation when you can directly match the first digits in the hh:mm format.
Assuming that the date string format always is in the format with a hh:mm in there.
This regex snippet should have the first group match the hh:mm.
https://regex101.com/r/aHdehZ/1
The regex to use is (\d\d:\d\d)

Positive Lookbehind greedy

I think I have some misunderstanding about how a positive Lookbehind works in Regex, here is an example:
12,2 g this is fully random
89 g random string 2
0,6 oz random stuff
1 really random stuff
Let's say I want to match everything after the measuring unit, so I want "this is fully random", "random string 2", "random stuff" and really "random stuff".
In order to do that I tried the following pattern:
(?<=(\d(,\d)?) (g|oz)?).*
But as "?" means 0 or 1, it seems that the pattern prioritizes 0 over 1 in that case - So I get:
But the measuring unit has to stay "optional" as it won't necessary be in the string (cf fourth instance)...
Any idea on how to deal with that issue? Thanks!
It would be easier to look at the positions that it matches to see what happens. The assertion (?<=(\d(,\d)?) (g|oz)?) is true at a position where what is directly to the left is (\d(,\d)?) and optional (g|oz)?
The pattern goes from left to right, and the assertion is true at multiple places. But at the first place it encounters, it matches .* meaning 0+ times any char and will match until the end of the line.
See the positions on regex101
What you might do instead is match the digit part and make the space followed by g or oz optional and use a capturing group for the second part.
\d+(?:,\d+)?(?: g| oz)? (.*)
Regex demo

Regex extracting specific text ignoring part of result

I have a text.
The overall system blur must be smaller than 12 mrad and diameter shall be less than 20 meters.
I want to extract:
must be 12
I'm using:
^(.*?(\bmust be\b[\s*?\D]*\d+)[^$]*)$
And I get
must be smaller than 12
Any way to do this directly? Or better to try to do different groups somehow?
In the pattern that you tried, \D matches any char except a digit, so this [\s*?\D]* could be shortened to just \D.
This part [^$]* matches any char except a dollar sign. If you intend to match the rest of the line, you can use .* instead.
You can use 2 capturing groups instead.
^.*?\b(must be)\b\D*(\d+).*$
Regex demo
In the replacement use the 2 capturing groups like for example $1 $2

Match sequence at the very beginning of a string along with other sequences in the middle

I'm trying to find a way to match 2 patterns in the same regular expression, but I can't seem to be able to combine both. These are a few strings I'm trying to match against:
0 800-204-4000
0800-204 4000
0 800 -204 - 4000
800 204 4000
What I'm trying to do is find a regular expression that matches a zero at the beginning of a string if it exists and all subsequent white spaces and dashes. So I was able to match the first zero if it exists using /^0?/ and match all empty spaces and dashes using /[\s-]*/g but how exactly can I combine both into the same expression?
Edit:
So I want to match the first 0 IF it exists and all following spaces and dashes. So in the examples above, what should be matched is in brackets:
[0 ]800[-]204[-]4000
[0]800[-]204[ ]4000
[0 ]800[ -]204[ - ]4000
800[ ]204[ ]4000
The regex provided in the answers do not work. Check it out: https://regex101.com/r/dMN6xR/1
(^0|[\s-])[\s-]*
To avoid empty matches, set up a match group () that matches starting zero ^0 or | whitespace/- [\s-], and then match rest of whitespace/- [\s-]*
To allow whitespace before the zero, just add that right after the start anchor like this
(^[\s-]*0|[\s-])[\s-]*.
You combine the regex by using the or operator:
/(^0?)|[\s-]*/g
When simplified, it would be
/(^0)?[\s-]+/g
Drawing on the others' answers, you can use /^0[-\s]*|[-\s]+/g. I've used [- ] in the demo because I don't like how difficult it is to read when newlines are removed.