Negate a character group to replace all other characters - regex

I have the following string:
"Thu Dec 31 22:00:00 UYST 2009"
I want to replace everything except for the hours and minutes so I get the following result:
"22:00"
I am using this regex :
(^([0-9][0-9]:[0-9][0-9]))
But its not matching anything.
This would be my line of actual code :
println("Thu Dec 31 22:00:00 UYST 2009".replace("(^([0-9][0-9]:[0-9][0-9]))".toRegex(),""))
Can someone help me to correct the regex?

The reason the one you have isn't working is because you are asserting that the line starts right before the minutes and seconds, which isn't the case. This can be fixed by removing the assertion (^).
If you need the assertion to remain, there is another way. In most languages, you wouldn't be able to use a variable-length positive lookbehind here, but lucky for you, it looks like you can in Kotlin.
A positive lookbehind is basically just telling the pattern "this comes before what I'm looking for". It's denoted by a group beginning with ?<=. In this case, you can use something like (?<=^[\w ]+). This will match all word characters or spaces between the beginning of the line and where the pattern that comes after it is able to match. Appending it to your expression would look something like (?<=^[\w ]+)([0-9][0-9]:[0-9][0-9]) (note you will have to escape the \w in order for it to be in a string and not be angry about it).
Side note, Yogesh_D is correct in saying that \d\d:\d\d is the same as your [0-9][0-9]:[0-9][0-9]. Using this, it would look more like (?<=^[\w ]+)\d\d:\d\d.

You may use various solutions, here are two:
val text = """Thu Dec 31 22:00:00 UYST 2009"""
val match = """\b(?:0?[1-9]|1\d|2[0-3]):[0-5]\d\b""".toRegex().find(text)
println(match?.value)
val match2 = """\b(\d{1,2}:\d{2}):\d{2}\b""".toRegex().find(text)
println(match2?.groupValues?.getOrNull(1))
Both return 22:00. See regex #1 demo and regex #2 demo.
The regex complexity should be selected based on how messy the input string is.
Details
\b - a word boundary
(?:0?[1-9]|1\d|2[0-3]) - an optional zero and then a non-zero digit, or 1 and any digit, or 2 and a digit from 0 to 3
: - a : char
[0-5]\d - 0, 1, 2, 3, 4 or 5 and then any one digit
\b - a word boundary.
If there is a match with this regex, you get it as a whole match, so you can access it via match?.value.
If you do not have to worry about any pre-valiation when matching, you may simply match 3 colon-separated digit pairs and capture the first two, see the second regex:
\b - a word boundary
(\d{1,2}:\d{2}) - Group 1: one or two digits, : and two digits
:\d{2} - a : and two digits (not captured)
\b - a word boundary.
If there is a match, we need Group 1 value, hence match2?.groupValues?.getOrNull(1) is used.

I am not sure what language you are using but why use negation when you can directly match the first digits in the hh:mm format.
Assuming that the date string format always is in the format with a hh:mm in there.
This regex snippet should have the first group match the hh:mm.
https://regex101.com/r/aHdehZ/1
The regex to use is (\d\d:\d\d)

Related

Regular Expression Stopping at Specified Value

I have to use a regular expression to parse values out of a swift message and there are some situations where the behaviour is not what I want.
Lets say I am after something with a particular pattern - in this case a BIC (6 letters, followed by 2 letters or digits followed by optional XXX or 3 digits)
([A-Z]{6}[A-Z0-9]{2}[XXX0-9]{0,3})
this is fine but now I want to look for these bank codes in particular fields. In swift a field is denoted with : and has some numbers and sometimes a letter.
so I want to match a BIC value in field 52A
I can do the following
(52A:[A-Z]{6}[A-Z0-9]{2}[XXX0-9]{0,3})
which would match 52A:AAAAAAAAXXX
my problem is you can have things before and after this value - and the value itself might not exist in the field you want
so I can wildcard the reg ex to allow for things before it for example
(52A:.*?[A-Z]{6}[A-Z0-9]{2}[XXX0-9]{0,3})
matches 52A:somerubbishAAAAAAAAXXX
but if there isnt something within this field - the reg ex continues to search for the pattern and this is where i have a problem.
for example the above reg ex matches this 52A:somerubbish:57D:AAAAAAAAXXX
Question
I need the reg ex to stop on the first field that is after it (it might not always be 57D but it will always follow the format [0-9]{2}[A-Z]{0,1})
so the above example shouldnt return a match as the pattern I am after is not contained in the 52A section
Does anyone know how I can do this?
Change .*? to [^:]*?:
(52A:[^:]*?[A-Z]{6}[A-Z0-9]{2}[XXX0-9]{0,3})
[^:] means "any character except :", which ensures the match doesn't run into the next field.
See live demo.
Also, unless your situation requires you to match your target as group 1, you don't need the outer brackets: the entire match (ie group 0) will be your target.
I suspect instead of [XXX0-9]{0,3} you want (XXX|\d{3})? (XXX or 3 digits, but optionally) or perhaps (XXX|\d{1,3})? (XXX or up to 3 digits, but optionally)
Using [XXX0-9]{0,3} (which is the same as [X0-9]{0,3}) is a character class notation, repeating 0-3 times an X char or a digit.
If the value itself can also contain a colon, you can match any character as "rubbish" as long as what is directly to the right is not the field format.
52A:(?:(?![0-9]{2}[A-Z]?:).)*[A-Z]{6}[A-Z0-9]{2}(?:[0-9]{3}|XXX)?
The pattern matches:
52A: Match literally
(?:(?![0-9]{2}[A-Z]?:).)* Match any character asserting not 2 digits, optional char A-Z and : directly to the right
[A-Z]{6}[A-Z0-9]{2} Match 6 chars A-Z and 2 chars A-Z or 0-9
(?:[0-9]{3}|XXX)? Optionally match 3 digits or XXX
See a regex demo.

How to select with regex this character?

For the example i have these four ip address:
10.100.0.11; wrong
10.100.1.12; good
10.100.11.4; good
10.100.44.1; wrong
The task has simple rules. In the 3rd place cant be 0, and the 4rd place cant be a solo 1.
I need to select they from an ip table in different routers and i know only this rules.
My solution:
^(10.100.[1-9]{1,3}.[023456789]{1,3})$
but in this case every number with 1 like 10, 100 etc is missing, so in this way this solution is wrong.
^(10.100.[1-9]{1,3}.[1-9]{2,3})$
This solve the problem of the single 1, but make another one.
From the rules you have given, this regex should work:
10\.100\.([123456789]\d*|\d{2,})\.([^1]$|\d{2,})
it also matches 3rd position number containing a 0 but not in the first place.
so 10.100.10.4 will match as well as 10.100.02.4
I don't know if it's the intended behavior since I'm not familiar with ip adress.
The last part \.([^1]$|\d{2,}) reads like this:
"after the 3rd dot is either
a character which is not 1 followed by the end of the line
or two or more digits"
If you want to avoid malformed string containing non-digit character like 10.100.12.a to be match you should replace [^1] by [023456789] or lazier (and therefore better ;) by [02-9]
I use https://regex101.com to debug regex. It's just awesome.
Here is your regex if you want to play with it
You might use
^10\.100\.[1-9]{1,3}\.(?:[02-9]|\d{2,3})$
The pattern matches
^ Start of string
10\.100\. Match 10.100. (note to escape the dot to match it literally)
[1-9]{1,3} Match 3 times a digit 1-9
\. Match a dot
(?: Non capture group
[02-9] Match a digit 0 or 2-9
| Or
\d{2,3} Match 2 or 3 digits 0-9
) Close the group
$ End of string
Regex demo

Regex : extract the biggest number from x to y figures

I have an Url formatted as follow : https://www.mywebsite.com/subdomain/123456789.htm. I know that the webpage number is built with exactly 9 or 10 digits. I would like to extract this number using a Regex.
The Regex I use to perform this operation is :
^https://www.mywebsite.com/[A-Za-z0-9_.-~/]+([0-9]{9,10}).htm$
The problem is that when the number is 10 digits long, I get a match which is good but only the last 9 digits are captured. For example : https://www.mywebsite.com/subdomain/1234567890.htm captures 234567890 only.
I could easily create two regexes (one with 9 digits and one with 10) and take the longest number if both matches, but is there any elegant way to solve this problem using Regex?
EDIT
Following remarks which have been made below, there is actually a mistake in my original Regex : the first character group matches the first digit of the 10, and leaves only the 9 others for the capturing group. I've added a screenshot below. Adding a forward slash to the Regex before the capturing group solved the issue, thanks!
As per #TheFourthBird, you are missing a match on the forward slash. Maybe a slightly different approach to yours would be a non-capturing group:
^https://www.mywebsite.com/(?:[^/]+/)+(\d{9,10}).htm$
The character class [A-Za-z0-9_.-~/]+ matches all the character that follow until the end of the line.
This part ([0-9]{9,10}). will then backtrack until it can match the resulting digits, which it can starting from 9 digits and that will be in the capturing group.
Note to either escape the hyphen \- or place it at the start or end of the character class or else it could possible match a range.
One option is to use a word bounary \b before matching the digits
^https://www\.mywebsite\.com/[A-Za-z0-9_.~/-]+\b([0-9]{9,10})\.htm$
Regex demo
Another way could be matching the / right before the digits.
^https://www\.mywebsite\.com/[A-Za-z0-9_.~/-]+/([0-9]{9,10})\.htm$
Regex demo
If there can also be chars a-zA-Z or an underscoe before the digits and a lookbehind is supported, you could also assert that there is not a digit before (?<!\d)
^https://www\.mywebsite\.com/[A-Za-z0-9_.~/-]+(?<!\d)([0-9]{9,10})\.htm$
Regex demo
One more approach. This gets all the numbers between / and htm
(\d+)(?=\.htm)
RegexDemo

RegEx to check 24 hours time format fails

I have the following RegEx that is supposed to do 24 hours time format validation, which I'm trying out in https://rubular.com
/^[0-23]{2}:[0-59]{2}:[0-59]{2}$/
But the following times fails to match even if they look correct
02:06:00
04:05:00
Why this is so?
In character classes, you're supposed to denote the range of characters allowed (in contrast to the numbers you want to match in your example). For minutes and seconds, this is relatively straight-forward - the following expression
[0-5][0-9]
...will match any numerical string from "00" to "59".
But for the hours, you need to two separate expressions:
[01][0-9]|2[0-3]
...one to match "00" to "19" and one to match "20" to "23". Due to the alternative used (| character), these need to be grouped, which adds another bit of syntax (?:...). Finally we're just adding the anchors ^ and $ for beginning and end of string, which you already had where they belong.
^(?:[01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9]$
You can check this solution out at regex101, if you like.
Your problem is that you understand characters ranges wrong: 0-23 doesn't mean "match any number from 0 to 23", it means: 0-2- match one digit: 0,1 or 2, then match 3.
Try this pattern: (?:[01][0-9]|2[0-3])(?::[0-5][0-9]){2}
Explanation:
(?:...) - non-capturing group
[01][0-9]|2[0-3] - alternation: match whether 0 or one followed by any digits fro 0 to 9 OR 2 followed by 0, 1, 2 or 3 (number from 00-23)
(?::[0-5][0-9]){2} - match : and [0-5][0-9] (basically number from 00-59) twice
Demo
use this (([0-1]\d|[2][0-3])):(([0-5][0-9])):(([0-5][0-9]))
Online demo

Regex is possible to match?

I have files with these filename:
ZATR0008_2018.pdf
ZATR0018_2018.pdf
ZATR0218_2018.pdf
Where the 4 digits after ZATR is the issue number of magazine.
With this regex:
([1-9][0-9]*)(?=_\d)
I can extract 8, 18 or 218 but I would like to keep minimum 2 digits and max 3 digits so the result should be 08, 18 and 218.
How is possible to do that?
You may use
0*(\d{2,3})_\d
and grab Group 1 value. See the regex demo.
Details
0* - zero or more 0 chars
(\d{2,3}) - Group 1: two or three digits
_\d - a _ followed with a digit.
Here is a PCRE variation that grabs the value you need into a whole match:
0*\K\d{2,3}(?=_\d)
See another regex demo
Here, \K makes the regex engine omit the text matched so far (zeros) and then matches 2 to 3 digits that are followed with _ and a digit.
(?:[1-9][0-9]?)?[0-9]{2}(?=_[0-9])
or perhaps:
(?:[1-9][0-9]+|[0-9]{2})(?=_[0-9])
(https://www.freeformatter.com/regex-tester.html, which claims to use the XRegExp library, that you mention in another answer doesn't seem to backtrack into the (?:)? in my first suggestion where necessary, which makes it very different from any regex engine I've encoutered before and makes it prefer to match just the 18 of 218 even though it starts later in the string. But it does work with my second suggestion.
([1-9]\d{2,3})(?=_\d)
{x,y} will match from x to y times the previous pattern, in this case \d
Edit: from your own regex it looked as you wanted the part of the number which starts with a non-zero. However since your examples include leading 0s, maybe you really wanted :
(\d{2,3})(?=_\d)
Which will give you the last 3 digits before underscore unless there are only 2 digits.
I propose you:
^ZATR0*(\d{2,3})_\d+\.pdf$
demo code here. Result:
Match 1 Full match 0-17 ZATR0008_2018.pdf Group 1. 6-8 08
Match 2 Full match 18-35 ZATR0018_2018.pdf Group 1. 24-26 18
Match 3 Full match 36-53 ZATR0218_2018.pdf Group 1. 41-44 218