Regular Expression: Find a specific group within other groups in VB.Net - regex

I need to write a regular expression that has to replace everything except for a single group.
E.g
IN
OUT
OK THT PHP This is it 06222021
This is it
NO MTM PYT Get this content 111111
Get this content
I wrote the following Regular Expression: (\w{0,2}\s\w{0,3}\s\w{0,3}\s)(.*?)(\s\d{6}(\s|))
This RegEx creates 4 groups, using the first entry as an example the groups are:
OK THT PHP
This is it
06222021
Space Charachter
I need a way to:
Replace Group 1,2,4 with String.Empty
OR
Get Group 3, ONLY

You don't need 4 groups, you can use a single group 1 to be in the replacement and match 6-8 digits for the last part instead of only 6.
Note that this \w{0,2} will also match an empty string, you can use \w{1,2} if there has to be at least a single word char.
^\w{0,2}\s\w{0,3}\s\w{0,3}\s(.*?)\s\d{6,8}\s?$
^ Start of string
\w{0,2}\s\w{0,3}\s\w{0,3}\s Match 3 times word characters with a quantifier and a whitespace in between
(.*?) Capture group 1 match any char as least as possible
\s\d{6,8} Match a whitespace char and 6-8 digits
\s? Match an optional whitespace char
$ End of string
Regex demo
Example code
Dim s As String = "OK THT PHP This is it 06222021"
Dim result As String = Regex.Replace(s, "^\w{0,2}\s\w{0,3}\s\w{0,3}\s(.*?)\s\d{6,8}\s?$", "$1")
Console.WriteLine(result)
Output
This is it

My approach does not work with groups and does use a Replace operation. The match itself yields the desired result.
It uses look-around expressions. To find a pattern between two other patterns, you can use the general form
(?<=prefix)find(?=suffix)
This will only return find as match, excluding prefix and suffix.
If we insert your expressions, we get
(?<=\w{0,2}\s\w{0,3}\s\w{0,3}\s).*?(?=\s\d{6}\s?)
where I simplified (\s|) as \s?. We can also drop it completely, since we don't care about trailing spaces.
(?<=\w{0,2}\s\w{0,3}\s\w{0,3}\s).*?(?=\s\d{6})
Note that this works also if we have more than 6 digits because regex stops searching after it has found 6 digits and doesn't care about what follows.
This also gives a match if other things precede our pattern like in 123 OK THT PHP This is it 06222021. We can exclude such results by specifying that the search must start at the beginning of the string with ^.
If the exact length of the words and numbers does not matter, we simply write
(?<=^\w+\s\w+\s\w+\s).*?(?=\s\d+)
If the find part can contain numbers, we must specify that we want to match until the end of the line with $ (and include a possible space again).
(?<=^\w+\s\w+\s\w+\s).*?(?=\s\d+\s?$)
Finally, we use a quantifier for the 3 ocurrences of word-space:
(?<=^(\w+\s){3}).*?(?=\s\d+\s?$)
This is compact and will only return This is it or Get this content.
string result = Regex.Match(#"(?<=^(\w+\s){3}).*?(?=\s\d+\s?$)").Value;

Related

Pattern to match everything except a string of 5 digits

I only have access to a function that can match a pattern and replace it with some text:
Syntax
regexReplace('text', 'pattern', 'new text'
And I need to return only the 5 digit string from text in the following format:
CRITICAL - 192.111.6.4: rta nan, lost 100%
Created Time Tue, 5 Jul 8:45
Integration Name CheckMK Integration
Node 192.111.6.4
Metric Name POS1
Metric Value DOWN
Resource 54871
Alert Tags 54871, POS1
So from this text, I want to replace everything with "" except the "54871".
I have come up with the following:
regexReplace("{{ticket.description}}", "\w*[^\d\W]\w*", "")
Which almost works but it doesn't match the symbols. How can I change this to match any word that includes a letter or symbol, essentially.
As you can see, the pattern I have is very close, I just need to include special characters and letters, whereas currently it is only letters:
You can match the whole string but capture the 5-digit number into a capturing group and replace with the backreference to the captured group:
regexReplace("{{ticket.description}}", "^(?:[\w\W]*\s)?(\d{5})(?:\s[\w\W]*)?$", "$1")
See the regex demo.
Details:
^ - start of string
(?:[\w\W]*\s)? - an optional substring of any zero or more chars as many as possible and then a whitespace char
(\d{5}) - Group 1 ($1 contains the text captured by this group pattern): five digits
(?:\s[\w\W]*)? - an optional substring of a whitespace char and then any zero or more chars as many as possible.
$ - end of string.
The easiest regex is probably:
^(.*\D)?(\d{5})(\D.*)?$
You can then replace the string with "$2" ("\2" in other languages) to only place the contents of the second capture group (\d{5}) back.
The only issue is that . doesn't match newline characters by default. Normally you can pass a flag to change . to match ALL characters. For most regex variants this is the s (single line) flag (PCRE, Java, C#, Python). Other variants use the m (multi line) flag (Ruby). Check the documentation of the regex variant you are using for verification.
However the question suggest that you're not able to pass flags separately, in which case you could pass them as part of the regex itself.
(?s)^(.*\D)?(\d{5})(\D.*)?$
regex101 demo
(?s) - Set the s (single line) flag for the remainder of the pattern. Which enables . to match newline characters ((?m) for Ruby).
^ - Match the start of the string (\A for Ruby).
(.*\D)? - [optional] Match anything followed by a non-digit and store it in capture group 1.
(\d{5}) - Match 5 digits and store it in capture group 2.
(\D.*)? - [optional] Match a non-digit followed by anything and store it in capture group 3.
$ - Match the end of the string (\z for Ruby).
This regex will result in the last 5-digit number being stored in capture group 2. If you want to use the first 5-digit number instead, you'll have to use a lazy quantifier in (.*\D)?. Meaning that it becomes (.*?\D)?.
(?s) is supported by most regex variants, but not all. Refer to the regex variant documentation to see if it's available for you.
An example where the inline flags are not available is JavaScript. In such scenario you need to replace . with something that matches ALL characters. In JavaScript [^] can be used. For other variants this might not work and you need to use [\s\S].
With all this out of the way. Assuming a language that can use "$2" as replacement, and where you do not need to escape backslashes, and a regex variant that supports an inline (?s) flag. The answer would be:
regexReplace("{{ticket.description}}", "(?s)^(.*\D)?(\d{5})(\D.*)?$", "$2")

Regex to extract last period and md5 string

I have the following regular expression:
/^[a-f0-9]{8}$/ --- This expression extracts an 8 character string as a md5 hash, for example: if I have the following string "hello world .305eef9f x1xxx 304ccf9f test1232" it will return "304ccf9f"
I also have the following regular expression:
/.[^.]*$/ --- This expression extracts a string after the last period (included), for example, if I have "hello world.this.is.atest.case9.23919sd3xxxs" it will return ".23919sd3xxxs"
Thing is, I've readen a bit about regex but I can't join both expressions in order to find the md5 string after the last period (included), for example:
topLeftLogo.93f02a9d.controller.99f06a7s ----> must return ".99f06a7s"
Thanks in advance for your time and help!
/^[a-f0-9]{8}$/ --- This expression extracts an 8 character string as a md5 hash
Yes but it doesn't return "304ccf9f" from "hello world .305eef9f x1xxx 304ccf9f test1232" because ^ in regex means start of string. How is it possible for it to match in middle of a string?
/.[^.]*$/ --- This expression extracts a string after the last period
No. It will do if you escape first dot only \.
To combine these two you have to replace ^ with \.:
\.[a-f0-9]{8}$
To match your characters 8 times after the last dot in this range [a-f0-9] you might use (if supported) a positive lookahead (?!.*\.) to match your values and assert that what follows does not contain a dot:
\.[a-f0-9]{8}(?!.*\.)
Regex demo
If you want to match characters from a-z instead of a-f like 99f06a7s you could use [a-z0-9]
About the first example
This regex ^[a-f0-9]{8}$ will match one of the ranges in the character class 8 times from the start until the end of the string due to the anchors ^ and $. It would not find a match in hello world .305eef9f x1xxx 304ccf9f test1232 on the same line.
About the second example
.[^.]*$ will match any character zero or more times followed by matching not a dot. That would for example also match a single a and is not bound to first matching a dot because you have to escape the dot to match it literally.
I'm adding this just in case people needs to solve a similar casuistic:
Case 1: for example, we want to get the hexadecimal ([a-f0-9]) 8 char string from our filename string
between the last period and the file extension, in order, for example, to remove that "hashed" part:
Example:
file.name2222.controller.2567d667.js ------> returns .2567d667
We will need to use the following regex:
\.[a-f0-9]{8}(?=\.\w+$)
Case 2: for example, we want the same as above but ignoring the first period:
Example:
file.name2222.controller.2567d667.js ------> returns 2567d667
We will need to use the following regex
[a-f0-9]{8}(?=\.\w+$)

Regular expression not working

I want to extract from the following regex (?<=^\d+\s*).*?\t trying to extract from the following text just the resources\blahblah:
10 _Resources\index.test FAIL
11 _Resources\index.test FAIL
12 Resources\index.test FAIL
13set\Relicensing Statement.test FAIL
but it captures the following text:
0 _Resources\index.test
1 _Resources\index.test
2 Resources\index.test
3set\Relicensing Statement.test
I just want the lines like Resources\index.test and not the starting numbers, no spaces, why is failing? If I just execute ^\d+\s*and matches with the any number of digits and space, but do not works with prefix.
Since you commented you were using Notepad++, how about matching ^\d+\s*([^\t]*).*$ and replacing by \1 ?
From NSRegularExpression (I saw it was tagged):
Look-behind assertion. True if the parenthesized pattern matches text
preceding the current input position, with the last character of the
match being the input character just before the current position. Does
not alter the input position. The length of possible strings matched
by the look-behind pattern must not be unbounded (no * or +
operators.)
The same problem holds in most of the languages.
Can't you extract $1 from (?:^\d+\s*)(.*?\t)?

Regular Expression begining of string with special characters

Using this for an example string
+$43073$7
and need the 5 number sequence from it I'm using the Regex expression
#"\$+(?<lot>\d{5})"
which is matching up any +$ in the string. I tried
#"^\$+(?<lot>\d{5})"
as the +$ are always at the beginning of the string. What will work?
If you use anchor ^, you need to include the + symbol at the first and don't forget to escape it because + is a special meta character in regex which repeats the previous token one or more times.
#"^\+\$(?<lot>\d{5})"
And without the anchor, it would be like
#"\$(?<lot>\d{5})"
And get the 5 digit number you want from group index 1.
DEMO
I would match what you want:
\d+
or if you only want digits after "special" characters at the start of input:
^\W+(\d+)
grabbing group 1

C# regular expression to match square brackets

I'm trying to use a regular expression in C# to match a software version number that can contain:
a 2 digit number
a 1 or 2 digit number (not starting in 0)
another 1 or 2 digit number (not starting in 0)
a 1, 2, 3, 4 or 5 digit number (not starting in 0)
an option letter at the end enclosed in square brackets.
Some examples:
10.1.23.26812
83.33.7.5
10.1.23.26812[d]
83.33.7.5[q]
Invalid examples:
10.1.23.26812[
83.33.7.5]
10.1.23.26812[d
83.33.7.5q
I have tried the following:
string rex = #"[0-9][0-9][.][1-9]([0-9])?[.][1-9]([0-9])?[.][1-9]([0-9])?([0-9])?([0-9])?([0-9])?([[][a-zA-Z][]])?";
(note: if I try without the "#" and just escape the square brackets by doing "\[" I get an error saying "Unrecognised escape sequence")
I can get to the point where the version number is validating correctly, but it accepts anything that comes after (for example: "10.1.23.26812thisShouldBeWrong" is being matched as correct).
So my question is: is there a way of using a regular expression to match / check for square brackets in a string or would I need to convert it to a different character (eg: change [a] to a and match for *s instead)?
This happens because the regex matches part of the string, and you haven't told it to force the entire string to match. Also, you can simplify your regex a lot (for example, you don't need all those capturing groups:
string rex = #"^[0-9]{2}\.[1-9][0-9]?\.[1-9][0-9]?\.[1-9][0-9]{0,4}(?:\[[a-zA-Z]\])?$";
The ^ and $ are anchors that match the start and end of the string.
The error message you mentioned has to do with the fact that you need to escape the backslash, too, if you don't use a verbatim string. So a literal opening bracket can be matched in a regex as "[[]" or "\\[" or #"\[". The latter form is preferred.
You need to anchor the regex with ^ and $
string rex = #"^[0-9][0-9][.][1-9]([0-9])?[.][1-9]([0-9])?[.][1-9]([0-9])?([0-9])?([0-9])?([0-9])?([[][a-zA-Z][]])?$";
the reason the 10.1.23.26812thisShouldBeWrong matches is because it matches the substring 10.1.23.26812
The regex can be simplfied slightly for readability
string rex = #"^\d{2}\.([1-9]\d?\.){2}[1-9]\d{0,4}(\[[a-zA-Z]\])?$";
In response to TimCross warning - updated regex
string rex = #"^[0-9]{2}\.([1-9][0-9]?\.){2}[1-9][0-9]{0,4}(\[[a-zA-Z]\])?$";