I am using a regular expression to extract all non-numeric characters between two underscores from a string.
JohnDoe_King234_sample
I need the following output from the string: King
I have tried the following regular expression: (?<=_).\D*(?=_)
(Look positively forward for _ then match non numeric characters then look positively behind _ )
If my string is:
JohnDoe_King_sample
then my expression returns King. If my string is:
JohnDoe_King234_sample
then my expression does not match.
(?<=_).\D*(?=_)
Expected results: King
Actual results:
You may use
(?<=_)[^_\d]+(?=\d*_)
See the regex demo
Details
(?<=_) - a _ should be right before the current location
[^_\d]+ - any 1 or more chars other than _ and digits -
(?=\d*_) - there must be 0 or more digits followed with one _ immediately to the right of the current location.
NOTE: In case you may have digits anywhere inside that substring between underscores, if you have a way to process the string with some programming language, you might consider a _([^_]+)_ regex to extract the first match, then grab Group 1 value and remove all digits from it using a simple \d+ pattern with a regex replace method/function.
Related
there are 4 strings as shown below
ABC_FIXED_20220720_VALUEABC.csv
ABC_FIXED_20220720_VALUEABCQUERY_answer.csv
ABC_FIXED_20220720_VALUEDEF.csv
ABC_FIXED_20220720_VALUEDEFQUERY_answer.csv
Two strings are considered as matched based on a matching substring value (VALUEABC, VALUEDEF in the above shown strings). Thus I am looking to match first 2 (having VALUEABC) and then next 2 (having VALUEDEF). The matched strings are identified based on the same value returned for one regex group.
What I tried so far
ABC.*[0-9]{8}_(.*[^QUERY_answer])(?:QUERY_answer)?.csv
This returns regex group-1 (from (.*[^QUERY_answer])) value "VALUEABC" for first 2 strings and "VALUEDEF" for next 2 strings and thus desired matching achieved.
But the problem with above regex is that as soon as the value ends with any of the characters of "QUERY_answer", the regex doesn't match any value for the grouping. For instance, the below 2 strings doesn't match at all as the VALUESTU ends with "U" here :
ABC_FIXED_20220720_VALUESTU.csv
ABC_FIXED_20220720_VALUESTUQUERY_answer.csv
I tried to use Negative Lookahead:
ABC.*[0-9]{8}_(.*(?!QUERY_answer))(?:QUERY_answer)?.csv
but in this case the grouping-1 value is returned as "VALUESTU" for first string and "VALUESTUQUERY_answer" for second string, thus effectively making the 2 strings unmatched.
Any way to achieve the desired matching?
With your shown samples please try following regex.
^ABC_[^_]*_[0-9]+_(.*?)(?:QUERY_answer)?\.csv$
OR to match exact 8 digits try:
^ABC_[^_]*_[0-9]{8}_(.*?)(?:QUERY_answer)?\.csv$
Here is the online demo for above regex.
Explanation: Adding detailed explanation for above regex.
^ABC_[^_]*_ ##Matching from starting of value ABC followed by _ till next occurrence of _.
[0-9]+_ ##Matching continuous occurrences of digits followed by _ here.
(.*?) ##Creating one and only capturing group using lazy match which is opposite of greedy match.
(?:QUERY_answer)? ##In a non-capturing group matching QUERY_answer and keeping it optional.
\.csv$ ##Matching dot literal csv at the end of the value.
You need
ABC.*[0-9]{8}_(.*?)(?:QUERY_answer)?\.csv
See the regex demo.
Note
.*[^QUERY_answer] matches any zero or more chars other than line break chars as many as possible, and then any one char other than Q, U, E, etc., i.e. any char in the negated character class. This is replaced with .*?, to match any zero or more chars other than line break chars as few as possible.
(?:QUERY_answer)? - the group is made non-capturing to reduce grouping complexity.
\.csv - the . is escaped to match a literal dot.
I have to use a regular expression to parse values out of a swift message and there are some situations where the behaviour is not what I want.
Lets say I am after something with a particular pattern - in this case a BIC (6 letters, followed by 2 letters or digits followed by optional XXX or 3 digits)
([A-Z]{6}[A-Z0-9]{2}[XXX0-9]{0,3})
this is fine but now I want to look for these bank codes in particular fields. In swift a field is denoted with : and has some numbers and sometimes a letter.
so I want to match a BIC value in field 52A
I can do the following
(52A:[A-Z]{6}[A-Z0-9]{2}[XXX0-9]{0,3})
which would match 52A:AAAAAAAAXXX
my problem is you can have things before and after this value - and the value itself might not exist in the field you want
so I can wildcard the reg ex to allow for things before it for example
(52A:.*?[A-Z]{6}[A-Z0-9]{2}[XXX0-9]{0,3})
matches 52A:somerubbishAAAAAAAAXXX
but if there isnt something within this field - the reg ex continues to search for the pattern and this is where i have a problem.
for example the above reg ex matches this 52A:somerubbish:57D:AAAAAAAAXXX
Question
I need the reg ex to stop on the first field that is after it (it might not always be 57D but it will always follow the format [0-9]{2}[A-Z]{0,1})
so the above example shouldnt return a match as the pattern I am after is not contained in the 52A section
Does anyone know how I can do this?
Change .*? to [^:]*?:
(52A:[^:]*?[A-Z]{6}[A-Z0-9]{2}[XXX0-9]{0,3})
[^:] means "any character except :", which ensures the match doesn't run into the next field.
See live demo.
Also, unless your situation requires you to match your target as group 1, you don't need the outer brackets: the entire match (ie group 0) will be your target.
I suspect instead of [XXX0-9]{0,3} you want (XXX|\d{3})? (XXX or 3 digits, but optionally) or perhaps (XXX|\d{1,3})? (XXX or up to 3 digits, but optionally)
Using [XXX0-9]{0,3} (which is the same as [X0-9]{0,3}) is a character class notation, repeating 0-3 times an X char or a digit.
If the value itself can also contain a colon, you can match any character as "rubbish" as long as what is directly to the right is not the field format.
52A:(?:(?![0-9]{2}[A-Z]?:).)*[A-Z]{6}[A-Z0-9]{2}(?:[0-9]{3}|XXX)?
The pattern matches:
52A: Match literally
(?:(?![0-9]{2}[A-Z]?:).)* Match any character asserting not 2 digits, optional char A-Z and : directly to the right
[A-Z]{6}[A-Z0-9]{2} Match 6 chars A-Z and 2 chars A-Z or 0-9
(?:[0-9]{3}|XXX)? Optionally match 3 digits or XXX
See a regex demo.
I want to expect some characters only if a prior regex matched. If not, no characters (empty string) is expected.
For instance, if after the first four characters appears a string out of the group (A10, B32, C56, D65) (kind of enumeration) then a "_" followed by a 3-digit number like 123 is expected. If no element of the mentioned group appears, no other string is expected.
My first attempt was this but the ELSE branch does not work:
^XXX_(?<DT>A12|B43|D14)(?(DT)(_\d{1,3})|)\.ZZZ$
XXX_A12_123.ZZZ --> match
XXX_A11.ZZZ --> match
XXX_A12_abc.ZZZ --> no match
XXX_A23_123.ZZZ --> no match
These are examples of filenames. If the filename contains a string of the mentioned group like A12 or C56, then I expect that this element if followed by an underscore followed by 1 to 3 digits. If the filename does not contain a string of that group (no character or a character sequence different from the strings in the group) then I don't want to see the underscore followed by 1 to 3 digits.
For instance, I could extend the regex to
^XXX_(?<DT>A12|B43|D14)_\d{5}(?(DT)(_\d{1,3})|)_someMoreChars\.ZZZ$
...and then I want these filenames to be valid:
XXX_A12_12345_123_wellDone.ZZZ
XXX_Q21_00000_wellDone.ZZZ
XXX_Q21_00000_456_wellDone.ZZZ
...but this is invalid:
XXX_A12_12345_wellDone.ZZZ
How can I make the ELSE branch of the conditional statement work?
In the end I intend to have two groups like
Group A: (A11, B32, D76, R33)
Group B: (A23, C56, H78, T99)
If an element of group A occurs in the filename then I expect to find _\d{1,3} in the filename.
If an element of group B occurs ion the filename then the _\d{1,3} shall be optional (it may or may not occur in the filename).
I ended up in this regex:
^XXX_(?:(?A12|B43|D14))?(?(DT)(_\d{5}_\d{1,3})|(?!(?&DT))(?!.*_\d{3}(?!\d))).*\.ZZZ$
^XXX_(?:(?<DT>A12|B43|D14))?_\d{5}(?(DT)(_\d{1,3})|(?!(?&DT))(?!.*_\d{3}(?!\d))).+\.ZZZ$
Since I have to use this regex in the OpenApi #Pattern annotation I have the problem that I get the error:
Conditionals are not supported in this regex dialect.
As #The fourth bird suggested alternation seems to do the trick:
XXX_((((A12|B43|D14)_\d{5}_\d{1,3}))|((?:(A10|B10|C20)((?:_\d{5}_\d{3})|(?:_\d{3}))))).*\.ZZZ$
The else branch is the part after the |, but if you also want to match the 2nd example, the if clause would not work as you have already matched one of A12|B43|D14
The named capture group is not optional, so the if clause will always be true.
What you can do instead is use an alternation to match either the numeration part followed by an underscore and 3 digits, or match an uppercase char and 2 digits.
^XXX_(?:(?<DT>A12|B43|D14)_\d{1,3}|[A-Z]\d{2})\.ZZZ$
Regex demo
If you want to make use of the if/else clause, you can make the named capture group optional, and then check if group 1 exists.
^XXX_(?<DT>A12|B43|D14)?(?(DT)_\d{1,3}|[A-Z]\d{2})\.ZZZ$
Regex demo
For the updated question:
^XXX_(?<DT>A12|B43|D14)?(?(DT)(?:_\d{5})?_\d{3}(?!\d)|(?!A12|B43|D14|[A-Z]\d{2}_\d{3}(?!\d))).*\.ZZZ$
The pattern matches:
^ Start of string
XXX_ Match literally
(?<DT>A12|B43|D14)?
(?(DT) If we have group DT
(?:_\d{5})? Optionally match _ and 5 digits
_\d{3}(?!\d) Match _ and 3 digits
| Or
(?! Negative lookahead, assert not to the right
A12|B43|D14| Match one of the alternatives, or
[A-Z]\d{2}_\d{3}(?!\d) Match 1 char A-Z, 2 digits _ 3 digits not followed by a digit
) Close lookahead
) Close if clause
.* Match the rest of the line
\.ZZZ Match . and ZZZ
$ End of string
Regex demo
I need to write a regular expression that has to replace everything except for a single group.
E.g
IN
OUT
OK THT PHP This is it 06222021
This is it
NO MTM PYT Get this content 111111
Get this content
I wrote the following Regular Expression: (\w{0,2}\s\w{0,3}\s\w{0,3}\s)(.*?)(\s\d{6}(\s|))
This RegEx creates 4 groups, using the first entry as an example the groups are:
OK THT PHP
This is it
06222021
Space Charachter
I need a way to:
Replace Group 1,2,4 with String.Empty
OR
Get Group 3, ONLY
You don't need 4 groups, you can use a single group 1 to be in the replacement and match 6-8 digits for the last part instead of only 6.
Note that this \w{0,2} will also match an empty string, you can use \w{1,2} if there has to be at least a single word char.
^\w{0,2}\s\w{0,3}\s\w{0,3}\s(.*?)\s\d{6,8}\s?$
^ Start of string
\w{0,2}\s\w{0,3}\s\w{0,3}\s Match 3 times word characters with a quantifier and a whitespace in between
(.*?) Capture group 1 match any char as least as possible
\s\d{6,8} Match a whitespace char and 6-8 digits
\s? Match an optional whitespace char
$ End of string
Regex demo
Example code
Dim s As String = "OK THT PHP This is it 06222021"
Dim result As String = Regex.Replace(s, "^\w{0,2}\s\w{0,3}\s\w{0,3}\s(.*?)\s\d{6,8}\s?$", "$1")
Console.WriteLine(result)
Output
This is it
My approach does not work with groups and does use a Replace operation. The match itself yields the desired result.
It uses look-around expressions. To find a pattern between two other patterns, you can use the general form
(?<=prefix)find(?=suffix)
This will only return find as match, excluding prefix and suffix.
If we insert your expressions, we get
(?<=\w{0,2}\s\w{0,3}\s\w{0,3}\s).*?(?=\s\d{6}\s?)
where I simplified (\s|) as \s?. We can also drop it completely, since we don't care about trailing spaces.
(?<=\w{0,2}\s\w{0,3}\s\w{0,3}\s).*?(?=\s\d{6})
Note that this works also if we have more than 6 digits because regex stops searching after it has found 6 digits and doesn't care about what follows.
This also gives a match if other things precede our pattern like in 123 OK THT PHP This is it 06222021. We can exclude such results by specifying that the search must start at the beginning of the string with ^.
If the exact length of the words and numbers does not matter, we simply write
(?<=^\w+\s\w+\s\w+\s).*?(?=\s\d+)
If the find part can contain numbers, we must specify that we want to match until the end of the line with $ (and include a possible space again).
(?<=^\w+\s\w+\s\w+\s).*?(?=\s\d+\s?$)
Finally, we use a quantifier for the 3 ocurrences of word-space:
(?<=^(\w+\s){3}).*?(?=\s\d+\s?$)
This is compact and will only return This is it or Get this content.
string result = Regex.Match(#"(?<=^(\w+\s){3}).*?(?=\s\d+\s?$)").Value;
I have the following regular expression:
/^[a-f0-9]{8}$/ --- This expression extracts an 8 character string as a md5 hash, for example: if I have the following string "hello world .305eef9f x1xxx 304ccf9f test1232" it will return "304ccf9f"
I also have the following regular expression:
/.[^.]*$/ --- This expression extracts a string after the last period (included), for example, if I have "hello world.this.is.atest.case9.23919sd3xxxs" it will return ".23919sd3xxxs"
Thing is, I've readen a bit about regex but I can't join both expressions in order to find the md5 string after the last period (included), for example:
topLeftLogo.93f02a9d.controller.99f06a7s ----> must return ".99f06a7s"
Thanks in advance for your time and help!
/^[a-f0-9]{8}$/ --- This expression extracts an 8 character string as a md5 hash
Yes but it doesn't return "304ccf9f" from "hello world .305eef9f x1xxx 304ccf9f test1232" because ^ in regex means start of string. How is it possible for it to match in middle of a string?
/.[^.]*$/ --- This expression extracts a string after the last period
No. It will do if you escape first dot only \.
To combine these two you have to replace ^ with \.:
\.[a-f0-9]{8}$
To match your characters 8 times after the last dot in this range [a-f0-9] you might use (if supported) a positive lookahead (?!.*\.) to match your values and assert that what follows does not contain a dot:
\.[a-f0-9]{8}(?!.*\.)
Regex demo
If you want to match characters from a-z instead of a-f like 99f06a7s you could use [a-z0-9]
About the first example
This regex ^[a-f0-9]{8}$ will match one of the ranges in the character class 8 times from the start until the end of the string due to the anchors ^ and $. It would not find a match in hello world .305eef9f x1xxx 304ccf9f test1232 on the same line.
About the second example
.[^.]*$ will match any character zero or more times followed by matching not a dot. That would for example also match a single a and is not bound to first matching a dot because you have to escape the dot to match it literally.
I'm adding this just in case people needs to solve a similar casuistic:
Case 1: for example, we want to get the hexadecimal ([a-f0-9]) 8 char string from our filename string
between the last period and the file extension, in order, for example, to remove that "hashed" part:
Example:
file.name2222.controller.2567d667.js ------> returns .2567d667
We will need to use the following regex:
\.[a-f0-9]{8}(?=\.\w+$)
Case 2: for example, we want the same as above but ignoring the first period:
Example:
file.name2222.controller.2567d667.js ------> returns 2567d667
We will need to use the following regex
[a-f0-9]{8}(?=\.\w+$)