Regex: exclude string from matched pattern - regex

Input string:
hrStorageDescr{hrStorageDescr="devfs: dev file system, mounted on: /.mount/dev"}
Regex to match value of hrStorageDescr only:
.*hrStorageDescr="(.*?)",.*
How to write this regex in order to preserve matching function, but exclude everything in the value, if devfs string is matched?

You could match bhrStorageDescr preceded by a word boundary \b
First match =" and assert what is directly to the right is not devfs followed by a word boundary using a negative lookahead (?!devfs\b)
If that assertion succeeds, capture in the group matching any char except a " using a negated character class and close the group before matching the closing double quote ([^"]+)
Using .* will match the last occurrence of the pattern, using .*? will match the first. If you want to match all occurrences you could omit that part, assuming you allowed to match all matches instead of a single match.
.*?\bhrStorageDescr="(?!devfs\b)([^"]+)"
Regex demo

Related

Regex positive lookahead multiple occurrence

I have below sample string
abc,com;def,med;ghi,com;jkl,med
I have to grep the string which is coming before keyword ",com" (all occurrences)
Final result which is I am looking for is something like -
abc,ghi
I have tried below positive lookahead regex -
[\s\S]*?(?=com)
But this is only fetching abc, not the ghi.
What modification do I need to make in above regex?
Using a character class [\s\S] can match any character and will also match the , and ;
What you can do is match non whitespace characters except for , and ; using a negated character class and that way you don't have to make it non greedy as well.
Then assert the ,com to the right (followed by a word boundary to prevent a partial word match)
Instead of using a lookahead, you might also use a capture group:
([^\s,;]+),com\b
See a regex demo with the capture group values.

Parenthesis content after a specific word

I'm trying to get UNIX group names using a regex (can't use groups because I can only get the process uid, so I'm using id <process_id> to get groups)
input looks like this
uid=1001(kawsay) gid=1001(kawsay) groups=1001(kawsay),27(sudo),44(video),997(gpio)\n
I'd like to capture kawsay, sudo, video and gpio
The only pieces I've got are:
a positive lookbehind to start capturing after groups: /(?<=groups)/
capture the parenthesis content: /\((\w+)\)/
Using PCRE's \G you may use this regex:
(?:\bgroups=|(?<!^)\G)[^(]*\(([^)]+)\)
Your intended matches are available in capture group #1
RegEx Demo
RegEx Details:
(?:: Start non-capture group
\bgroups=: Match word groups followed by a =
|: OR
(?<!^)\G: Start from end position of the previous match
): End non-capture group
[^(]*: Match 0 or more of any character that is not (
\(: Match opening (
([^)]+): Use capture group #1 to match 1+ of any non-) characters
\): Match closing )
You can use
(?:\G(?!\A)\),|\bgroups=)\d+\(\K\w+
See the regex demo. Details:
(?:\G(?!\A)\),|\bgroups=) - either of
\G(?!\A)\), - end of the previous match (\G operator matches either start of string or end of the previous match, so the (?!\A) is necessary to exclude the start of string location) and then ), substring
| - or
\bgroups= - a whole word groups (\b is a word boundary) and then a = char
\d+\( - one or more digits and a (
\K - match reset operator that makes the regex engine "forget" the text matched so far
\w+ - one or more word chars.
Here are two more ways to extract the strings of interest. Both return matches and do not employ capture groups. My preference is for second one.
str = "uid=1001(kawsay) gid=1001(kawsay) groups=1001(kawsay),27(sudo),44(video),997(gpio)\n"
Match substrings between parentheses that are not followed later in the string with "groups="
Match the regular expression
rgx = /(?<=\()(?!.*\bgroups=).*?(?=\))/
str.scan(rgx)
#=> ["kawsay", "sudo", "video", "gpio"]
Demo
See String#scan.
This expression can be broken down as follows.
(?<=\() # positive lookbehind asserts previous character is '('
(?! # begin negative lookahead
.* # match zero or more characters
\bgroups= # match 'groups=' preceded by a word boundary
) # end negative lookahead
.* # match zero or more characters lazily
(?=\)) # positive lookahead asserts next character is ')'
This may not be as efficient as expressions that employ \G (because of the need to determine if 'groups=' appears in the string after each left parenthesis), but that may not matter.
Extract the portion of the string following "groups=" and then match substrings between parentheses
First, obtain the portion of the string that follows "groups=":
rgx1 = /(?<=\bgroups=).*/
s = str[rgx1]
#=> "1001(kawsay),27(sudo),44(video),997(gpio)\n"
See String#[].
Then match the regular expression
rgx2 = /(?<=\()[^\)\r\n]+/
against s:
s.scan(rgx2)
#=> ["kawsay", "sudo", "video", "gpio"]
The regular expression rgx1 can be broken down as follows:
(?<=\bgroups=) # Positive lookbehind asserts that the current
# position in the string is preceded by`'groups'`,
# which is preceded by a word boundary
.* # match zero of more characters other than line
# terminators (to end of line)
rgx2 can be broken down as follows:
(?<=\() # Use a positive lookbehind to assert that the
# following character is preceded by '('
[^\)\r\n]+ # Match one or more characters other than
# ')', '\r' and '\n'
Note:
The operations can of course be chained: str[/(?<=\bgroups=).*/].scan(/(?<=\()[^\)\r\n]+/); and
rgx2 could alternatively be written /(?<=\().+?(?=\)), where ? makes the match of one or more characters lazy and (?=\)) is a positive lookahead that asserts that the match is followed by a right parenthesis.
This would probably be the fastest solution of those offered and certainly the easiest to test.

Match all instances of a certain character inside every word preceded by a certain word and not delimited by a space

Given a string such as below:
word.hi. bla. word.
I want to construct a regex which will match all "."s preceded by "word" and any other non space character
So, in the above example I would want the the first, second and last dots to be matched.
While matching the first and last dots would be easy with global flag (/(?:word.*)\K./gU), I'm not sure how to construct a regex that would also match the second dot.
Appreciate any pointers.
You might match word and then get all consecutive matches using the \G anchor excluding matching whitespace chars or a dot.
(?:\bword|\G(?!\A))[^.\s]*\K\.
In parts
(?: Non capture group
\bword Match word preceded by a word boundary
| Or
\G(?!\A) Assert the position at the end of the previous match, not at the start
) Close non capture group
[^.\s]* Match 0+ occurrences of any char except . or a whitespace char
\K Clear the match buffer (forget what is matched until now)
\. Match a dot
Regex demo

REGEX: Select all text between last underscore and dot

I'm having trouble retrieving specific information of a string.
The string is as follows:
20190502_PO_TEST.pdf
This includes the .pdf part. I need to retrieve the part between the last underscore (_) and the dot (.) leaving me with TEST
I've tried this:
[^_]+$
This however, returns:
TEST.PDF
I've also tried this:
_(.+)\.
This returns:
PO_TEST
This pattern [^_]+$ will match not an underscore until the end of the string and will also match the .
In this pattern _(.+). you have to escape the dot to match it literally like _(.+)\. see demo and then your match will be in the first capturing group.
What you also might use:
^.*_\K[^.]+
^.*_ Match the last underscore
\K Forget what was matched
[^.]+ Match 0+ times not a dot
Regex demo

Regex ignore part of the string in matches

Suppose I have a tags object as such:
["warn-error-fatal-failure-exception-ok","parsefailure","anothertag","syslog-warn-error-fatal-failure-exception-ok"]
I would like to be able to use regex to match on "failure" but exclude "warn-error-fatal-failure-exception-ok".
So in the above case if I used my regex to search for failure it should only match failure on parsefailure and ignore the rest.
How can this be accomplished using regex?
NOTE: The regex has to exclude the whole string "warn-error-fatal-failure-exception-ok"
EDIT
After documenting the answer below, I realized that maybe what you are looking for is:
(?<!warn-error-fatal-)failure(?!-exception-ok)
So I'm adding it here in case that it fits what you are looking for better. This regex is just looking for "failure" but using a Negative Lookbehind and a Negative Lookahead to specify that "failure" may not be preceded by "warn-error-fatal-" or followed by "-exception-ok".
ANSWER DEVELOPED FROM COMMENTS:
The following regex captures the "failure" substring in the "parsefailure" tag, and it puts it in Group 1:
^.*"(?![^"]*warn-error-fatal-failure-exception-ok[^"]*)[^"]*(failure)[^"]*".*$
DETAIL
I will break the regex in parts, and I'll explain each. First, let's forget about everything in between the first set of parentheses, and let's just look at the rest.
^.*"[^"]*(failure)[^"]*".*$
The important part of the regex is what we are trying to capture in the group, which is the word "failure" which itself is a part of a tag surrounded by double-quotes. The regular expression above matches the whole test string, but it focuses on a tag surrounded by double-quotes and containing the substring "failure".
^.*" matches any character from the beginning of the string to a quote
"[^"]*(failure)[^"]*" matches a tag surrounded by double-quotes and containing the substring "failure". Literally: a double-quote, followed by zero or more characters that are not double-quotes, followed by "failure", followed by zero or more characters that are not double-quotes, followed by a double-quote. The parentheses capture the word "failure" in group 1.
".*$ matches any character from the double-quote to the end of the test string
Because [^"]*(failure)[^"]* matches all tags containing the substring "failure", ^.*"[^"]*(failure)[^"]*".*$ will capture the substring "failure" from the first tag containing the string. In other words, it will capture "failure" from the warn-error-fatal-failure-exception-ok tag which is not what we want, so we most exclude the warn-error-fatal-failure-exception-ok tag from being a possible match to the tag portion of the regex: [^"]*(failure)[^"]*. This is achieved with a Negative Lookahead:
(?![^"]*warn-error-fatal-failure-exception-ok[^"]*)
This Negative Lookahead basically means: "The regular expression following the Negative Lookahead can't match [^"]*warn-error-fatal-failure-exception-ok[^"]*". The (?! and ) are just part of the syntax. You can read more about it here.
MORE BREAKDOWN
^ matches the beginning of the test string
.* matches any character zero or more times
" matches a double-quote character
[^"]* matches any character other than the double-quote character zero or more times
(failure) matches the word "failure", and since it is in parentheses, it will capture it in a group; in this case, it will be captured in group 1 because there is only one set of capturing parentheses. The parentheses of the Negative Lookahead are non-capturing.
$ matches the end of the test string
RegularExpression : [A-Za-z-]*(?<!("warn-error-fatal-))failure
Recognizes parsefailure and "syslog-warn-error-fatal-failure-exception-ok" not the other failure.