Regex that excludes character in one line but not another - regex

I have the following two lines for which I'm trying to create an expression:
Auth.NAS-IP-Address=0.0.0.0,
auth.Alerts=Failed to construct filter=select COALESCE('%{Endpoint:intel_endpoint_BLOCK_locations}','none') ilike '%;' || '%{Device:Location}' || ';%' as is_blocked.
I am trying to create a capture group that captures everything after the first '=' in each line. (Everything after "Address=" and "Alerts="). However, I'd like to exclude the ',' at the end of the first line.
This is the closest I've come:
^([\S]+)=(.+)(,$)?
My goal here was to capture everything except for a comma that occurs right before the end of the line. That didn't work.
The following expression will exclude the ',' on the first line, but also stops the capture group at the comma in the second line and therefore doesn't capture the entire value.
^([\S]+)=(.+),
Is this something that's even possible with Regex? Can I create an expression that will exclude a character on one line but not another?

You can try make the second group non-greedy:
^(\S+)=(.+?),?$
Regex demo.

As I understand, in the first line you wish to capture everything after the first equals sign, except for a comma at the end of that line; in all other lines you wish to capture everything after the first equals sign, regardless of whether the line ends with a comma. You can do that with the following regular expression.
(?:\A(\S+)=(.+),$|^(?!\A)(\S+)=(.+))
Demo
The regular expression can be broken down as follows.
(?: # begin non-capture group
\A # match beginning of the string
(\S+) # match >= 1 non-whitespace chars, save to group 1
= # match equals sign
(.+) # match >= 1 chars other than line terminators, save to group 2
, # match a comma
$ # match end of line
| # or
^ # match beginning of a line
(?!\A) # assert location is not at the beginning of the string
(\S+) # match >= 1 non-whitespace chars, save to group 3
= # match equals sign
(.+) # match >= 1 chars other than line terminators, save to group 4
) # end non-capture group
(?!\A) is a negative lookahead. One could alternatively use the negative lookbehind (?<!\A).

Related

Regex to capture optional characters

I want to pull out a base string (Wax) or (noWax) from a longer string, along with potentially any data before and after if the string is Wax. I'm having trouble getting the last item in my list below (noWax) to match.
Can anyone flex their regex muscles? I'm fairly new to regex so advice on optimization is welcome as long as all matches below are found.
What I'm working with in Regex101:
/(?<Wax>Wax(?:Only|-?\d+))/mg
Original string
need to extract in a capturing group
Loc3_341001_WaxOnly_S212
WaxOnly
Loc4_34412-a_Wax4_S231
Wax4
Loc3a_231121-a_Wax-4-S451
Wax-4
Loc3_34112_noWax_S311
noWax
Here is one way to do so, using a conditional:
(?<Wax>(no)?Wax(?(2)|(?:Only|-?\d+)))
See the online demo.
(no)?: Optional capture group.
(? If.
(2): Test if capture group 2 exists ((no)). If it does, do nothing.
|: Or.
(?:Only|-?\d+)
I assume the following match is desired.
the match must include 'Wax'
'Wax' is to be preceded by '_' or by '_no'. If the latter 'no' is included in the match.
'Wax' may be followed by:
'Only' followed by '_', in which case 'Only' is part of the match, or
one or more digits, followed by '_', in which case the digits are part of the match, or
'-' followed by one or more digits, followed by '-', in which case
'-' followed by one or more digits is part of the match.
If these assumptions are correct the string can be matched against the following regular expression:
(?<=_)(?:(?:no)?Wax(?:(?:Only|\d+)?(?=_)|\-\d+(?=-)))
Demo
The regular expression can be broken down as follows.
(?<=_) # positive lookbehind asserts previous character is '_'
(?: # begin non-capture group
(?:no)? # optionally match 'no'
Wax # match literal
(?: # begin non-capture group
(?:Only|\d+)? # optionally match 'Only' or >=1 digits
(?=_) # positive lookahead asserts next character is '_'
| # or
\-\d+ # match '-' followed by >= 1 digits
(?=-) # positive lookahead asserts next character is '-'
) # end non-capture group
) # end non-capture group

Parenthesis content after a specific word

I'm trying to get UNIX group names using a regex (can't use groups because I can only get the process uid, so I'm using id <process_id> to get groups)
input looks like this
uid=1001(kawsay) gid=1001(kawsay) groups=1001(kawsay),27(sudo),44(video),997(gpio)\n
I'd like to capture kawsay, sudo, video and gpio
The only pieces I've got are:
a positive lookbehind to start capturing after groups: /(?<=groups)/
capture the parenthesis content: /\((\w+)\)/
Using PCRE's \G you may use this regex:
(?:\bgroups=|(?<!^)\G)[^(]*\(([^)]+)\)
Your intended matches are available in capture group #1
RegEx Demo
RegEx Details:
(?:: Start non-capture group
\bgroups=: Match word groups followed by a =
|: OR
(?<!^)\G: Start from end position of the previous match
): End non-capture group
[^(]*: Match 0 or more of any character that is not (
\(: Match opening (
([^)]+): Use capture group #1 to match 1+ of any non-) characters
\): Match closing )
You can use
(?:\G(?!\A)\),|\bgroups=)\d+\(\K\w+
See the regex demo. Details:
(?:\G(?!\A)\),|\bgroups=) - either of
\G(?!\A)\), - end of the previous match (\G operator matches either start of string or end of the previous match, so the (?!\A) is necessary to exclude the start of string location) and then ), substring
| - or
\bgroups= - a whole word groups (\b is a word boundary) and then a = char
\d+\( - one or more digits and a (
\K - match reset operator that makes the regex engine "forget" the text matched so far
\w+ - one or more word chars.
Here are two more ways to extract the strings of interest. Both return matches and do not employ capture groups. My preference is for second one.
str = "uid=1001(kawsay) gid=1001(kawsay) groups=1001(kawsay),27(sudo),44(video),997(gpio)\n"
Match substrings between parentheses that are not followed later in the string with "groups="
Match the regular expression
rgx = /(?<=\()(?!.*\bgroups=).*?(?=\))/
str.scan(rgx)
#=> ["kawsay", "sudo", "video", "gpio"]
Demo
See String#scan.
This expression can be broken down as follows.
(?<=\() # positive lookbehind asserts previous character is '('
(?! # begin negative lookahead
.* # match zero or more characters
\bgroups= # match 'groups=' preceded by a word boundary
) # end negative lookahead
.* # match zero or more characters lazily
(?=\)) # positive lookahead asserts next character is ')'
This may not be as efficient as expressions that employ \G (because of the need to determine if 'groups=' appears in the string after each left parenthesis), but that may not matter.
Extract the portion of the string following "groups=" and then match substrings between parentheses
First, obtain the portion of the string that follows "groups=":
rgx1 = /(?<=\bgroups=).*/
s = str[rgx1]
#=> "1001(kawsay),27(sudo),44(video),997(gpio)\n"
See String#[].
Then match the regular expression
rgx2 = /(?<=\()[^\)\r\n]+/
against s:
s.scan(rgx2)
#=> ["kawsay", "sudo", "video", "gpio"]
The regular expression rgx1 can be broken down as follows:
(?<=\bgroups=) # Positive lookbehind asserts that the current
# position in the string is preceded by`'groups'`,
# which is preceded by a word boundary
.* # match zero of more characters other than line
# terminators (to end of line)
rgx2 can be broken down as follows:
(?<=\() # Use a positive lookbehind to assert that the
# following character is preceded by '('
[^\)\r\n]+ # Match one or more characters other than
# ')', '\r' and '\n'
Note:
The operations can of course be chained: str[/(?<=\bgroups=).*/].scan(/(?<=\()[^\)\r\n]+/); and
rgx2 could alternatively be written /(?<=\().+?(?=\)), where ? makes the match of one or more characters lazy and (?=\)) is a positive lookahead that asserts that the match is followed by a right parenthesis.
This would probably be the fastest solution of those offered and certainly the easiest to test.

Regex match text after last '-'

I am really stuck with the following regex problem:
I want to remove the last piece of a string, but only if the '-' is more then once occurring in the string.
Example:
BOL-83846-M/L -> Should match -M/L and remove it
B0L-026O1 -> Should not match
D&F-176954 -> Should not match
BOL-04134-58/60 -> Should match -58/60 and remove it
BOL-5068-4 - 6 jaar -> Should match -4 - 6 jaar and remove it (maybe in multiple search/replace steps)
It would be no problem if the regex needs two (or more) steps to remove it.
Now I have
[^-]*$
But in sublime it matches B0L-026O1 and D&F-176954
Need your help please
You can match the first - in a capture group, and then match the second - till the end of the string to remove it.
In the replacement use capture group 1.
^([^-\n]*-[^-\n]*)-.*$
^ Start of string
( Capture group 1
[^-\n]*-[^-\n]* Match the first - between chars other than - (or a newline if you don't want to cross lines)
) Capture group 1
-.*$ Match the second - and the rest of the line
Regex demo
You can match the following regular expression.
^[^-\r\n]*(?:$|-[^-\r\n]*(?=-|$))
Demo
If the string contains two or more hyphens this returns the beginning of the string up to, but not including, the second hyphen; else it returns the entire string.
The regular expression can be broken down as follows.
^ # match the beginning of the string
[^-\r\n]* # match zero or more characters other than hyphens,
# carriage returns and linefeeds
(?: # begin a non-capture group
$ # match the end of the string
| # or
- # match a hyphen
[^-\r\n]* # match zero or more characters other than hyphens,
# carriage returns and linefeeds
(?= # begin a positive lookahead
- # match a hyphen
| # or
$ # match the end of the string
) # end positive lookahead
) # end non-capture group

Capture function and its value using Regex

I have a text that contains the following function calls:
set_name(value:"this is a test");
set_attribute(name:"description", value:"Some
Multi
Line
Value");
And I am trying to capture its data so that I get back:
'name'
or
'attribute'
The value just after "set_"
As well as the inside content:
value:"this is a test"
And
name:"description", value:"Some
Multi
Line
Value"
Respectively
I tried using this regex:
script_([A-Za-z_]+)\s*\(([\S\s]*?)\)
but it will fail if this is the set_attribute value:
set_attribute(name:"description", value:"Some
Multi
(Line)
Value");
Because the (first) ) found there is captured by the regex
I am looking for a regex that would return "attribute" and the content via two group captures:
name:"description", value:"Some
Multi
(Line)
Value"
The desired strings could be extracted with the following regular expression, with the single-line or DOTALL flag is set, causing dot to match line terminators.
(?<=^set_)\w+(?=\()|(?<=\().*?(?=\);$)
The first match is the substring between set_ and (; the second match is the substring between ( and ).
In Ruby, for example, this regex could be used as follows.
str = 'set_name(value:"this is a test");'
r = /(?<=^set_)\w+(?=\()|(?<=\().*?(?=\);$)/m
after_set, inside_parens = str.scan(r)
after_set #=> "name"
inside_parens #=> "value:\"this is a test\""
Note that in Ruby single-line or DOTALL mode (dot matches line terminators) is denoted /m.
Start your engine!.
The regex engine performs the following operations.
/
(?<=^set_) : positive lookbehind asserts match is preceded by `set_` at
the beginning of the string
\w+ : match 1+ word characters
(?=\() : positive lookahead asserts following character is '('
| : or
(?<=\() : positive lookbehind asserts match is preceded by '('
.*? : match 0+ characters, as few as possible
(?=\);$) : positive lookahead asserts match is followed by ');' at
: the end of the line
/m : flag to cause '.' to match line terminators
Each line ends with character semicolon. You could add the character in regex after character ).
set_([A-Za-z_]+)\s*\(([\S\s]*?)\);
Demo
You may use
(?ms)^set_(\w+)\((.*?)\);$
See the regex demo.
Details
(?ms) - multiline (^ and $ match start/end of the line now) and dotall (. matches line break chars) modes are ON
^ - start of a line
set_ - a literal string
(\w+) - Group 1: one or more word chars
\( - a ( char
(.*?) - Group 2: any 0 or more chars, as few as possible
\); - ); substring...
$ - at the end of the line.
Another way to get the values using the 2 capturing groups is to repeatedly match the key:values pairs between the opening and the closing parenthesis in group 2.
^set_([A-Za-z_]+)\s*\((\w+:"[^"]+"(?:, ?\w+:"[^"]+")*)\);
Explanation
^set_ Match set_ form the start of the string
( Capture group 1
[A-Za-z_]+ Match 1+ times any of the listed
) Close group 1
\s*\( Match 0+ whitespace chars and opening (
( Capture group 2
\w+:"[^"]+" Match 1+ word chars, then from opening " till closing "
(?:, ?\w+:"[^"]+")* Optionally repeat the previous pattern preceded by a comma and optional space
) Close group 2
\); Match the closing )
Regex demo

Regex remove last newline

Given the following ; delimited string
a;; z
toy;d;hh
toy
;b;;jj
z;
d;23
d;23td
;;io;
b y;b;12
z
a;b;bb;;;34
z
and this regex
^(?!(?:(a|d))(?:;|$)).*(\s*\z|$)\R*
I am looking to get the full lines whose 1st. column is not a or d, and have the matching lines removed, to get this , after substituting with empty
a;; z
d;23
d;23td
a;b;bb;;;34
Please see the demo
In the Substitution panel, there is a 5th empty line, which needs to be removed.
I have used this \s*\z in this past for this purpose. As implemented here, it does not seem to work.
Any help is appreciated
I think the reason your regex won't remove the last newline, is that it is part of the end of the last part that you want to keep, so without matching it you can't remove it.
So I rewrote the regex to match the line you want to keep, but also to include everything above and below the match that is not another match.
The key difference is using a conditional to only match the newline of the group you want to keep if it is followed by another match.
regex (linebreaks for readability):
((?!(a|d)).*(\s*\z|$)\R*)*
(^(a|d).*(?(?=\R*(.*\s*\R+)*(a|b))\R))
((?!(a|d)).*(\s*\z|$)\R*)*
replace with $4 -->
a;; z
d;23
d;23td
a;b;bb;;;34
For readability I removed some of the non-capturing and string separator logic you had, if they are necessary you can add them back in.
Logic breakdown of the parts:
(?(?=\R*(.*\s*\R+)*(a|b))\R) is the conditional, it only matches the newline \R if (?) it is followed by (?=) any non-matching lines (.*\s*\R+)* that end in a newline followed by (a|b).
The middle part (^(a|d).*(?(?=\R*(.*\s*\R+)*(a|b))\R)) containing this ends up as the replacing group $4. It thus matches lines starting with (a|d), and all but the last match also match the newline at the end of their line.
The beginning and end of the regex ((?!(a|d)).*(\s*\z|$)\R*)* is exactly the same, and matches of all the unneeded stuff so that it gets removed.
You could match what you want to remove, and capture in a group what you want to keep.
To prevent removing the newline sequences between capturing groups, you could use an if clause (? to only match 0+ unicode newline sequences when there is no more line following that starts with [ad];
In the replacement use group 1 $1
^(?:(?![ad];).*\R*)*|^([ad];.*(?:\R[ad];.*)*)(?(?![\s\S]*\R[ad];)\R*)
Explanation
^ Start of line
(?: Non capture group
(?![ad];) If the line does not start with a or d followed by ;
.*\R* Match the whole line and 0+ times a unicode newline sequence
)* Close group and repeat 0+ times to match all consecutive lines
| Or
^ Start of line
( Capture group 1
[ad];.* Match a or d followed by ; and the rest of the line
(?: Non capture group
\R[ad];.* Match newline, a or d followed by ; and the rest of the line
)* Close group and repeat 0+ times to match all consecutive lines
) Close group 1
(? If clause, only match a unicode newline sequence if the [ad]; pattern does not occur anymore
(?! Negative lookahead, assert what follows is not
[\s\S]*\R[ad]; Match the [ad]; pattern
) Close lookahead.
\R* If the assertion is true, Match 0+ unicode newline sequences
) Close if clause
See a Regex demo