Capture function and its value using Regex - regex

I have a text that contains the following function calls:
set_name(value:"this is a test");
set_attribute(name:"description", value:"Some
Multi
Line
Value");
And I am trying to capture its data so that I get back:
'name'
or
'attribute'
The value just after "set_"
As well as the inside content:
value:"this is a test"
And
name:"description", value:"Some
Multi
Line
Value"
Respectively
I tried using this regex:
script_([A-Za-z_]+)\s*\(([\S\s]*?)\)
but it will fail if this is the set_attribute value:
set_attribute(name:"description", value:"Some
Multi
(Line)
Value");
Because the (first) ) found there is captured by the regex
I am looking for a regex that would return "attribute" and the content via two group captures:
name:"description", value:"Some
Multi
(Line)
Value"

The desired strings could be extracted with the following regular expression, with the single-line or DOTALL flag is set, causing dot to match line terminators.
(?<=^set_)\w+(?=\()|(?<=\().*?(?=\);$)
The first match is the substring between set_ and (; the second match is the substring between ( and ).
In Ruby, for example, this regex could be used as follows.
str = 'set_name(value:"this is a test");'
r = /(?<=^set_)\w+(?=\()|(?<=\().*?(?=\);$)/m
after_set, inside_parens = str.scan(r)
after_set #=> "name"
inside_parens #=> "value:\"this is a test\""
Note that in Ruby single-line or DOTALL mode (dot matches line terminators) is denoted /m.
Start your engine!.
The regex engine performs the following operations.
/
(?<=^set_) : positive lookbehind asserts match is preceded by `set_` at
the beginning of the string
\w+ : match 1+ word characters
(?=\() : positive lookahead asserts following character is '('
| : or
(?<=\() : positive lookbehind asserts match is preceded by '('
.*? : match 0+ characters, as few as possible
(?=\);$) : positive lookahead asserts match is followed by ');' at
: the end of the line
/m : flag to cause '.' to match line terminators

Each line ends with character semicolon. You could add the character in regex after character ).
set_([A-Za-z_]+)\s*\(([\S\s]*?)\);
Demo

You may use
(?ms)^set_(\w+)\((.*?)\);$
See the regex demo.
Details
(?ms) - multiline (^ and $ match start/end of the line now) and dotall (. matches line break chars) modes are ON
^ - start of a line
set_ - a literal string
(\w+) - Group 1: one or more word chars
\( - a ( char
(.*?) - Group 2: any 0 or more chars, as few as possible
\); - ); substring...
$ - at the end of the line.

Another way to get the values using the 2 capturing groups is to repeatedly match the key:values pairs between the opening and the closing parenthesis in group 2.
^set_([A-Za-z_]+)\s*\((\w+:"[^"]+"(?:, ?\w+:"[^"]+")*)\);
Explanation
^set_ Match set_ form the start of the string
( Capture group 1
[A-Za-z_]+ Match 1+ times any of the listed
) Close group 1
\s*\( Match 0+ whitespace chars and opening (
( Capture group 2
\w+:"[^"]+" Match 1+ word chars, then from opening " till closing "
(?:, ?\w+:"[^"]+")* Optionally repeat the previous pattern preceded by a comma and optional space
) Close group 2
\); Match the closing )
Regex demo

Related

Parenthesis content after a specific word

I'm trying to get UNIX group names using a regex (can't use groups because I can only get the process uid, so I'm using id <process_id> to get groups)
input looks like this
uid=1001(kawsay) gid=1001(kawsay) groups=1001(kawsay),27(sudo),44(video),997(gpio)\n
I'd like to capture kawsay, sudo, video and gpio
The only pieces I've got are:
a positive lookbehind to start capturing after groups: /(?<=groups)/
capture the parenthesis content: /\((\w+)\)/
Using PCRE's \G you may use this regex:
(?:\bgroups=|(?<!^)\G)[^(]*\(([^)]+)\)
Your intended matches are available in capture group #1
RegEx Demo
RegEx Details:
(?:: Start non-capture group
\bgroups=: Match word groups followed by a =
|: OR
(?<!^)\G: Start from end position of the previous match
): End non-capture group
[^(]*: Match 0 or more of any character that is not (
\(: Match opening (
([^)]+): Use capture group #1 to match 1+ of any non-) characters
\): Match closing )
You can use
(?:\G(?!\A)\),|\bgroups=)\d+\(\K\w+
See the regex demo. Details:
(?:\G(?!\A)\),|\bgroups=) - either of
\G(?!\A)\), - end of the previous match (\G operator matches either start of string or end of the previous match, so the (?!\A) is necessary to exclude the start of string location) and then ), substring
| - or
\bgroups= - a whole word groups (\b is a word boundary) and then a = char
\d+\( - one or more digits and a (
\K - match reset operator that makes the regex engine "forget" the text matched so far
\w+ - one or more word chars.
Here are two more ways to extract the strings of interest. Both return matches and do not employ capture groups. My preference is for second one.
str = "uid=1001(kawsay) gid=1001(kawsay) groups=1001(kawsay),27(sudo),44(video),997(gpio)\n"
Match substrings between parentheses that are not followed later in the string with "groups="
Match the regular expression
rgx = /(?<=\()(?!.*\bgroups=).*?(?=\))/
str.scan(rgx)
#=> ["kawsay", "sudo", "video", "gpio"]
Demo
See String#scan.
This expression can be broken down as follows.
(?<=\() # positive lookbehind asserts previous character is '('
(?! # begin negative lookahead
.* # match zero or more characters
\bgroups= # match 'groups=' preceded by a word boundary
) # end negative lookahead
.* # match zero or more characters lazily
(?=\)) # positive lookahead asserts next character is ')'
This may not be as efficient as expressions that employ \G (because of the need to determine if 'groups=' appears in the string after each left parenthesis), but that may not matter.
Extract the portion of the string following "groups=" and then match substrings between parentheses
First, obtain the portion of the string that follows "groups=":
rgx1 = /(?<=\bgroups=).*/
s = str[rgx1]
#=> "1001(kawsay),27(sudo),44(video),997(gpio)\n"
See String#[].
Then match the regular expression
rgx2 = /(?<=\()[^\)\r\n]+/
against s:
s.scan(rgx2)
#=> ["kawsay", "sudo", "video", "gpio"]
The regular expression rgx1 can be broken down as follows:
(?<=\bgroups=) # Positive lookbehind asserts that the current
# position in the string is preceded by`'groups'`,
# which is preceded by a word boundary
.* # match zero of more characters other than line
# terminators (to end of line)
rgx2 can be broken down as follows:
(?<=\() # Use a positive lookbehind to assert that the
# following character is preceded by '('
[^\)\r\n]+ # Match one or more characters other than
# ')', '\r' and '\n'
Note:
The operations can of course be chained: str[/(?<=\bgroups=).*/].scan(/(?<=\()[^\)\r\n]+/); and
rgx2 could alternatively be written /(?<=\().+?(?=\)), where ? makes the match of one or more characters lazy and (?=\)) is a positive lookahead that asserts that the match is followed by a right parenthesis.
This would probably be the fastest solution of those offered and certainly the easiest to test.

regular expression with If condition question

I have the following regular expressions that extract everything after first two alphabets
^[A-Za-z]{2})(\w+)($) $2
now I want to the extract nothing if the data doesn't start with alphabets.
Example:
AA123 -> 123
123 -> ""
Can this be accomplished by regex?
Introduce an alternative to match any one or more chars from start to end of string if your regex does not match:
^(?:([A-Za-z]{2})(\w+)|.+)$
See the regex demo. Details:
^ - start of string
(?: - start of a container non-capturing group:
([A-Za-z]{2})(\w+) - Group 1: two ASCII letters, Group 2: one or more word chars
| - or
.+ - one or more chars other than line break chars, as many as possible (use [\w\W]+ to match any chars including line break chars)
) - end of a container non-capturing group
$ - end of string.
Your pattern already captures 1 or more word characters after matching 2 uppercase chars. The $ does not have to be in a group, and this $2 should not be in the pattern.
^[A-Za-z]{2})(\w+)$
See a regex demo.
Another option could be a pattern with a conditional, capturing data in group 2 only if group 1 exist.
^([A-Z]{2})?(?(1)(\w+)|.+)$
^ Start of string
([A-Z]{2})? Capture 2 uppercase chars in optional group 1
(? Conditional
(1)(\w+) If we have group 1, capture 1+ word chars in group 2
| Or
.+ Match the whole line with at least 1 char to not match an empty string
) Close conditional
$ End of string
Regex demo
For a match only, you could use other variations Using \K like ^[A-Za-z]{2}\K\w+$ or with a lookbehind assertion (?<=^[A-Za-z]{2})\w+$

Is there a regex to grab all spaces that separate key value pairs

Is there a regex to extract all spaces that separate key+value pairs and ignoring those delimited by double quotes
sample:
key1=value1 key1=value1 spaces="some spaces in text" nested1="key2=value2 key2=value2 key2=value2" nested2="key2=value2, key2=value2, key2=value2" quoted="his name is \"no body\""
this is where i come for so far: (?<!,) (?=\w+=), but of course it doesn't work.
[^\s="]+\s*=\s*(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|[^\s=]+)\K[ \t]+
PCRE demo
No need to write back. just matches space delimiters.
can replace with new delimiter
([^\s="]+\s*=\s*(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|[^\s=]+))([ \t]+)
Python demo
Can write back \1 or \2 if needed.
can replace with new delimiter
note - the part of the above expressions matching the field info
could benifit by placing Atomic group around (?>) but not strictly
necessary as the field structure is fairly concise.
are other options to garantee integrity as well like matching every
character with the use of the \G anchor if availibul.
let me know if need this approach.
many ways to go here
Here is another option:
".*?(?<!\\)"(*SKIP)(*F)| +
See the online demo
Please do let me know if it actually does what is required as I'm unsure. Anyways, here is a breakdown:
" - A literal double quote.
.*? - Anything but newline zero or more times but lazy.
(?<!\\) - A negative lookbehind for \.
" - A literal double quote.
(*SKIP)(*F) - Consume all characters of matches, force a failure and continue matching.
| - Alternation.
+ - One or more space characters.
If it's Python you are using, you'll need a reference to the PypI regex module.
You could do that with the following PCRE-compatible regular expression.
\G[^" \n]*(?:(?<!\\)"(?:[^\n"]|(?<=\\)")*(?<!\\)"[^" \n]*)*\K +
Start your engine!
\G : assert position at the end of the previous match
or the start of the string for the first match
[^" \n]* : match 0+ chars other than those in char class
(?: : begin non-capture group
(?<!\\) : use negative lookbehind to assert next char is not
preceded by a backslash
" : match double-quote
(?: : begin non-capture group
[^"\n] : match a char other than those in char class
| : or
(?<=\\) : use positive lookbehind to assert next char is
preceded by a backslash
" : match double-quote
) :end non-capture group
* : match non-capture group 0+ times
(?<!\\) : use negative lookbehind to assert next char is not
" : match double-quote
[^" \n]* : match 0+ chars other than those in char class
) : end non-capture group
* : match non-capture group 0+ times
\K : forget everything matched so far and reset start of match
\ + : match 1+ spaces

Exclude curly brace matches

I have the following strings:
logger.debug('123', 123)
logger.debug(`123`,123)
logger.debug('1bc','test')
logger.debug('1bc', `test`)
logger.debug('1bc', test)
logger.debug('1bc', {})
logger.debug('1bc',{})
logger.debug('1bc',{test})
logger.debug('1bc',{ test })
logger.debug('1bc',{ test})
logger.debug('1bc',{test })
Instead of debug there can be other calls like warn, fatal etc.
All quote pairs can be "", '' or ``.
I need to create a regular express which matches case 1 - 5 but not 6 - 11.
That's what I've come up with:
logger.*\(['`].*['`],\s*.([^{.*}])
This also matches 8 - 11, so I'm suspecting this part is wrong ([^{.*}]) but I don't get it why.
You can try this
logger\.[^(]+\((?:"(?:\\"|[^"])*"|'(?:\\'|[^'])*'|`(?:\\`|[^`])*`),[^{}]*?\)
Regex Demo
P.S:- This pattern can be shorten if we are sure there won't be any mismatch of quotes, also if there won't be any escaped quote inside string
If there's no escaped string
logger\.[^(]+\((?:"[^"]*"|'[^']*'|`[^`]*`),[^{}]*?\)
If there's no quotes in between string. i.e no strings like "mr's jhon
logger\.[^(]+\(([`"'])[^"'`]*\1,[^{}]*?\)
If there are no quotes between the quoted parts, you could make use of a capturing group to match one of the quote types (['`"]) and use a backreference \1 to match the closing quote type.
The \r\n in the negated character class is to not cross newline boundaries.
The pattern will match either the quoted parts or 1+ times a word character for the first part.
The second part matches any char except { or } or ) using a negated character class.
logger\.[^(\r\n]+\((?:(['`"])[^'`"]+\1|\w+),[^{})\r\n]+\)
That will match
logger\. Match logger.
[^(\r\n]+ Match 1+ times any char except ( or a newline
\( Match (
(?: Non capture group
(['`"]) Capture group 1
[^'`"]+\1 Match 1+ times any char except the quote types, backreference to the captured
| or
\w+ Match 1+ word chars
), Close non capture group and match ,
[^{})\r\n]+ Match 1+ times any char except { } ) or a newline
\) Match )
Regex demo

regex match after word

I would like to know how to capture text only if the beginning of a line matching a certain string... but i dont want to capture the begining string...
for example if i have the text:
BEGIN_TAG: Text To Capture
WRONG_TAG: Text Not to Capture
i want to capture:
Text To Capture
From the line that begin with BEGIN_TAG: not the line that begin with WRONG_TAG:
I know the how to select the line that begin with the desired text: ^BEGIN_TAG:\W?(.*)
but this selects the text "BEGIN_TAG:"... i dont want this only want the text after "BEGIN_TAG"
I am using PCRE regex
Instead of a positive lookbehind that does not allow unknown width patterns, you may use a match reset operator \K:
^BEGIN_TAG:\W?\K.*
See the regex demo
Details:
^ - in Sublime, start of a line
BEGIN_TAG: - a string of literal chars
\W? - 1 or 0 non-word chars
\K - the match reset operator that discards all text matched so far
.* - any 0+ chars other than linebreak characters (the rest of the line) that are the only chars that will be kept in the matched text.
You can use lookbehind. Then, the text in the lookbehind group isn't part of the whole match. You can see it as an anchor like \b, ^, etc.
You then get:
(?<=^BEGIN_TAG:\W)(\w.*)$
Explained:
(?<= # Positive lookbehind group
^ # Start of line / string
BEGIN_TAG: # Literal
\W # A non-word character ([^a-zA-Z_])
)
( # First and only matching group (probably not needed)
\w # A word character ([a-zA-Z_])
.* # Any character, any number of times
)
$ # End of line / string