cant save pattern matches in array using perl and regex - regex

I am trying to save matched patterns in an array using perl and regex, the problem is that when the match is saved it is missing some characters
ex:
my #array;
my #temp_array;
#types_U8 = ("uint8","vuint8","UCHAR");
foreach my $type (#types_U8)
{
#temp_array = $str =~ /\(\s*\Q$type\E\s*\)\s*(0x[0-9ABCDEF]{3,}|\-[1-9]+)/g;
push(#array,#temp_array);
#temp_array = ();
}
So if $str = "any text (uint8)-1"
The saved string in the #temp_array is only ever "-1"

Your current regular expression is:
/\(\s*\Q$type\E\s*\)\s*(0x[0-9ABCDEF]{3,}|\-[1-9]+)/g
this means
match a literal left paren: \(
match zero or more whitespace characters: \s*
match the value that is stored in $type: \Q$type\E
match zero of more whitespace characters: \s*
match a literal right paren: \)
match zero of more whitespace characters: \s*
START capturing group: (
match a 3 digit hexadecimal number prefixed with 0x
OR
match a literal dash, followed by 1 or more digits from 1 to 9: 0x[0-9ABCDEF]{3,}|\-[1-9]+
END capturing group: )
If you notice above, your capturing group doesn't start until step #7, when you would also like to capture $type and the literal parens.
Extend your capturing group to enclose those areas:
/(\(\s*\Q$type\E\s*\)\s*(?:0x[0-9ABCDEF]{3,}|\-[1-9]+))/;
This means:
START a capturing group: (
match a literal left paren: \(
match zero or more whitespace characters: \s*
match the value that is stored in $type: \Q$type\E
match zero of more whitespace characters: \s*
match a literal right paren: \)
match zero of more whitespace characters: \s*
START non-capturing group: (?:
match a 3 digit hexadecimal number prefixed with 0x
OR
match a literal dash, followed by 1 or more digits from 1 to 9: 0x[0-9ABCDEF]{3,}|\-[1-9]+
END non-capturing group: )
END capturing group: )
(Note: I removed the g (global) modifier because it is unnecessary)
This change gives me a result of (uint8)-1

Related

Parenthesis content after a specific word

I'm trying to get UNIX group names using a regex (can't use groups because I can only get the process uid, so I'm using id <process_id> to get groups)
input looks like this
uid=1001(kawsay) gid=1001(kawsay) groups=1001(kawsay),27(sudo),44(video),997(gpio)\n
I'd like to capture kawsay, sudo, video and gpio
The only pieces I've got are:
a positive lookbehind to start capturing after groups: /(?<=groups)/
capture the parenthesis content: /\((\w+)\)/
Using PCRE's \G you may use this regex:
(?:\bgroups=|(?<!^)\G)[^(]*\(([^)]+)\)
Your intended matches are available in capture group #1
RegEx Demo
RegEx Details:
(?:: Start non-capture group
\bgroups=: Match word groups followed by a =
|: OR
(?<!^)\G: Start from end position of the previous match
): End non-capture group
[^(]*: Match 0 or more of any character that is not (
\(: Match opening (
([^)]+): Use capture group #1 to match 1+ of any non-) characters
\): Match closing )
You can use
(?:\G(?!\A)\),|\bgroups=)\d+\(\K\w+
See the regex demo. Details:
(?:\G(?!\A)\),|\bgroups=) - either of
\G(?!\A)\), - end of the previous match (\G operator matches either start of string or end of the previous match, so the (?!\A) is necessary to exclude the start of string location) and then ), substring
| - or
\bgroups= - a whole word groups (\b is a word boundary) and then a = char
\d+\( - one or more digits and a (
\K - match reset operator that makes the regex engine "forget" the text matched so far
\w+ - one or more word chars.
Here are two more ways to extract the strings of interest. Both return matches and do not employ capture groups. My preference is for second one.
str = "uid=1001(kawsay) gid=1001(kawsay) groups=1001(kawsay),27(sudo),44(video),997(gpio)\n"
Match substrings between parentheses that are not followed later in the string with "groups="
Match the regular expression
rgx = /(?<=\()(?!.*\bgroups=).*?(?=\))/
str.scan(rgx)
#=> ["kawsay", "sudo", "video", "gpio"]
Demo
See String#scan.
This expression can be broken down as follows.
(?<=\() # positive lookbehind asserts previous character is '('
(?! # begin negative lookahead
.* # match zero or more characters
\bgroups= # match 'groups=' preceded by a word boundary
) # end negative lookahead
.* # match zero or more characters lazily
(?=\)) # positive lookahead asserts next character is ')'
This may not be as efficient as expressions that employ \G (because of the need to determine if 'groups=' appears in the string after each left parenthesis), but that may not matter.
Extract the portion of the string following "groups=" and then match substrings between parentheses
First, obtain the portion of the string that follows "groups=":
rgx1 = /(?<=\bgroups=).*/
s = str[rgx1]
#=> "1001(kawsay),27(sudo),44(video),997(gpio)\n"
See String#[].
Then match the regular expression
rgx2 = /(?<=\()[^\)\r\n]+/
against s:
s.scan(rgx2)
#=> ["kawsay", "sudo", "video", "gpio"]
The regular expression rgx1 can be broken down as follows:
(?<=\bgroups=) # Positive lookbehind asserts that the current
# position in the string is preceded by`'groups'`,
# which is preceded by a word boundary
.* # match zero of more characters other than line
# terminators (to end of line)
rgx2 can be broken down as follows:
(?<=\() # Use a positive lookbehind to assert that the
# following character is preceded by '('
[^\)\r\n]+ # Match one or more characters other than
# ')', '\r' and '\n'
Note:
The operations can of course be chained: str[/(?<=\bgroups=).*/].scan(/(?<=\()[^\)\r\n]+/); and
rgx2 could alternatively be written /(?<=\().+?(?=\)), where ? makes the match of one or more characters lazy and (?=\)) is a positive lookahead that asserts that the match is followed by a right parenthesis.
This would probably be the fastest solution of those offered and certainly the easiest to test.

Capture function and its value using Regex

I have a text that contains the following function calls:
set_name(value:"this is a test");
set_attribute(name:"description", value:"Some
Multi
Line
Value");
And I am trying to capture its data so that I get back:
'name'
or
'attribute'
The value just after "set_"
As well as the inside content:
value:"this is a test"
And
name:"description", value:"Some
Multi
Line
Value"
Respectively
I tried using this regex:
script_([A-Za-z_]+)\s*\(([\S\s]*?)\)
but it will fail if this is the set_attribute value:
set_attribute(name:"description", value:"Some
Multi
(Line)
Value");
Because the (first) ) found there is captured by the regex
I am looking for a regex that would return "attribute" and the content via two group captures:
name:"description", value:"Some
Multi
(Line)
Value"
The desired strings could be extracted with the following regular expression, with the single-line or DOTALL flag is set, causing dot to match line terminators.
(?<=^set_)\w+(?=\()|(?<=\().*?(?=\);$)
The first match is the substring between set_ and (; the second match is the substring between ( and ).
In Ruby, for example, this regex could be used as follows.
str = 'set_name(value:"this is a test");'
r = /(?<=^set_)\w+(?=\()|(?<=\().*?(?=\);$)/m
after_set, inside_parens = str.scan(r)
after_set #=> "name"
inside_parens #=> "value:\"this is a test\""
Note that in Ruby single-line or DOTALL mode (dot matches line terminators) is denoted /m.
Start your engine!.
The regex engine performs the following operations.
/
(?<=^set_) : positive lookbehind asserts match is preceded by `set_` at
the beginning of the string
\w+ : match 1+ word characters
(?=\() : positive lookahead asserts following character is '('
| : or
(?<=\() : positive lookbehind asserts match is preceded by '('
.*? : match 0+ characters, as few as possible
(?=\);$) : positive lookahead asserts match is followed by ');' at
: the end of the line
/m : flag to cause '.' to match line terminators
Each line ends with character semicolon. You could add the character in regex after character ).
set_([A-Za-z_]+)\s*\(([\S\s]*?)\);
Demo
You may use
(?ms)^set_(\w+)\((.*?)\);$
See the regex demo.
Details
(?ms) - multiline (^ and $ match start/end of the line now) and dotall (. matches line break chars) modes are ON
^ - start of a line
set_ - a literal string
(\w+) - Group 1: one or more word chars
\( - a ( char
(.*?) - Group 2: any 0 or more chars, as few as possible
\); - ); substring...
$ - at the end of the line.
Another way to get the values using the 2 capturing groups is to repeatedly match the key:values pairs between the opening and the closing parenthesis in group 2.
^set_([A-Za-z_]+)\s*\((\w+:"[^"]+"(?:, ?\w+:"[^"]+")*)\);
Explanation
^set_ Match set_ form the start of the string
( Capture group 1
[A-Za-z_]+ Match 1+ times any of the listed
) Close group 1
\s*\( Match 0+ whitespace chars and opening (
( Capture group 2
\w+:"[^"]+" Match 1+ word chars, then from opening " till closing "
(?:, ?\w+:"[^"]+")* Optionally repeat the previous pattern preceded by a comma and optional space
) Close group 2
\); Match the closing )
Regex demo

Exclude curly brace matches

I have the following strings:
logger.debug('123', 123)
logger.debug(`123`,123)
logger.debug('1bc','test')
logger.debug('1bc', `test`)
logger.debug('1bc', test)
logger.debug('1bc', {})
logger.debug('1bc',{})
logger.debug('1bc',{test})
logger.debug('1bc',{ test })
logger.debug('1bc',{ test})
logger.debug('1bc',{test })
Instead of debug there can be other calls like warn, fatal etc.
All quote pairs can be "", '' or ``.
I need to create a regular express which matches case 1 - 5 but not 6 - 11.
That's what I've come up with:
logger.*\(['`].*['`],\s*.([^{.*}])
This also matches 8 - 11, so I'm suspecting this part is wrong ([^{.*}]) but I don't get it why.
You can try this
logger\.[^(]+\((?:"(?:\\"|[^"])*"|'(?:\\'|[^'])*'|`(?:\\`|[^`])*`),[^{}]*?\)
Regex Demo
P.S:- This pattern can be shorten if we are sure there won't be any mismatch of quotes, also if there won't be any escaped quote inside string
If there's no escaped string
logger\.[^(]+\((?:"[^"]*"|'[^']*'|`[^`]*`),[^{}]*?\)
If there's no quotes in between string. i.e no strings like "mr's jhon
logger\.[^(]+\(([`"'])[^"'`]*\1,[^{}]*?\)
If there are no quotes between the quoted parts, you could make use of a capturing group to match one of the quote types (['`"]) and use a backreference \1 to match the closing quote type.
The \r\n in the negated character class is to not cross newline boundaries.
The pattern will match either the quoted parts or 1+ times a word character for the first part.
The second part matches any char except { or } or ) using a negated character class.
logger\.[^(\r\n]+\((?:(['`"])[^'`"]+\1|\w+),[^{})\r\n]+\)
That will match
logger\. Match logger.
[^(\r\n]+ Match 1+ times any char except ( or a newline
\( Match (
(?: Non capture group
(['`"]) Capture group 1
[^'`"]+\1 Match 1+ times any char except the quote types, backreference to the captured
| or
\w+ Match 1+ word chars
), Close non capture group and match ,
[^{})\r\n]+ Match 1+ times any char except { } ) or a newline
\) Match )
Regex demo

Regex pattern for key and value in Kotlin

I am try to split this String into key value pattern using regex
val x = "title=MyTitle, active=true, title2=MyTitle, Subtitle, new=false, title3=My Title#subtitle1"
I try using this formula
([\w]+)=(.*?)([\w]+)
the output is
title=MyTitle
active=true
title2=MyTitle
new=false
title3=My
Any clue to modified regex formula so the output become
title=MyTitle
active=true
title2=MyTitle, Subtitle
new=false
title3=My Title#subtitle1
A look-ahead works pretty well
(\w+)=(.*?(?=,\s*\w+=|\s*$))
Breakdown
( # group 1 (key)
\w+ # word characters
) # end group 1
= # equals sign
( # group 2 (value)
.*? # anything, non-greedy
(?= # look-ahead ("followed by")...
,\s* # comma and spaces
\w+= # word characters and an equals sign (the next key)
| # or
\s*$ # spaces and the end of string
) # end look-ahead
) # end group 2
You might use 2 capturing groups.
In the first group capture 1+ word chars. In the second group capture the words asserting what is on the right is not a whitespace char followed by a word and an equals sign.
(\w+)=([\w+,]+(?:(?!\s\w+=)\s[#\w]+)*)(?:,|$)
(\w+) Capture group 1, match 1+ word chars
= Match =
( Capture group 2
[\w+,]+ Match 1+ word chars or ,
(?: Non capturing group
(?!\s\w+=) Assert what is on the right is not a whitespace, 1+ word chars and =
\s[#\w]+ Match a whitespace and 1+ word chars or #
)* Close group and repeat 0+ times
) Close group 1
(?:,|$) Match either , or assert the end of the string
Regex demo
Or you might use a broader match for the value part using a negated character class instead of matching the specified characters.
(\w+)=([^=\s]+(?:(?!\s\w+=)\s[^=\s,]+)*)(?:,|$)
Regex demo

Perl regex to extract digits from string with parenthesis

I have the following string:
my $string = "Ethernet FlexNIC (NIC 1) LOM1:1-a FC:15:B4:13:6A:A8";
I want to extract the number that is in brackets (1) in another variable.
The following statement does not work:
my ($NAdapter) = $string =~ /\((\d+)\)/;
What is the correct syntax?
\d+(?=[^(]*\))
You can use this.See demo.Yours will not work as inside () there is more data besides \d+.
https://regex101.com/r/fM9lY3/57
You could try something like
my ($NAdapter) = $string =~ /\(.*(\d+).*\)/;
After that, $NAdapter should include the number that you want.
my $string = "Ethernet FlexNIC (NIC 1) LOM1:1-a FC:15:B4:13:6A:A8";
I want to extract the number that is in brackets (1) in another
variable
Your regex (with some spaces for clarity):
/ \( (\d+) \) /x;
says to match:
A literal opening parenthesis, immediately followed by...
A digit, one or more times (captured in group 1), immediately followed by...
A literal closing parenthesis.
Yet, the substring you want to match:
(NIC 1)
is of the form:
A literal opening parenthesis, immediately followed by...
Some capital letters
STOP EVERYTHING! NO MATCH!
As an alternative, your substring:
(NIC 1)
could be described as:
Some digits, immediately followed by...
A literal closing parenthesis.
Here's the regex:
use strict;
use warnings;
use 5.020;
my $string = "Ethernet FlexNIC (NIC 1234) LOM1:1-a FC:15:B4:13:6A:A8";
my ($match) = $string =~ /
(\d+) #Match any digit, one or more times, captured in group 1, followed by...
\) #a literal closing parenthesis.
#Parentheses have a special meaning in a regex--they create a capture
#group--so if you want to match a parenthesis in your string, you
#have to escape the parenthesis in your regex with a backslash.
/xms; #Standard flags that some people apply to every regex.
say $match;
--output:--
1234
Another description of your substring:
(NIC 1)
could be:
A literal opening parenthesis, immediately followed by...
Some non-digits, immediately followed by...
Some digits, immediately followed by..
A literal closing parenthesis.
Here's the regex:
use strict;
use warnings;
use 5.020;
my $string = "Ethernet FlexNIC (ABC NIC789) LOM1:1-a FC:15:B4:13:6A:A8";
my ($match) = $string =~ /
\( #Match a literal opening parethesis, followed by...
\D+ #a non-digit, one or more times, followed by...
(\d+) #a digit, one or more times, captured in group 1, followed by...
\) #a literal closing parentheses.
/xms; #Standard flags that some people apply to every regex.
say $match;
--output:--
789
If there might be spaces on some lines and not others, such as:
spaces
||
VV
(NIC 1 )
(NIC 2)
You can insert a \s* (any whitespace, zero or more times) in the appropriate place in the regex, for instance:
my ($match) = $string =~ /
#Parentheses have special meaning in a regex--they create a capture
#group--so if you want to match a parenthesis in your string, you
#have to escape the parenthesis in your regex with a backslash.
\( #Match a literal opening parethesis, followed by...
\D+ #a non-digit, one or more times, followed by...
(\d+) #a digit, one or more times, captured in group 1, followed by...
\s* #any whitespace, zero or more times, followed by...
\) #a literal closing parentheses.
/xms; #Standard flags that some people apply to every regex.