I'm trying to get all the words between the parentheses after a specific word and the end of the string.
For example, I have this case:
p " some other text in downcase LOREM (foo, bar)".scan(/ LOREM \((.*?)\)\z/m)
# [["foo, bar"]]
The regex is getting foo, bar which is between the parenthesis, it's okay, but I'd like to get them like two separate elements within a single array, meaning:
["foo", "bar"]
That's to say, the regex should group every words as a separate element.
My intention is to get everything between LOREM ( and the last closing parenthesis ).
I've tried adding (\b\w+\b), which groups every word in the string. But when adding it to the attempt to get the words from the parenthesis, it returns nothing.
You may use
.scan(/(?:\G(?!\A)\s*,\s*|\sLOREM\s+\()\K\w+(?=[^()]*\)\z)/
See the Ruby demo and the Rubular regex demo. You may replace \w+ with [[:alnum:]]+, or \p{L}+ (to only match letters), or [^\s,()]+ (to match any 1+ chars other than whitespace, ,, ( and )), it all depends on what you want to match inside the paretheses.
Details
(?:\G(?!\A)\s*,\s*|\sLOREM\s+\() - either the end of the previous successful match and a , enclosed with 0+ whitespaces, or whitespace, LOREM, 1+ whitespaces and (
\K - omit the text matched so far
\w+ - consume 1+ word chars
(?=[^()]*\)\z) - immediately to the right, there must be 0 or more chars other than ( and ) and then ) at the end of the string.
r = /
(?<= # begin a positive lookbehind
LOREM[ ] # match 'LOREM '
\( # match left paren
| # or
,[ ] # match a comma followed by a space
) # end positive lookbehind
(?: # begin a non-capture group
[^, ")]+ # match one or more characters other than ',', ' ', '"' and ")"
| # or
\" # match a double quote
[^, ")]+ # match one or more characters other than ',', ' ', '"' and ")"
\" # match a double quote
) # end non-capture group
(?= # begin a positive lookahead
.*\) # match any number of characters followed by a right paren
) # end positive lookahead
/x # free-spacing regex definition mode
Conventionally this is written
r = /(?<=LOREM \(|, )(?:[^, ")]+|\"[^, ")]+\")(?=.*\))/
Let's try it.
str = "some other text in downcase LOREM (foo, \"bar\", \"baz), daz"
str.scan(r)
#=> ["foo", "\"bar\""]
The first match, "foo", matches
str.scan /(?<=LOREM \()[^, ")]+/
#=> ["foo"]
That is, this matches one or more characters other than a comma, space, double quote or left parenthesis, immediately preceded by "LOREM " followed by a left parenthesis.
The next attempted match begins at the end of "foo". There is no match of "L" in "LOREM" so an attempt is made to match ", ", which is met with success. [^, ")]+ does not match "bar", so an attempt is made to match \"[^, ")]+\", which is successful. As ", " is matched within the lookaround it is not part of the match returned. This matches '"bar"'.
\"baz is not matched because it has no closing double quote.
Related
I'm trying to get UNIX group names using a regex (can't use groups because I can only get the process uid, so I'm using id <process_id> to get groups)
input looks like this
uid=1001(kawsay) gid=1001(kawsay) groups=1001(kawsay),27(sudo),44(video),997(gpio)\n
I'd like to capture kawsay, sudo, video and gpio
The only pieces I've got are:
a positive lookbehind to start capturing after groups: /(?<=groups)/
capture the parenthesis content: /\((\w+)\)/
Using PCRE's \G you may use this regex:
(?:\bgroups=|(?<!^)\G)[^(]*\(([^)]+)\)
Your intended matches are available in capture group #1
RegEx Demo
RegEx Details:
(?:: Start non-capture group
\bgroups=: Match word groups followed by a =
|: OR
(?<!^)\G: Start from end position of the previous match
): End non-capture group
[^(]*: Match 0 or more of any character that is not (
\(: Match opening (
([^)]+): Use capture group #1 to match 1+ of any non-) characters
\): Match closing )
You can use
(?:\G(?!\A)\),|\bgroups=)\d+\(\K\w+
See the regex demo. Details:
(?:\G(?!\A)\),|\bgroups=) - either of
\G(?!\A)\), - end of the previous match (\G operator matches either start of string or end of the previous match, so the (?!\A) is necessary to exclude the start of string location) and then ), substring
| - or
\bgroups= - a whole word groups (\b is a word boundary) and then a = char
\d+\( - one or more digits and a (
\K - match reset operator that makes the regex engine "forget" the text matched so far
\w+ - one or more word chars.
Here are two more ways to extract the strings of interest. Both return matches and do not employ capture groups. My preference is for second one.
str = "uid=1001(kawsay) gid=1001(kawsay) groups=1001(kawsay),27(sudo),44(video),997(gpio)\n"
Match substrings between parentheses that are not followed later in the string with "groups="
Match the regular expression
rgx = /(?<=\()(?!.*\bgroups=).*?(?=\))/
str.scan(rgx)
#=> ["kawsay", "sudo", "video", "gpio"]
Demo
See String#scan.
This expression can be broken down as follows.
(?<=\() # positive lookbehind asserts previous character is '('
(?! # begin negative lookahead
.* # match zero or more characters
\bgroups= # match 'groups=' preceded by a word boundary
) # end negative lookahead
.* # match zero or more characters lazily
(?=\)) # positive lookahead asserts next character is ')'
This may not be as efficient as expressions that employ \G (because of the need to determine if 'groups=' appears in the string after each left parenthesis), but that may not matter.
Extract the portion of the string following "groups=" and then match substrings between parentheses
First, obtain the portion of the string that follows "groups=":
rgx1 = /(?<=\bgroups=).*/
s = str[rgx1]
#=> "1001(kawsay),27(sudo),44(video),997(gpio)\n"
See String#[].
Then match the regular expression
rgx2 = /(?<=\()[^\)\r\n]+/
against s:
s.scan(rgx2)
#=> ["kawsay", "sudo", "video", "gpio"]
The regular expression rgx1 can be broken down as follows:
(?<=\bgroups=) # Positive lookbehind asserts that the current
# position in the string is preceded by`'groups'`,
# which is preceded by a word boundary
.* # match zero of more characters other than line
# terminators (to end of line)
rgx2 can be broken down as follows:
(?<=\() # Use a positive lookbehind to assert that the
# following character is preceded by '('
[^\)\r\n]+ # Match one or more characters other than
# ')', '\r' and '\n'
Note:
The operations can of course be chained: str[/(?<=\bgroups=).*/].scan(/(?<=\()[^\)\r\n]+/); and
rgx2 could alternatively be written /(?<=\().+?(?=\)), where ? makes the match of one or more characters lazy and (?=\)) is a positive lookahead that asserts that the match is followed by a right parenthesis.
This would probably be the fastest solution of those offered and certainly the easiest to test.
I am really stuck with the following regex problem:
I want to remove the last piece of a string, but only if the '-' is more then once occurring in the string.
Example:
BOL-83846-M/L -> Should match -M/L and remove it
B0L-026O1 -> Should not match
D&F-176954 -> Should not match
BOL-04134-58/60 -> Should match -58/60 and remove it
BOL-5068-4 - 6 jaar -> Should match -4 - 6 jaar and remove it (maybe in multiple search/replace steps)
It would be no problem if the regex needs two (or more) steps to remove it.
Now I have
[^-]*$
But in sublime it matches B0L-026O1 and D&F-176954
Need your help please
You can match the first - in a capture group, and then match the second - till the end of the string to remove it.
In the replacement use capture group 1.
^([^-\n]*-[^-\n]*)-.*$
^ Start of string
( Capture group 1
[^-\n]*-[^-\n]* Match the first - between chars other than - (or a newline if you don't want to cross lines)
) Capture group 1
-.*$ Match the second - and the rest of the line
Regex demo
You can match the following regular expression.
^[^-\r\n]*(?:$|-[^-\r\n]*(?=-|$))
Demo
If the string contains two or more hyphens this returns the beginning of the string up to, but not including, the second hyphen; else it returns the entire string.
The regular expression can be broken down as follows.
^ # match the beginning of the string
[^-\r\n]* # match zero or more characters other than hyphens,
# carriage returns and linefeeds
(?: # begin a non-capture group
$ # match the end of the string
| # or
- # match a hyphen
[^-\r\n]* # match zero or more characters other than hyphens,
# carriage returns and linefeeds
(?= # begin a positive lookahead
- # match a hyphen
| # or
$ # match the end of the string
) # end positive lookahead
) # end non-capture group
I have lines which consist of:
optional: "no ",
optional: two non whitespace characters,
several non whitespace characters.
I want to capture string from each line which consist of:
optional: two non whitespace characters (but not "no " part),
several non whitespace characters.
Example lines:
ab123
ab 123
no abc123
no ab 123
I want to capture:
ab123
ab 123
abc123
ab 123
My regexp (works only for examples without "no ").
^
(?! no \s) # not "no "
( # match it
(?: \S{1,2} \s )? # optional: 1-2 non whitespace characters and one space, BUT NOT GET "no " (it doesn't works)
\S+ # non whitespace characters
)
$
Example online (4 unit tests): https://regex101.com/r/70soe2/1
Maybe should I use negative look ahead (?! no \\s) or negative look behind (?<! no \\s) in some way? But I don't know how to use it.
You cannot actually rely on lookarounds here, you need to consume the optional no + whitespace part of the string.
It is best to use a non-capturing optional group at the start:
^
(?: no \s)? # not "no "
( # capture it
(?: \S{1,2} \s )? # optional: 1-2 non whitespace characters and one space, BUT NOT GET "no " (it doesn't works)
\S+ # non whitespace characters
)
$
See the regex demo
The value you need is inside Group 1.
If your regex engine supports \K construct, you may use this instead:
^
(?:no \s \K)? # not "no "
( # match it
(?: \S{1,2} \s )? # optional: 1-2 non whitespace characters and one space, BUT NOT GET "no " (it doesn't works)
\S+ # non whitespace characters
)
$
The \K in (?:no \s \K)? will omit the consumed string part from the match value, and you will get the expected result as a whole match value.
See the regex demo
I'm using regular expressions in a custom text editor to in effect whitelist certain modules (assert and crypto). I'm close to what I need but not quite there. Here it is:
/require\s*\(\s*'(?!(\bassert\b|\bcrypto\b)).*'\s*\)/
I want the regular expression to match any line with require('foo'); where foo is anything except for 'assert' or 'crypto'. The case I'm failing is require('assert '); which is not being matched with my regex however require(' assert'); is correctly being matched.
https://regexr.com/4i6ot
If you don't want to match assert or crypto between ', you could change the lookahead to assert exactly that. You can omit the word boundaries matching the words right after the '.
If what follows should match until the first occurrence of ', you could use a negated character class [^'\r\n]* to match any char except ' or a newline.
require\s*\(\s*'(?!(assert|crypto)')[^'\r\n]*'\s*\)
^
Regex demo
You can use: require\s*\(\s*'(?!(\bassert'|\bcrypto')).*'\s*\)
Online demo
The difference is that I replaced word boundary \b with ' at the end of the module names. With \b a module name of 'assert ' was matched by negative lookahead, because t was matched by \b. In the new version, we require ' at the end of the name of the module.
EDIT
As Cary Swoveland advised, leading \b are not required:
require\s*\(\s*'(?!(assert'|crypto')).*'\s*\)
Demo
I assume from the flawed regex that if there is a match the string between "('" and "')" is to be captured. One way to do that follows.
r = /
require # match word
\ * # match zero or more spaces (note escaped space)
\( # match a left paren
(?! # begin a negative lookahead
' # match a single quote
(?:assert|crypto) # match either word
' # match a single quote
(?=\)) # match a right paren in a forward lookahead
) # end negative lookahead
' # match a single quote
(.*?) # match any number of characters lazily in a capture group 1
' # match a single quote
\) # match a right paren
/x # free-spacing regex definition mode
As the capture group is followed by a single quote, matching characters in the capture group lazily ensures that a single quote is not matched in the capture group. I could have instead written ([^']*). In conventional form this regex is written as follows:
r = /require *\((?!'(?:assert|crypto)'(?=\)))'(.*?)'\)/
Note that in free-spacing regex definition mode spaces will be removed unless they are escaped, put in a character class ([ ]), replaced with \p{Space} and so on.
"require ('victory')" =~ r #=> 0
$1 #=> "victory"
"require (' assert')" =~ r #=> 0
$1 #=> " assert"
"require ('assert ')" =~ r #=> 0
$1 #=> "assert "
"require ('crypto')" =~ r #=> nil
"require ('assert')" =~ r #=> nil
"require\n('victory')" =~ r #=> nil
Notice that had I replace the space character in the regex with "\s" in the last example I would have obtained:
"require\n('victory')" =~ r #=> 0
$1 #=> "victory"
I don't think you need anything remotely that complicated, this simple pattern will work just fine:
require\((?!'crypto'|'assert')'.*'\);
regex101 demo
The string I want to match shouldn't have a following alphabetical letter except for 's' but it can have any following digit or symbol.
Note: Any alphabetical letter are allowed after the string but must have preceding whitespace/symbol.
For the root msl,
Should match: msl, msls, msl123, msl123s, mslss, mslss xxx, x_msl, x_msl_x
Shouldn't match: msled, mslsxxx, xmsl_x
"msl" matches ".*" + "word_msl" + "(What Regex to put here?).*"
[The question originally had perl as a tag]
my $root = "msl";
/
(?<![^\s_]) # At the start of a "word" or after a "_"
\Q$root\E # Match the value of $root literally
(?: \S* s # Non-whitespace characters ending with "s", or
| [^\ss]* # Non-whitespace, non-"s" characters
)
(?!\S) # At the end of a "word"
/x
Optimized:
my $root = "msl";
/
(?<![^\s_]) # At the start of a "word" or after a "_"
\Q$root\E # Match the value of $root literally
[^\ss]*+ # Non-whitespace, non-"s" characters
(?: s (?: [^\ss]*+ s )*+ )?+ # Optionally, non-whitespace characters ending with "s"
(?!\S) # At the end of a "word"
/x
In this case, a "word" is considered to be a sequence of non-whitespace characters delimited by whitespace, the start of the string and/or the end of the string.