Regular expression to match requiring a module? - regex

I'm using regular expressions in a custom text editor to in effect whitelist certain modules (assert and crypto). I'm close to what I need but not quite there. Here it is:
/require\s*\(\s*'(?!(\bassert\b|\bcrypto\b)).*'\s*\)/
I want the regular expression to match any line with require('foo'); where foo is anything except for 'assert' or 'crypto'. The case I'm failing is require('assert '); which is not being matched with my regex however require(' assert'); is correctly being matched.
https://regexr.com/4i6ot

If you don't want to match assert or crypto between ', you could change the lookahead to assert exactly that. You can omit the word boundaries matching the words right after the '.
If what follows should match until the first occurrence of ', you could use a negated character class [^'\r\n]* to match any char except ' or a newline.
require\s*\(\s*'(?!(assert|crypto)')[^'\r\n]*'\s*\)
^
Regex demo

You can use: require\s*\(\s*'(?!(\bassert'|\bcrypto')).*'\s*\)
Online demo
The difference is that I replaced word boundary \b with ' at the end of the module names. With \b a module name of 'assert ' was matched by negative lookahead, because t was matched by \b. In the new version, we require ' at the end of the name of the module.
EDIT
As Cary Swoveland advised, leading \b are not required:
require\s*\(\s*'(?!(assert'|crypto')).*'\s*\)
Demo

I assume from the flawed regex that if there is a match the string between "('" and "')" is to be captured. One way to do that follows.
r = /
require # match word
\ * # match zero or more spaces (note escaped space)
\( # match a left paren
(?! # begin a negative lookahead
' # match a single quote
(?:assert|crypto) # match either word
' # match a single quote
(?=\)) # match a right paren in a forward lookahead
) # end negative lookahead
' # match a single quote
(.*?) # match any number of characters lazily in a capture group 1
' # match a single quote
\) # match a right paren
/x # free-spacing regex definition mode
As the capture group is followed by a single quote, matching characters in the capture group lazily ensures that a single quote is not matched in the capture group. I could have instead written ([^']*). In conventional form this regex is written as follows:
r = /require *\((?!'(?:assert|crypto)'(?=\)))'(.*?)'\)/
Note that in free-spacing regex definition mode spaces will be removed unless they are escaped, put in a character class ([ ]), replaced with \p{Space} and so on.
"require ('victory')" =~ r #=> 0
$1 #=> "victory"
"require (' assert')" =~ r #=> 0
$1 #=> " assert"
"require ('assert ')" =~ r #=> 0
$1 #=> "assert "
"require ('crypto')" =~ r #=> nil
"require ('assert')" =~ r #=> nil
"require\n('victory')" =~ r #=> nil
Notice that had I replace the space character in the regex with "\s" in the last example I would have obtained:
"require\n('victory')" =~ r #=> 0
$1 #=> "victory"

I don't think you need anything remotely that complicated, this simple pattern will work just fine:
require\((?!'crypto'|'assert')'.*'\);
regex101 demo

Related

RegEx pattern to get text surrounded with quotes that ends with backslash [duplicate]

I'm trying to write a regex that matches strings as the following:
translate("some text here")
and
translate('some text here')
I've done that:
preg_match ('/translate\("(.*?)"\)*/', $line, $m)
but how to add if there are single quotes, not double. It should match as single, as double quotes.
You could go for:
translate\( # translate( literally
(['"]) # capture a single/double quote to group 1
.+? # match anything except a newline lazily
\1 # up to the formerly captured quote
\) # and a closing parenthesis
See a demo for this approach on regex101.com.
In PHP this would be:
<?php
$regex = '~
translate\( # translate( literally
([\'"]) # capture a single/double quote to group 1
.+? # match anything except a newline lazily
\1 # up to the formerly captured quote
\) # and a closing parenthesis
~x';
if (preg_match($regex, $string)) {
// do sth. here
}
?>
Note that you do not need to escape both of the quotes in square brackets ([]), I have only done it for the Stackoverflow prettifier.
Bear in mind though, that this is rather error-prone (what about whitespaces, escaped quotes ?).
In the comments the discussion came up that you cannot say anything BUT the first captured group. Well, yes, you can (thanks to Obama here), the technique is called a tempered greedy token which can be achieved via lookarounds. Consider the following code:
translate\(
(['"])
(?:(?!\1).)*
\1
\)
It opens a non-capturing group with a negative lookahead that makes sure not to match the formerly captured group (a quote in this example).
This eradicates matches like translate("a"b"c"d") (see a demo here).
The final expression to match all given examples is:
translate\(
(['"])
(?:
.*?(?=\1\))
)
\1
\)
#translate\(
([\'"]) # capture quote char
((?:
(?!\1). # not a quote
| # or
\\\1 # escaped one
)* #
[^\\\\]?)\1 # match unescaped last quote char
\)#gx
Fiddle:
ok: translate("some text here")
ok: translate('some text here')
ok: translate('"some text here..."')
ok: translate("a\"b\"c\"d")
ok: translate("")
no: translate("a\"b"c\"d")
You can alternate expression components using the pipe (|) like this:
preg_match ('/translate(\("(.*?)"\)|\(\'(.*?)\'\))/', $line, $m)
Edit: previous also matched translate("some text here'). This should work but you will have to escape the quotes in some languages.

Regex Ruby How to group every word within parentheses

I'm trying to get all the words between the parentheses after a specific word and the end of the string.
For example, I have this case:
p " some other text in downcase LOREM (foo, bar)".scan(/ LOREM \((.*?)\)\z/m)
# [["foo, bar"]]
The regex is getting foo, bar which is between the parenthesis, it's okay, but I'd like to get them like two separate elements within a single array, meaning:
["foo", "bar"]
That's to say, the regex should group every words as a separate element.
My intention is to get everything between LOREM ( and the last closing parenthesis ).
I've tried adding (\b\w+\b), which groups every word in the string. But when adding it to the attempt to get the words from the parenthesis, it returns nothing.
You may use
.scan(/(?:\G(?!\A)\s*,\s*|\sLOREM\s+\()\K\w+(?=[^()]*\)\z)/
See the Ruby demo and the Rubular regex demo. You may replace \w+ with [[:alnum:]]+, or \p{L}+ (to only match letters), or [^\s,()]+ (to match any 1+ chars other than whitespace, ,, ( and )), it all depends on what you want to match inside the paretheses.
Details
(?:\G(?!\A)\s*,\s*|\sLOREM\s+\() - either the end of the previous successful match and a , enclosed with 0+ whitespaces, or whitespace, LOREM, 1+ whitespaces and (
\K - omit the text matched so far
\w+ - consume 1+ word chars
(?=[^()]*\)\z) - immediately to the right, there must be 0 or more chars other than ( and ) and then ) at the end of the string.
r = /
(?<= # begin a positive lookbehind
LOREM[ ] # match 'LOREM '
\( # match left paren
| # or
,[ ] # match a comma followed by a space
) # end positive lookbehind
(?: # begin a non-capture group
[^, ")]+ # match one or more characters other than ',', ' ', '"' and ")"
| # or
\" # match a double quote
[^, ")]+ # match one or more characters other than ',', ' ', '"' and ")"
\" # match a double quote
) # end non-capture group
(?= # begin a positive lookahead
.*\) # match any number of characters followed by a right paren
) # end positive lookahead
/x # free-spacing regex definition mode
Conventionally this is written
r = /(?<=LOREM \(|, )(?:[^, ")]+|\"[^, ")]+\")(?=.*\))/
Let's try it.
str = "some other text in downcase LOREM (foo, \"bar\", \"baz), daz"
str.scan(r)
#=> ["foo", "\"bar\""]
The first match, "foo", matches
str.scan /(?<=LOREM \()[^, ")]+/
#=> ["foo"]
That is, this matches one or more characters other than a comma, space, double quote or left parenthesis, immediately preceded by "LOREM " followed by a left parenthesis.
The next attempted match begins at the end of "foo". There is no match of "L" in "LOREM" so an attempt is made to match ", ", which is met with success. [^, ")]+ does not match "bar", so an attempt is made to match \"[^, ")]+\", which is successful. As ", " is matched within the lookaround it is not part of the match returned. This matches '"bar"'.
\"baz is not matched because it has no closing double quote.

Perl Non-greedy Matching -- Is the "?" character used correctly?

I am trying to match the parameter name of a parameter declaration line such as below:
parameter BWIDTH = 32;
The Perl regular expression used is:
$line =~ /(\w+)\s*=/
where the parameter name, BWIDTH, is captured into $1. Most parameters I encountered are declared in such a way that the name precedes the equal sign, "=", which is the reason the regular expression is designed with the "=" in it (/(\w+)\s*=/).
However there are special cases where the parameter is declared:
parameter reg [31:0] PORT_WIDTH [BWIDTH-1:0] = 32;
In this case, the parameter name that I am trying to capture is PORT_WIDTH. Revising the regular expression to match this instance does not capture PORT_WIDTH successfully, although it does capture BWIDTH fine.
$line =~ /(\w+)(\s*\[.*?\])*\s*=/
where (\s*\[.*?\])* matches reg [31:0] PORT_WIDTH [BWIDTH-1:0] which is greedy matching.
I am baffled as to why the metacharacter ? does not halt the greedy matching? How should I revise the regular expression?
Replace the .*? with [^][]* to match 0+ chars other than ] and [:
/(\w+)(\s*\[[^][]*])*\s*=/
^^^^^^
You may also turn the second capturing group into a non-capturing one if you are not using that value.
Pattern details:
(\w+) - Group 1: one or more word chars
(\s*\[[^][]*])* - a capturing group (add ?: after ( to make it non-capturing) zero or more occurrences of:
\s* - 0+ whitespaces
\[ - a literal [
[^][]* - a negated character class matching zero or more chars other than ] and [
] - a literal ]
\s* - zero or more whitespaces
= - an equal sign.
Greediness vs. non-greediness affects where a match ends, but it still starts as early as possible. Basically, a greedy match is the leftmost-longest possible match, while non-greedy is leftmost-shortest. But non-greedy is still leftmost, not rightmost.
To get what you want, I would use a more explicit description of what I want matched: /(\w+)(\s*\[[^]]*\])?\s*=/ In English, that's a word (\w+), optionally followed by some text in square brackets ((\s*\[[^]]*\])?), and then optional whitespace and an equals sign. Note that I used a negated character class ([^]]) instead of a non-greedy match for what's inside the brackets - IMO, negated character classes are generally a better option than non-greedy matching.
Results with this regex:
$ perl -E '$x = q(parameter reg [31:0] PORT_WIDTH [BWIDTH-1:0] = 32;); $x =~ /(\w+)(:?\s*\[[^]]*\])?\s*=/; say $1;'
PORT_WIDTH
$ perl -E '$x = q(parameter BWIDTH = 32;); $x =~ /:?(\w+)(\s*\[[^]]*\])?\s*=/; say $1;'
BWIDTH
You have information available to you which you are choosing not to use. You know the basic structure of each statement you are trying to parse. The statements have mandatory and optional parts. So, put the information you have in to the match. For example:
#!/usr/bin/env perl
use strict;
use warnings;
my $stuff_in_square_brackets = qr{ \[ [^\]]+ \] }x;
my $re = qr{
^
parameter \s+
(?: reg \s+)?
(?: $stuff_in_square_brackets \s+)?
(\w+) \s+
(?: $stuff_in_square_brackets \s+)?
= \s+
(\w+) ;
$
}x;
while (my $line = <DATA>) {
if (my($p, $v) = ($line =~ $re)) {
print "'$p' = '$v'\n";
}
}
__DATA__
parameter BWIDTH = 32;
parameter reg [31:0] PORT_WIDTH [BWIDTH-1:0] = 32;
Output:
'BWIDTH' = '32'
'PORT_WIDTH' = '32'

How to match string in single or double quoted using regex

I'm trying to write a regex that matches strings as the following:
translate("some text here")
and
translate('some text here')
I've done that:
preg_match ('/translate\("(.*?)"\)*/', $line, $m)
but how to add if there are single quotes, not double. It should match as single, as double quotes.
You could go for:
translate\( # translate( literally
(['"]) # capture a single/double quote to group 1
.+? # match anything except a newline lazily
\1 # up to the formerly captured quote
\) # and a closing parenthesis
See a demo for this approach on regex101.com.
In PHP this would be:
<?php
$regex = '~
translate\( # translate( literally
([\'"]) # capture a single/double quote to group 1
.+? # match anything except a newline lazily
\1 # up to the formerly captured quote
\) # and a closing parenthesis
~x';
if (preg_match($regex, $string)) {
// do sth. here
}
?>
Note that you do not need to escape both of the quotes in square brackets ([]), I have only done it for the Stackoverflow prettifier.
Bear in mind though, that this is rather error-prone (what about whitespaces, escaped quotes ?).
In the comments the discussion came up that you cannot say anything BUT the first captured group. Well, yes, you can (thanks to Obama here), the technique is called a tempered greedy token which can be achieved via lookarounds. Consider the following code:
translate\(
(['"])
(?:(?!\1).)*
\1
\)
It opens a non-capturing group with a negative lookahead that makes sure not to match the formerly captured group (a quote in this example).
This eradicates matches like translate("a"b"c"d") (see a demo here).
The final expression to match all given examples is:
translate\(
(['"])
(?:
.*?(?=\1\))
)
\1
\)
#translate\(
([\'"]) # capture quote char
((?:
(?!\1). # not a quote
| # or
\\\1 # escaped one
)* #
[^\\\\]?)\1 # match unescaped last quote char
\)#gx
Fiddle:
ok: translate("some text here")
ok: translate('some text here')
ok: translate('"some text here..."')
ok: translate("a\"b\"c\"d")
ok: translate("")
no: translate("a\"b"c\"d")
You can alternate expression components using the pipe (|) like this:
preg_match ('/translate(\("(.*?)"\)|\(\'(.*?)\'\))/', $line, $m)
Edit: previous also matched translate("some text here'). This should work but you will have to escape the quotes in some languages.

Matching first letter of word

I want to match the first letter of a word in one string to another with the similar letter. In this example the letter H:
25HB matches to HC
I am using the match operator shown below:
my ($match) = ( $value =~ m/^d(\w)/ );
to not match the digit, but the first matching word character. How could I correct this?
That regex doesn't do what you think it does:
m/^d(\w)/
Matches 'start of line' - letter d then a single word character.
You may want:
m/^\d+(\w)/
Which will then match one or more digits from the start of line, and grab the first word character after that.
E.g.:
my $string = '25HC';
my ( $match ) =( $string =~ m/^\d+(\w)/ );
print $match,"\n";
Prints H
You are not clear about what you want. If you want to match the first letter in a string to the same letter later in the string:
m{
( # start a capture
[[:alpha:]] # match a single letter
) # end of capture
.*? # skip minimum number of any character
\1 # match the captured letter
}msx; # /m means multilines, /s means . matches newlines, /x means ignore whitespace in pattern
See perldoc perlre for more details.
Addendum:
If by word, you mean any alphanumeric sequence, this may be closer to what you want:
m{
\b # match a word boundary (start or end of a word)
\d* # greedy match any digits
( # start a capture
[[:alpha:]] # match a single letter
) # end of capture
.*? # skip minimum number of any character
\b # match a word boundary (start or end of a word)
\d* # greedy match any digits
\1 # match the captured letter
}msx; # /m means multilines, /s means . matches newlines, /x means ignore whitespace in pattern
You could try ^.*?([A-Za-z]).
The following code returns:
ITEM: 22hb
MATCH: h
ITEM: 33HB
MATCH: H
ITEM: 3333
MATCH:
ITEM: 43 H
MATCH: H
ITEM: HB33
MATCH: H
Script.
#!/usr/bin/perl
my #array = ('22hb','33HB','3333','43 H','HB33');
for my $item (#array) {
my $match = $1 if $item =~ /^.*?([A-Za-z])/;
print "ITEM: $item \nMATCH: $match\n\n";
}
I believe this is what you are looking for:
(If you can provide more clear example of what you are looking for we may be able to help you better)
The following code takes two strings and finds the first non-digit character common in both the strings:
my $string1 = '25HB';
my $string2 = 'HC';
#strip all digits
$string1 =~ s/\d//g;
foreach my $alpha (split //, $string1) {
# for each non-digit check if we find a match
if ($string2 =~ /$alpha/) {
print "First matching non-numeric character: $alpha\n";
exit;
}
}