In this regex
$line = 'this is a regular expression';
$line =~ s/^(\w+)\b(.*)\b(\w+)$/$3 $2 $1/;
print $line;
Why is $2 equal to " is a regular "? My thought process is that (.*) should be greedy and match all characters until the end of the line and therefore $3 would be empty.
That's not happening, though. The regex matcher is somehow stopping right before the last word boundary and populating $3 with what's after the last word boundary and the rest of the string is sent to $2.
Any explanation?
Thanks.
$3 can't be empty when using this regex because the corresponding capturing group is (\w+), which must match at least one word character or the whole match will fail.
So what happens is (.*) matches "is a regular expression", \b matches the end of the string, and (\w+) fails to match. The regex engine then backtracks to (.*) matching "is a regular " (note the match includes the space), \b matches the word boundary before e, and (\w+) matches "expression".
If you change(\w+) to (\w*) then you will end up with the result you expected, where (.*) consumes the whole string.
Greedy doesn't mean it gets to match absolutely everything. It just means it can take as much as possible and still have the regex succeed.
This means that since you use the + in group 3 it can't be empty and still succeed as + means 1 or more.
If you want 3 to be empty, just change (\w+) to (\w?). Now since ? means 0 or 1 it can be empty, and therefore the greedy .* takes everything. Note: This seems to work only in Perl, due to how perl deals with lines.
In order for the regex to match the whole string, ^(\w+)\b requires that the entire first word be \1. Likewise, \b(\w+)$ requires that the entire last word be \3. Therefore, no matter how greedy (.*) is, it can only capture ' is a regular ', otherwise the pattern won't match. At some point while matching the string, .* probably did take up the entire ' is a regular expression', but then it found that it had to backtrack and let the \w+ get its match too.
The way that you wrote your regexp it doesn't matter if .* is being greedy, or non-greedy.
It will still match.
The reason is that you used \b between .* and \w+.
use strict;
use warnings;
my $string = 'this is a regular expression';
sub test{
my($match,$desc) = #_;
print '# ', $desc, "\n" if $desc;
print "test( qr'$match' );\n";
if( my #elem = $string =~ $match ){
print ' 'x4,'[\'', join("']['",#elem), "']\n\n"
}else{
print ' 'x4,"FAIL\n\n";
}
}
test( qr'^ (\w+) \b (.*) \b (\w+) $'x, 'original' );
test( qr'^ (\w+) \b (.*+) \b (\w+) $'x, 'extra-greedy' );
test( qr'^ (\w+) \b (.*?) \b (\w+) $'x, 'non-greedy' );
test( qr'^ (\w+) \b (.*) \b (\w*) $'x, '\w* instead of \w+' );
test( qr'^ (\w+) \b (.*) (\w+) $'x, 'no \b');
test( qr'^ (\w+) \b (.*?) (\w+) $'x, 'no \b, non-greedy .*?' );
# original
test( qr'(?^x:^ (\w+) \b (.*) \b (\w+) $)' );
['this'][' is a regular ']['expression']
# extra-greedy
test( qr'(?^x:^ (\w+) \b (.*+) \b (\w+) $)' );
FAIL
# non-greedy
test( qr'(?^x:^ (\w+) \b (.*?) \b (\w+) $)' );
['this'][' is a regular ']['expression']
# \w* instead of \w+
test( qr'(?^x:^ (\w+) \b (.*) \b (\w*) $)' );
['this'][' is a regular expression']['']
# no \b
test( qr'(?^x:^ (\w+) \b (.*) (\w+) $)' );
['this'][' is a regular expressio']['n']
# no \b, non-greedy .*?
test( qr'(?^x:^ (\w+) \b (.*?) (\w+) $)' );
['this'][' is a regular ']['expression']
Related
I'm using regular expressions in a custom text editor to in effect whitelist certain modules (assert and crypto). I'm close to what I need but not quite there. Here it is:
/require\s*\(\s*'(?!(\bassert\b|\bcrypto\b)).*'\s*\)/
I want the regular expression to match any line with require('foo'); where foo is anything except for 'assert' or 'crypto'. The case I'm failing is require('assert '); which is not being matched with my regex however require(' assert'); is correctly being matched.
https://regexr.com/4i6ot
If you don't want to match assert or crypto between ', you could change the lookahead to assert exactly that. You can omit the word boundaries matching the words right after the '.
If what follows should match until the first occurrence of ', you could use a negated character class [^'\r\n]* to match any char except ' or a newline.
require\s*\(\s*'(?!(assert|crypto)')[^'\r\n]*'\s*\)
^
Regex demo
You can use: require\s*\(\s*'(?!(\bassert'|\bcrypto')).*'\s*\)
Online demo
The difference is that I replaced word boundary \b with ' at the end of the module names. With \b a module name of 'assert ' was matched by negative lookahead, because t was matched by \b. In the new version, we require ' at the end of the name of the module.
EDIT
As Cary Swoveland advised, leading \b are not required:
require\s*\(\s*'(?!(assert'|crypto')).*'\s*\)
Demo
I assume from the flawed regex that if there is a match the string between "('" and "')" is to be captured. One way to do that follows.
r = /
require # match word
\ * # match zero or more spaces (note escaped space)
\( # match a left paren
(?! # begin a negative lookahead
' # match a single quote
(?:assert|crypto) # match either word
' # match a single quote
(?=\)) # match a right paren in a forward lookahead
) # end negative lookahead
' # match a single quote
(.*?) # match any number of characters lazily in a capture group 1
' # match a single quote
\) # match a right paren
/x # free-spacing regex definition mode
As the capture group is followed by a single quote, matching characters in the capture group lazily ensures that a single quote is not matched in the capture group. I could have instead written ([^']*). In conventional form this regex is written as follows:
r = /require *\((?!'(?:assert|crypto)'(?=\)))'(.*?)'\)/
Note that in free-spacing regex definition mode spaces will be removed unless they are escaped, put in a character class ([ ]), replaced with \p{Space} and so on.
"require ('victory')" =~ r #=> 0
$1 #=> "victory"
"require (' assert')" =~ r #=> 0
$1 #=> " assert"
"require ('assert ')" =~ r #=> 0
$1 #=> "assert "
"require ('crypto')" =~ r #=> nil
"require ('assert')" =~ r #=> nil
"require\n('victory')" =~ r #=> nil
Notice that had I replace the space character in the regex with "\s" in the last example I would have obtained:
"require\n('victory')" =~ r #=> 0
$1 #=> "victory"
I don't think you need anything remotely that complicated, this simple pattern will work just fine:
require\((?!'crypto'|'assert')'.*'\);
regex101 demo
Giving the following code:
use strict;
use warnings;
my $text = "asdf(blablabla)";
$text =~ s/(.*?)\((.*)\)/$2/;
print "\nfirst match: $1";
print "\nsecond match: $2";
I expected that $2 would catch my last bracket, yet my output is:
If .* by default it's greedy why it stopped at the bracket?
The .* is a greedy subpattern, but it does not account for grouping. Grouping is defined with a pair of unescaped parentheses (see Use Parentheses for Grouping and Capturing).
See where your group boundaries are:
s/(.*?)\((.*)\)/$2/
| G1| |G2|
So, the \( and \) matching ( and ) are outside the groups, and will not be part of neither $1 nor $2.
If you need the ) be part of $2, use
s/(.*?)\((.*\))/$2/
^
A regex engine is processing both the string and the pattern from left to right. The first (.*?) is handled first, and it matches up to the first literal ( symbol as it is lazy (matches as few chars as possible before it can return a valid match), and the whole part before the ( is placed into Group 1 stack. Then, the ( is matched, but not captured, then (.*) matches any 0+ characters other than a newline up to the last ) symbol, and places the capture into Group 2. Then, the ) is just matched. The point is that .* grabs the whole string up to the end, but then backtracking happens since the engine tries to accommodate for the final ) in the pattern. The ) must be matched, but not captured in your pattern, thus, it is not part of Group 2 due to the group boundary placement. You can see the regex debugger at this regex demo page to see how the pattern matches your string.
I want to match the first letter of a word in one string to another with the similar letter. In this example the letter H:
25HB matches to HC
I am using the match operator shown below:
my ($match) = ( $value =~ m/^d(\w)/ );
to not match the digit, but the first matching word character. How could I correct this?
That regex doesn't do what you think it does:
m/^d(\w)/
Matches 'start of line' - letter d then a single word character.
You may want:
m/^\d+(\w)/
Which will then match one or more digits from the start of line, and grab the first word character after that.
E.g.:
my $string = '25HC';
my ( $match ) =( $string =~ m/^\d+(\w)/ );
print $match,"\n";
Prints H
You are not clear about what you want. If you want to match the first letter in a string to the same letter later in the string:
m{
( # start a capture
[[:alpha:]] # match a single letter
) # end of capture
.*? # skip minimum number of any character
\1 # match the captured letter
}msx; # /m means multilines, /s means . matches newlines, /x means ignore whitespace in pattern
See perldoc perlre for more details.
Addendum:
If by word, you mean any alphanumeric sequence, this may be closer to what you want:
m{
\b # match a word boundary (start or end of a word)
\d* # greedy match any digits
( # start a capture
[[:alpha:]] # match a single letter
) # end of capture
.*? # skip minimum number of any character
\b # match a word boundary (start or end of a word)
\d* # greedy match any digits
\1 # match the captured letter
}msx; # /m means multilines, /s means . matches newlines, /x means ignore whitespace in pattern
You could try ^.*?([A-Za-z]).
The following code returns:
ITEM: 22hb
MATCH: h
ITEM: 33HB
MATCH: H
ITEM: 3333
MATCH:
ITEM: 43 H
MATCH: H
ITEM: HB33
MATCH: H
Script.
#!/usr/bin/perl
my #array = ('22hb','33HB','3333','43 H','HB33');
for my $item (#array) {
my $match = $1 if $item =~ /^.*?([A-Za-z])/;
print "ITEM: $item \nMATCH: $match\n\n";
}
I believe this is what you are looking for:
(If you can provide more clear example of what you are looking for we may be able to help you better)
The following code takes two strings and finds the first non-digit character common in both the strings:
my $string1 = '25HB';
my $string2 = 'HC';
#strip all digits
$string1 =~ s/\d//g;
foreach my $alpha (split //, $string1) {
# for each non-digit check if we find a match
if ($string2 =~ /$alpha/) {
print "First matching non-numeric character: $alpha\n";
exit;
}
}
I'm trying to match a string such that the leftmost symbol and the rightmost symbol are the same. How do I do that?
It’s impossible to know exactly what you mean without clarification of what you consider a “symbol”, but here is one possible solution:
use Unicode::Normalize;
NFD($string) =~ / \A \s* ( (?= \p{Grapheme_Base} ) \X ) .* \1 \s* \z /sx;
and here is another:
use Unicode::Normalize;
NFD($string) =~ / \A \s* ( (?= \p{Symbol} ) \X ) .* \1 \s* \z /sx;
and here is one more:
use Unicode::Normalize;
NFD($string) =~ / \A \s* ( (?: (?= \p{Symbol} ) \X )+ ) .* \1 \s* \z /sx;
And it is even possible that you might be able in some very limited circumstances be able to get away with:
$string =~ / ^ (\pS) .* \1 $ /xs;
But if you do, it’s also likely that someday you’re going to wish you had been more careful.
$string =~ m/^(.).*\1$/
should work. This fails to match strings of length 1, though.
Why do you want to do this with a regex? Is it homework? I avoid regexes for trivial patterns like this.
use Unicode::Normalize qw(NFC);
$s = NFC( $s );
substr( $s, 0, 1 ) eq substr( $s, -1, 1 );
Because Tom will complain about characters versus graphemes, you can handle that too:
use v5.10.1;
use Unicode::GCString;
use Unicode::Normalize qw(NFC);
my $gcs = Unicode::GCString->new( NFC( $s ) );
$gcs->substr( 0, 1 ) eq $gcs->substr( -1, 1 )
These regex's match strings with length 1 and greater. In the expressions, (.) represent a capture group where the dot should be substituted with your class of symbols I guess (see Unicode guru's, although that does not seem to be the intent of the question).
The context of this regex is single line (/s modifier). It allows the dot to match
newlines as well as anything else (like [\s\S]) so newlines can be embedded as well as being the outter most delimeter.
Using \z is the same as $ (in /s mode), except \z corrects a scenario where $ could match before a newline (matches at the end of string is more commona). If the character in question is a newline and you use un-greedy quantifier (like .*?) and the target string is "\nasdf\n\n", it could falsly match before the final newline. But that is a moot issue since the match is all greedy. Still, leave it in for grins.
/^(?=(.)).*\1\z/s
expanded
/
^ # Beginning of string
(?=(.)) # Lookahead - capture grp1, first (any) character (but don't consume it)
.* # Optionally consume all the characters up until before the last character
\1 # Backreference to capture grp1, this must exist
\z # End of string
/s # s modifier
Example stipulating just word class characters
/^(?=(\w)).*\1\z/s
Again, just substitute your acceptable symbols
I want a Perl regular expression that will match duplicated words in a string.
Given the following input:
$str = "Thus joyful Troy Troy maintained the the watch of night..."
I would like the following output:
Thus joyful [Troy Troy] maintained [the the] watch of night...
This is similar to one of the Learning Perl exercises. The trick is to catch all of the repeated words, so you need a "one or more" quantifier on the duplication:
$str = 'This is Goethe the the the their sentence';
$str =~ s/\b((\w+)(?:\s+\2\b)+)/[\1]/g;
The features I'm about to use are described in either perlre, when they apply at a pattern, or perlop when they affect how the substitution operator does its work.
If you like the /x flag to add insignificant whitespace and comments:
$str =~ s/
\b
(
(\w+)
(?:
\s+
\2
\b
)+
)
/[\1]/xg;
I don't like that \2 though because I hate counting relative positions. I can use the relative backreferences in Perl 5.10. The \g{-1} refers to the immediately preceding capture group:
use 5.010;
$str =~ s/
\b
(
(\w+)
(?:
\s+
\g{-1}
\b
)+
)
/[\1]/xg;
Counting isn't all that great either, so I can use labeled matches:
use 5.010;
$str =~ s/
\b
(
(?<word>\w+)
(?:
\s+
\k<word>
\b
)+
)
/[\1]/xg;
I can label the first capture ($1) and access its value in %+ later:
use 5.010;
$str =~ s/
\b
(?<dups>
(?<word>\w+)
(?:
\s+
\k<word>
\b
)+
)
/[$+{dups}]/xg;
I shouldn't really need that first capture though since it's really just there to refer to everything that matched. Sadly, it looks like ${^MATCH} isn't set early enough for me to use it in the replacement side. I think that's a bug. This should work but doesn't:
$str =~ s/
\b
(?<word>\w+)
(?:
\s+
\k<word>
\b
)+
/[${^MATCH}]/pgx; # DOESN'T WORK
I'm checking this on blead, but that's going to take a little while to compile on my tiny machine.
This works:
$str =~ s/\b((\w+)\s+\2)\b/[\1]/g;
You can try:
$str = "Thus joyful Troy Troy maintained the the watch of night...";
$str =~s{\b(\w+)\s+\1\b}{[$1 $1]}g;
print "$str"; # prints Thus joyful [Troy Troy] maintained [the the] watch of night...
Regex used: \b(\w+)\s+\1\b
Explanation:
\b: word bondary
\w+: a word
(): to remember the above word
\s+: whitespace
\1: the remembered word
It effectively finds two full words separated by whitespace and places [ ] around them.
EDIT:
If you want to preserve the amount of whitespace between the words you can use:
$str =~s{\b(\w+)(\s+)\1\b}{[$1$2$1]}g;
Try the following:
$str =~ s/\b(\S+)\b(\s+\1\b)+/[\1]/g;