Perl Regex not working as expected - regex

Im tryin to match a digit, followed by a dot, and two digits after, followed by a W. if($_ =~ /\d{1}\.\d{2}\W\/)/g It does not work, ay ideas what im missing here?

Put the modifier inside the brackets.
if($_ =~ /\d\.\d{2}\W/g)
OR
if($_ =~ /\d\.\d{2}W/g)
Added two patterns since i don't know you want \W or W. NOte that \W is a special regex pattern which matches a non-word character. W matches a literal W.

Related

Removing multiple consecutive words separated with whitespace

In the code below the pattern / man / matches twice consecutively. So when I substitute that pattern only the first occurence is matched but the second occurence is not matched.
As I understand the problem the first pattern itself matches until the start of second pattern(i.e, the space after man is the end of first pattern and also the start of first pattern). So second pattern is not matched. How to match this pattern globally when it occurs consecutively.
use strict;
use warnings;
#my $name =" man sky man "; #this works
my $name =" man man sky"; #this does'nt
$name =~s/ man / nam /g; #expected= 'nam nam sky'
print $name,"\n";
The regex is eating up characters which it matches. So, to avoid this, you should use lookahead and lookbehind to match it in this case. Check perlre
$name =~ s/(?<=\s)man(?=\s)/nam/g;
Quoting from perlre
Look Ahead:
(?=pattern)
A zero-width positive lookahead assertion. For example, /\w+(?=\t)/ matches
a word followed by a tab, without including the tab in $&.
Look Behind:
(?<=pattern) \K A zero-width positive lookbehind assertion. For
example, /(?<=\t)\w+/ matches a word that follows a tab, without
including the tab in $& . Works only for fixed-width lookbehind.
I understand you want to replace man in between whitespace characters or start/end of string.
In this case, you may use two approaches, with positive lookarounds containing alternation operator checking for string boundaries and/or whitespaces, or negative lookarounds checking for the non-whitespace chars on both ends of the search word.
Use either of the two:
$name =~ s/(?<=^|\s)man(?=\z|\s)/nam/g;
$name =~ s/(?<!\S)man(?!\S)/nam/g;
From the point of view of efficiency, the second option is better since alternation is a bit "expensive".
The (?<=^|\s) positive lookbehind matches a location in string that is preceded with start of string (^) or (|) a whitespace (\s) and the (?=$|\s) positive lookahead makes sure there is a whitespace or end of string ($) immediately after man.
The (?<!\S) negative lookbehind matches a location in string that is not immediately preceded with a non-whitespace char, i.e. if there is a non-whitespace char there will be no match), and (?!\S) negative lookahead asserts there is no non-whitespace right after man.
See more details about Lookaround Assertions at perlre.

How can I match start of the line or a character in Perl?

For example, this is ok:
my $str = 'I am $name. \$escape';
$str =~ s/[^\\]\K\$([a-z]+)/Bob/g;
print $str; # 'I am Bob. \$escape';
But below is not what I was expected.
my $str = '$name';
$str =~ s/[^\\]\K\$([a-z]+)/Bob/g;
print $str; # '$name';
How can I correct this?
How can I match start of the line or a character in Perl?
The circumflex inside a character class loses the meaning of the start-of-string anchor. Instead of a character class, you need to use a non-capturing group:
$str =~ s/(?:^|\\)\K\$([a-z]+)/Bob/g;
^^^^^^^^
This (?:^|\\) will either assert the position at the string start or will match \.
For those who understand the question as match only if the $ symbol is not escaped, the solution will be
$str =~ s/(?<!\\)(?:\\\\)*\K\$([a-z]+)/Bob/g;
Here, the (?<!\\) zero-width assertion is a negative lookbehind that fails the match if $ is preceded with \ symbol and (?:\\\\)* will consume any escaped backslashes (if present) before $ while \K match reset operator will discard all these backslashes from the match value.
If your goal is to match $ that are not escaped by a backslash, you can change your pattern to:
(?<!\\)(?:\\{2})*\K\$([a-z]+)
This way you don't have to use an alternation since the negative lookbehind matches a position not preceded by a backslash (that includes the start of the string).
In addition, (?:\\{2})* prevents to miss cases when a backslash before a $ is itself escaped with an other backslash. For example: \\$name

Why does adding the Perl /x switch stop my regex from matching?

I'm trying to match:
JOB: fruit 342 apples to get
The code matches:
$line =~ /^JOB: fruit (\d+) apples to get/
But, when I add the /x switch in:
$line =~ /^JOB: fruit (\d+) apples to get/x
It does not match.
I looked into the /x switch, and it says it just lets you do comments. I don't know why adding /x stops my regex from matching.
The /x modifier tells Perl to ignore most whitespace that isn't escaped in the regex.
For example, let's just focus on apples to get. You could match it with:
$line =~ /apples to get/
But if you try:
$line =~ /apples to get/x
then Perl will ignore the spaces. So it would be like trying to match applestoget.
You can read more about it in perlre. They have this nice example of how you can use the modifier to make the code more readable.
# Delete (most) C comments.
$program =~ s {
/\* # Match the opening delimiter.
.*? # Match a minimal number of characters.
\*/ # Match the closing delimiter.
} []gsx;
They also mention how to match whitespace or # again while using the /x modifier.
Use of /x means that if you want real whitespace or # characters in
the pattern (outside a bracketed character class, which is unaffected
by /x), then you'll either have to escape them (using backslashes or
\Q...\E ) or encode them using octal, hex, or \N{} escapes.
Part of allowing comments is also ignoring literal white space. Use \s or [ ] for spaces you wish to match.
For example
$line =~ /^ #beginning of string
JOB:[ ]fruit[ ] #some literal text
(\d+) #capture digits to $1
[ ]apples[ ]to[ ]get #more literal text
/x
Notice all those spaces before the beginning of the comments. It would stink if they counted....

What does this snippet do in perl?

What does the following do in Perl?
$string =~ s#[^a-zA-Z0-9]+# #sg;
$string =~ s#\s+# #sg;
I undestand that [^a-zA-Z0-9]+ is a start of sentence and at least one of a-zA-Z0-9 and \s+ is at least one whitespace.
But I can not figure out what this snippet does as a whole.
First, it replaces any sequence of non-alphanumeric characters (being neither upper case chars, lower case chars nor numbers) in the string with a single space.
After that it replaces all multi-spaces, i.e. any sequence of whitespaces with just one space character.
the first pattern replace all that is not alphanumeric by a space.
The second replace any number of white characters (space, tab, newlines) by a single space
Note that you can replace these two patterns by an only pattern:
$string =~ s#[^a-zA-Z0-9]+# #sg;
$string =~ s#[^a-zA-Z0-9]+# #sg;
$string =~ s#\s+# #sg;
is more commonly written as
$string =~ s/[^a-zA-Z0-9]+/ /sg;
$string =~ s/\s+/ /sg;
The choice of delimiter isn't significant, but / is used by convention unless the pattern contains many some /.
Here we have two instances of the substitution operator. Between the first two delimiters is a regular expression pattern to search for. Between the last two delimiters is the string with which to replace the matching text. The trailing s and g are flags.
The s flag affects what . matches. Given that . isn't used, the s flag is useless.
The g flag causes the all matches to be replaced instead of just the first one.
The first regex pattern, [^a-zA-Z0-9]
[...] is a character class that matches a single character among those specified. A leading ^ negates the class, so [^a-zA-Z0-9] matches any character other than unaccented latin letters and numbers.
atom+ matches atom one or more times, so [^a-zA-Z0-9]+ matches a sequence of non-alphanumeric characters (and some alphanumeric characters such as "é").
Therefore, s/[^a-zA-Z0-9]+/ /g replaces all sequences of non-alphanumeric characters (and some alphanumeric characters such as "é") with a single space. For example, "abc - déf :)" becomes "abc d f ".
The second regex pattern, \s+
\s matches any whitespace character (except the vertical tab and the non-breaking space sometimes).
Therefore, s/\s+/ /g replaces all sequences of white space with a single space. For example, "abc\tdef ghi\n" becomes "abc def ghi ".
As a whole
When used together, the second statement does absolutely nothing. There will never be any sequences of two or more whitespace characters left in $string after the first statement.
So
$string =~ s#[^a-zA-Z0-9]+# #sg;
$string =~ s#\s+# #sg;
is the same as
$string =~ s/[^a-zA-Z0-9]+/ /g;

What do these Perl regexes mean?

What does the following syntax mean in Perl?
$line =~ /([^:]+):/;
and
$line =~ s/([^:]+):/$replace/;
See perldoc perlreref
[^:]
is a character class that matches any character other than ':'.
[^:]+
means match one or more of such characters.
I am not sure the capturing parentheses are needed. In any case,
([^:]+):
captures a sequence of one or more non-colon characters followed by a colon.
$line =~ /([^:]+):/;
The =~ operator is called the binding operator, it runs a regex or substitution against a scalar value (in this case $line). As for the regex itself, () specify a capture. Captures place the text that matches them in special global variables. These variables are numbered starting from one and correspond to the order the parentheses show up in, so given
"abc" =~ /(.)(.)(.)/;
the $1 variable will contain "a", the $2 variable will contain "b", and the $3 variable will contain "c" (if you haven't guessed yet . matches one character*). [] specifies a character class. Character classes will match one character in them, so /[abc]/ will match one character if it is "a", "b", or "c". Character classes can be negated by starting them with ^. A negated character class matches one character that is not listed in it, so [^abc] will match one character that is not "a", "b", or "c" (for instance, "d" will match). The + is called a quantifier. Quantifiers tell you how many times the preceding pattern must match. + requires the pattern to match one or more times. (the * quantifier requires the pattern to match zero or more times). The : has no special meaning to the regex engine, so it just means a literal :.
So, putting that information together we can see that the regex will match one or more non-colon characters (saving this part to $1) followed by a colon.
$line =~ s/([^:]+):/$replace/;
This is a substitution. Substitutions have two parts, the regex, and the replacement string. The regex part follows all of the same rules as normal regexes. The replacement part is treated like a double quoted string. The substitution replaces whatever matches the regex with the replacement, so given the following code
my $line = "key: value";
my $replace = "option";
$line =~ s/([^:]+):/$replace/;
The $line variable will hold the string "option value".
You may find it useful to read perldoc perlretut.
* except newline, unless the /m option is used, in which case it matches any character
The first one captures the part in front of a colon from a line, such as "abc" in the string "abc:foo". More precisely it matches at least one non-colon character (though as many as possible) directly before a colon and puts them into a capture group.
The second one substitutes said part, although this time including the colon by the contents of the variable $replace.
I may be misunderstanding some of the previous answers, but I think that there's a confusion about the second example. It will not replace only the captured item (i.e., one or more non-colons up until a colon) by $replaced. It will replace all of ([^:]+): with $replace - the colon as well. (The substitution operates on the match, not just the capture.)
This means if you don't include a colon in $replace (and you want one), you will get bit:
my $line = 'http://www.example.com/';
my $replace = 'ftp';
$line =~ s/([^:]+):/$replace/;
print "Here's \$line now: $line\n";
Output:
Here's $line now: ftp//www.example.com/ # Damn, no colon!
I'm not sure if you are just looking at example code, but you unless you plan to use the capture I'm not sure you really want it in these examples.
If you are very unfamiliar with regular expressions (or Perl), you should look at perldoc perlrequick before trying perldoc perlre or perldoc perlretut.
You want to return something matching one or more characters that are anything but : followed by a : and the second one you want to do the same thing but replace it with $replace.
perl -MYAPE::Regex::Explain -e "print YAPE::Regex::Explain->new('([^:]+):')->explain"
The regular expression:
(?-imsx:([^:]+):)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[^:]+ any character except: ':' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
$line =~ /([^:]+):/;
Matches anything that does not contain : before :/
If $line = "http://www.google.com", it will match http (the variable $1 will contain http)
$line =~ s/([^:]+):/$replace/;
This time, replace the value matched by the content of the variable $replace