In Perl, what is the meaning of if (s/^\+//)? - regex

In a Perl/Tk code I found a conditional statement as below
if (s/^\+//)
{
#do something
}
elsif (/^-/)
{
#do another thing
}
Seems like some pattern matching has been done. But I cannot understand it. Can anyone help me understanding this pattern matching?

They are both regular expressions. You can read up on them at perlre and perlretut. You can play around with them on http://www.rubular.com.
They both implicitly do something with $_. There probably is a while or foreach around your lines of code without a loop variable. In that case, $_ becomes that loop variable. It might for instance contain the current line of a file that is being read.
If the current value of $_ contains a + (plus) sign as the first character at the beginning of the string, #do somehting.
Else if it contains a - (minus) sign, #do another thing.
In case 1. it also replaces that + sign with nothing (i.e. removes it). It does not remove the - in 2. however.
Let's look at an explanation with YAPE::Regex::Explain.
use YAPE::Regex::Explain;
print YAPE::Regex::Explain->new(qr/^\+/)->explain();
Here it is. Not really helpful in our case, but a nice tool nonetheless. Note that the (?-imsx and ) parts are the default things Perl implies. They are always there unless you change them.
The regular expression:
(?-imsx:^\+)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
\+ '+'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
Update: As Mikko L in the comments pointed out, you should maybe refactor/change this piece of code. While it probably does what it is supposed to, I believe it would be a good idea to make it more readable. Whoever wrote it obviously didn't care about you as the later maintainer. I suggest you do. You could change it to:
# look at the content of $_ (current line?)
if ( s/^\+// )
{
# the line starts with a + sign,
# which we remove!
#do something
}
elsif ( m/^-/ )
{
# the line starts witha - sign
# we do NOT remove the - sign!
#do another thing
}

Those are regular expressions, used for pattern matching and substitution.
You should read up on the concept, but as for your question:
s/^\+//
If the string started with a plus, remove that plus (the "s" means "substitute"), and return true.
/^-/
True if the string starts with a minus.

This code is equivalent to
if ($_ =~ s/^\+//) { # s/// modifies $_ by default
#do something
}
elsif ($_ =~ m/^-/) { # m// searches $_ by default
#do another thing
}
s/// and m// are regexp quote-like operators. You can read about them in perlop.

The other answers have given a summary of how the code works, but not much of why. Here is a simple example of why one might use such logic.
#!/usr/bin/env perl
use strict;
use warnings;
my $args = {};
for ( #ARGV ) {
if ( s/^no// ) {
$args->{$_} = 0;
} else {
$args->{$_} = 1;
}
}
use Data::Dumper;
print Dumper $args;
When you call the script like
./test.pl hi nobye
you get
$VAR1 = {
'hi' => 1,
'bye' => 0
};
The key is the string, however if it is preceded by no then remove it (to get the key in question) and instead set the value to 0.
The OP's example is a little more involved, but follows the same logic.
if the key starts with a +, remove it and do something
if the key starts with a -, don't remove it and do something else

Related

grep a pattern and return all characters before and after another specific character bash

I'm interested in searching a variable inside a log file, in case the search returns something then I wish for all entries before the variable until the character '{' is met and after the pattern until the character '}' is met.
To be more precise let's take the following example:
something something {
entry 1
entry 2
name foo
entry 3
entry 4
}
something something test
test1 test2
test3 test4
In this case I would search for 'name foo' which will be stored in a variable (which I create before in a separate part) and the expected output would be:
{
entry 1
entry 2
name foo
entry 3
entry 4
}
I tried finding something on grep, awk or sed. I was able to only come up with options for finding the pattern and then return all lines until '}' is met, however I can't find a suitable solution for the lines before the pattern.
I found a regex in Perl that could be used but I'm not able to use the variable, in case I switch the variable with 'foo' then I will have output.
grep -Poz '.*(?s)\{[^}]*name\tfoo.*?\}'
The regex is quite simple, once the whole file is read into a variable
use warnings;
use strict;
use feature 'say';
die "Usage: $0 filename\n" if not #ARGV;
my $file_content = do { local $/; <> }; # "slurp" file with given name
my $target = qr{name foo};
while ( $file_content =~ /({ .*? $target .*? })/gsx ) {
say $1;
}
Since we undef-ine the input record separator inside the do block using local, the following read via the null filehandle <> pulls the whole file at once, as a string ("slurps" it). That is returned by the do block and assigned to the variable. The <> reads from file(s) with names in #ARGV, so what was submitted on the command-line at program's invocation.
In the regex pattern, the ? quantifier makes .* match only up to the first occurrence of the next subpattern, so after { the .*? matches up to the first (evaluated) $target, then the $target is matched, then .*? matches eveyrthing up to the first }. All that is captured by enclosing () and is thus later available in $1.
The /s modifier makes . match newlines, what it normally doesn't, what is necessary in order to match patterns that span multiple lines. With the /g modifier it keeps going through the string searching for all such matches. With /x whitespace isn't matched so we can spread out the pattern for readability (even over lines -- and use comments!).
The $target is compiled as a proper regex pattern using the qr operator.
See regex tutorial perlretut, and then there's the full reference perlre.
Here's an Awk attempt which tries to read between the lines to articulate an actual requirement. What I'm guessing you are trying to say is that "if there is an opening brace, print all content between it and the closing brace in case of a match inside the braces. Otherwise, just print the matching line."
We accomplish this by creating a state variable in Awk which keeps track of whether you are in a brace context or not. This simple implementation will not handle nested braces correctly; if that's your requirement, maybe post a new and better question with your actual requirements.
awk -v search="foo" 'n { context[++n] = $0 }
/{/ { delete context; n=0; matched=0; context[++n] = $0 }
/}/ && n { if (matched) for (i=1; i<=n; i++) print context[i];
delete context; n=0 }
$0 ~ search { if(n) matched=1; else print }' file
The variable n is the number of lines in the collected array context; when it is zero, we are not in a context between braces. If we find a match and are collecting lines into context, defer printing until we have collected the whole context. Otherwise, just print the current line.

Extract first word after specific word

I'm having difficulty writing a Perl program to extract the word following a certain word.
For example:
Today i'm not going anywhere except to office.
I want the word after anywhere, so the output should be except.
I have tried this
my $words = "Today i'm not going anywhere except to office.";
my $w_after = ( $words =~ /anywhere (\S+)/ );
but it seems this is wrong.
Very close:
my ($w_after) = ($words =~ /anywhere\s+(\S+)/);
^ ^ ^^^
+--------+ |
Note 1 Note 2
Note 1: =~ returns a list of captured items, so the assignment target needs to be a list.
Note 2: allow one or more blanks after anywhere
In Perl v5.22 and later, you can use \b{wb} to get better results for natural language. The pattern could be
/anywhere\b{wb}.+?\b{wb}(.+?\b{wb})/
"wb" stands for word break, and it will account for words that have apostrophes in them, like "I'll", that plain \b doesn't.
.+?\b{wb}
matches the shortest non-empty sequence of characters that don't have a word break in them. The first one matches the span of spaces in your sentence; and the second one matches "except". It is enclosed in parentheses, so upon completion $1 contains "except".
\b{wb} is documented most fully in perlrebackslash
First, you have to write parentheses around left side expression of = operator to force array context for regexp evaluation. See m// and // in perlop documentation.[1] You can write
parentheses also around =~ binding operator to improve readability but it is not necessary because =~ has pretty high priority.
Use POSIX Character Classes word
my ($w_after) = ($words =~ / \b anywhere \W+ (\w+) \b /x);
Note I'm using x so whitespaces in regexp are ignored. Also use \b word boundary to anchor regexp correctly.
[1]: I write my ($w_after) just for convenience because you can write my ($a, $b, $c, #rest) as equivalent of (my $a, my $b, my $c, my #rest) but you can also control scope of your variables like (my $a, our $UGLY_GLOBAL, local $_, #_).
This Regex to be matched:
my ($expect) = ($words=~m/anywhere\s+([^\s]+)\s+/);
^\s+ the word between two spaces
Thanks.
If you want to also take into consideration the punctuation marks, like in:
my $words = "Today i'm not going anywhere; except to office.";
Then try this:
my ($w_after) = ($words =~ /anywhere[[:punct:]|\s]+(\S+)/);

Perl regex: look for keyword which are not starting with

Example 1: "hello this is me. KEYWORD: blah"
Example 2: "KEYWORD: apple"
I just want to be able to catch KEYWORD in example 1, not 2 since in 2, it starts with KEYWORD
if ($line =~/KEYWORD:/x) {
# do something
}
The above code catch both examples. How can I change regex so that it only catches KEYWORD in example 1?
PS Eventually I want example 1 to be KEYWORD: blah
If you are just looking for a keyword, you should be using index and not a regex :
if (index($line, 'KEYWORD') > 0) {
# do something
}
See the documentation : index STR, SUBSTR returns -1 if SUBSTR isn't found in STR, otherwise it return the index of SUBSTR in STR (starting at 0).
If you want to look for a more complex pattern than a simple keyword, then you should do as #Perl Dog said in his answer.
You are looking for a negative lookbehind assertion, i.e. for 'KEYWORD' that is not preceeded by a certain string (in your case the start-of-line marker ^):
if ($line =~/(?<!^)KEYWORD:/x) {
# found KEYWORD in '$line', but not at the beginning
print $line, "\n";
}
Output:
hello this is me. KEYWORD: blah
Update: As stated in the comments, the /x modifier isn't necessary in my first regex but can be used to make the pattern more readable. It allows for whitespace (including newlines) and/or comments in the pattern to improve readability. The downside is that every blank/space character in the actual pattern has to be escaped (to distinguish it from the comments) but we don't have these here. The pattern can thus be re-written as follows (the result is the same):
if ($line =~ / (?<! # huh? (?) ahh, look left (<) for something
# NOT (!) appearing on the left.
^) # oh, ok, I got it, there must be no '^' on the left
KEYWORD: # but the string 'KEYWORD:' should come then
/x ) {
# found KEYWORD in '$line', but not at the beginning
print $line, "\n";
}
The answer is actually quite simple!
/.KEYWORD/ # Not at the start of a line
/.KEYWORD/s # Not at the start of the string
By the way, you might want to add \b before KEYWORD to avoid matching NOTTHEKEYWORD.
I think you need to give better, real examples
On the face of it, all you need is
if ( /KEYWORD/ and not /^KEYWORD/ ) {
...
}
Another simple regex
print if /^.+KEYWORD/;
match
hello this is me. KEYWORD: blah

Matching numbers for substitution in Perl

I have this little script:
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
foreach (#list) {
s/(\d{2}).*\.txt$/$1.txt/;
s/^0+//;
print $_ . "\n";
}
The expected output would be
5.txt
12.txt
1.txt
But instead, I get
R3_05.txt
T3_12.txt
1.txt
The last one is fine, but I cannot fathom why the regex gives me the string start for $1 on this case.
Try this pattern
foreach (#list) {
s/^.*?_?(?|0(\d)|(\d{2})).*\.txt$/$1.txt/;
print $_ . "\n";
}
Explanations:
I use here the branch reset feature (i.e. (?|...()...|...()...)) that allows to put several capturing groups in a single reference ( $1 here ). So, you avoid using a second replacement to trim a zero from the left of the capture.
To remove all from the begining before the number, I use :
.*? # all characters zero or more times
# ( ? -> make the * quantifier lazy to match as less as possible)
_? # an optional underscore
Note that you can ensure that you have only 2 digits adding a lookahead to check if there is not a digit that follows:
s/^.*?_?(?|0(\d)|(\d{2}))(?!\d).*\.txt$/$1.txt/;
(?!\d) means not followed by a digit.
The problem here is that your substitution regex does not cover the whole string, so only part of the string is substituted. But you are using a rather complex solution for a simple problem.
It seems that what you want is to read two digits from the string, and then add .txt to the end of it. So why not just do that?
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
for (#list) {
if (/(\d{2})/) {
$_ = "$1.txt";
}
}
To overcome the leading zero effect, you can force a conversion to a number by adding zero to it:
$_ = 0+$1 . ".txt";
I would modify your regular expression. Try using this code:
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
foreach (#list) {
s/.*(\d{2}).*\.txt$/$1.txt/;
s/^0+//;
print $_ . "\n";
}
The problem is that the first part in your s/// matches, what you think it does, but that the second part isn't replacing what you think it should. s/// will only replace what was previously matched. Thus to replace something like T3_ you will have to match that too.
s/.*(\d{2}).*\.txt$/$1.txt/;

How can I validate a filename with only eight digits and an extension, in Perl?

Perl (no modules loaded and -Tw & strict)
I found much info on regex and pattern matching here but not exactly what I need.
I want to know if this is the correct way to validate a couple things.
Sorry about the beginner attempt here. I am quite new to this.
my $this = "12345678";
if ($this != m/\b[0-9]{8}\b/x) { print "$this is bad"; }
my $that = "12345678.gif";
if ($that != m/\b[0-9]{8}\.gif\b/x) { print "$that is bad"; }
or
if ($that != m/\b[0-9]{8}\.(jpe?g|gif|png)\b/x) { print "$that is bad"; }
my ($ext) = $that =~ m/\.([^\.]+)$/x;
# verify extension
if ($ext != m/\.(jpe?g|png|gif)$/x ){ print "$ext is bad"; }
# for content type
if ($ext eq "jpg") {$ext = "jpeg";}
I use the /x because perl::critic indicated I needed it. Passes with /x so...
\d is not an option and should be avoided from what I have read here.
The file name "that" has to be 8 digits + an image type. The other set of digits "this" is actually a folder name. This provides a little error checking for an image serving script. The root htaccess sends calls to images in a specific folder to said script. I grab the directory and image name off via path info.
If you are running under taint checking, this isn't the way to do it. You need to match the pattern you need then remember that in a memory variable to clear the taint:
my $this = ...;
my $regex = qr/
^ # beginning of string
( # start of $1
[0-9]{8}
\.
(gif|jpg) # extension in $2
)
\z #end of string
/x;
my( $cleansed, $extension ) = do {
if( $this =~ m/$regex/ ) { ( $1, $2 ) }
else { die "Bad filename!" }
};
I'm not sure why you have a \b at the beginning of your regex. It probably doesn't do what you think it does. If you want the file name to be only the digits, use the ^ beginning of string anchor instead. That way, nothing can come before the digits. Likewise, the end of string anchor \z says that nothing can come after the extension.
If you then need to match an extension to a content-type for an HTTP response, which I'm guessing that your doing, you can use a hash to make the map:
my %types = (
jpg => jpeg,
gif => gif,
...
);
Now that you have the hash, you can use it as another level of validation:
unless( exists $types{$extension} ) { die "Unsupported type!" }
Most of what you have above looks fine. A few points:
if ($ext != m/pattern/) is wrong - the != operator should be !~
\d is just fine, if you're parsing filenames or anything else not likely to be unicode
avoid /x unless you really need it (you're not splitting the regexp up over multiple lines for readability). For that matter, avoid any of the flags unless you need them
(jpe?g|gif|png) can be modified to (?:jpe?g|gif|png) to disable capturing on that set of parentheses (the efficiency increase is nominal in most cases, but it occasionally can make a difference e.g. in a rapid loop, so I make it a habit to not capture unless I need to)
you don't need to escape . inside a character class - i.e. [^\.] can be [^.] (I believe the only character you need to escape is ] itself, but don't take this as gospel) :)
it's "Perl", not "PERL" :)
You need to use =~ and !~ instead of == and != for regex matching. Also after removing redundant code and optimizing, I would write it this way.
my $that = "12345678.gif";
if ($that =~ m/\b[0-9]{8}\.(jpe?g|gif|png)\b/x)
{
my $ext = $1;
if ($ext eq "jpg") {$ext = "jpeg";}
}
else
{
print "$that is bad";
}