What does this Perl regex do?

What does this Perl regex do? - regex

What does the following snippet do?
if ($str =~ /^:(\w+)/) {
$hash{$1} = 1;
}
It uses the first successful capture as key in the hash. And the $str has to contain one or more words but I am not sure what the ^: means

^ start at beginning of string
: match a literal colon
( capture the following string
\w+ matching one or more alphanumeric characters
) end capture
The capture is stored in $1, which then becomes a key in the hash %hash below.
So if you have the string :foo, you will match foo, and get $hash{foo} = 1. The purpose of this code is no doubt to extract certain strings and dedupe them using a hash.

^: mean ":" symbol at the start of the line. Also it will capture only single "word" after :

It means ^ At the begin of the Line a :
e.g:
$string = q~:Thats~;
$hash{Thats} = 1;
$string2 = q~Thats~;
The if Statement is successfull at $string, but it fails at $string2 because it doesn't start with an :.

You said:
'And the $str has to contain one or more words ...'
I'm not sure if this is simply a typo or if your intention is different from your small example. Now (according to your post), your regex would match a string like: :Hello. In Perl, this could be written also like
my %hash = ();
my $str = ':Hello';
$hash{ $1 }++ if $str =~ /^:(\w+)/;
Now, if you'd change ^: in your regex to [:^], which means: your word in your string should be preceded by 'start of string' ^ OR a colon :, your regex could now match lines like: 'Hello:World:Perl:Script'; (maybe this was the real intention).
Such a string could then be dissected in a while loop:
$hash{ $1 }++ while $str =~ /[:^](\w+)/g;
If you print the captured keys: print "#{[keys %hash]}";
the result would be: Perl Script Hello World (the order of the keys is undefined).
These kinds strings are widespread in the unix world, e.g. the environment variables PATH, LD_LIBRARY_PATH, and also the file /etc/passwd looks like that.
BTW, this is only an idea - IF your typo wasn't really one ;-)

Related

Capture a substring between two characters?

I am trying to write a regex pattern which will capture a substring between two characters. The string is
default_checks/my_checks/VLG6.3: Unsupported system function call
I need to capture VLG6.3. It is between a slash / and a colon :.
I have tried these ideas
my $rule = $line =~ /\/(.*)\:/;
my $rule = $line =~ /\/(.+?)\:/ ;
my $rule = $line =~ /\/(\w+)\:/ ;
But none of them are working. In the best case I get my_checks/VLG6.3

Aside from the issue with assigning a list to a scalar, which ikegami has helpfully pointed out, the regex pattern can use some fixing.
The repeater * in regex is greedy. It gobbles up as many characters as it can as long as it matches. You need to let another repeater do the gobbling up front so that it only leaves just enough for the repeater you really want to match.
my ($rule) = $line =~ /.*\/(.*):/;
Alternatively, in this case you can just use an exclusion class instead of matching any characters.
my ($rule) = $line =~ /\/([^\/]*):/;
Both of the above will end up with $rule assigned with 'VLG6.3'.

You are interested in a non-empty string, meeting the following conditions:
It is preceded by a /.
It is followed by a colon.
It contains neither / nor a colon.
So the intuitive regex, without any capturing group is:
(?<=\/)[^\/:]+(?=:) (positive lookbehind, the actual content
and positive lookahead).
Using such a regex, you can:
Use the result of =~ operator only to check whether something has been
matched.
Print the matched text from $& variable.
And the example script can look like below:
use strict;
use warnings;
my $line = 'default_checks/my_checks/VLG6.3: Unsupported system function call';
print "Source: $line\n";
if ($line =~ /(?<=\/)[^\/:]+(?=:)/) {
print "Rule: $&\n";
} else {
print "No match.\n";
}

The reason you are getting 1 is because you are evaluating the match in scalar context. For the match to return the captures, it needs to be evaluated in list context.
You need to evaluate the match in list context by evaluating the =~ in list context. Unlike the scalar assignment operator you used, the list assignment operator evaluates its operands in list context. You can cause the list assignment operator to be used by replacing my $rule with my ($rule).
my ($rule) = $line =~ /\/(.*)\:/;
See Why are there parentheses around scalar when assigning the return value of regex match in this Perl snippet?.
Furthermore, the match operator will grab more than desired. You can address that by replacing
/\/(.*)\:/
with
/\/([^\/]*)\:/
I would write that as follows:
m{/([^/]*):}

To capture a string between two characters, capture everything that is not the two characters.
my $line = 'default_checks/my_checks/VLG6.3: Unsupported system function call';
my ( $rule ) = $line =~ /\/([^\/:]*):/;
print "$rule\n";
PS: To capture content between two string involves skipping sequences of the starting string.
my $line = 'begin not this begin or this begin wanted end not this end or this end';
my ( $rule ) = $line =~ m{ (?: begin .* )? begin (.*?) end }msx;
print "$rule\n";

How to extract last occurrence of character in Perl?

I am trying to extract a particular part of the string.
Input String : $str = /wave=1/sin2=1/sin1=2/sin0=3
Output String : $str = /wave=1/sin2=1/sin1=2
Method 1 :
#waveSplitArray = split /\//,$str;
$lastOccuranceOfWave = pop #waveSplitArray;
How to use regex to get the desired output?

The easiest way is to use a greedy quantifier with a "keep" flag \K
If you want to keep the value of $str and put the result in a new variable
my $s2 = $str =~ s|.*\K/.*||r;
or
( my $s2 = $str ) =~ s|.*\K/.*||;
If you want to modify the original string, then it's just
$str =~ s|.*\K/.*||;

Try
/(.*\/)[^\/]*/
and you'll have the desired pattern in $1.
Demo

You can try this method also
my $str = '/wave=1/sin2=1/sin1=2/sin0=3';
my ($st2) = $str =~m{(.+)/};
print $st2;
{(.+)/}
Here {} works like delimiter // (Don't confused with quantifier for example\d{n})
And . matches any character except new line and making this as to match 1 or more times using the + quantifier then it will matches till the END (Because + is greedy quantifier, check the steps here), then it will back track for the /, when it is find the / back track will terminate. And storing into the capturing group (), and capturing group will store into the $st2 variable.
RegEx Demo

Perl $1 variable not defined after regex match

This is probably a very basic error on my part, but I've been stuck on this problem for ages and it's driving me up the wall!
I am looping through a file of Python code using Perl and identifying its variables. I am using a Perl regex to pick out substrings of alphanumeric characters in between spaces. The regex works fine and identifies the lines that the matches belong to, but when I try to return the actual substring that matches the regex, the capture variable $1 is undefined.
Here is my regex:
if ($line =~ /.*\s+[a-zA-Z0-9]+\s+.*/) {
print $line;
print $1;
}
And here is the error:
x = 1
Use of uninitialized value $1 in print at ./vars.pl line 7, <> line 2.
As I understand it, $1 is supposed to return x. Where is my code going wrong?

You're not capturing the result:
if ($line =~ /.*\s+([a-zA-Z0-9]+)\s+.*/) {
If you want to match a line like x = 1 and get both parts of it, you need to match on and capture both with parenthesis. A crude approach:
if ( $line =~ /^\s* ( \w+ ) \s* = \s* ( \w+ ) \s* $/msx ) {
my $var = $1;
my $val = $2;
}

The correct answer has been given by Leeft: You need to capture the string by using parentheses. I wanted to mention some other things. In your code:
if ($line =~ /.*\s+[a-zA-Z0-9]+\s+.*/) {
print $line;
print $1;
}
You are surrounding your match with .*\s+. This is unlikely doing what you think. You never need to use .* with m//, unless you are capturing a string (or capturing the whole match using $&). The match is not anchored by default, and will match anywhere in the string. To anchor the match you must use ^ or $. E.g.:
if ('abcdef' =~ /c/) # returns true
if ('abcdef' =~ /^c/) # returns false, match anchored to beginning
if ('abcdef' =~ /c$/) # returns false, match anchored to end
if ('abcdef' =~ /c.*$/) # returns true
As you see in the last example, using .* is quite redundant, and to get the match you need only remove the anchor. Or if you wanted to capture the whole string:
if ('abcdef' =~ /(c.*)$/) # returns true, captures 'cdef'
You can also use $&, which contains the entire match, regardless of parentheses.
You are probably using \s+ to ensure you do not match partial words. You should be aware that there is an escape sequence called word boundary, \b. This is a zero-length assertion, that checks that the characters around it are word and non-word.
'abc cde fgh' =~ /\bde\b/ # no match
'abc cde fgh' =~ /\bcde\b/ # match
'abc cde fgh' =~ /\babc/ # match
'abc cde fgh' =~ /\s+abc/ # no match! there is no whitespace before 'a'
As you see in the last example, using \s+ fails at start or end of string. Do note that \b also matches partially at non-word characters that can be part of words, such as:
'aaa-xxx' =~ /\bxxx/ # match
You must decide if you want this behaviour or not. If you do not, an alternative to using \s is to use the double negated case: (?!\S). This is a zero-length negative look-ahead assertion, looking for non-whitespace. It will be true for whitespace, and for end of string. Use a look-behind to check the other side.
Lastly, you are using [a-zA-Z0-9]. This can be replaced with \w, although \w also includes underscore _ (and other word characters).
So your regex becomes:
/\b(\w+)\b/
Or
/(?<!\S)(\w+)(?!\S)/
Documentation:
perldoc perlvar - Perl built-in variables
perldoc perlop - Perl operators
perldoc perlre - Perl regular expressions

regular expressions in perl for extracting information

How would I match any number of any characters between two specific words... I have a document with a block of text enclosed between 'begin parameters' and 'end parameters'. These two phrases are separated by a number of lines of text. So my text looks like this:
begin parameters
<lines of text here \n.
end parameters
My current regular expression looks like this:
my $regex = "begin parameters[.*\n*]end parameters";
However this is not matching. Does anybody have any suggestions?

Use the /s switch so that the any character . will match new lines.
I also suggest that you use non greedy matching by adding ? to your quantifier.
use strict;
use warnings;
my $data = do {local $/; <DATA>};
if ($data =~ /begin parameters(.*?)end parameters/s) {
print "'$1'";
}
__DATA__
begin parameters
<lines of text here.
end parameters
Outputs:
'
<lines of text here.
'

Your current regular expression does not do what you may think, by placing those characters inside of a character class; it matches any character of: ( ., *, \n, * ) instead of actually matching what you want.
You can use the s modifier forcing the dot . to match newline sequences. By placing a capturing group around what you want to extract, you can access that by using $1
my $regex = qr/begin parameters(.*?)end parameters/s;
my $string = do {local $/; <DATA>};
print $1 if $string =~ /$regex/;
See Demo

Please try this :
Begin Parameters([\S\s]+?)EndParameters
Translation : This will look for any char who is a separator, or any char who is everything but a separator (so actually, it will look for any char) until it find "EndParameters".
I hope it is what you expect.

The meta-character . loses its special properties inside of a character class.
So [.*\n*] actually matches 0 or more literal periods or zero or more newlines.
What you actual want is to match 0 or more of any character and 0 or more of a newline. Which you can represent in a non-capturing group:
begin parameters(?:.|\n)*?end parameters

How to split a string with multiple patterns in perl?

I want to split a string with multiple patterns:
ex.
my $string= "10:10:10, 12/1/2011";
my #string = split(/firstpattern/secondpattern/thirdpattern/, $string);
foreach(#string) {
print "$_\n";
}
I want to have an output of:
10
10
10
12
1
2011
What is the proper way to do this?

Use a character class in the regex delimiter to match on a set of possible delimiters.
my $string= "10:10:10, 12/1/2011";
my #string = split /[:,\s\/]+/, $string;
foreach(#string) {
print "$_\n";
}
Explanation
The pair of slashes /.../ denotes the regular expression or pattern to be matched.
The pair of square brackets [...] denotes the character class of the regex.
Inside is the set of possible characters that can be matched: colons :, commas ,, any type of space character \s, and forward slashes \/ (with the backslash as an escape character).
The + is needed to match on 1 or more of the character immediately preceding it, which is the entire character class in this case. Without this, the comma-space would be considered as 2 separate delimiters, giving you an additional empty string in the result.

Wrong tool!
my $string = "10:10:10, 12/1/2011";
my #fields = $string =~ /([0-9]+)/g;

You can split on non-digits;
#!/usr/bin/perl
use strict;
use warnings;
use 5.014;
my $string= "10:10:10, 12/1/2011";
say for split /\D+/, $string;

my $string= "10:10:10, 12/1/2011";
my #string = split(m[(?:firstpattern|secondpattern|thirdpattern)+], $string);
my #string = split(m[(?:/| |,|:)+], $string);
print join "\n", #string;

To answer your original question:
you were looking for the | operator:
my $string = "10:10:10, 12/1/2011";
my #string = split(/:|,\s*|\//, $string);
foreach(#string) {
print "$_\n";
}
But, as the other answers point out, you can often improve on that with further simplifications or generalizations.

If numbers are what you want, extract numbers:
my #numbers = $string =~ /\d+/g;
say for #numbers;
Capturing parentheses are not required, as specified in perlop:
The /g modifier specifies global pattern matching--that is, matching
as many times as possible within the string. How it behaves depends on
the context. In list context, it returns a list of the substrings
matched by any capturing parentheses in the regular expression. If
there are no parentheses, it returns a list of all the matched
strings, as if there were parentheses around the whole pattern.

As you're parsing something that is rather obviously a date/time, I wonder if it would make more sense to use DateTime::Format::Strptime to parse it into a DateTime object.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

What does this Perl regex do? - regex

What does the following snippet do? if ($str =~ /^:(\w+)/) { $hash{$1} = 1; } It uses the first successful capture as key in the hash. And the $str has to contain one or more words but I am not sure what the ^: means

^: mean ":" symbol at the start of the line. Also it will capture only single "word" after :

It means ^ At the begin of the Line a : e.g: $string = q~:Thats~; $hash{Thats} = 1; $string2 = q~Thats~; The if Statement is successfull at $string, but it fails at $string2 because it doesn't start with an :.

Related

Capture a substring between two characters?

How to extract last occurrence of character in Perl?

Perl $1 variable not defined after regex match

regular expressions in perl for extracting information

How to split a string with multiple patterns in perl?

Categories

Resources