Capture a substring between two characters? - regex

I am trying to write a regex pattern which will capture a substring between two characters. The string is
default_checks/my_checks/VLG6.3: Unsupported system function call
I need to capture VLG6.3. It is between a slash / and a colon :.
I have tried these ideas
my $rule = $line =~ /\/(.*)\:/;
my $rule = $line =~ /\/(.+?)\:/ ;
my $rule = $line =~ /\/(\w+)\:/ ;
But none of them are working. In the best case I get my_checks/VLG6.3

Aside from the issue with assigning a list to a scalar, which ikegami has helpfully pointed out, the regex pattern can use some fixing.
The repeater * in regex is greedy. It gobbles up as many characters as it can as long as it matches. You need to let another repeater do the gobbling up front so that it only leaves just enough for the repeater you really want to match.
my ($rule) = $line =~ /.*\/(.*):/;
Alternatively, in this case you can just use an exclusion class instead of matching any characters.
my ($rule) = $line =~ /\/([^\/]*):/;
Both of the above will end up with $rule assigned with 'VLG6.3'.

You are interested in a non-empty string, meeting the following conditions:
It is preceded by a /.
It is followed by a colon.
It contains neither / nor a colon.
So the intuitive regex, without any capturing group is:
(?<=\/)[^\/:]+(?=:) (positive lookbehind, the actual content
and positive lookahead).
Using such a regex, you can:
Use the result of =~ operator only to check whether something has been
matched.
Print the matched text from $& variable.
And the example script can look like below:
use strict;
use warnings;
my $line = 'default_checks/my_checks/VLG6.3: Unsupported system function call';
print "Source: $line\n";
if ($line =~ /(?<=\/)[^\/:]+(?=:)/) {
print "Rule: $&\n";
} else {
print "No match.\n";
}

The reason you are getting 1 is because you are evaluating the match in scalar context. For the match to return the captures, it needs to be evaluated in list context.
You need to evaluate the match in list context by evaluating the =~ in list context. Unlike the scalar assignment operator you used, the list assignment operator evaluates its operands in list context. You can cause the list assignment operator to be used by replacing my $rule with my ($rule).
my ($rule) = $line =~ /\/(.*)\:/;
See Why are there parentheses around scalar when assigning the return value of regex match in this Perl snippet?.
Furthermore, the match operator will grab more than desired. You can address that by replacing
/\/(.*)\:/
with
/\/([^\/]*)\:/
I would write that as follows:
m{/([^/]*):}

To capture a string between two characters, capture everything that is not the two characters.
my $line = 'default_checks/my_checks/VLG6.3: Unsupported system function call';
my ( $rule ) = $line =~ /\/([^\/:]*):/;
print "$rule\n";
PS: To capture content between two string involves skipping sequences of the starting string.
my $line = 'begin not this begin or this begin wanted end not this end or this end';
my ( $rule ) = $line =~ m{ (?: begin .* )? begin (.*?) end }msx;
print "$rule\n";

Related

Extract delimited substrings from a string

I have a string that looks like this
my $source = "PayRate=[[sDate=05Jul2017,Rate=0.05,eDate=06Sep2017]],item1,item2,ReceiveRate=[[sDate=05Sep2017,Rate=0.06]],item3" ;
I want to use capture groups to extract only the PayRate values contained within the first [[...]] block.
$1 = "Date=05Jul2017,Rate=0.05,EDate=06Sep2017"
I tried this but it returns the entire string.
my $match =~ m/PayRate=\[\[(.*)\]\],/ ;
It is clear that I have to put specific patterns for the series of {(.*)=(.*)} blocks inside. Need expert advice.
You are using a greedy match .*, which consumes as much input as possible while still matching, you're matching the first [[ to the last ]].
Instead, use a reluctant match .*?, which matches as little as possible while still matching:
my ( $match) = $source =~ /PayRate=\[\[(.*?)\]\]/;
Use the /x modifier on match (and substitute) so you can use white space for easier reading and comments to tell what is going on. Limit the patterns by matching everything not in the pattern. [^\]]*? is better than .*?.
my ( $match ) = $line =~ m{
Payrate \= \[\[ # select which part
( [^\]]*? ) # capture anything between square brackets
}x;

Perl $1 variable not defined after regex match

This is probably a very basic error on my part, but I've been stuck on this problem for ages and it's driving me up the wall!
I am looping through a file of Python code using Perl and identifying its variables. I am using a Perl regex to pick out substrings of alphanumeric characters in between spaces. The regex works fine and identifies the lines that the matches belong to, but when I try to return the actual substring that matches the regex, the capture variable $1 is undefined.
Here is my regex:
if ($line =~ /.*\s+[a-zA-Z0-9]+\s+.*/) {
print $line;
print $1;
}
And here is the error:
x = 1
Use of uninitialized value $1 in print at ./vars.pl line 7, <> line 2.
As I understand it, $1 is supposed to return x. Where is my code going wrong?
You're not capturing the result:
if ($line =~ /.*\s+([a-zA-Z0-9]+)\s+.*/) {
If you want to match a line like x = 1 and get both parts of it, you need to match on and capture both with parenthesis. A crude approach:
if ( $line =~ /^\s* ( \w+ ) \s* = \s* ( \w+ ) \s* $/msx ) {
my $var = $1;
my $val = $2;
}
The correct answer has been given by Leeft: You need to capture the string by using parentheses. I wanted to mention some other things. In your code:
if ($line =~ /.*\s+[a-zA-Z0-9]+\s+.*/) {
print $line;
print $1;
}
You are surrounding your match with .*\s+. This is unlikely doing what you think. You never need to use .* with m//, unless you are capturing a string (or capturing the whole match using $&). The match is not anchored by default, and will match anywhere in the string. To anchor the match you must use ^ or $. E.g.:
if ('abcdef' =~ /c/) # returns true
if ('abcdef' =~ /^c/) # returns false, match anchored to beginning
if ('abcdef' =~ /c$/) # returns false, match anchored to end
if ('abcdef' =~ /c.*$/) # returns true
As you see in the last example, using .* is quite redundant, and to get the match you need only remove the anchor. Or if you wanted to capture the whole string:
if ('abcdef' =~ /(c.*)$/) # returns true, captures 'cdef'
You can also use $&, which contains the entire match, regardless of parentheses.
You are probably using \s+ to ensure you do not match partial words. You should be aware that there is an escape sequence called word boundary, \b. This is a zero-length assertion, that checks that the characters around it are word and non-word.
'abc cde fgh' =~ /\bde\b/ # no match
'abc cde fgh' =~ /\bcde\b/ # match
'abc cde fgh' =~ /\babc/ # match
'abc cde fgh' =~ /\s+abc/ # no match! there is no whitespace before 'a'
As you see in the last example, using \s+ fails at start or end of string. Do note that \b also matches partially at non-word characters that can be part of words, such as:
'aaa-xxx' =~ /\bxxx/ # match
You must decide if you want this behaviour or not. If you do not, an alternative to using \s is to use the double negated case: (?!\S). This is a zero-length negative look-ahead assertion, looking for non-whitespace. It will be true for whitespace, and for end of string. Use a look-behind to check the other side.
Lastly, you are using [a-zA-Z0-9]. This can be replaced with \w, although \w also includes underscore _ (and other word characters).
So your regex becomes:
/\b(\w+)\b/
Or
/(?<!\S)(\w+)(?!\S)/
Documentation:
perldoc perlvar - Perl built-in variables
perldoc perlop - Perl operators
perldoc perlre - Perl regular expressions

regular expressions in perl for extracting information

How would I match any number of any characters between two specific words... I have a document with a block of text enclosed between 'begin parameters' and 'end parameters'. These two phrases are separated by a number of lines of text. So my text looks like this:
begin parameters
<lines of text here \n.
end parameters
My current regular expression looks like this:
my $regex = "begin parameters[.*\n*]end parameters";
However this is not matching. Does anybody have any suggestions?
Use the /s switch so that the any character . will match new lines.
I also suggest that you use non greedy matching by adding ? to your quantifier.
use strict;
use warnings;
my $data = do {local $/; <DATA>};
if ($data =~ /begin parameters(.*?)end parameters/s) {
print "'$1'";
}
__DATA__
begin parameters
<lines of text here.
end parameters
Outputs:
'
<lines of text here.
'
Your current regular expression does not do what you may think, by placing those characters inside of a character class; it matches any character of: ( ., *, \n, * ) instead of actually matching what you want.
You can use the s modifier forcing the dot . to match newline sequences. By placing a capturing group around what you want to extract, you can access that by using $1
my $regex = qr/begin parameters(.*?)end parameters/s;
my $string = do {local $/; <DATA>};
print $1 if $string =~ /$regex/;
See Demo
Please try this :
Begin Parameters([\S\s]+?)EndParameters
Translation : This will look for any char who is a separator, or any char who is everything but a separator (so actually, it will look for any char) until it find "EndParameters".
I hope it is what you expect.
The meta-character . loses its special properties inside of a character class.
So [.*\n*] actually matches 0 or more literal periods or zero or more newlines.
What you actual want is to match 0 or more of any character and 0 or more of a newline. Which you can represent in a non-capturing group:
begin parameters(?:.|\n)*?end parameters

What does this Perl regex do?

What does the following snippet do?
if ($str =~ /^:(\w+)/) {
$hash{$1} = 1;
}
It uses the first successful capture as key in the hash. And the $str has to contain one or more words but I am not sure what the ^: means
^ start at beginning of string
: match a literal colon
( capture the following string
\w+ matching one or more alphanumeric characters
) end capture
The capture is stored in $1, which then becomes a key in the hash %hash below.
So if you have the string :foo, you will match foo, and get $hash{foo} = 1. The purpose of this code is no doubt to extract certain strings and dedupe them using a hash.
^: mean ":" symbol at the start of the line. Also it will capture only single "word" after :
It means ^ At the begin of the Line a :
e.g:
$string = q~:Thats~;
$hash{Thats} = 1;
$string2 = q~Thats~;
The if Statement is successfull at $string, but it fails at $string2 because it doesn't start with an :.
You said:
'And the $str has to contain one or more words ...'
I'm not sure if this is simply a typo or if your intention is different from your small example. Now (according to your post), your regex would match a string like: :Hello. In Perl, this could be written also like
my %hash = ();
my $str = ':Hello';
$hash{ $1 }++ if $str =~ /^:(\w+)/;
Now, if you'd change ^: in your regex to [:^], which means: your word in your string should be preceded by 'start of string' ^ OR a colon :, your regex could now match lines like: 'Hello:World:Perl:Script'; (maybe this was the real intention).
Such a string could then be dissected in a while loop:
$hash{ $1 }++ while $str =~ /[:^](\w+)/g;
If you print the captured keys: print "#{[keys %hash]}";
the result would be: Perl Script Hello World (the order of the keys is undefined).
These kinds strings are widespread in the unix world, e.g. the environment variables PATH, LD_LIBRARY_PATH, and also the file /etc/passwd looks like that.
BTW, this is only an idea - IF your typo wasn't really one ;-)

Perl regex - why does the regex /[0-9\.]+(\,)/ match comma

The following seems to match ,
Can someone explain why?
I would like to match more than one Number or point, ended by comma.
123.456.768,
123,
.,
1.2,
But doing the following unexpectedly prints , too
my $text = "241.000,00";
foreach my $match ($text =~ /[0-9\.]+(\,)/g){
print "$match \n";
}
print $text;
# prints 241.000,
# ,
Update:
The comma matched because:
In list context, //g returns a list of matched groupings, or if there are no groupings, a list of matches to the whole regex
As defined here.
Use a zero-width positive look-ahead assertion to exclude the comma from the match itself:
$text =~ /[0-9\.]+(?=,)/g
Your match in the foreach loop is in list context. In list context, a match returns what its captured. Parens indicate a capture, not the whole regex. You have parens around a comma. You want it the other way around, put the parens aroundt he bit you want.
my $text = "241.000,00";
# my($foo) puts the right hand side in list context.
my($integer_part) = $text =~ /([0-9\.]+),/;
print "$integer_part\n"; # 241.000
If you don't want to match the comma, use a lookahead assertion:
/[0-9\.]+(?=,)/g
You're capturing the wrong thing! Move the parens from around the comma to around the number.
$text =~ /([0-9\.]+),/g
You can replace the comma with a lookahead, or just exclude the comma altogether since it isn't part of what you want to capture, it won't make a difference in this case. However, the pattern as it is puts the comma instead of the number into capture group 1, and then doesn't even reference by capture group, returning the entire match instead.
This is how a capture group is retrieved:
$mystring = "The start text always precedes the end of the end text.";
if($mystring =~ m/start(.*)end/) {
print $1;
}