regular expressions in perl for extracting information - regex

How would I match any number of any characters between two specific words... I have a document with a block of text enclosed between 'begin parameters' and 'end parameters'. These two phrases are separated by a number of lines of text. So my text looks like this:
begin parameters
<lines of text here \n.
end parameters
My current regular expression looks like this:
my $regex = "begin parameters[.*\n*]end parameters";
However this is not matching. Does anybody have any suggestions?

Use the /s switch so that the any character . will match new lines.
I also suggest that you use non greedy matching by adding ? to your quantifier.
use strict;
use warnings;
my $data = do {local $/; <DATA>};
if ($data =~ /begin parameters(.*?)end parameters/s) {
print "'$1'";
}
__DATA__
begin parameters
<lines of text here.
end parameters
Outputs:
'
<lines of text here.
'

Your current regular expression does not do what you may think, by placing those characters inside of a character class; it matches any character of: ( ., *, \n, * ) instead of actually matching what you want.
You can use the s modifier forcing the dot . to match newline sequences. By placing a capturing group around what you want to extract, you can access that by using $1
my $regex = qr/begin parameters(.*?)end parameters/s;
my $string = do {local $/; <DATA>};
print $1 if $string =~ /$regex/;
See Demo

Please try this :
Begin Parameters([\S\s]+?)EndParameters
Translation : This will look for any char who is a separator, or any char who is everything but a separator (so actually, it will look for any char) until it find "EndParameters".
I hope it is what you expect.

The meta-character . loses its special properties inside of a character class.
So [.*\n*] actually matches 0 or more literal periods or zero or more newlines.
What you actual want is to match 0 or more of any character and 0 or more of a newline. Which you can represent in a non-capturing group:
begin parameters(?:.|\n)*?end parameters

Related

Capture a substring between two characters?

I am trying to write a regex pattern which will capture a substring between two characters. The string is
default_checks/my_checks/VLG6.3: Unsupported system function call
I need to capture VLG6.3. It is between a slash / and a colon :.
I have tried these ideas
my $rule = $line =~ /\/(.*)\:/;
my $rule = $line =~ /\/(.+?)\:/ ;
my $rule = $line =~ /\/(\w+)\:/ ;
But none of them are working. In the best case I get my_checks/VLG6.3
Aside from the issue with assigning a list to a scalar, which ikegami has helpfully pointed out, the regex pattern can use some fixing.
The repeater * in regex is greedy. It gobbles up as many characters as it can as long as it matches. You need to let another repeater do the gobbling up front so that it only leaves just enough for the repeater you really want to match.
my ($rule) = $line =~ /.*\/(.*):/;
Alternatively, in this case you can just use an exclusion class instead of matching any characters.
my ($rule) = $line =~ /\/([^\/]*):/;
Both of the above will end up with $rule assigned with 'VLG6.3'.
You are interested in a non-empty string, meeting the following conditions:
It is preceded by a /.
It is followed by a colon.
It contains neither / nor a colon.
So the intuitive regex, without any capturing group is:
(?<=\/)[^\/:]+(?=:) (positive lookbehind, the actual content
and positive lookahead).
Using such a regex, you can:
Use the result of =~ operator only to check whether something has been
matched.
Print the matched text from $& variable.
And the example script can look like below:
use strict;
use warnings;
my $line = 'default_checks/my_checks/VLG6.3: Unsupported system function call';
print "Source: $line\n";
if ($line =~ /(?<=\/)[^\/:]+(?=:)/) {
print "Rule: $&\n";
} else {
print "No match.\n";
}
The reason you are getting 1 is because you are evaluating the match in scalar context. For the match to return the captures, it needs to be evaluated in list context.
You need to evaluate the match in list context by evaluating the =~ in list context. Unlike the scalar assignment operator you used, the list assignment operator evaluates its operands in list context. You can cause the list assignment operator to be used by replacing my $rule with my ($rule).
my ($rule) = $line =~ /\/(.*)\:/;
See Why are there parentheses around scalar when assigning the return value of regex match in this Perl snippet?.
Furthermore, the match operator will grab more than desired. You can address that by replacing
/\/(.*)\:/
with
/\/([^\/]*)\:/
I would write that as follows:
m{/([^/]*):}
To capture a string between two characters, capture everything that is not the two characters.
my $line = 'default_checks/my_checks/VLG6.3: Unsupported system function call';
my ( $rule ) = $line =~ /\/([^\/:]*):/;
print "$rule\n";
PS: To capture content between two string involves skipping sequences of the starting string.
my $line = 'begin not this begin or this begin wanted end not this end or this end';
my ( $rule ) = $line =~ m{ (?: begin .* )? begin (.*?) end }msx;
print "$rule\n";

Perl: remove a part of string after pattern

I have strings like this:
trn_425374_1_94_-
trn_12_1_200_+
trn_2003_2_198_+
And I want to split all after the first number, like this:
trn_425374
trn_12
trn_2003
I tried the following code:
$string =~ s/(?<=trn_\d)\d+//gi;
But returns the same as the input. I have been following examples of similar questions but I don't know what I'm doing wrong. Any suggestion?
If you are running Perl 5 version 10 or later then you have access to the \K ("keep") regular expression escape. Everything before the \K is excluded from the substitution, so this removes everything after the first sequence of digits (except newlines)
s/\d+\K.+//;
with earlier versions of Perl, you will have to capture the part of the string you want to keep, and replace it in the substitution
s/(\D*\d+).+/$1/;
Note that neither of these will remove any trailing newline characters. If you want to strip those as well, then either chomp the string first, or add the /s modifier to the substitution, like this
s/\d+\K.+//s;
or
s/(\D*\d+).+/$1/s;
Do grouping to save first numbers of digits found and use .* to delete from there until end of line:
#!/usr/bin/env perl
use warnings;
use strict;
while ( <DATA> ) {
s/(\d+).*$/$1/ && print;
}
__DATA__
trn_425374_1_94_-
trn_12_1_200_+
trn_2003_2_198_+
It yields:
trn_425374
trn_12
trn_2003
your regexr should be:
$string =~ s/(trn_\d+).*/$1/g;
It substitutes the whole match by the memorized at $1 (which is the string part you want to preserve)
Use \K to preserve the part of the string you want to keep:
$string =~ s/trn_\d+\K.*//;
To quote the link above:
\K
This appeared in perl 5.10.0. Anything matched left of \K is not
included in $& , and will not be replaced if the pattern is used in a
substitution.

What does this Perl regex do?

What does the following snippet do?
if ($str =~ /^:(\w+)/) {
$hash{$1} = 1;
}
It uses the first successful capture as key in the hash. And the $str has to contain one or more words but I am not sure what the ^: means
^ start at beginning of string
: match a literal colon
( capture the following string
\w+ matching one or more alphanumeric characters
) end capture
The capture is stored in $1, which then becomes a key in the hash %hash below.
So if you have the string :foo, you will match foo, and get $hash{foo} = 1. The purpose of this code is no doubt to extract certain strings and dedupe them using a hash.
^: mean ":" symbol at the start of the line. Also it will capture only single "word" after :
It means ^ At the begin of the Line a :
e.g:
$string = q~:Thats~;
$hash{Thats} = 1;
$string2 = q~Thats~;
The if Statement is successfull at $string, but it fails at $string2 because it doesn't start with an :.
You said:
'And the $str has to contain one or more words ...'
I'm not sure if this is simply a typo or if your intention is different from your small example. Now (according to your post), your regex would match a string like: :Hello. In Perl, this could be written also like
my %hash = ();
my $str = ':Hello';
$hash{ $1 }++ if $str =~ /^:(\w+)/;
Now, if you'd change ^: in your regex to [:^], which means: your word in your string should be preceded by 'start of string' ^ OR a colon :, your regex could now match lines like: 'Hello:World:Perl:Script'; (maybe this was the real intention).
Such a string could then be dissected in a while loop:
$hash{ $1 }++ while $str =~ /[:^](\w+)/g;
If you print the captured keys: print "#{[keys %hash]}";
the result would be: Perl Script Hello World (the order of the keys is undefined).
These kinds strings are widespread in the unix world, e.g. the environment variables PATH, LD_LIBRARY_PATH, and also the file /etc/passwd looks like that.
BTW, this is only an idea - IF your typo wasn't really one ;-)

What do these regular expressions mean?

I'm venturing to read a code in Perl and found the following regular expressions:
$str =~ s/(<.+?>)|(&\w+;)/ /gis;
$str =~ /(\w+)/gis
I wonder what these codes represent.
Can anyone help me?
The first one $str =~ s/(<.+?>)|(&\w+;)/ /gis; does a sustitution:
$str : the variable to work on
=~ : do the subs and save in the same variable
s : substitution operator
/ : begining or the regex
( : begining of captured group 1
< : <
.+? : one or more of any char NOT greedy
> : >
) : end of capture group 1
| : alternation
( : begining of captured group 2
& : &
\w+ : one or more word char ie: [a-zA-Z0-9_]
; : ;
) : end of group 2
/ : end of search part
: a space
/ : end of replace part
gis; : global, case insensitive, multi-line
This will replace all tags and encoded element like & or < by a space.
The second one expect that it left at least one word.
One way to help deciper regular expressions is to use the YAPE::Regex::Explain module from CPAN:
#!/usr/bin/env perl
use YAPE::Regex::Explain;
#...may need to single quote $ARGV[0] for the shell...
print YAPE::Regex::Explain->new( $ARGV[0] )->explain;
Assuming this snippet is named 'rexplain' you would do:
$ ./rexplain 's/(<.+?>)|(&\w+;)/ /gis'
The first strips out every XML/HTML tag and every character entity, replacing each one with a space. The second finds every substring consisting entirely of word characters.
In detail:
The first part of the first expression first matches a <, then any character with the . (newlines included thanks to the /s flag at the end). The + modifier would match one or more characters up until the last > found in $str, but the ? after it makes it not greedy, so it only matches up to the first > encountered. The second part matches & followed by any word character until ; is found. Since ; is not a word character, the ? modifier is not needed. The s/ up front means a substitution, and the bit after the second / means that's what any match is substituted with. The /gis at the end means *g*reedy, case *i*nsensitive, and *s*ingle line.
The second expression finds the first substring of non-word characters and puts it in $1. If you call it repeatedly, the /g at the end means that it will keep matching every instance in $str.
The first one takes a string and replaces html tags or html character codes with a space
The second one makes sure there is still a word left when done.
These "codes" are regular expressions. Type this to learn more:
perldoc perlre
The code above replaces with blanks some HTML/XML tags and some URL-encoded characters such as
from $str. But there are better ways to do this using CPAN modules. The code then tries to match and capture into variable $1 the first word in $str.
Ex:
perl -le '$str = "foo<br> bar<another\ntag>baz"; print $str; $str =~ s/(<.+?>)|(&\w+;)/ /gis; $str =~ /(\w+)/gis; print $str; print $1;'
It prints:
foo<br> bar<another
tag>baz
foo bar baz
foo

How to split a string with multiple patterns in perl?

I want to split a string with multiple patterns:
ex.
my $string= "10:10:10, 12/1/2011";
my #string = split(/firstpattern/secondpattern/thirdpattern/, $string);
foreach(#string) {
print "$_\n";
}
I want to have an output of:
10
10
10
12
1
2011
What is the proper way to do this?
Use a character class in the regex delimiter to match on a set of possible delimiters.
my $string= "10:10:10, 12/1/2011";
my #string = split /[:,\s\/]+/, $string;
foreach(#string) {
print "$_\n";
}
Explanation
The pair of slashes /.../ denotes the regular expression or pattern to be matched.
The pair of square brackets [...] denotes the character class of the regex.
Inside is the set of possible characters that can be matched: colons :, commas ,, any type of space character \s, and forward slashes \/ (with the backslash as an escape character).
The + is needed to match on 1 or more of the character immediately preceding it, which is the entire character class in this case. Without this, the comma-space would be considered as 2 separate delimiters, giving you an additional empty string in the result.
Wrong tool!
my $string = "10:10:10, 12/1/2011";
my #fields = $string =~ /([0-9]+)/g;
You can split on non-digits;
#!/usr/bin/perl
use strict;
use warnings;
use 5.014;
my $string= "10:10:10, 12/1/2011";
say for split /\D+/, $string;
my $string= "10:10:10, 12/1/2011";
my #string = split(m[(?:firstpattern|secondpattern|thirdpattern)+], $string);
my #string = split(m[(?:/| |,|:)+], $string);
print join "\n", #string;
To answer your original question:
you were looking for the | operator:
my $string = "10:10:10, 12/1/2011";
my #string = split(/:|,\s*|\//, $string);
foreach(#string) {
print "$_\n";
}
But, as the other answers point out, you can often improve on that with further simplifications or generalizations.
If numbers are what you want, extract numbers:
my #numbers = $string =~ /\d+/g;
say for #numbers;
Capturing parentheses are not required, as specified in perlop:
The /g modifier specifies global pattern matching--that is, matching
as many times as possible within the string. How it behaves depends on
the context. In list context, it returns a list of the substrings
matched by any capturing parentheses in the regular expression. If
there are no parentheses, it returns a list of all the matched
strings, as if there were parentheses around the whole pattern.
As you're parsing something that is rather obviously a date/time, I wonder if it would make more sense to use DateTime::Format::Strptime to parse it into a DateTime object.