How to split a string with multiple patterns in perl? - regex

I want to split a string with multiple patterns:
ex.
my $string= "10:10:10, 12/1/2011";
my #string = split(/firstpattern/secondpattern/thirdpattern/, $string);
foreach(#string) {
print "$_\n";
}
I want to have an output of:
10
10
10
12
1
2011
What is the proper way to do this?

Use a character class in the regex delimiter to match on a set of possible delimiters.
my $string= "10:10:10, 12/1/2011";
my #string = split /[:,\s\/]+/, $string;
foreach(#string) {
print "$_\n";
}
Explanation
The pair of slashes /.../ denotes the regular expression or pattern to be matched.
The pair of square brackets [...] denotes the character class of the regex.
Inside is the set of possible characters that can be matched: colons :, commas ,, any type of space character \s, and forward slashes \/ (with the backslash as an escape character).
The + is needed to match on 1 or more of the character immediately preceding it, which is the entire character class in this case. Without this, the comma-space would be considered as 2 separate delimiters, giving you an additional empty string in the result.

Wrong tool!
my $string = "10:10:10, 12/1/2011";
my #fields = $string =~ /([0-9]+)/g;

You can split on non-digits;
#!/usr/bin/perl
use strict;
use warnings;
use 5.014;
my $string= "10:10:10, 12/1/2011";
say for split /\D+/, $string;

my $string= "10:10:10, 12/1/2011";
my #string = split(m[(?:firstpattern|secondpattern|thirdpattern)+], $string);
my #string = split(m[(?:/| |,|:)+], $string);
print join "\n", #string;

To answer your original question:
you were looking for the | operator:
my $string = "10:10:10, 12/1/2011";
my #string = split(/:|,\s*|\//, $string);
foreach(#string) {
print "$_\n";
}
But, as the other answers point out, you can often improve on that with further simplifications or generalizations.

If numbers are what you want, extract numbers:
my #numbers = $string =~ /\d+/g;
say for #numbers;
Capturing parentheses are not required, as specified in perlop:
The /g modifier specifies global pattern matching--that is, matching
as many times as possible within the string. How it behaves depends on
the context. In list context, it returns a list of the substrings
matched by any capturing parentheses in the regular expression. If
there are no parentheses, it returns a list of all the matched
strings, as if there were parentheses around the whole pattern.

As you're parsing something that is rather obviously a date/time, I wonder if it would make more sense to use DateTime::Format::Strptime to parse it into a DateTime object.

Related

Capture a substring between two characters?

I am trying to write a regex pattern which will capture a substring between two characters. The string is
default_checks/my_checks/VLG6.3: Unsupported system function call
I need to capture VLG6.3. It is between a slash / and a colon :.
I have tried these ideas
my $rule = $line =~ /\/(.*)\:/;
my $rule = $line =~ /\/(.+?)\:/ ;
my $rule = $line =~ /\/(\w+)\:/ ;
But none of them are working. In the best case I get my_checks/VLG6.3
Aside from the issue with assigning a list to a scalar, which ikegami has helpfully pointed out, the regex pattern can use some fixing.
The repeater * in regex is greedy. It gobbles up as many characters as it can as long as it matches. You need to let another repeater do the gobbling up front so that it only leaves just enough for the repeater you really want to match.
my ($rule) = $line =~ /.*\/(.*):/;
Alternatively, in this case you can just use an exclusion class instead of matching any characters.
my ($rule) = $line =~ /\/([^\/]*):/;
Both of the above will end up with $rule assigned with 'VLG6.3'.
You are interested in a non-empty string, meeting the following conditions:
It is preceded by a /.
It is followed by a colon.
It contains neither / nor a colon.
So the intuitive regex, without any capturing group is:
(?<=\/)[^\/:]+(?=:) (positive lookbehind, the actual content
and positive lookahead).
Using such a regex, you can:
Use the result of =~ operator only to check whether something has been
matched.
Print the matched text from $& variable.
And the example script can look like below:
use strict;
use warnings;
my $line = 'default_checks/my_checks/VLG6.3: Unsupported system function call';
print "Source: $line\n";
if ($line =~ /(?<=\/)[^\/:]+(?=:)/) {
print "Rule: $&\n";
} else {
print "No match.\n";
}
The reason you are getting 1 is because you are evaluating the match in scalar context. For the match to return the captures, it needs to be evaluated in list context.
You need to evaluate the match in list context by evaluating the =~ in list context. Unlike the scalar assignment operator you used, the list assignment operator evaluates its operands in list context. You can cause the list assignment operator to be used by replacing my $rule with my ($rule).
my ($rule) = $line =~ /\/(.*)\:/;
See Why are there parentheses around scalar when assigning the return value of regex match in this Perl snippet?.
Furthermore, the match operator will grab more than desired. You can address that by replacing
/\/(.*)\:/
with
/\/([^\/]*)\:/
I would write that as follows:
m{/([^/]*):}
To capture a string between two characters, capture everything that is not the two characters.
my $line = 'default_checks/my_checks/VLG6.3: Unsupported system function call';
my ( $rule ) = $line =~ /\/([^\/:]*):/;
print "$rule\n";
PS: To capture content between two string involves skipping sequences of the starting string.
my $line = 'begin not this begin or this begin wanted end not this end or this end';
my ( $rule ) = $line =~ m{ (?: begin .* )? begin (.*?) end }msx;
print "$rule\n";

How do I capture match-groups of alternation of a regular expression with split?

I have a string
my $foo = 'one#two#three!four#five#six';
from which I want to extract the parts that are seperated by either a # or a !. This is easy enough with split:
my #parts = split /#|!/, $foo;
An additional requirement is that I also need to capture the exclamation marks. So I tried
my #parts = split /#|(!)/, $foo;
This however returns either an undef value or the exclamation mark (which is also clearly stated in the specification of split).
So, I weed out the unwanted undef values with grep:
my #parts = grep { defined } split /#|(!)/, $foo;
This does what I want.
Yet I was wondering if I can change the regular expression in a way so that I don't have to also invoke grep.
When you use split, you may not omit the empty captures once a match is found (as there are always as many captures in the match as there are defined in the regular expression). You may use a matching approach here, though:
my #parts = $foo =~ /[^!#]+|!/g;
This way, you will match 1 or more chars other than ! and # (with [^!#]+ alternative), or an exclamation mark, multiple times (/g).
Use "empty string followed by an exclamation mark or empty string preceded by an exclamation mark" in place of your second alternative:
my #parts = split /#|(?=!)|(?<=!)/, $foo;
Demo: https://ideone.com/6pA1wx

Split into words by an uncommented comma that is not inside matching parentheses

Consider the following string:
blah, foo(a,b), bar(c,d), yo
I want to extract a list of strings:
blah
foo(a,b)
bar(c,d)
yo
It seems to me that I should be able to use quote words here, but I'm struggling with the regex. Can someone help me out?
Perl has a little thing regex recursion, so you might be able to look for:
either a bare word like blah containing no parentheses (\w+)
a "call", like \w+\((?R)(, *(?R))*\)
The total regex is (\w+(\((?R)(, ?(?R))*\))?), which seems to work.
You can use the following regex to use in split:
\([^()]*\)(*SKIP)(*F)|\s*,\s*
With \([^()]*\), we match a ( followed with 0 or more characters other than ( or ) and then followed with ). We fail the match with (*SKIP)(*F) if that parenthetical construction is found, and then we only match the comma surrounded with optional whitespaces.
See demo
#!/usr/bin/perl
my $string= "blah, foo(a,b), bar(c,d), yo";
my #string = split /\([^()]*\)(*SKIP)(*F)|\s*,\s*/, $string;
foreach(#string) {
print "$_\n";
}
To account for commas inside nested balanced parentheses, you can use
my #string = split /\((?>[^()]|(?R))*\)(*SKIP)(*F)|\s*,\s*/, $string;
Here is an IDEONE demo
With \((?>[^()]|(?R))*\) we match all balanced ()s and fail the match if found with the verbs (*SKIP)(*F), and then we match a comma with optional whitespace around (so as not to manually trim the strings later).
For a blah, foo(b, (a,b)), bar(c,d), yo string, the result is:
blah
foo(b, (a,b))
bar(c,d)
yo
There is a solution given by Borodin for one of your question (which is similar to this question). A small change of regex will give you desire output: (this will not work for nested parentheses)
use strict;
use warnings;
use 5.010;
my $line = q<blah, foo(a,b), bar(c,d), yo>;
my #words = $line =~ / (?: \([^)]*\) | [^,] )+ /xg;
say for #words;
Output:
blah
foo(a,b)
bar(c,d)
yo

regular expressions in perl for extracting information

How would I match any number of any characters between two specific words... I have a document with a block of text enclosed between 'begin parameters' and 'end parameters'. These two phrases are separated by a number of lines of text. So my text looks like this:
begin parameters
<lines of text here \n.
end parameters
My current regular expression looks like this:
my $regex = "begin parameters[.*\n*]end parameters";
However this is not matching. Does anybody have any suggestions?
Use the /s switch so that the any character . will match new lines.
I also suggest that you use non greedy matching by adding ? to your quantifier.
use strict;
use warnings;
my $data = do {local $/; <DATA>};
if ($data =~ /begin parameters(.*?)end parameters/s) {
print "'$1'";
}
__DATA__
begin parameters
<lines of text here.
end parameters
Outputs:
'
<lines of text here.
'
Your current regular expression does not do what you may think, by placing those characters inside of a character class; it matches any character of: ( ., *, \n, * ) instead of actually matching what you want.
You can use the s modifier forcing the dot . to match newline sequences. By placing a capturing group around what you want to extract, you can access that by using $1
my $regex = qr/begin parameters(.*?)end parameters/s;
my $string = do {local $/; <DATA>};
print $1 if $string =~ /$regex/;
See Demo
Please try this :
Begin Parameters([\S\s]+?)EndParameters
Translation : This will look for any char who is a separator, or any char who is everything but a separator (so actually, it will look for any char) until it find "EndParameters".
I hope it is what you expect.
The meta-character . loses its special properties inside of a character class.
So [.*\n*] actually matches 0 or more literal periods or zero or more newlines.
What you actual want is to match 0 or more of any character and 0 or more of a newline. Which you can represent in a non-capturing group:
begin parameters(?:.|\n)*?end parameters

What does this Perl regex do?

What does the following snippet do?
if ($str =~ /^:(\w+)/) {
$hash{$1} = 1;
}
It uses the first successful capture as key in the hash. And the $str has to contain one or more words but I am not sure what the ^: means
^ start at beginning of string
: match a literal colon
( capture the following string
\w+ matching one or more alphanumeric characters
) end capture
The capture is stored in $1, which then becomes a key in the hash %hash below.
So if you have the string :foo, you will match foo, and get $hash{foo} = 1. The purpose of this code is no doubt to extract certain strings and dedupe them using a hash.
^: mean ":" symbol at the start of the line. Also it will capture only single "word" after :
It means ^ At the begin of the Line a :
e.g:
$string = q~:Thats~;
$hash{Thats} = 1;
$string2 = q~Thats~;
The if Statement is successfull at $string, but it fails at $string2 because it doesn't start with an :.
You said:
'And the $str has to contain one or more words ...'
I'm not sure if this is simply a typo or if your intention is different from your small example. Now (according to your post), your regex would match a string like: :Hello. In Perl, this could be written also like
my %hash = ();
my $str = ':Hello';
$hash{ $1 }++ if $str =~ /^:(\w+)/;
Now, if you'd change ^: in your regex to [:^], which means: your word in your string should be preceded by 'start of string' ^ OR a colon :, your regex could now match lines like: 'Hello:World:Perl:Script'; (maybe this was the real intention).
Such a string could then be dissected in a while loop:
$hash{ $1 }++ while $str =~ /[:^](\w+)/g;
If you print the captured keys: print "#{[keys %hash]}";
the result would be: Perl Script Hello World (the order of the keys is undefined).
These kinds strings are widespread in the unix world, e.g. the environment variables PATH, LD_LIBRARY_PATH, and also the file /etc/passwd looks like that.
BTW, this is only an idea - IF your typo wasn't really one ;-)