Perl split by regexp issue - regex

I'm writing some parser on Perl and here is a problem with split. Here is my code:
my $str = 'a,b,"c,d",e';
my #arr = split(/,(?=([^\"]*\"[^\"]*\")*[^\"]*$)/, $str);
# try to split the string by comma delimiter, but only if comma is followed by the even or zero number of quotes
foreach my $val (#arr) {
print "$val\n"
}
I'm expecting the following:
a
b
"c,d"
e
But this is what am I really received:
a
b,"c,d"
b
"c,d"
"c,d"
e
I see my string parts are in array, their indices are 0, 2, 4, 6. But how to avoid these odd b,"c,d" and other rest string parts in the resulting array? Is there any error in my regexp delimiter or is there some special split options?

You need to use a non-capturing group:
my #arr = split(/,(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)/, $str);
^^
See IDEONE demo
Otherwise, the captured texts are output as part of the resulting array.
See perldoc reference:
If the regex has groupings, then the list produced contains the matched substrings from the groupings as well

What's tripping you up is a feature in split in that if you're using a group, and it's set to capture - it returns the captured 'bit' as well.
But rather than using split I would suggest the Text::CSV module, that already handles quoting for you:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new();
my $fields = $csv->getline( \*DATA );
print join "\n", #$fields;
__DATA__
a,b,"c,d",e
Prints:
a
b
c,d
e
My reasoning is fairly simple - you're doing quote matching and may have things like quoted/escaped quotes, etc. mean you're trying to do a recursive parse, which is something regex simply isn't well suited to doing.

You can use parse_line() of Text::ParseWords, if you are not really bounded for regex:
use Text::ParseWords;
my $str = 'a,b,"c,d",e';
my #arr = parse_line(',', 1, $str);
foreach (#arr)
{
print "$_\n";
}
Output:
a
b
"c,d"
e

Do matching instead of splitting.
use strict; use warnings;
my $str = 'a,b,"c,d",e';
my #matches = $str =~ /"[^"]*"|[^,]+/g;
foreach my $val (#matches) {
print "$val\n"
}

Related

Perl: regex for conditional replace?

in this string
ab<(CN)cdXYlm<(CI)efgXYop<(CN)zXYklmn<(CI)efgXYuvw<
I want to replace each substring between XY and < by either ONE or TWO depending on characters between previous brackets:
if XY after (CN) replace substring by ONE
if XY after (CI) replace substring by TWO
So the result should be:
ab<(CN)cdONE<(CI)efgTWO<(CN)zONE<(CI)efgTWO<
XY and following characters should be replaced but not angle bracket <.
This is for modifying HTML and arbitrary characters can occur between XY and <.
I guess I need two regex for (CN) and (CI).
# This one replaces just all XY:
my $s = 'ab<(CN)cdXYlm<(CI)efgXYop<(CN)zXYklmn<(CI)efgXYuvw<';
$s =~ s/(XY(.*?))</ONE/g;
# But how to add the conditions to the regex?
You don't need two regexes. Capture the C[NI] and retrieve the corresponding replacement value from a hash:
#!/usr/bin/perl
use warnings;
use strict;
my $s = 'ab<(CN)cdXYlm<(CI)efgXYop<(CN)zXYklmn<(CI)efgXYuvw<';
my %replace = (CN => 'ONE', CI => 'TWO');
$s =~ s/(\((C[NI])\).*?)XY.*?</$1$replace{$2}</g;
my $exp = 'ab<(CN)cdONE<(CI)efgTWO<(CN)zONE<(CI)efgTWO<';
use Test::More tests => 1;
is $s, $exp;
My guess is that this expression or maybe a modified version of that might work, not sure though:
([a-z]{2}<\([A-Z]{2}\)[a-z]{2})([^<]+)(<\([A-Z]{2}\)[a-z]{3})([^<]+)(<\([A-Z]{2}\)[a-z])([^<]+)(<\([A-Z]{2}\)[a-z]{3})([^<]+)<
Test
use strict;
use warnings;
my $str = 'ab<(CN)cdXYlm<(CI)efgXYop<(CN)zXYklmn<(CI)efgXYuvw<';
my $regex = qr/([a-z]{2}<\([A-Z]{2}\)[a-z]{2})([^<]+)(<\([A-Z]{2}\)[a-z]{3})([^<]+)(<\([A-Z]{2}\)[a-z])([^<]+)(<\([A-Z]{2}\)[a-z]{3})([^<]+)</mp;
my $subst = '"$1ONE$3TWO$5ONE$7TWO<"';
my $result = $str =~ s/$regex/$subst/rgee;
print $result;
The expression is explained on the top right panel of this demo, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs step by step, if you like.
This can be done in one line regex using /e and ternary operator ? in the /replace/.
/r option returns the resulting string, in effect this would keep the original string $s unmodified.
use strict;
use warnings;
my $s ='ab<(CN)cdXYlm<(CI)efgXYop<(CN)zXYklmn<(CI)efgXYuvw<';
print (($s=~s/\(([^)]+)\)([^(]+)XY[^(]+</"($1)$2".(($1 eq CN)?ONE:TWO)."<"/gre)."\n");
Output:
ab<(CN)cdONE<(CI)efgTWO<(CN)zONE<(CI)efgTWO<

Perl how do you assign a varanble to a regex match result

How do you create a $scalar from the result of a regex match?
Is there any way that once the script has matched the regex that it can be assigned to a variable so it can be used later on, outside of the block.
IE. If $regex_result = blah blah then do something.
I understand that I should make the regex as non-greedy as possible.
#!/usr/bin/perl
use strict;
use warnings;
# use diagnostics;
use Win32::OLE;
use Win32::OLE::Const 'Microsoft Outlook';
my #Qmail;
my $regex = "^\\s\*owner \#";
my $sentence = $regex =~ "/^\\s\*owner \#/";
my $outlook = Win32::OLE->new('Outlook.Application')
or warn "Failed Opening Outlook.";
my $namespace = $outlook->GetNamespace("MAPI");
my $folder = $namespace->Folders("test")->Folders("Inbox");
my $items = $folder->Items;
foreach my $msg ( $items->in ) {
if ( $msg->{Subject} =~ m/^(.*test alert) / ) {
my $name = $1;
print " processing Email for $name \n";
push #Qmail, $msg->{Body};
}
}
for(#Qmail) {
next unless /$regex|^\s*description/i;
print; # prints what i want ie lines that start with owner and description
}
print $sentence; # prints ^\\s\*offense \ # not lines that start with owner.
One way is to verify a match occurred.
use strict;
use warnings;
my $str = "hello what world";
my $match = 'no match found';
my $what = 'no what found';
if ( $str =~ /hello (what) world/ )
{
$match = $&;
$what = $1;
}
print '$match = ', $match, "\n";
print '$what = ', $what, "\n";
Use Below Perl variables to meet your requirements -
$` = The string preceding whatever was matched by the last pattern match, not counting patterns matched in nested blocks that have been exited already.
$& = Contains the string matched by the last pattern match
$' = The string following whatever was matched by the last pattern match, not counting patterns matched in nested blockes that have been exited already. For example:
$_ = 'abcdefghi';
/def/;
print "$`:$&:$'\n"; # prints abc:def:ghi
The match of a regex is stored in special variables (as well as some more readable variables if you specify the regex to do so and use the /p flag).
For the whole last match you're looking at the $MATCH (or $& for short) variable. This is covered in the manual page perlvar.
So say you wanted to store your last for loop's matches in an array called #matches, you could write the loop (and for some reason I think you meant it to be a foreach loop) as:
my #matches = ();
foreach (#Qmail) {
next unless /$regex|^\s*description/i;
push #matches_in_qmail $MATCH
print;
}
I think you have a problem in your code. I'm not sure of the original intention but looking at these lines:
my $regex = "^\\s\*owner \#";
my $sentence = $regex =~ "/^\s*owner #/";
I'll step through that as:
Assign $regexto the string ^\s*owner #.
Assign $sentence to value of running a match within $regex with the regular expression /^s*owner $/ (which won't match, if it did $sentence will be 1 but since it didn't it's false).
I think. I'm actually not exactly certain what that line will do or was meant to do.
I'm not quite sure what part of the match you want: the captures, or something else. I've written Regexp::Result which you can use to grab all the captures etc. on a successful match, and Regexp::Flow to grab multiple results (including success statuses). If you just want numbered captures, you can also use Data::Munge
You can do the following:
my $str ="hello world";
my ($hello, $world) = $str =~ /(hello)|(what)/;
say "[$_]" for($hello,$world);
As you see $hello contains "hello".
If you have older perl on your system like me, perl 5.18 or earlier, and you use $ $& $' like codequestor's answer above, it will slow down your program.
Instead, you can use your regex pattern with the modifier /p, and then check these 3 variables: ${^PREMATCH}, ${^MATCH}, and ${^POSTMATCH} for your matching results.

Perl regex return matches from substitution

I am trying to simultaneously remove and store (into an array) all matches of some regex in a string.
To return matches from a string into an array, you could use
my #matches = $string=~/$pattern/g;
I would like to use a similar pattern for a substitution regex. Of course, one option is:
my #matches = $string=~/$pattern/g;
$string =~ s/$pattern//g;
But is there really no way to do this without running the regex engine over the full string twice? Something like
my #matches = $string=~s/$pattern//g
Except that this will only return the number of subs, regardless of list context. I would also take, as a consolation prize, a method to use qr// where I could simply modify the quoted regex to to a sub regex, but I don't know if that's possible either (and that wouldn't preclude searching the same string twice).
Perhaps the following will be helpful:
use warnings;
use strict;
my $string = 'I thistle thing am thinking this Thistle a changed thirsty string.';
my $pattern = '\b[Tt]hi\S+\b';
my #matches;
$string =~ s/($pattern)/push #matches, $1; ''/ge;
print "New string: $string; Removed: #matches\n";
Output:
New string: I am a changed string.; Removed: thistle thing thinking this Thistle thirsty
Here is another way to do it without executing Perl code inside the substitution. The trick is that the s///g will return one capture at a time and undef if it does not match, thus quitting the while loop.
use strict;
use warnings;
use Data::Dump;
my $string = "The example Kenosis came up with was way better than mine.";
my #matches;
push #matches, $1 while $string =~ s/(\b\w{4}\b)\s//;
dd #matches, $string;
__END__
(
"came",
"with",
"than",
"The example Kenosis up was way better mine.",
)

Regular expression match count in Perl

I am matching a string of the form A<=>B!C<=>D!E<=>F... and want to do checks on the letters. Basically I want to tell if the letters are in the class according to a hash I have defined. I had the idea of doing the following regex and then looping through the matched strings:
$a =~ /(.)<=>(.)/g;
But I can't figure out to tell how many $1, $2 variables have matched. How do I know how many there are? Also, is there a better way to do this? I am using Perl 5.8.8.
You'll want the 'countof' operator to count the number of matches:
my $count = () = $string =~ /(.)<=>(.)/g;
Replacing the empty list with an array will retain the matches:
my #matches = $string =~ /(.)<=>(.)/g;
Which provides another way to get the $count:
my $count = #matches; # scalar #matches works too
Use a while loop
use warnings;
use strict;
my %letters = map { $_ => 1 } qw(A C F);
my $s = 'A<=>B!C<=>D!E<=>F';
while ($s =~ /(.)<=>(.)/g) {
print "$1\n" if exists $letters{$1};
print "$2\n" if exists $letters{$2};
}
__END__
A
C
F
Create a variable and increment it each time you go through your loop?

How can I split a string by whitespace unless inside of a single quoted string?

I'm seeking a solution to splitting a string which contains text in the following format:
"abcd efgh 'ijklm no pqrs' tuv"
which will produce the following results:
['abcd', 'efgh', 'ijklm no pqrs', 'tuv']
In other words, it splits by whitespace unless inside of a single quoted string. I think it could be done with .NET regexps using "Lookaround" operators, particularly balancing operators. I'm not so sure about Perl.
Use Text::ParseWords:
#!/usr/bin/perl
use strict; use warnings;
use Text::ParseWords;
my #words = parse_line('\s+', 0, "abcd efgh 'ijklm no pqrs' tuv");
use Data::Dumper;
print Dumper \#words;
Output:
C:\Temp> ff
$VAR1 = [
'abcd',
'efgh',
'ijklm no pqrs',
'tuv'
];
You can look at the source code for Text::ParseWords::parse_line to see the pattern used.
use strict; use warnings;
my $text = "abcd efgh 'ijklm no pqrs' tuv 'xwyz 1234 9999' 'blah'";
my #out;
my #parts = split /'/, $text;
for ( my $i = 1; $i < $#parts; $i += 2 ) {
push #out, split( /\s+/, $parts[$i - 1] ), $parts[$i];
}
push #out, $parts[-1];
use Data::Dumper;
print Dumper \#out;
So you've decided to use a regex? Now you have two problems.
Allow me to infer a little bit. You want an arbitrary number of fields, where a field is composed of text without containing a space, or it is separated by spaces and begins with a quote and ends with a quote (possibly with spaces inbetween).
In other words, you want to do what a command line shell does. You really should just reuse something. Failing that, you should capture a field at a time, with a regex something like:
^ *([^ ]+|'[^']*')(.*)
Where you append group one to your list, and continue the loop with the contents of group 2.
A single pass through a regex wouldn't be able to capture an arbitrarily large number of fields. You might be able to split on a regex (python will do this, not sure about perl), but since you are matching the stuff outside the spaces, I'm not sure that is even an option.