Regex matching characters of one string in another in any order - regex

Consider the following string wizard. I want to find if it is in another string in any order and in any case.
I tried the following
while(<>){if($_=~/(?:([wizard])(?!.*\1)){6}/i){print"0"}else{print"1"}}
For the inputs
Garry Kasparov
Bobby Fischer
Vladimir Kramnik
Wayne Drimaz
Lionel Messi
La Signora
It printed 111111 but it must have been 111011.
So, I tried this instead (for the same inputs)
while(<>){if($_=~/(?=[wizard]{6})(?!.*(.).*\1).*/i){print"0"}else{print"1"}}
It again printed 111111.
In input number 4, we can make WaDriaz but only one a is needed. Anyway, we can spell wizard by rearranging and in any case. Why is it not working?
What is wrong with my code?

The following should be quite fast (especially if you inline the subs):
use feature qw( fc say );
sub make_key {
my %counts;
++$counts{$_} for split //, fc($_[0]) =~ s/\PL//rg;
return \%counts;
}
sub search {
my ($substr, $str) = #_;
$str = make_key($str);
no warnings qw( uninitialized );
return !( grep { $str->{$_} < $substr->{$_} } keys(%$substr) );
}
my $substr = make_key("wizard");
while (<>) {
chomp;
say search($substr, $_) ? 0 : 1;
}
Unlike virtually all of the previous solutions, this one doesn't consider latte to be in late.
The following is a regex based solution (with some prep). This should also be quite fast (especially if you inline the subs).
use feature qw( fc say );
sub make_re {
my $pat = join ".*?", map quotemeta, sort split //, fc($_[0]) =~ s/\PL//rg;
return qr/$pat/s;
}
sub search {
my ($substr, $str) = #_;
return ( join "", sort split //, $str ) =~ $substr;
}
my $substr = make_re("wizard"); # qr/a.*?d.*?i.*?r.*?w.*?z/is
while (<>) {
chomp;
say search($substr, $_) ? 0 : 1;
}
Finally, a purely regexp-based solution.
use feature qw( fc say );
sub make_re {
my %counts;
++$counts{$_} for split //, fc($_[0]) =~ s/\PL//rg;
my $pat =
join "",
map { "(?=".( ( ".*?" . quotemeta($_) ) x $counts{$_} ).")" }
#sort
keys(%counts);
return qr/^$pat/s;
}
my $re = make_re("wizard"); # qr/^(?=.*?a)(?=.*?d)(?=.*?i)(?=.*?r)(?=.*?w)(?=.*?z)/is
while (<>) {
say /$re/ ? 0 : 1;
}
Unlike virtually all of the previous solutions, none of mine consider latte to be in late.

Here is a pure regex: do a positive lookahead for each character
use warnings;
use strict;
use feature 'say';
use List::Util qw(uniq); # before v. 1.45 in List::MoreUtils
my $string = shift // q(wizard);
my $patt = join '', map { qq{(?=.*\Q$_\E)} } uniq split //, $string;
# say $patt;
#--> (?=.*w)(?=.*i)(?=.*z)(?=.*a)(?=.*r)(?=.*d) (for wizard)
while (<DATA>) {
say "Found '$string' in: $_" if /^$patt/is;
}
__DATA__
Garry Kasparov
Bobby Fischer
Vladimir Kramnik
Wayne Drimaz
Lionel Messi
La Signora
Being all in one regex, with an anchored lookahead and no overhead, this should be very fast.
The \Q...\E are there in case the search string contains regex-sensitive characters.
This code considers words with repeated characters (latte, rare) to "fit" a word without that (later). It is clarified in comments that this is indeed the wanted behavior: repeated characters need only be found once in the target (letter matches later etc).

I find canonicalization of the inputs and the pattern to be a more generalizable and understandable approach:
#!/usr/bin/env perl
use strict;
use warnings;
sub canonchars {
my %c;
$c{$_} = undef for map lc, grep /\S/, split //, $_[0];
sort keys %c;
}
sub pattern {
map "$_.*", canonchars($_[0]);
}
my %canonical;
while (my $line = <DATA>) {
last unless $line =~ /\S/;
push $canonical{join '', canonchars($line)}->#*, $line;
}
my $pat = qr/#{[join '', pattern('wizard')]}/;
for my $k (keys %canonical) {
if ($k =~ $pat) {
print for $canonical{$k}->#*;
}
}
__DATA__
Garry Kasparov
Bobby Fischer
Vladimir Kramnik
Wayne Drimaz
Lionel Messi
La Signora
Output:
C:\Temp> perl t.pl
Wayne Drimaz
There is a lot of logic you are trying to fit into a regular expression pattern, and every edge case you find and fix will make it more complicated and more fragile.

It's just a simple matter of using a positive lookahead for every character.
my #stars = (
'Garry Kasparov',
'Bobby Fischer',
'Vladimir Kramnik',
'Wayne Drimaz',
'Lionel Messi',
'La Signora'
);
say /^(?=.*w)(?=.*i)(?=.*z)(?=.*a)(?=.*r)(?=.*d)/i ? 0 : 1 for #stars;
This outputs 111011.

No need for regular expressions... they just complicate things, especially if you're not looking for a string you know ahead of time. Just look for each character in turn after normalizing their cases.
#!/usr/bin/env perl
use strict;
use warnings;
sub contains_chars {
my ($needle, $haystack) = #_;
$haystack = lc $haystack;
my %positions;
for my $char (split //, lc $needle) {
my $p = index $haystack, $char, $positions{$char}//0;
return 1 if $p < 0;
$positions{$char} = $p + 1;
}
return 0;
}
while (<DATA>) {
print contains_chars("wIzArD", $_);
}
print "\n";
__DATA__
Garry Kasparov
Bobby Fischer
Vladimir Kramnik
Wayne Drimaz
Lionel Messi
La Signora

This a black art way that out of order matching can occur.
Each character in the source string is visited only once.
No need to recurse the string from the beginning for each letter.
use strict;
use warnings;
while (my $line = <DATA>) {
if ( $line =~ /
(?:
.*?
(?:
(?(1)(?!))(w)
| (?(2)(?!))(i)
| (?(3)(?!))(z)
| (?(4)(?!))(a)
| (?(5)(?!))(r)
| (?(6)(?!))(d)
)
){6}/ix ) { print $line, "\n" }
}
__DATA__
Garry Kasparov
Bobby Fischer
Vladimir Kramnik
Wayne Drimaz
Lionel Messi
La Signora
Wayne Drimaz
Captures can exist in two states, defined or undefined.
The essence of this black art is to use the state of captures as flags
to insure all the items are matched in the out-of-order state.
The above can also be written this way with the same result.
use strict;
use warnings;
while (my $line = <DATA>) {
if ( $line =~ /
(?im)
^
(?>
.*?
(?:
w ( ) # (1)
| i ( ) # (2)
| z ( ) # (3)
| a ( ) # (4)
| r ( ) # (5)
| d ( ) # (6)
)
)+
(?= \1 \2 \3 \4 \5 \6 )
/x ) { print $line, "\n" }
}
__DATA__
Garry Kasparov
Bobby Fischer
Vladimir Kramnik
Wayne Drimaz
Lionel Messi
La Signora

Just adding a simple variant of /w/ && /i/ && /z/ ... solution I mentioned in comments. If you want this solution to match a lot of different strings, instead of lining the regexes together with &&, you can simply loop over the characters. A useful tool is to use the &&= operator to mimic the behaviour of a long string of conditions. This will also allow us to short-circuit the matching if we find a mismatch, giving us a speed benefit.
For example:
/a/ && /b/ && /c/
Equivalent to
my $match = 1;
for my $w (qw(a b c)) {
$match &&= (/$w/); # $match = ($match && /$w/)
}
To remember the count of letters, i.e. whether latte should be considered to be a substring of late, you can simply use the substitution operator s/// instead of match operator m//. I added the multi-letter criteria, and added two test cases to demonstrate.
I like this solution because of the simplicity, but thereby not said it is the best one.
use strict;
use warnings;
my $word = "wizzard";
while (<DATA>) {
print search($_, $word), " $_";
}
sub search {
my ($str, $substr) = #_;
my $match = 1; # assume true
for my $w (split //, $substr) { # for each char in substr...
$match &&= ($str =~ s/\Q$w//i); # ...remove character
return 0 if not $match; # ...return false if no match found
}
return 1 if $match;
}
__DATA__
wizard
wizzard
Garry Kasparov
Bobby Fischer
Vladimir Kramnik
Wayne Drimaz
Wayne Drimazz
Lionel Messi
La Signora
Output:
0 wizard
1 wizzard
0 Garry Kasparov
0 Bobby Fischer
0 Vladimir Kramnik
0 Wayne Drimaz
1 Wayne Drimazz
0 Lionel Messi
0 La Signora
If you do not care about multi-letter matching, just replace s/// with m//.

Related

Finding and replacing matched letters at same position

It is a regex question, but I could not find a proper option for my case in Wiki page so I decided to ask here. May be a simple unknown option of regex can resolve my case.
I have a log file(a.txt) which has multiple lines of strings. I want to compare every two lines (1st vs. 2nd, 3rd vs. 4th...) and replace matched letter (not a string) to "."(or any special character).
a.txt:
1100110010
1100101100
0011001100
0110101111
.
.
.
result.txt:
.....1001.
.....01100
.0.10...00
.1.01...11
.
.
.
This may be XOR problem of two strings, so I tried this way, but it needed to be converted to ASCII and then XOR is doable (may be this approach is not right). I think there is a very simple way to do this job with SED/PERL. Any suggestion and/or guidance is appreciated. Thank you for reading my question.
Perl using bitwise operators:
#!/usr/bin/env perl
use strict;
use warnings;
use v5.10;
while ( !eof(DATA) ) {
chomp( my $line1 = <DATA> );
chomp( my $line2 = <DATA> );
( my $uniq_mask = $line1 ^ $line2 ) =~ s/[^\0]/\xFF/g;
my $uniq1 = $line1;
my $uniq2 = $line2;
for ( $uniq1, $uniq2 ) {
$_ &= $uniq_mask;
s/\0/./g;
}
say for $line1, $line2, $uniq1, $uniq2, '';
}
__DATA__
1100110010
1100101100
0011001100
0110101111
Outputs:
1100110010
1100101100
.....1001.
.....0110.
0011001100
0110101111
.0.10...00
.1.01...11
Here is a Perl version:
#!/usr/bin/env perl
# always use these two
use strict;
use warnings;
# handle errors in open and close
use autodie; # See http://perldoc.perl.org/autodie.html
while( ! eof( DATA ) ){
chomp( my $line1 = <DATA> );
chomp( my $line2 = <DATA> );
my #data1 = split //, $line1;
my #data2 = split //, $line2;
# do the first
for my $i ( 0 .. $#data1 ){
if( $data1[$i] eq $data2[$i] ){
print ".";
}else{
print $data1[$i];
}
}
print "\n";
# do the second
for my $i ( 0 .. $#data2 ){
if( $data1[$i] eq $data2[$i] ){
print ".";
}else{
print $data2[$i];
}
}
print "\n";
}
__DATA__
1100110010
1100101100
0011001100
0110101111
Since you mentioned xor,
my $xor = $s1 ^ $s2;
my $mask = $xor =~ tr/\x01-\xFF/\xFF/r;
my $dots = $xor =~ tr/\x00\x01-\xFF/.\x00/r;
say $s1 & $mask | $dots;
say $s2 & $mask | $dots;
This code assumes the line feed has been removed, and it assumes the length of $s1 and $s2 are the same.
Here is an answer in sed.
It assumes that the lines are always equally long and only contain "0"s and "1"s.
Only "0"s or "1"s especially covers the assumption 'no ">" anywhere'.
It seems to be somewhat robust against differently long lines (I did a few simple tests), but no guarantee.
sed -En "N;s/^(.*)\n(.*)$/>\1\n>\2/;:a;s/>([01])(.*)\n(.*)>\1/.>\2\n\3.>/;ta;s/>([^$\n])/\1>/g;ta;s/>//g;p"
The code means:
-En use extended regexes, do not print automatically
N look at this and next line at once
s/// do a single replace, non-globally because of the absence of g
first replace introduces a cursor ">" at the start of both lines
:a introduce a label for looping
second replace does replace
cursor, 0 or 1, rest of first line,
start of second line, cursor, same 0 or 1
by
dot, cursor, rest of line one,
start of second line, dot cursor
then, in case of successful replace, loop to label
otherwise third replace moves boths cursors one ahead and loops,
except if end of line is reached
fourth replace remoces the cursors
p print result
Output for you sample input (interleaved with sample input):
1100110010
1100101100
.....1001.
.....0110.
0011001100
0110101111
.0.10...00
.1.01...11
The output differs in line two from your stated desired output, "." instead of "0",
but with all due respect, I think your desired output is incorrect there.
Using: GNU sed version 4.2.1

Dynamically capture regular expression match in Perl

I'm trying to dynamically catch regex matching in Perl. I've known that eval will help me do this but I may be doing something wrong.
Code:
use strict;
use warnings;
my %testHash = (
'(\d+)\/(\d+)\/(\d+)' => '$1$2$3'
);
my $str = '1/12/2016';
foreach my $pattern (keys (%testHash)) {
my $value = $testHash{$pattern};
my $result;
eval {
local $_ = $str;
/$pattern/;
print "\$1 - $1\n";
print "\$2 - $2\n";
print "\$3 - $3\n";
eval { print "$value\n"; }
}
}
Is it also possible to store captured regex patterns in an array?
I believe what you really want is a dynamic version of the following:
say $str =~ s/(\d+)\/(\d+)\/(\d+)/$1$2$3/gr;
String::Substitution provides what we need to achieve that.
use String::Substitution qw( gsub_copy );
for my $pattern (keys(%testHash)) {
my $replacement = $testHash{$pattern};
say gsub_copy($str, $pattern, $replacement);
}
Note that $replacement can also be a callback. This permits far more complicated substitutions. For example, if you wanted to convert 1/12/2016 into 2016-01-12, you could use the following:
'(\d+)/(\d+)/(\d+)' => sub { sprintf "%d-%02d-%02d", #_[3,1,2] },
To answer your actual question:
use String::Substitution qw( interpolate_match_vars last_match_vars );
for my $pattern (keys(%testHash)) {
my $template = $testHash{$pattern};
$str =~ $pattern # Or /$pattern/ if you prefer
or die("No match!\n");
say interpolate_match_vars($template, last_match_vars());
}
I am not completely sure what you want to do here, but I don't think your program does what you think it does.
You are useing eval with a BLOCK of code. That's like a try block. If it dies inside of that eval block, it will catch that error. It will not run your string like it was code. You need a string eval for that.
Instead of explaining that, here's an alternative.
This program uses sprintf and numbers the parameters. The %1$s syntax in the pattern says _take the first argument (1$) and format it as a string (%s). You don't need to localize or assign to $_ to do a match. The =~ operator does that on other variables for you. I also use qr{} to create a quoted regular expression (essentially a variable containing a precompiled pattern) that I can use directly. Because of the {} as delimiter, I don't need to escape the slashes.
use strict;
use warnings;
use feature 'say'; # like print ..., "\n"
my %testHash = (
qr{(\d+)/(\d+)/(\d+)} => '%1$s.%2$s.%3$s',
qr{(\d+)/(\d+)/(\d+) nomatch} => '%1$s.%2$s.%3$s',
qr{(\d+)/(\d+)/(\d\d\d\d)} => '%3$4d-%2$02d-%1$02d',
qr{\d} => '%s', # no capture group
);
my $str = '1/12/2016';
foreach my $pattern ( keys %testHash ) {
my #captures = ( $str =~ $pattern );
say "pattern: $pattern";
if ($#+ == 0) {
say " no capture groups";
next;
}
unless (#captures) {
say " no match";
next;
}
# debug-output
for my $i ( 1 .. $#- ) {
say sprintf " \$%d - %s", $i, $captures[ $i - 1 ];
}
say sprintf $testHash{$pattern}, #captures;
}
I included four examples:
The first pattern is the one you had. It uses %1$s and so on as explained above.
The second one does not match. We check the number of elements in #captured by looking at it in scalar context.
The third one shows that you can also reorder the result, or even use the sprintf formatting.
The last one has no capture group. We check by looking at the index of the last element ($# as the sigil for arrays that usually have an # sigil) in #+, which holds the offsets of the ends of the last successful submatches in the currently active dynamic scope. The first element is the end of the overall match, so if this only has one element, we don't have capture groups.
The output for me is this:
pattern: (?^:(\d+)/(\d+)/(\d\d\d\d))
$1 - 1
$2 - 12
$3 - 2016
2016-12-01
pattern: (?^:(\d+)/(\d+)/(\d+) nomatch)
no match
pattern: (?^:\d)
no capture groups
pattern: (?^:(\d+)/(\d+)/(\d+))
$1 - 1
$2 - 12
$3 - 2016
1.12.2016
Note that the order in the output is mixed up. That's because hashes are not ordered in Perl, and if you iterate over the keys in a hash without sort the order is random.
Apologies! I realized both my question and sample code were both vague. But after reading your suggestions I came of with the following code.
I haven't optimized this code yet and there is a limit to the replacement.
foreach my $key (keys %testHash) {
if ( $str =~ $key ) {
my #matchArr = ($str =~ $key); # Capture all matches
# Search and replace (limited from $1 to $9)
for ( my $i = 0; $i < #matchArr; $i++ ) {
my $num = $i+1;
$testHash{$key} =~ s/\$$num/$matchArr[$i]/;
}
$result = $testHash{$key};
last;
}
}
print "$result\n";
Evaluing the regexp in list context returns the matches. so in your example:
use Data::Dumper; # so we can see the result
foreach my $pattern (keys (%testHash)) {
my #a = ($str =~/$pattern/);
print Dumper(\#a);
}
would do the job.
HTH
Georg
Is it also possible to store captured regex patterns in an array?
Of course it is possible to store captured substrings in an array:
#!/usr/bin/env perl
use strict;
use warnings;
my #patterns = map qr{$_}, qw{
(\d+)/(\d+)/(\d+)
};
my $str = '1/12/2016';
foreach my $pattern ( #patterns ) {
my #captured = ($str =~ $pattern)
or next;
print "'$_'\n" for #captured;
}
Output:
'1'
'12'
'2016'
I do not quite understand what you are trying to do with combinations of local, eval EXPR and eval BLOCK in your code and the purpose of the following hash:
my %testHash = (
'(\d+)\/(\d+)\/(\d+)' => '$1$2$3'
);
If you are trying to codify that this pattern should result in three captures, you can do that like this:
my #tests = (
{
pattern => qr{(\d+)/(\d+)/(\d+)},
ncaptures => 3,
}
);
my $str = '1/12/2016';
foreach my $test ( #tests ) {
my #captured = ($str =~ $test->{pattern})
or next;
unless (#captured == $test->{ncaptures}) {
# handle failure
}
}
See this answer to find out how you can automate counting the number of capture groups in a pattern. Using the technique in that answer:
#!/usr/bin/env perl
use strict;
use warnings;
use Test::More;
my #tests = map +{ pattern => qr{$_}, ncaptures => number_of_capturing_groups($_) }, qw(
(\d+)/(\d+)/(\d+)
);
my $str = '1/12/2016';
foreach my $test ( #tests ) {
my #captured = ($str =~ $test->{pattern});
ok #captured == $test->{ncaptures};
}
done_testing;
sub number_of_capturing_groups {
"" =~ /|$_[0]/;
return $#+;
}
Output:
ok 1
1..1

perl match single occurence pattern in string

I have a list of names and I want to look for names containing two given letters asigned using variables.
$one = "A";
$two = "O";
Please note that I want those letters to be present anywhere in the checked names, so that I can get outputs like this:
Jason
Damon
Amo
Noma
Boam
...
But each letter must only be present once per name, meaning that this wouldn't work.
Alamo
I've tried this bit of code but it doesn't work.
foreach my $name (#list) {
if ($name =~ /$one/) {
if ($name =~ /$two/) {
print $name;
}}
else {next}; }
How about this?
for my $name (#list) {
my $ones = () = $name =~ /$one/gi;
my $twos = () = $name =~ /$two/gi;
if ($ones == 1 && $twos == 1) {
print $name;
}
}
#!/usr/bin/env perl
#
# test.pl is the name of this script
use warnings;
use strict;
my %char = map {$_ => 1} grep {/[a-z]/} map {lc($_)} split //, join '', #ARGV;
my #chars = sort keys %char; # the different characters appearing in the command line arguments
while (my $line = <STDIN>)
{
grep {$_ <=> 1} map {scalar(() = $line =~ /$_/ig )} #chars
or print $line;
}
Now:
echo hello world | test.pl fw will print nothing (w occurs exactly once in hello world, but f does not)
echo hello world | test.pl hw will print a line consisting of hello world (both h and w occur exactly once).
One way to get it all into a single regex is to use an expression within the regex pattern to search for the other letter (a or o) based on which one was found first:
#!/usr/bin/env perl
use 5.010; use strict; use warnings;
while(<DATA>){
chomp;
say if m/^
[^ao]* # anything but a or o
([ao]) # an 'a' or 'o'
[^ao]* # anything but a or o
(??{($1 and lc($1) eq 'a') ? 'o' : 'a'}) # the other 'a' or 'o'
[^ao]* $/xi; # anything but a or o
}
__DATA__
Abdi
Chelal
Jucal
Husham
Gallio
Pileser
Tekoa
Kenaz
Raphah
See the perlre section on Extended Expressions for more info.
This is my solution. You don't make it clear whether there will always be just two single-character strings to match but I have assumed that there may be more
Unfortunately the classical way of counting occurrences of a character -- tr/// -- doesn't interpolate variables into its searchlist and doesn't have a case-independent modifier /i. But the pattern-match operator m// does, so that is what I have used
I thoroughly dislike the so-called goatse operator, but there isn't a neater way that I know of that allows you to count the number of times a global regex pattern matches
I could have used a grep for the inner loop, but I went for a regular for loop and a next with a label as I believe it's more readable this way
use strict;
use warnings;
use v5.10.1;
use autodie;
my #list = do {
open my $fh, '<', 'names.txt';
<$fh>;
};
chomp #list;
my ($one, $two) = qw/ A O /;
NAME:
for my $name ( #list ) {
for ( $one, $two) {
my $count = () = $name =~ /$_/gi;
next NAME unless $count == 1;
}
say $name;
}
output
Gallio
Tekoa
Achbor
Clopas
This is the input that I used
Abdi
Chelal
Jucal
Husham
Gallio
Pileser
Tekoa
Kenaz
Raphah
Tiras
Jehudi
Bildad
Shemidah
Meshillemoth
Tabeel
Achbor
Jesus
Osee
Elnaam
Rephah
Asaiah
Er
Clopas
Penuel
Shema
Marsena
Jaare
Joseph
Shamariah
Levi
Aphses

Perl Parsing CSV file with embedded commas

I'm parsing a CSV file with embedded commas, and obviously, using split() has a few limitations due to this.
One thing I should note is that the values with embedded commas are surrounded by parentheses, double quotes, or both...
for example:
(Date, Notional),
"Date, Notional",
"(Date, Notional)"
Also, I'm trying to do this without using any modules for certain reasons I don't want to go into right now...
Can anyone help me out with this?
This should do what you need. It works in a very similar way to the code in Text::CSV_PP, but doesn't allow for escaped characters within the field as you say you have none
use strict;
use warnings;
use 5.010;
my $re = qr/(?| "\( ( [^()""]* ) \)" | \( ( [^()]* ) \) | " ( [^"]* ) " | ( [^,]* ) ) , \s* /x;
my $line = '(Date, Notional 1), "Date, Notional 2", "(Date, Notional 3)"';
my #fields = "$line," =~ /$re/g;
say "<$_>" for #fields;
output
<Date, Notional 1>
<Date, Notional 2>
<Date, Notional 3>
Update
Here's a version for older Perls (prior to version 10) that don't have the regex branch reset construct. It produces identical output to the above
use strict;
use warnings;
use 5.010;
my $re = qr/(?: "\( ( [^()""]* ) \)" | \( ( [^()]* ) \) | " ( [^"]* ) " | ( [^,]* ) ) , \s* /x;
my $line = '(Date, Notional 1), "Date, Notional 2", "(Date, Notional 3)"';
my #fields = grep defined, "$line," =~ /$re/g;
say "<$_>" for #fields;
I know you already have a working solution with Borodin's answer, but for the record there is also a simple solution with split (see the results at the bottom of the online demo). This situation sounds very similar to regex match a pattern unless....
#!/usr/bin/perl
$regex = '(?:\([^\)]*\)|"[^"]*")(*SKIP)(*F)|\s*,\s*';
$subject = '(Date, Notional), "Date, Notional", "(Date, Notional)"';
#splits = split($regex, $subject);
print "\n*** Splits ***\n";
foreach(#splits) { print "$_\n"; }
How it Works
The left side of the alternation | matches complete (parentheses) and (quotes), then deliberately fails. The right side matches commas, and we know they are the right commas because they were not matched by the expression on the left.
Possible Refinements
If desired, the parenthess-matching portion could be made recursive to match (nested(parens))
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...
I know that this is quite old question, but for completeness I would like to add solution from great book "Mastering Regular Expressions" by Jeffrey Friedl (page 271):
sub parse_csv {
my $text = shift; # record containing comma-separated values
my #fields = ( );
my $field;
chomp($text);
while ($text =~ m{\G(?:^|,)(?:"((?>[^"]*)(?:""[^"]*)*)"|([^",]*))}gx) {
if (defined $2) {
$field = $2;
} else {
$field = $1;
$field =~ s/""/"/g;
}
# print "[$field]";
push #fields, $field;
}
return #fields;
}
Try it against test row:
my $line = q(Ten Thousand,10000, 2710 ,,"10,000",,"It's ""10 Grand"", baby",10K);
my #fields = parse_csv($line);
my $i;
for ($i = 0; $i < #fields; $i++) {
print "$fields[$i],";
}
print "\n";

Extract word before the 1st occurrence of a special string

I have an array that contains elements like
#array=("link_dm &&& drv_ena&&&1",
"txp_n_los|rx_n_lost",
"eof &&& 2 &&& length =!!!drv!!!0");
I want to get all the characters before the first "&&&", and if the element doesn't have a "&&&", then I need to extract the entire element.
This is what I want to extract:
likn_dm
txp_n_los|rx_n_lost
eof
I used
foreach my $row (#array){
if($row =~ /^(.*)\&{3}/){
push #firstelements,$1;
}
}
But I'm getting
link_dm &&& drv_ena
txp_n_los|rx_n_lost
eof &&& 2
Can somebody please suggest how I can achieve this?
Perhaps just splitting would be helpful:
use strict;
use warnings;
my #array = (
"link_dm &&& drv_ena&&&1",
"txp_n_los|rx_n_lost",
"eof &&& 2 &&& length =!!!drv!!!0"
);
foreach my $row (#array){
my ($chars) = split /\&{3}/, $row, 2;
print $chars, "\n"
}
Output:
link_dm
txp_n_los|rx_n_lost
eof
You can write:
#firstelements = map { m/^(.*?) *&&&/ ? $1 : $_ } #array;
Or, if you prefer foreach over map and if over ?::
foreach my $row (#array){
if($row =~ /^(.*)\&{3}/) {
push #firstelements, $1;
} else {
push #firstelements, $row;
}
}
for (#array) {
print "$1\n" if /([^ ]*)(?: *[&]{3}.*)?$/;
}
If you're using regular expressions, use the minimum spanning pattern: .*?. See perldoc perlre http://perldoc.perl.org/perlre.html
#!/usr/bin/env perl
use strict;
use warnings;
# --------------------------------------
use charnames qw( :full :short );
use English qw( -no_match_vars ); # Avoids regex performance penalty
use Data::Dumper;
# Make Data::Dumper pretty
$Data::Dumper::Sortkeys = 1;
$Data::Dumper::Indent = 1;
# Set maximum depth for Data::Dumper, zero means unlimited
local $Data::Dumper::Maxdepth = 0;
# conditional compile DEBUGging statements
# See http://lookatperl.blogspot.ca/2013/07/a-look-at-conditional-compiling-of.html
use constant DEBUG => $ENV{DEBUG};
# --------------------------------------
my #array = (
"link_dm &&& drv_ena&&&1",
"txp_n_los|rx_n_lost",
"eof &&& 2 &&& length =!!!drv!!!0",
);
my #first_elements = ();
for my $line ( #array ){
# check for '&&&'
if( my ( $first_element ) = $line =~ m{ \A (.*?) \s* \&{3} }msx ){
push #first_elements, $first_element;
}else{
push #first_elements, $line;
}
}
print Dumper \#first_elements;