How the regex substitution is working in perl? - regex

I have tried the remove duplicates from the strings, "a","b","b","a","c" after removing the result is "a","b","c",. I have achieved this, but I have a doubt about working of regex substitution
use warnings;
use strict;
my $s = q+"a","b","b","a","c"+;
$s=~s/ ("\w"),? / ($s=~s|($1)||g)?"$1,":"" /xge;
#^ ^
#| Consider this as s2
#Consider this as s1
print "\n$s\n\n";
s1 value contain string as "a","b","b","a","c"
Step 1
After substitution:
Guess, what is the data contain s1 variable from the following "a","b","b","c" or "a","b","b","a","c" or ,"b","b",,"c" data.?
I have run the regex with eval grouping
$s=~s/ ("\w"),? (?{print "$s\n"})/ ($s=~s|($1)||g)?"$1,":"" /xge;
The result is
"a","b","b","a","c"
,"b","b",,"c" #This is from after substitution
,,,,"c"
,,,,"c"
,,,,"c"
Now my dobut is s2 variable also $s why it is not concatenated with s1, it means at the second step the result should be "a","b","b","c" (All the string "a" is replaced with empty and a is added in the $s).?
Edited
The result from the eval grouping is (?{print $s})
"a","b","b","a","c"
,"b","b",,"c"
,,,,"c"
,,,,"c"
,,,,"c"
After the substitution line I printed the $s variable it is giving "a","b","c", How this output is coming.?

A regex is (in my opinion) the wrong tool to use here. I would
split the string on commas
remove duplicates from the list returned by split
join the list back into a string
Like this:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
my $str = q["a","b","b","a","c"];
my %seen;
$str = join ',',
grep { ! $seen{$_}++ }
split /,/, $str;
say $str;

The proper solution to this is split, filter, rejoin as #Dave Cross has already demonstrated.
...
However, the following regex solution does work and hopefully demonstrates why Dave's solution is superior
#!/usr/bin/env perl
use v5.10;
use strict;
use warnings;
my $str = q{"a","b","b","a","c"};
1 while $str =~ s{
\A
(?: (?&element) , )*
( (?&element) ) # Capture in \1
(?: , (?&element) )*
\K
,
\1 # Remove the duplicate along with preceding comma
(?= \z | , )
(?(DEFINE)
(?<element>
"
\w
"
)
)
}{}xg;
say $str;
Outputs:
"a","b","c"

Related

Multi-order splitting inside Perl

I have a string which comes from a CSV file:
my $str = 'NA19900,4,111629038,0;0,0;0,"GSA-rs16997168,rs16997168,rs2w34r23424",C,T,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0';
which should be translated (somehow) to
'NA19900,4,111629038,0;0,0;0,"GSA-rs16997168;rs16997168;rs2w34r23424",C,T,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0';
so that perl's split does not split the single field GSA-rs16997168,rs16997168 into two separate fields
i.e. the comma should be replaced by a semi-colon if it is between the two " I can't find how to do this on Google
What I've tried so far:
$str =~ s/"([^"]+),([^"]+)"/"$1;$2"/g; but this fails with > 2 expressions
It would be great if I could somehow tell perl's split function to count everything within "" as one field even if that text has the , delimiter, but I don't know how to do that :(
I've heard of lookaheads, but I don't see how I can use them here :(
Why try to recreate a CSV parser when perfectly good ones exist?
use Text::CSV_XS qw( );
my $csv = Text::CSV_XS->new({ binary => 1, auto_diag => 2 });
while ( my $row = $csv->get_line($fh) ) {
$row->[5] =~ s/,/;/g
$csv->say(\*STDOUT, $row);
}
My guess is that we wish to capture upto four commas after the last ", for which we would be starting with a simple expression such as:
(.*",.+?,.+?,.+?,.+?),
Demo
Test
use strict;
my $str = 'NA19900,4,111629038,0;0,0;0,"GSA-rs16997168,rs16997168,rs2w34r23424",C,T,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0';
my $regex = qr/(.*",.+?,.+?,.+?,.+?),/mp;
if ( $str =~ /$regex/g ) {
print "Whole match is ${^MATCH} and its start/end positions can be obtained via \$-[0] and \$+[0]\n";
# print "Capture Group 1 is $1 and its start/end positions can be obtained via \$-[1] and \$+[1]\n";
# print "Capture Group 2 is $2 ... and so on\n";
}
# ${^POSTMATCH} and ${^PREMATCH} are also available with the use of '/p'
# Named capture groups can be called via $+{name}
RegEx
If this expression wasn't desired and you wish to modify it, please visit this link at regex101.com.
RegEx Circuit
jex.im visualizes regular expressions:
Why use a CSV module and a regex.
Just use a regex and cut out the middle man .
$str =~ s/(?m:(?:,|^)"|(?!^)\G)[^",]*\K,(?=[^"]*")/;/g ;
https://regex101.com/r/tRDCen/1
Read-me version
(?m:
(?: , | ^ )
"
|
(?! ^ )
\G
)
[^",]*
\K
,
(?= [^"]* " )

Perl Regex Remove Hyphen but Ignore Specific Hyphenated words

I have a perl regex which converts hyphens to spaces eg:-
$string =~ s/-/ /g;
I need to modify this to ignore specific hyphenated phrases and not replace the hyphen e.g. in a string like this:
"use-either-dvi-d-or-dvi-i"
I wish to NOT replace the hyphen in dvi-d and dvi-i so it reads:
"use either dvi-d or dvi-i"
I have tried various negative look ahead matches but failed miserably.
You can use this PCRE regex with verbs (*SKIP)(*F) to skip certain words from your match:
dvi-[id](*SKIP)(*F)|-
RegEx Demo
This will skip words dvi-i and dvi-d for splitting due to use of (*SKIP)(*F).
For your code:
$string =~ s/dvi-[id](*SKIP)(*F)|-/ /g;
Perl Code Demo
There is an alternate lookarounds based solution as well:
/(?<!dvi)-|-(?![di])/
Which basically means match hyphen if it is not preceded by dvi OR if it is not followed by d or i, thus making sure to not match - when we have dvi on LHS and [di] on RHS.
Perl code:
$string =~ s/(?<!dvi)-|-(?![di])/ /g;
Perl Code Demo 2
$string =~ s/(?<!dvi)-(?![id])|(?<=dvi)-(?![id])|(?<!dvi)-(?=[id])/ /g;
While using just (?<!dvi)-(?![id]) you will exclude also dvi-x or x-i, where x can be any character.
It is unlikely that you could get a simple and straightforward regex solution to this. However, you could try the following:
#!/usr/bin/env perl
use strict;
use warnings;
my %whitelist = map { $_ => 1 } qw( dvi-d dvi-i );
my $string = 'use-either-dvi-d-or-dvi-i';
while ( $string =~ m{ ( [^-]+ ) ( - ) ( [^-]+ ) }gx ) {
my $segment = substr($string, $-[0], $+[0] - $-[0]);
unless ( $whitelist{ $segment } ) {
substr( $string, $-[2], 1, ' ');
}
pos( $string ) = $-[ 3 ];
}
print $string, "\n";
The #- array contains the starting offsets of matched groups, and the #+ array contains the ends offsets. In both cases, element 0 refers to the whole match.
I had to resort to something like this because of how \G works:
Note also that s/// will refuse to overwrite part of a substitution that has already been replaced; so for example this will stop after the first iteration, rather than iterating its way backwards through the string:
$_ = "123456789";
pos = 6;
s/.(?=.\G)/X/g;
print; # prints 1234X6789, not XXXXX6789
Maybe #tchrist can figure out how to bend various assertions to his will.
we can ignore specific words using negative Look-ahead and negative Look-behind
Example :
(?!pattern)
is a negative look-ahead assertion
in your case the pattern is
$string =~ s/(?<!dvi)-(?<![id])/ /g;
output :
use either dvi-d or dvi-i
Reference : http://www.perlmonks.org/?node_id=518444
Hope this will help you.

Perl split by regexp issue

I'm writing some parser on Perl and here is a problem with split. Here is my code:
my $str = 'a,b,"c,d",e';
my #arr = split(/,(?=([^\"]*\"[^\"]*\")*[^\"]*$)/, $str);
# try to split the string by comma delimiter, but only if comma is followed by the even or zero number of quotes
foreach my $val (#arr) {
print "$val\n"
}
I'm expecting the following:
a
b
"c,d"
e
But this is what am I really received:
a
b,"c,d"
b
"c,d"
"c,d"
e
I see my string parts are in array, their indices are 0, 2, 4, 6. But how to avoid these odd b,"c,d" and other rest string parts in the resulting array? Is there any error in my regexp delimiter or is there some special split options?
You need to use a non-capturing group:
my #arr = split(/,(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)/, $str);
^^
See IDEONE demo
Otherwise, the captured texts are output as part of the resulting array.
See perldoc reference:
If the regex has groupings, then the list produced contains the matched substrings from the groupings as well
What's tripping you up is a feature in split in that if you're using a group, and it's set to capture - it returns the captured 'bit' as well.
But rather than using split I would suggest the Text::CSV module, that already handles quoting for you:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new();
my $fields = $csv->getline( \*DATA );
print join "\n", #$fields;
__DATA__
a,b,"c,d",e
Prints:
a
b
c,d
e
My reasoning is fairly simple - you're doing quote matching and may have things like quoted/escaped quotes, etc. mean you're trying to do a recursive parse, which is something regex simply isn't well suited to doing.
You can use parse_line() of Text::ParseWords, if you are not really bounded for regex:
use Text::ParseWords;
my $str = 'a,b,"c,d",e';
my #arr = parse_line(',', 1, $str);
foreach (#arr)
{
print "$_\n";
}
Output:
a
b
"c,d"
e
Do matching instead of splitting.
use strict; use warnings;
my $str = 'a,b,"c,d",e';
my #matches = $str =~ /"[^"]*"|[^,]+/g;
foreach my $val (#matches) {
print "$val\n"
}

Perl Regex To Remove Commas Between Quotes?

I am trying to remove commas between double quotes in a string, while leaving other commas intact? (This is an email address which sometimes contains spare commas). The following "brute force" code works OK on my particular machine, but is there a more elegant way to do it, perhaps with a single regex?
Duncan
$string = '06/14/2015,19:13:51,"Mrs, Nkoli,,,ka N,ebedo,,m" <ubabankoffice93#gmail.com>,1,2';
print "Initial string = ", $string, "<br>\n";
# Extract stuff between the quotes
$string =~ /\"(.*?)\"/;
$name = $1;
print "name = ", $1, "<br>\n";
# Delete all commas between the quotes
$name =~ s/,//g;
print "name minus commas = ", $name, "<br>\n";
# Put the modified name back between the quotes
$string =~ s/\"(.*?)\"/\"$name\"/;
print "new string = ", $string, "<br>\n";
You can use this kind of pattern:
$string =~ s/(?:\G(?!\A)|[^"]*")[^",]*\K(?:,|"(*SKIP)(*FAIL))//g;
pattern details:
(?: # two possible beginnings:
\G(?!\A) # contiguous to the previous match
| # OR
[^"]*" # all characters until an opening quote
)
[^",]* #"# all that is not a quote or a comma
\K # discard all previous characters from the match result
(?: # two possible cases:
, # a comma is found, so it will be replaced
| # OR
"(*SKIP)(*FAIL) #"# when the closing quote is reached, make the pattern fail
# and force the regex engine to not retry previous positions.
)
If you use an older perl version, \K and the backtracking control verbs may be not supported. In this case you can use this pattern with capture groups:
$string =~ s/((?:\G(?!\A)|[^"]*")[^",]*)(?:,|("[^"]*(?:"|\z)))/$1$2/g;
One way would be to use the nice module Text::ParseWords to isolate the specific field and perform a simple transliteration to get rid of the commas:
use strict;
use warnings;
use Text::ParseWords;
my $str = '06/14/2015,19:13:51,"Mrs, Nkoli,,,ka N,ebedo,,m" <ubabankoffice93#gmail.com>,1,2';
my #row = quotewords(',', 1, $str);
$row[2] =~ tr/,//d;
print join ",", #row;
Output:
06/14/2015,19:13:51,"Mrs Nkolika Nebedom" <ubabankoffice93#gmail.com>,1,2
I assume that no commas can appear legitimately in your email field. Otherwise some other replacement method is required.

How do I use Perl to intersperse characters between consecutive matches with a regex substitution?

The following lines of comma-separated values contains several consecutive empty fields:
$rawData =
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n"
I want to replace these empty fields with 'N/A' values, which is why I decided to do it via a regex substitution.
I tried this first of all:
$rawdata =~ s/,([,\n])/,N\/A/g; # RELABEL UNAVAILABLE DATA AS 'N/A'
which returned
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,,N/A,\n
Not what I wanted. The problem occurs when more than two consecutive commas occur. The regex gobbles up two commas at a time, so it starts at the third comma rather than the second when it rescans the string.
I thought this could be something to do with lookahead vs. lookback assertions, so I tried the following regex out:
$rawdata =~ s/(?<=,)([,\n])|,([,\n])$/,N\/A$1/g; # RELABEL UNAVAILABLE DATA AS 'N/A'
which resulted in:
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,N/A,,N/A,,N/A,,N/A\n
That didn't work either. It just shifted the comma-pairings by one.
I know that washing this string through the same regex twice will do it, but that seems crude. Surely, there must be a way to get a single regex substitution to do the job. Any suggestions?
The final string should look like this:
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,N/A,,N/A,N/A,N/A,N/A,N/A\n
EDIT: Note that you could open a filehandle to the data string and let readline deal with line endings:
#!/usr/bin/perl
use strict; use warnings;
use autodie;
my $str = <<EO_DATA;
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,
EO_DATA
open my $str_h, '<', \$str;
while(my $row = <$str_h>) {
chomp $row;
print join(',',
map { length $_ ? $_ : 'N/A'} split /,/, $row, -1
), "\n";
}
Output:
E:\Home> t.pl
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,N/A,N/A,N/A
You can also use:
pos $str -= 1 while $str =~ s{,(,|\n)}{,N/A$1}g;
Explanation: When s/// finds a ,, and replaces it with ,N/A, it has already moved to the character after the last comma. So, it will miss some consecutive commas if you only use
$str =~ s{,(,|\n)}{,N/A$1}g;
Therefore, I used a loop to move pos $str back by a character after each successful substitution.
Now, as #ysth shows:
$str =~ s!,(?=[,\n])!,N/A!g;
would make the while unnecessary.
I couldn't quite make out what you were trying to do in your lookbehind example, but I suspect you are suffering from a precedence error there, and that everything after the lookbehind should be enclosed in a (?: ... ) so the | doesn't avoid doing the lookbehind.
Starting from scratch, what you are trying to do sounds pretty simple: place N/A after a comma if it is followed by another comma or a newline:
s!,(?=[,\n])!,N/A!g;
Example:
my $rawData = "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n";
use Data::Dumper;
$Data::Dumper::Useqq = $Data::Dumper::Terse = 1;
print Dumper($rawData);
$rawData =~ s!,(?=[,\n])!,N/A!g;
print Dumper($rawData);
Output:
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n"
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,N/A,N/A,N/A\n"
You could search for
(?<=,)(?=,|$)
and replace that with N/A.
This regex matches the (empty) space between two commas or between a comma and end of line.
The quick and dirty hack version:
my $rawData = "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n";
while ($rawData =~ s/,,/,N\/A,/g) {};
print $rawData;
Not the fastest code, but the shortest. It should loop through at max twice.
Not a regex, but not too complicated either:
$string = join ",", map{$_ eq "" ? "N/A" : $_} split (/,/, $string,-1);
The ,-1 is needed at the end to force split to include any empty fields at the end of the string.