Multi-order splitting inside Perl

Multi-order splitting inside Perl - regex

I have a string which comes from a CSV file:
my $str = 'NA19900,4,111629038,0;0,0;0,"GSA-rs16997168,rs16997168,rs2w34r23424",C,T,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0';
which should be translated (somehow) to
'NA19900,4,111629038,0;0,0;0,"GSA-rs16997168;rs16997168;rs2w34r23424",C,T,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0';
so that perl's split does not split the single field GSA-rs16997168,rs16997168 into two separate fields
i.e. the comma should be replaced by a semi-colon if it is between the two " I can't find how to do this on Google
What I've tried so far:
$str =~ s/"([^"]+),([^"]+)"/"$1;$2"/g; but this fails with > 2 expressions
It would be great if I could somehow tell perl's split function to count everything within "" as one field even if that text has the , delimiter, but I don't know how to do that :(
I've heard of lookaheads, but I don't see how I can use them here :(

Why try to recreate a CSV parser when perfectly good ones exist?
use Text::CSV_XS qw( );
my $csv = Text::CSV_XS->new({ binary => 1, auto_diag => 2 });
while ( my $row = $csv->get_line($fh) ) {
$row->[5] =~ s/,/;/g
$csv->say(\*STDOUT, $row);
}

My guess is that we wish to capture upto four commas after the last ", for which we would be starting with a simple expression such as:
(.*",.+?,.+?,.+?,.+?),
Demo
Test
use strict;
my $str = 'NA19900,4,111629038,0;0,0;0,"GSA-rs16997168,rs16997168,rs2w34r23424",C,T,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0';
my $regex = qr/(.*",.+?,.+?,.+?,.+?),/mp;
if ( $str =~ /$regex/g ) {
print "Whole match is ${^MATCH} and its start/end positions can be obtained via \$-[0] and \$+[0]\n";
# print "Capture Group 1 is $1 and its start/end positions can be obtained via \$-[1] and \$+[1]\n";
# print "Capture Group 2 is $2 ... and so on\n";
}
# ${^POSTMATCH} and ${^PREMATCH} are also available with the use of '/p'
# Named capture groups can be called via $+{name}
RegEx
If this expression wasn't desired and you wish to modify it, please visit this link at regex101.com.
RegEx Circuit
jex.im visualizes regular expressions:

Why use a CSV module and a regex.
Just use a regex and cut out the middle man .
$str =~ s/(?m:(?:,|^)"|(?!^)\G)[^",]*\K,(?=[^"]*")/;/g ;
https://regex101.com/r/tRDCen/1
Read-me version
(?m:
(?: , | ^ )
"
|
(?! ^ )
\G
)
[^",]*
\K
,
(?= [^"]* " )

Related

How the regex substitution is working in perl?

I have tried the remove duplicates from the strings, "a","b","b","a","c" after removing the result is "a","b","c",. I have achieved this, but I have a doubt about working of regex substitution
use warnings;
use strict;
my $s = q+"a","b","b","a","c"+;
$s=~s/ ("\w"),? / ($s=~s|($1)||g)?"$1,":"" /xge;
#^ ^
#| Consider this as s2
#Consider this as s1
print "\n$s\n\n";
s1 value contain string as "a","b","b","a","c"
Step 1
After substitution:
Guess, what is the data contain s1 variable from the following "a","b","b","c" or "a","b","b","a","c" or ,"b","b",,"c" data.?
I have run the regex with eval grouping
$s=~s/ ("\w"),? (?{print "$s\n"})/ ($s=~s|($1)||g)?"$1,":"" /xge;
The result is
"a","b","b","a","c"
,"b","b",,"c" #This is from after substitution
,,,,"c"
,,,,"c"
,,,,"c"
Now my dobut is s2 variable also $s why it is not concatenated with s1, it means at the second step the result should be "a","b","b","c" (All the string "a" is replaced with empty and a is added in the $s).?
Edited
The result from the eval grouping is (?{print $s})
"a","b","b","a","c"
,"b","b",,"c"
,,,,"c"
,,,,"c"
,,,,"c"
After the substitution line I printed the $s variable it is giving "a","b","c", How this output is coming.?

A regex is (in my opinion) the wrong tool to use here. I would
split the string on commas
remove duplicates from the list returned by split
join the list back into a string
Like this:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
my $str = q["a","b","b","a","c"];
my %seen;
$str = join ',',
grep { ! $seen{$_}++ }
split /,/, $str;
say $str;

The proper solution to this is split, filter, rejoin as #Dave Cross has already demonstrated.
...
However, the following regex solution does work and hopefully demonstrates why Dave's solution is superior
#!/usr/bin/env perl
use v5.10;
use strict;
use warnings;
my $str = q{"a","b","b","a","c"};
1 while $str =~ s{
\A
(?: (?&element) , )*
( (?&element) ) # Capture in \1
(?: , (?&element) )*
\K
,
\1 # Remove the duplicate along with preceding comma
(?= \z | , )
(?(DEFINE)
(?<element>
"
\w
"
)
)
}{}xg;
say $str;
Outputs:
"a","b","c"

Perl Regex Remove Hyphen but Ignore Specific Hyphenated words

I have a perl regex which converts hyphens to spaces eg:-
$string =~ s/-/ /g;
I need to modify this to ignore specific hyphenated phrases and not replace the hyphen e.g. in a string like this:
"use-either-dvi-d-or-dvi-i"
I wish to NOT replace the hyphen in dvi-d and dvi-i so it reads:
"use either dvi-d or dvi-i"
I have tried various negative look ahead matches but failed miserably.

You can use this PCRE regex with verbs (*SKIP)(*F) to skip certain words from your match:
dvi-[id](*SKIP)(*F)|-
RegEx Demo
This will skip words dvi-i and dvi-d for splitting due to use of (*SKIP)(*F).
For your code:
$string =~ s/dvi-[id](*SKIP)(*F)|-/ /g;
Perl Code Demo
There is an alternate lookarounds based solution as well:
/(?<!dvi)-|-(?![di])/
Which basically means match hyphen if it is not preceded by dvi OR if it is not followed by d or i, thus making sure to not match - when we have dvi on LHS and [di] on RHS.
Perl code:
$string =~ s/(?<!dvi)-|-(?![di])/ /g;
Perl Code Demo 2

$string =~ s/(?<!dvi)-(?![id])|(?<=dvi)-(?![id])|(?<!dvi)-(?=[id])/ /g;
While using just (?<!dvi)-(?![id]) you will exclude also dvi-x or x-i, where x can be any character.

It is unlikely that you could get a simple and straightforward regex solution to this. However, you could try the following:
#!/usr/bin/env perl
use strict;
use warnings;
my %whitelist = map { $_ => 1 } qw( dvi-d dvi-i );
my $string = 'use-either-dvi-d-or-dvi-i';
while ( $string =~ m{ ( [^-]+ ) ( - ) ( [^-]+ ) }gx ) {
my $segment = substr($string, $-[0], $+[0] - $-[0]);
unless ( $whitelist{ $segment } ) {
substr( $string, $-[2], 1, ' ');
}
pos( $string ) = $-[ 3 ];
}
print $string, "\n";
The #- array contains the starting offsets of matched groups, and the #+ array contains the ends offsets. In both cases, element 0 refers to the whole match.
I had to resort to something like this because of how \G works:
Note also that s/// will refuse to overwrite part of a substitution that has already been replaced; so for example this will stop after the first iteration, rather than iterating its way backwards through the string:
$_ = "123456789";
pos = 6;
s/.(?=.\G)/X/g;
print; # prints 1234X6789, not XXXXX6789
Maybe #tchrist can figure out how to bend various assertions to his will.

we can ignore specific words using negative Look-ahead and negative Look-behind
Example :
(?!pattern)
is a negative look-ahead assertion
in your case the pattern is
$string =~ s/(?<!dvi)-(?<![id])/ /g;
output :
use either dvi-d or dvi-i
Reference : http://www.perlmonks.org/?node_id=518444
Hope this will help you.

Perl Regex To Remove Commas Between Quotes?

I am trying to remove commas between double quotes in a string, while leaving other commas intact? (This is an email address which sometimes contains spare commas). The following "brute force" code works OK on my particular machine, but is there a more elegant way to do it, perhaps with a single regex?
Duncan
$string = '06/14/2015,19:13:51,"Mrs, Nkoli,,,ka N,ebedo,,m" <ubabankoffice93#gmail.com>,1,2';
print "Initial string = ", $string, "<br>\n";
# Extract stuff between the quotes
$string =~ /\"(.*?)\"/;
$name = $1;
print "name = ", $1, "<br>\n";
# Delete all commas between the quotes
$name =~ s/,//g;
print "name minus commas = ", $name, "<br>\n";
# Put the modified name back between the quotes
$string =~ s/\"(.*?)\"/\"$name\"/;
print "new string = ", $string, "<br>\n";

You can use this kind of pattern:
$string =~ s/(?:\G(?!\A)|[^"]*")[^",]*\K(?:,|"(*SKIP)(*FAIL))//g;
pattern details:
(?: # two possible beginnings:
\G(?!\A) # contiguous to the previous match
| # OR
[^"]*" # all characters until an opening quote
)
[^",]* #"# all that is not a quote or a comma
\K # discard all previous characters from the match result
(?: # two possible cases:
, # a comma is found, so it will be replaced
| # OR
"(*SKIP)(*FAIL) #"# when the closing quote is reached, make the pattern fail
# and force the regex engine to not retry previous positions.
)
If you use an older perl version, \K and the backtracking control verbs may be not supported. In this case you can use this pattern with capture groups:
$string =~ s/((?:\G(?!\A)|[^"]*")[^",]*)(?:,|("[^"]*(?:"|\z)))/$1$2/g;

One way would be to use the nice module Text::ParseWords to isolate the specific field and perform a simple transliteration to get rid of the commas:
use strict;
use warnings;
use Text::ParseWords;
my $str = '06/14/2015,19:13:51,"Mrs, Nkoli,,,ka N,ebedo,,m" <ubabankoffice93#gmail.com>,1,2';
my #row = quotewords(',', 1, $str);
$row[2] =~ tr/,//d;
print join ",", #row;
Output:
06/14/2015,19:13:51,"Mrs Nkolika Nebedom" <ubabankoffice93#gmail.com>,1,2
I assume that no commas can appear legitimately in your email field. Otherwise some other replacement method is required.

Perl Regex Multiple Matches

I'm looking for a regular expression that will behave as follows:
input: "hello world."
output: he, el, ll, lo, wo, or, rl, ld
my idea was something along the lines of
while($string =~ m/(([a-zA-Z])([a-zA-Z]))/g) {
print "$1-$2 ";
}
But that does something a little bit different.

It's tricky. You have to capture it, save it, and then force a backtrack.
You can do that this way:
use v5.10; # first release with backtracking control verbs
my $string = "hello, world!";
my #saved;
my $pat = qr{
( \pL {2} )
(?{ push #saved, $^N })
(*FAIL)
}x;
#saved = ();
$string =~ $pat;
my $count = #saved;
printf "Found %d matches: %s.\n", $count, join(", " => #saved);
produces this:
Found 8 matches: he, el, ll, lo, wo, or, rl, ld.
If you do not have v5.10, or you have a headache, you can use this:
my $string = "hello, world!";
my #pairs = $string =~ m{
# we can only match at positions where the
# following sneak-ahead assertion is true:
(?= # zero-width look ahead
( # begin stealth capture
\pL {2} # save off two letters
) # end stealth capture
)
# succeed after matching nothing, force reset
}xg;
my $count = #pairs;
printf "Found %d matches: %s.\n", $count, join(", " => #pairs);
That produces the same output as before.
But you might still have a headache.

No need "to force backtracking"!
push #pairs, "$1$2" while /([a-zA-Z])(?=([a-zA-Z]))/g;
Though you might want to match any letter rather than the limited set you specified.
push #pairs, "$1$2" while /(\pL)(?=(\pL))/g;

Yet another way to do it. Doesn't use any regexp magic, it does use nested maps but this could easily be translated to for loops if desired.
#!/usr/bin/env perl
use strict;
use warnings;
my $in = "hello world.";
my #words = $in =~ /(\b\pL+\b)/g;
my #out = map {
my #chars = split '';
map { $chars[$_] . $chars[$_+1] } ( 0 .. $#chars - 1 );
} #words;
print join ',', #out;
print "\n";
Again, for me this is more readable than a strange regex, YMMV.

I would use captured group in lookahead..
(?=([a-zA-Z]{2}))
------------
|->group 1 captures two English letters
try it here

You can do this by looking for letters and using the pos function to make use of the position of the capture, \G to reference it in another regex, and substr to read a few characters from the string.
use v5.10;
use strict;
use warnings;
my $letter_re = qr/[a-zA-Z]/;
my $string = "hello world.";
while( $string =~ m{ ($letter_re) }gx ) {
# Skip it if the next character isn't a letter
# \G will match where the last m//g left off.
# It's pos() in a regex.
next unless $string =~ /\G $letter_re /x;
# pos() is still where the last m//g left off.
# Use substr to print the character before it (the one we matched)
# and the next one, which we know to be a letter.
say substr $string, pos($string)-1, 2;
}
You can put the "check the next letter" logic inside the original regex with a zero-width positive assertion, (?=pattern). Zero-width meaning it is not captured and does not advance the position of a m//g regex. This is a bit more compact, but zero-width assertions get can get tricky.
while( $string =~ m{ ($letter_re) (?=$letter_re) }gx ) {
# pos() is still where the last m//g left off.
# Use substr to print the character before it (the one we matched)
# and the next one, which we know to be a letter.
say substr $string, pos($string)-1, 2;
}
UPDATE: I'd originally tried to capture both the match and the look ahead as m{ ($letter_re (?=$letter_re)) }gx but that didn't work. The look ahead is zero-width and slips out of the match. Other's answers showed that if you put a second capture inside the look-ahead then it can collapse to just...
say "$1$2" while $string =~ m{ ($letter_re) (?=($letter_re)) }gx;
I leave all the answers here for TMTOWTDI, especially if you're not a regex master.

In Perl, how can I get the matched substring from a regex?

My program read other programs source code and colect information about used SQL queries. I have problem with getting substring.
...
$line = <FILE_IN>;
until( ($line =~m/$values_string/i && $line !~m/$rem_string/i) || eof )
{
if($line =~m/ \S{2}DT\S{3}/i)
{
# here I wish to get (only) substring that match to pattern \S{2}DT\S{3}
# (7 letter table name) and display it.
$line =~/\S{2}DT\S{3}/i;
print $line."\n";
...
In result print prints whole line and not a substring I expect. I tried different approach, but I use Perl seldom and probably make basic concept error. ( position of tablename in line is not fixed. Another problem is multiple occurrence i.e.[... SELECT * FROM AADTTAB, BBDTTAB, ...] ). How can I obtain that substring?

Use grouping with parenthesis and store the first group.
if( $line =~ /(\S{2}DT\S{3})/i )
{
my $substring = $1;
}
The code above fixes the immediate problem of pulling out the first table name. However, the question also asked how to pull out all the table names. So:
# FROM\s+ match FROM followed by one or more spaces
# (.+?) match (non-greedy) and capture any character until...
# (?:x|y) match x OR y - next 2 matches
# [^,]\s+[^,] match non-comma, 1 or more spaces, and non-comma
# \s*; match 0 or more spaces followed by a semi colon
if( $line =~ /FROM\s+(.+?)(?:[^,]\s+[^,]|\s*;)/i )
{
# $1 will be table1, table2, table3
my #tables = split(/\s*,\s*/, $1);
# delim is a space/comma
foreach(#tables)
{
# $_ = table name
print $_ . "\n";
}
}
Result:
If $line = "SELECT * FROM AADTTAB, BBDTTAB;"
Output:
AADTTAB
BBDTTAB
If $line = "SELECT * FROM AADTTAB;"
Output:
AADTTAB
Perl Version: v5.10.0 built for MSWin32-x86-multi-thread

I prefer this:
my ( $table_name ) = $line =~ m/(\S{2}DT\S{3})/i;
This
scans $line and captures the text corresponding to the pattern
returns "all" the captures (1) to the "list" on the other side.
This psuedo-list context is how we catch the first item in a list. It's done the same way as parameters passed to a subroutine.
my ( $first, $second, #rest ) = #_;
my ( $first_capture, $second_capture, #others ) = $feldman =~ /$some_pattern/;
NOTE:: That said, your regex assumes too much about the text to be useful in more than a handful of situations. Not capturing any table name that doesn't have dt as in positions 3 and 4 out of 7? It's good enough for 1) quick-and-dirty, 2) if you're okay with limited applicability.

It would be better to match the pattern if it follows FROM. I assume table names consist solely of ASCII letters. In that case, it is best to say what you want. With those two remarks out of the way, note that a successful capturing regex match in list context returns the matched substring(s).
#!/usr/bin/perl
use strict;
use warnings;
my $s = 'select * from aadttab, bbdttab';
if ( my ($table) = $s =~ /FROM ([A-Z]{2}DT[A-Z]{3})/i ) {
print $table, "\n";
}
__END__
Output:
C:\Temp> s
aadttab
Depending on the version of perl on your system, you may be able to use a named capturing group which might make the whole thing easier to read:
if ( $s =~ /FROM (?<table>[A-Z]{2}DT[A-Z]{3})/i ) {
print $+{table}, "\n";
}
See perldoc perlre.

Parens will let you grab part of the regex into special variables: $1, $2, $3...
So:
$line = ' abc andtabl 1234';
if($line =~m/ (\S{2}DT\S{3})/i) {
# here I wish to get (only) substring that match to pattern \S{2}DT\S{3}
# (7 letter table name) and display it.
print $1."\n";
}

Use a capturing group:
$line =~ /(\S{2}DT\S{3})/i;
my $substr = $1;

$& contains the string matched by the last pattern match.
Example:
$str = "abcdefghijkl";
$str =~ m/cdefg/;
print $&;
# Output: "cdefg"
So you could do something like
if($line =~m/ \S{2}DT\S{3}/i) {
print $&."\n";
}
WARNING:
If you use $& in your code it will slow down all pattern matches.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Multi-order splitting inside Perl - regex

Why try to recreate a CSV parser when perfectly good ones exist? use Text::CSV_XS qw( ); my $csv = Text::CSV_XS->new({ binary => 1, auto_diag => 2 }); while ( my $row = $csv->get_line($fh) ) { $row->[5] =~ s/,/;/g $csv->say(\*STDOUT, $row); }

Why use a CSV module and a regex. Just use a regex and cut out the middle man . $str =~ s/(?m:(?:,|^)"|(?!^)\G)[^",]\K,(?=[^"]")/;/g ; https://regex101.com/r/tRDCen/1 Read-me version (?m: (?: , | ^ ) " | (?! ^ ) \G ) [^",]* \K , (?= [^"]* " )

Related

How the regex substitution is working in perl?

Perl Regex Remove Hyphen but Ignore Specific Hyphenated words

Perl Regex To Remove Commas Between Quotes?

Perl Regex Multiple Matches

In Perl, how can I get the matched substring from a regex?

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Multi-order splitting inside Perl - regex

Why try to recreate a CSV parser when perfectly good ones exist? use Text::CSV_XS qw( ); my $csv = Text::CSV_XS->new({ binary => 1, auto_diag => 2 }); while ( my $row = $csv->get_line($fh) ) { $row->[5] =~ s/,/;/g $csv->say(\*STDOUT, $row); }

Why use a CSV module and a regex. Just use a regex and cut out the middle man . $str =~ s/(?m:(?:,|^)"|(?!^)\G)[^",]*\K,(?=[^"]*")/;/g ; https://regex101.com/r/tRDCen/1 Read-me version (?m: (?: , | ^ ) " | (?! ^ ) \G ) [^",]* \K , (?= [^"]* " )

Related

How the regex substitution is working in perl?

Perl Regex Remove Hyphen but Ignore Specific Hyphenated words

Perl Regex To Remove Commas Between Quotes?

Perl Regex Multiple Matches

In Perl, how can I get the matched substring from a regex?

Categories

Resources

Why use a CSV module and a regex. Just use a regex and cut out the middle man . $str =~ s/(?m:(?:,|^)"|(?!^)\G)[^",]\K,(?=[^"]")/;/g ; https://regex101.com/r/tRDCen/1 Read-me version (?m: (?: , | ^ ) " | (?! ^ ) \G ) [^",]* \K , (?= [^"]* " )