perl regex square brackets and single quotes - regex

Have this string:
ABC,-0.5,10Y,10Y,['TEST'],ABC.1000145721ABC,-0.5,20Y,10Y,['TEST'],ABC.1000145722
The data is repeated.
I need to remove the []' characters from the data so it looks like this:
ABC,-0.5,10Y,10Y,TEST,ABC.1000145721ABC,-0.5,20Y,10Y,TEST,ABC.1000145722
I'm also trying to split the data to assign it to variables as seen below:
my($currency, $strike, $tenor, $tenor2,$ado_symbol) = split /,/, $_;
This works for everything but the ['TEST'] section. Should I remove the []' characters first then keep my split the same or is there an easier way to do this?
Thanks

Something that's useful to know is this - that split takes a regex. (It'll even let you capture, but that'll insert into the returned list, which is why I've got (?: for non capturing groups)
I observe your data only has [' right next to the delimiter - so how about:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
while ( <DATA> ) {
chomp;
my #fields = split /(?:\'])?,(?:\[\')?/;
print Dumper \#fields;
}
__DATA__
ABC,-0.5,10Y,10Y,['TEST'],ABC.1000145721ABC,-0.5,20Y,10Y,['TEST'],ABC.1000145722
Output:
$VAR1 = [
'ABC',
'-0.5',
'10Y',
'10Y',
'TEST',
'ABC.1000145721ABC',
'-0.5',
'20Y',
'10Y',
'TEST',
'ABC.1000145722'
];

my $str = "ABC,-0.5,10Y,10Y,['TEST'],ABC.1000145721ABC,-0.5,20Y,10Y,['TEST'],ABC.1000145722";
$str =~ s/\['|'\]//g;
print $str;
output is
ABC,-0.5,10Y,10Y,TEST,ABC.1000145721ABC,-0.5,20Y,10Y,TEST,ABC.1000145722
Now you can split.

Clean up $ado_symbol after split:
$ado_symbol =~ s/^\['//;
$ado_symbol =~ s/'\]$//;

You can use a global regex match to find all substrings that are not a comma, a single quote, or a square bracket
Like this
use strict;
use warnings 'all';
my $s = q{ABC,-0.5,10Y,10Y,['TEST'],ABC.1000145721ABC,-0.5,20Y,10Y,['TEST'],ABC.1000145722};
my #data = $s =~ /[^,'\[\]]+/g;
my ( $currency, $strike, $tenor, $tenor2, $ado_symbol ) = #data;
print "\$currency = $currency\n";
print "\$strike = $strike\n";
print "\$tenor = $tenor\n";
print "\$tenor2 = $tenor2\n";
print "\$ado_symbol = $ado_symbol\n";
output
$currency = ABC
$strike = -0.5
$tenor = 10Y
$tenor2 = 10Y
$ado_symbol = TEST

Another alternative
my $str = "ABC,-0.5,10Y,10Y,['TEST'],ABC.1000145721ABC,-0.5,20Y,10Y,['TEST'],ABC.1000145722";
my ($currency, $strike, $tenor, $tenor2,$ado_symbol) = map{ s/[^A-Z0-9\.-]//g; $_} split ',',$str;
print "$currency, $strike, $tenor, $tenor2, $ado_symbol",$/;
Output is:
ABC, -0.5, 10Y, 10Y, TEST

Related

Perl regular expression named groups array

I'm trying to capture named groups in an array.
use strict;
use warnings;
use Data::Dumper;
my $text = 'My aunt is on vacation and eating some apples and banana';
$text =~ m/(?<Letter>A.)/img ;
print Dumper(%-) ;
output is
$VAR1 = 'Letter';
$VAR2 = [
'au'
];
but I would expect or actually was hoping that any occurrence will appear in the array.
$VAR1 = 'Letter';
$VAR2 = [
'au',
'ac',
'at',
'an',
'at',
'an'
];
Any chance to get all groups in one array?
You can either run the m//g in list context to get all the matching substrings (but without the capture names), or you need to match in a loop:
my #matches = $text =~ m/(?<Letter>A.)/img ;
print Dumper(\#matches) ;
or
while ($text =~ m/(?<Letter>A.)/img) {
print Dumper(\%+) ;
}

perl regex match using global switch

I am trying to match a word that starts with a letter and is followed by at .
I use this regex for it
use strict;
use warnings;
use Data::Dumper;
my $str = "fat 123 cat sat on the mat";
my #a = $str =~ /(\s?[a-z]{1,2}(at)\s?)/g;
print Dumper( #a );
the out put I am getting is:
$ perl ~/playground/regex.pl
$VAR1 = 'fat ';
$VAR2 = 'at';
$VAR3 = ' cat ';
$VAR4 = 'at';
$VAR5 = 'sat ';
$VAR6 = 'at';
$VAR7 = ' mat';
$VAR8 = 'at';
why does it match "at" as well when I clearly say match just 1 character before at.
Your optional spaces aren't a good way to delimit words: they are optional
Use the word boundary construct \b for a rough match to the ends of words
use strict;
use warnings;
use Data::Dumper;
my $str = "fat 123 cat sat on the mat";
my #aa = $str =~ /\b[a-z]+at\b/gi;
print Dumper \#aa;
output
$VAR1 = [
'fat',
'cat',
'sat',
'mat'
];
If you want to be more clever and be certain that the word found isn't preceded or followed by a non-space character then you can write this instead
my #aa = $str =~ /(?<!\S)[a-z]+at(?!\S)/gi;
which produces the same result for the data you show

Dynamically capture regular expression match in Perl

I'm trying to dynamically catch regex matching in Perl. I've known that eval will help me do this but I may be doing something wrong.
Code:
use strict;
use warnings;
my %testHash = (
'(\d+)\/(\d+)\/(\d+)' => '$1$2$3'
);
my $str = '1/12/2016';
foreach my $pattern (keys (%testHash)) {
my $value = $testHash{$pattern};
my $result;
eval {
local $_ = $str;
/$pattern/;
print "\$1 - $1\n";
print "\$2 - $2\n";
print "\$3 - $3\n";
eval { print "$value\n"; }
}
}
Is it also possible to store captured regex patterns in an array?
I believe what you really want is a dynamic version of the following:
say $str =~ s/(\d+)\/(\d+)\/(\d+)/$1$2$3/gr;
String::Substitution provides what we need to achieve that.
use String::Substitution qw( gsub_copy );
for my $pattern (keys(%testHash)) {
my $replacement = $testHash{$pattern};
say gsub_copy($str, $pattern, $replacement);
}
Note that $replacement can also be a callback. This permits far more complicated substitutions. For example, if you wanted to convert 1/12/2016 into 2016-01-12, you could use the following:
'(\d+)/(\d+)/(\d+)' => sub { sprintf "%d-%02d-%02d", #_[3,1,2] },
To answer your actual question:
use String::Substitution qw( interpolate_match_vars last_match_vars );
for my $pattern (keys(%testHash)) {
my $template = $testHash{$pattern};
$str =~ $pattern # Or /$pattern/ if you prefer
or die("No match!\n");
say interpolate_match_vars($template, last_match_vars());
}
I am not completely sure what you want to do here, but I don't think your program does what you think it does.
You are useing eval with a BLOCK of code. That's like a try block. If it dies inside of that eval block, it will catch that error. It will not run your string like it was code. You need a string eval for that.
Instead of explaining that, here's an alternative.
This program uses sprintf and numbers the parameters. The %1$s syntax in the pattern says _take the first argument (1$) and format it as a string (%s). You don't need to localize or assign to $_ to do a match. The =~ operator does that on other variables for you. I also use qr{} to create a quoted regular expression (essentially a variable containing a precompiled pattern) that I can use directly. Because of the {} as delimiter, I don't need to escape the slashes.
use strict;
use warnings;
use feature 'say'; # like print ..., "\n"
my %testHash = (
qr{(\d+)/(\d+)/(\d+)} => '%1$s.%2$s.%3$s',
qr{(\d+)/(\d+)/(\d+) nomatch} => '%1$s.%2$s.%3$s',
qr{(\d+)/(\d+)/(\d\d\d\d)} => '%3$4d-%2$02d-%1$02d',
qr{\d} => '%s', # no capture group
);
my $str = '1/12/2016';
foreach my $pattern ( keys %testHash ) {
my #captures = ( $str =~ $pattern );
say "pattern: $pattern";
if ($#+ == 0) {
say " no capture groups";
next;
}
unless (#captures) {
say " no match";
next;
}
# debug-output
for my $i ( 1 .. $#- ) {
say sprintf " \$%d - %s", $i, $captures[ $i - 1 ];
}
say sprintf $testHash{$pattern}, #captures;
}
I included four examples:
The first pattern is the one you had. It uses %1$s and so on as explained above.
The second one does not match. We check the number of elements in #captured by looking at it in scalar context.
The third one shows that you can also reorder the result, or even use the sprintf formatting.
The last one has no capture group. We check by looking at the index of the last element ($# as the sigil for arrays that usually have an # sigil) in #+, which holds the offsets of the ends of the last successful submatches in the currently active dynamic scope. The first element is the end of the overall match, so if this only has one element, we don't have capture groups.
The output for me is this:
pattern: (?^:(\d+)/(\d+)/(\d\d\d\d))
$1 - 1
$2 - 12
$3 - 2016
2016-12-01
pattern: (?^:(\d+)/(\d+)/(\d+) nomatch)
no match
pattern: (?^:\d)
no capture groups
pattern: (?^:(\d+)/(\d+)/(\d+))
$1 - 1
$2 - 12
$3 - 2016
1.12.2016
Note that the order in the output is mixed up. That's because hashes are not ordered in Perl, and if you iterate over the keys in a hash without sort the order is random.
Apologies! I realized both my question and sample code were both vague. But after reading your suggestions I came of with the following code.
I haven't optimized this code yet and there is a limit to the replacement.
foreach my $key (keys %testHash) {
if ( $str =~ $key ) {
my #matchArr = ($str =~ $key); # Capture all matches
# Search and replace (limited from $1 to $9)
for ( my $i = 0; $i < #matchArr; $i++ ) {
my $num = $i+1;
$testHash{$key} =~ s/\$$num/$matchArr[$i]/;
}
$result = $testHash{$key};
last;
}
}
print "$result\n";
Evaluing the regexp in list context returns the matches. so in your example:
use Data::Dumper; # so we can see the result
foreach my $pattern (keys (%testHash)) {
my #a = ($str =~/$pattern/);
print Dumper(\#a);
}
would do the job.
HTH
Georg
Is it also possible to store captured regex patterns in an array?
Of course it is possible to store captured substrings in an array:
#!/usr/bin/env perl
use strict;
use warnings;
my #patterns = map qr{$_}, qw{
(\d+)/(\d+)/(\d+)
};
my $str = '1/12/2016';
foreach my $pattern ( #patterns ) {
my #captured = ($str =~ $pattern)
or next;
print "'$_'\n" for #captured;
}
Output:
'1'
'12'
'2016'
I do not quite understand what you are trying to do with combinations of local, eval EXPR and eval BLOCK in your code and the purpose of the following hash:
my %testHash = (
'(\d+)\/(\d+)\/(\d+)' => '$1$2$3'
);
If you are trying to codify that this pattern should result in three captures, you can do that like this:
my #tests = (
{
pattern => qr{(\d+)/(\d+)/(\d+)},
ncaptures => 3,
}
);
my $str = '1/12/2016';
foreach my $test ( #tests ) {
my #captured = ($str =~ $test->{pattern})
or next;
unless (#captured == $test->{ncaptures}) {
# handle failure
}
}
See this answer to find out how you can automate counting the number of capture groups in a pattern. Using the technique in that answer:
#!/usr/bin/env perl
use strict;
use warnings;
use Test::More;
my #tests = map +{ pattern => qr{$_}, ncaptures => number_of_capturing_groups($_) }, qw(
(\d+)/(\d+)/(\d+)
);
my $str = '1/12/2016';
foreach my $test ( #tests ) {
my #captured = ($str =~ $test->{pattern});
ok #captured == $test->{ncaptures};
}
done_testing;
sub number_of_capturing_groups {
"" =~ /|$_[0]/;
return $#+;
}
Output:
ok 1
1..1

Perl regex to replace part of one string with a portion of another

I have a need in Perl to replace a section of one string with most of another. :-) This needs be done for multiple pairs of strings.
For example, I need to replace
"/root_vdm_2/fs_clsnymigration"
within
/root_vdm_2/fs_clsnymigration/CLSNYMIGRATION/NY_HQ_S1
with
rfsn_clsnymigration
so that I end up with
/rfsn_clsnymigration/CLSNYMIGRATION/NY_HQ_S1
(without the leading "/root_vdm_2" part) ... but I am sufficiently sleep-deprived to have lost sight of how to accomplish this.
Help ?
Try this regex:
^\/root_vdm_2\/fs_clsnymigration
Substitute with:
\/rfsn_clsnymigration
example:
$string = "/root_vdm_2/fs_clsnymigration/CLSNYMIGRATION/NY_HQ_S1";
$string=~s/^\/root_vdm_2\/fs_clsnymigration/\/rfsn_clsnymigration/;
print $string;
Output:
/rfsn_clsnymigration/CLSNYMIGRATION/NY_HQ_S1
EDIT 1
$string = "/root_vdm_2/fs_clsnymigration/CLSNYMIGRATION/NY_HQ_S1/LISU,rfsn_clsnymigration
/root_vdm_2/fs_users/users/Marketing,rfsw_users
/root_vdm_3/fs_sandi/sandi_users,rfsw_sandi
/root_vdm_3/fs_pci/Analytics,rfsw_pci
/root_vdm_4/fs_camnt01/camnt01/AV,rfsw_camnt01
/root_vdm_1/fs_stcloud01/sfa,rfss_stcloud01
/root_vdm_3/fs_stcloud03/data4,rfss_stcloud03
/root_vdm_2/fs_stcloud02/depart1,rfss_stcloud02";
$string=~s/^\/root_vdm_.\/fs_[^\/]*/\/rfsn_clsnymigration/gm;
print $string;
Output:
/rfsn_clsnymigration/CLSNYMIGRATION/NY_HQ_S1/LISU,rfsn_clsnymigration
/rfsn_clsnymigration/users/Marketing,rfsw_users
/rfsn_clsnymigration/sandi_users,rfsw_sandi
/rfsn_clsnymigration/Analytics,rfsw_pci
/rfsn_clsnymigration/camnt01/AV,rfsw_camnt01
/rfsn_clsnymigration/sfa,rfss_stcloud01
/rfsn_clsnymigration/data4,rfss_stcloud03
/rfsn_clsnymigration/depart1,rfss_stcloud02
use strict;
use warnings;
while (<DATA>) {
chomp;
my ($lhs, $rhs) = split(/,/, $_, 2);
my #parts = split(/\//, $lhs);
splice(#parts, 1, 2, $rhs);
print join('/', #parts) . "\n";
}
__DATA__
/root_vdm_2/fs_clsnymigration/CLSNYMIGRATION/NY_HQ_S1/LISU,rfsn_clsnymigration
/root_vdm_2/fs_users/users/Marketing,rfsw_users
/root_vdm_3/fs_sandi/sandi_users,rfsw_sandi
/root_vdm_3/fs_pci/Analytics,rfsw_pci
/root_vdm_4/fs_camnt01/camnt01/AV,rfsw_camnt01
/root_vdm_1/fs_stcloud01/sfa,rfss_stcloud01
/root_vdm_3/fs_stcloud03/data4,rfss_stcloud03
/root_vdm_2/fs_stcloud02/depart1,rfss_stcloud02
My challenge was to replace part of $string1 with all of $string2, split on the commas.
/root_vdm_2/fs_clsnymigration/CLSNYMIGRATION/NY_HQ_S1/LISU,rfsn_clsnymigration
/root_vdm_2/fs_users/users/Marketing,rfsw_users
/root_vdm_3/fs_sandi/sandi_users,rfsw_sandi
/root_vdm_3/fs_pci/Analytics,rfsw_pci
/root_vdm_4/fs_camnt01/camnt01/AV,rfsw_camnt01
/root_vdm_1/fs_stcloud01/sfa,rfss_stcloud01
/root_vdm_3/fs_stcloud03/data4,rfss_stcloud03
/root_vdm_2/fs_stcloud02/depart1,rfss_stcloud02
The difficulty I saw initially was how to replace /root_vdm_2/fs_clsnymigration with rfsn_clsnymigration, and I allowed myself to think that a regexp was the best approach.
Although far less eloquent, this gets the job done:
foreach $line (#lines) {
chop $line;
($orig,$replica) = split /\,/, $line;
chop substr $orig, 0, 1;
#pathparts = split /\//, $orig;
$rootvdm = shift #pathparts;
#pathparts[0] = $replica;
$newpath = "/" . join ('/', #pathparts);
print " here's \$newpath:$newpath\n";
}
... which yields something like this:
here's $newpath:/rfsn_clsnymigration/CLSNYMIGRATION/NY_HQ_S1/LISU
here's $newpath:/rfsw_users/users/Marketing
here's $newpath:/rfsw_sandi/sandi_users
here's $newpath:/rfsw_pci/Analytics
here's $newpath:/rfsw_camnt01/camnt01/AV
here's $newpath:/rfss_stcloud01/sfa
here's $newpath:/rfss_stcloud03/data4
here's $newpath:/rfss_stcloud02/depart1

Counting occurrences of a word in a string in Perl

I am trying to find out the number of occurrences of "The/the". Below is the code I tried"
print ("Enter the String.\n");
$inputline = <STDIN>;
chop($inputline);
$regex="\[Tt\]he";
if($inputline ne "")
{
#splitarr= split(/$regex/,$inputline);
}
$scalar=#splitarr;
print $scalar;
The string is :
Hello the how are you the wanna work on the project but i the u the
The
The output that it gives is 7. However with the string :
Hello the how are you the wanna work on the project but i the u the
the output is 5. I suspect my regex. Can anyone help in pointing out what's wrong.
I get the correct number - 6 - for the first string
However your method is wrong, because if you count the number of pieces you get by splitting on the regex pattern it will give you different values depending on whether the word appears at the beginning of the string. You should also put word boundaries \b into your regular expression to prevent the regex from matching something like theory
Also, it is unnecessary to escape the square brackets, and you can use the /i modifier to do a case-independent match
Try something like this instead
use strict;
use warnings;
print 'Enter the String: ';
my $inputline = <>;
chomp $inputline;
my $regex = 'the';
if ( $inputline ne '' ) {
my #matches = $inputline =~ /\b$regex\b/gi;
print scalar #matches, " occurrences\n";
}
With split, you're counting the substrings between the the's. Use match instead:
#!/usr/bin/perl
use warnings;
use strict;
my $regex = qr/[Tt]he/;
for my $string ('Hello the how are you the wanna work on the project but i the u the The',
'Hello the how are you the wanna work on the project but i the u the',
'the theological cathedral'
) {
my $count = () = $string =~ /$regex/g;
print $count, "\n";
my #between = split /$regex/, $string;
print 0 + #between, "\n";
print join '|', #between;
print "\n";
}
Note that both methods return the same number for the two inputs you mentioned (and the first one returns 6, not 7).
The following snippet uses a code side-effect to increment a counter, followed by an always-failing match to keep searching. It produces the correct answer for matches that overlap (e.g. "aaaa" contains "aa" 3 times, not 2). The split-based answers don't get that right.
my $i;
my $string;
$i = 0;
$string = "aaaa";
$string =~ /aa(?{$i++})(?!)/;
print "'$string' contains /aa/ x $i (should be 3)\n";
$i = 0;
$string = "Hello the how are you the wanna work on the project but i the u the The";
$string =~ /[tT]he(?{$i++})(?!)/;
print "'$string' contains /[tT]he/ x $i (should be 6)\n";
$i = 0;
$string = "Hello the how are you the wanna work on the project but i the u the";
$string =~ /[tT]he(?{$i++})(?!)/;
print "'$string' contains /[tT]he/ x $i (should be 5)\n";
What you need is 'countof' operator to count the number of matches:
my $string = "Hello the how are you the wanna work on the project but i the u the The";
my $count = () = $string =~/[Tt]he/g;
print $count;
If you want to select only the word the or The, add word boundary:
my $string = "Hello the how are you the wanna work on the project but i the u the The";
my $count = () = $string =~/\b[Tt]he\b/g;
print $count;