Array splitting with regex - regex

I have an array that holds only one string, it's hash's value:
%hash = ("key"=>["Value1 unit(1), Value2 unit(2), Value3 unit"])
How to split the " unit"-s from the value of the hash and save it to an array?
The new array should be like this:
#array=["Value1", "Value2", "Value3"]
I've tried this way:
#array=split(/\s\w\(\w\)\,/, $hash{key});

Split the string on comma, then strip off the unit at the end.
map(s/\s.*$//, #array = split(/,\s*/, $hash{'key'}[0]));

Here's another option:
use strict;
use warnings;
use Data::Dumper;
my %hash = ("key"=>["1567 I(u), 2070 I(m), 2.456e-2 V(m), 417 ---, 12 R(k),"]);
my #array = $hash{'key'}->[0] =~ /(\S+)\s+\S+,?/g;
print Dumper \#array;
Output:
$VAR1 = [
'1567',
'2070',
'2.456e-2',
'417',
'12'
];

And another option:
my %hash = ("key"=>["Value1 unit(1), Value2 unit(2), Value3 unit"]);
my $i;
my #array = grep { ++$i % 2 } split /,?\s/, $hash{key}->[0];
$, = "\n";
print #array;

You can do it with a single relatively straightforward regex:
#array = $hash{'key'}[0] =~ m/\s*(\S+)\s+\S+,?/g;
The important thing to know is that with the /g flag, a regex matcher returns all the captured ($1, $2, etc.) groups from matching globally against the string. This regex has one such group.
My assumptions are:
The "Value1", "Value2" parts are just chunks of non-whitespace
The "unit(1)", "unit(2)", parts are also just chunks of non-whitespace
If these aren't valid assumptions, you can replace the two \S+ parts of the regex with something more specific that matches your data.

Related

perl regex square brackets and single quotes

Have this string:
ABC,-0.5,10Y,10Y,['TEST'],ABC.1000145721ABC,-0.5,20Y,10Y,['TEST'],ABC.1000145722
The data is repeated.
I need to remove the []' characters from the data so it looks like this:
ABC,-0.5,10Y,10Y,TEST,ABC.1000145721ABC,-0.5,20Y,10Y,TEST,ABC.1000145722
I'm also trying to split the data to assign it to variables as seen below:
my($currency, $strike, $tenor, $tenor2,$ado_symbol) = split /,/, $_;
This works for everything but the ['TEST'] section. Should I remove the []' characters first then keep my split the same or is there an easier way to do this?
Thanks
Something that's useful to know is this - that split takes a regex. (It'll even let you capture, but that'll insert into the returned list, which is why I've got (?: for non capturing groups)
I observe your data only has [' right next to the delimiter - so how about:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
while ( <DATA> ) {
chomp;
my #fields = split /(?:\'])?,(?:\[\')?/;
print Dumper \#fields;
}
__DATA__
ABC,-0.5,10Y,10Y,['TEST'],ABC.1000145721ABC,-0.5,20Y,10Y,['TEST'],ABC.1000145722
Output:
$VAR1 = [
'ABC',
'-0.5',
'10Y',
'10Y',
'TEST',
'ABC.1000145721ABC',
'-0.5',
'20Y',
'10Y',
'TEST',
'ABC.1000145722'
];
my $str = "ABC,-0.5,10Y,10Y,['TEST'],ABC.1000145721ABC,-0.5,20Y,10Y,['TEST'],ABC.1000145722";
$str =~ s/\['|'\]//g;
print $str;
output is
ABC,-0.5,10Y,10Y,TEST,ABC.1000145721ABC,-0.5,20Y,10Y,TEST,ABC.1000145722
Now you can split.
Clean up $ado_symbol after split:
$ado_symbol =~ s/^\['//;
$ado_symbol =~ s/'\]$//;
You can use a global regex match to find all substrings that are not a comma, a single quote, or a square bracket
Like this
use strict;
use warnings 'all';
my $s = q{ABC,-0.5,10Y,10Y,['TEST'],ABC.1000145721ABC,-0.5,20Y,10Y,['TEST'],ABC.1000145722};
my #data = $s =~ /[^,'\[\]]+/g;
my ( $currency, $strike, $tenor, $tenor2, $ado_symbol ) = #data;
print "\$currency = $currency\n";
print "\$strike = $strike\n";
print "\$tenor = $tenor\n";
print "\$tenor2 = $tenor2\n";
print "\$ado_symbol = $ado_symbol\n";
output
$currency = ABC
$strike = -0.5
$tenor = 10Y
$tenor2 = 10Y
$ado_symbol = TEST
Another alternative
my $str = "ABC,-0.5,10Y,10Y,['TEST'],ABC.1000145721ABC,-0.5,20Y,10Y,['TEST'],ABC.1000145722";
my ($currency, $strike, $tenor, $tenor2,$ado_symbol) = map{ s/[^A-Z0-9\.-]//g; $_} split ',',$str;
print "$currency, $strike, $tenor, $tenor2, $ado_symbol",$/;
Output is:
ABC, -0.5, 10Y, 10Y, TEST

Perl regex to replace part of one string with a portion of another

I have a need in Perl to replace a section of one string with most of another. :-) This needs be done for multiple pairs of strings.
For example, I need to replace
"/root_vdm_2/fs_clsnymigration"
within
/root_vdm_2/fs_clsnymigration/CLSNYMIGRATION/NY_HQ_S1
with
rfsn_clsnymigration
so that I end up with
/rfsn_clsnymigration/CLSNYMIGRATION/NY_HQ_S1
(without the leading "/root_vdm_2" part) ... but I am sufficiently sleep-deprived to have lost sight of how to accomplish this.
Help ?
Try this regex:
^\/root_vdm_2\/fs_clsnymigration
Substitute with:
\/rfsn_clsnymigration
example:
$string = "/root_vdm_2/fs_clsnymigration/CLSNYMIGRATION/NY_HQ_S1";
$string=~s/^\/root_vdm_2\/fs_clsnymigration/\/rfsn_clsnymigration/;
print $string;
Output:
/rfsn_clsnymigration/CLSNYMIGRATION/NY_HQ_S1
EDIT 1
$string = "/root_vdm_2/fs_clsnymigration/CLSNYMIGRATION/NY_HQ_S1/LISU,rfsn_clsnymigration
/root_vdm_2/fs_users/users/Marketing,rfsw_users
/root_vdm_3/fs_sandi/sandi_users,rfsw_sandi
/root_vdm_3/fs_pci/Analytics,rfsw_pci
/root_vdm_4/fs_camnt01/camnt01/AV,rfsw_camnt01
/root_vdm_1/fs_stcloud01/sfa,rfss_stcloud01
/root_vdm_3/fs_stcloud03/data4,rfss_stcloud03
/root_vdm_2/fs_stcloud02/depart1,rfss_stcloud02";
$string=~s/^\/root_vdm_.\/fs_[^\/]*/\/rfsn_clsnymigration/gm;
print $string;
Output:
/rfsn_clsnymigration/CLSNYMIGRATION/NY_HQ_S1/LISU,rfsn_clsnymigration
/rfsn_clsnymigration/users/Marketing,rfsw_users
/rfsn_clsnymigration/sandi_users,rfsw_sandi
/rfsn_clsnymigration/Analytics,rfsw_pci
/rfsn_clsnymigration/camnt01/AV,rfsw_camnt01
/rfsn_clsnymigration/sfa,rfss_stcloud01
/rfsn_clsnymigration/data4,rfss_stcloud03
/rfsn_clsnymigration/depart1,rfss_stcloud02
use strict;
use warnings;
while (<DATA>) {
chomp;
my ($lhs, $rhs) = split(/,/, $_, 2);
my #parts = split(/\//, $lhs);
splice(#parts, 1, 2, $rhs);
print join('/', #parts) . "\n";
}
__DATA__
/root_vdm_2/fs_clsnymigration/CLSNYMIGRATION/NY_HQ_S1/LISU,rfsn_clsnymigration
/root_vdm_2/fs_users/users/Marketing,rfsw_users
/root_vdm_3/fs_sandi/sandi_users,rfsw_sandi
/root_vdm_3/fs_pci/Analytics,rfsw_pci
/root_vdm_4/fs_camnt01/camnt01/AV,rfsw_camnt01
/root_vdm_1/fs_stcloud01/sfa,rfss_stcloud01
/root_vdm_3/fs_stcloud03/data4,rfss_stcloud03
/root_vdm_2/fs_stcloud02/depart1,rfss_stcloud02
My challenge was to replace part of $string1 with all of $string2, split on the commas.
/root_vdm_2/fs_clsnymigration/CLSNYMIGRATION/NY_HQ_S1/LISU,rfsn_clsnymigration
/root_vdm_2/fs_users/users/Marketing,rfsw_users
/root_vdm_3/fs_sandi/sandi_users,rfsw_sandi
/root_vdm_3/fs_pci/Analytics,rfsw_pci
/root_vdm_4/fs_camnt01/camnt01/AV,rfsw_camnt01
/root_vdm_1/fs_stcloud01/sfa,rfss_stcloud01
/root_vdm_3/fs_stcloud03/data4,rfss_stcloud03
/root_vdm_2/fs_stcloud02/depart1,rfss_stcloud02
The difficulty I saw initially was how to replace /root_vdm_2/fs_clsnymigration with rfsn_clsnymigration, and I allowed myself to think that a regexp was the best approach.
Although far less eloquent, this gets the job done:
foreach $line (#lines) {
chop $line;
($orig,$replica) = split /\,/, $line;
chop substr $orig, 0, 1;
#pathparts = split /\//, $orig;
$rootvdm = shift #pathparts;
#pathparts[0] = $replica;
$newpath = "/" . join ('/', #pathparts);
print " here's \$newpath:$newpath\n";
}
... which yields something like this:
here's $newpath:/rfsn_clsnymigration/CLSNYMIGRATION/NY_HQ_S1/LISU
here's $newpath:/rfsw_users/users/Marketing
here's $newpath:/rfsw_sandi/sandi_users
here's $newpath:/rfsw_pci/Analytics
here's $newpath:/rfsw_camnt01/camnt01/AV
here's $newpath:/rfss_stcloud01/sfa
here's $newpath:/rfss_stcloud03/data4
here's $newpath:/rfss_stcloud02/depart1

How to split a string with a numeric suffix?

I have an input string and I need to split it according to the requirement below.
Input String :
1. "string"
2. "String 12343534"
3. "String_12343534"
4. "Stringone Stringtwo 12343534"
5. "Stringone Stringtwo_12343534"
6. "string 23string 12343534"
7. "string 23string_12343534"
8. "string_23string 12343534"
9. "string_23string_12343534"
10. "string 23string 4545stringthird 12343534"
11. "string 23string 4545stringthird_12343534"
12. "string_23string_stringthird_12343534"
13. "string-23string-stringthird_12343534"
14. "string_23string-stringthird_12343534"
Like this going on. And I have to split string separately and numerical separately.
The output should like this.
1. $str = "string" ; $num = ;
2. $str = "String" $num = "12343534";
3. $str = "String" $num = "_12343534";
4. $str = "Stringone Stringtwo" $num = "12343534";
5. $str = "Stringone Stringtwo" $num = "_12343534";
6. $str = "string 23string" $num = "12343534";
7. $str = "string 23string" $num = "_12343534";
8. $str = "string_23string" $num = "12343534";
9. $str = "string_23string" $num = "_12343534";
10. $str = "string 23string 4545stringthird" $num = "12343534";
11. $str = "string 23string 4545stringthird" $num = "_12343534";
12. $str = "string_23string_stringthird" $num = "_12343534";
13. $str = "string-23string-stringthird" $num = "_12343534";
14. $str = "string_23string-stringthird" $num = "_12343534";
Anyone can help me on this? How to split the given string to get above mentioned output?
Since you want to keep everything, you have to split on an anchor point. You can use a lookahead for this. Split on the following pattern:
(?=_\d)|\s+(?=\d)
So:
my ($string, $numerical) = split /(?=_\d)|\s+(?=\d)/, $input;
If an underscore is present before the digits, it will split just before it, otherwise it will split on any whitespace followed by a digit. This is the translation of the regex.
You could also use the following:
(?=_\d+$)|\s+(?=\d+$)
This will ensure there's nothing after the digits by forcing the match to go to the end of the string. If there's a non-digit character at the end, the split won't happen.
But it's easier to just match what you need instead of splitting IMO:
my ($string, $numerical) = $input =~ /^(.*?)\s*(_?\d+)$/;
This is more readable and better conveys your intent.
Personally I find the solutions using split a little overcomplex, and none of them seem to cope with a string like:
my $input = "code 4 you 12345678";
... where I'd expect the numeric suffix to be 12345678, not "4" or "4 you".
I'd prefer something like:
my ($string, $numerical) = $input =~ /^ (.+?) \s* (_?\d+) $/x;
Update: I think my solution above already covers most of your updated examples: all but the first example where the numeric suffix is empty. To cover the first example, you also need to set $string to the entire input string when the regexp fails to match at all. Something like this:
my ($string, $numerical) = ($input =~ /^ (.+?) \s* (_?\d+) $/x) ? ($1, $2) : ($input);
You could try the below code ,
my ($string, $numerical) = split / (?=\d+)|(?=_\d+)/, $str;
(?=_\d+) called positive lookahead which asserts what follows is an underscore followed by one or more numbers. If this condition is true then the regex engine sets the matching marker just before to the _\d+. Splitting according to this zero width match will give you the desired results.
Since you want to split on the boundary between numerical and alpha characters, you need to use positive lookahead and lookbehind assertions.
The additional spec for deciding where to include underscores is not entirely clear, but this is my best interpretation of what your intent may be:
use strict;
use warnings;
while (<DATA>) {
chomp;
my #fields = split m{(?<=[a-z])\s*(?=_*\d)|(?<=\d)\s*(?=_*[a-z])}i, $_;
use Data::Dump;
dd #fields;
}
__DATA__
string 123456
string_45645645
stringone stringtwo 23435345345
string one string two_2335345345
Outputs:
("string", 123456)
("string", "_45645645")
("stringone stringtwo", 23435345345)
("string one string two", "_2335345345")
([a-zA-Z\s]*)(.*)$
This will work.
See demo.
http://regex101.com/r/rX0dM7/8

Perl idiom for quickly searching file with elements in array

what is the Perl idiom to search a string or a whole file for array elements occurrences? E.g.:
my #array = qw(word, test, ...);
my $string = ".......";
I want to search for word or test (can also be words, tester, etc.) inside $string and return whatever is found (i.e. group match).
I searched the docs, seems like map + grep is what I need but I just can’t come up with the code for it. Perl is such fun that I am totally clueless sometimes. :)
Using one example from map:
my #squares = map { $_ * $_ } grep { $_ > 5 } #numbers;
I suppose I can split the string into array and grep. Am I right?
grep { #array } #string; # something like grep {/(word|test)/} #string but I want to use array
my #word_roots = qw( word test );
my $pat = join '|', map quotemeta, #word_roots;
my $re = qr/\b(?:$pat)\w+\b/;
my #matches = $string =~ /($re)/g;
How about something like this from a re.pl session:
$ my #array = qw(word test)
$VAR1 = 'word';
$VAR2 = 'test';
$ my $string = ' the word is test, I said'
the word is test, I said
$ my #match_array = map { $string =~ /\b($_)\b/ } #array
$VAR1 = 'word';
$VAR2 = 'test';
The parenthesis around \b$_\b capture the match in the regex inside of map.
The \b ensures that we only match is the word is found on its own (like "test" or "word") and not words that contain the characters "test", or "word" in them like "coward" or "brightest". See http://www.regular-expressions.info/wordboundaries.html for more details on \b.

Regex to match only innermost delimited sequence

I have a string that contains sequences delimited by multiple characters: << and >>. I need a regular expression to only give me the innermost sequences. I have tried lookaheads but they don't seem to work in the way I expect them to.
Here is a test string:
'do not match this <<but match this>> not this <<BUT NOT THIS <<this too>> IT HAS CHILDREN>> <<and <also> this>>'
It should return:
but match this
this too
and <also> this
As you can see with the third result, I can't just use /<<[^>]+>>/ because the string may have one character of the delimiters, but not two in a row.
I'm fresh out of trial-and-error. Seems to me this shouldn't be this complicated.
#matches = $string =~ /(<<(?:(?!<<|>>).)*>>)/g;
(?:(?!PAT).)* is to patterns as [^CHAR]* is to characters.
$string = 'do not match this <<but match this>> not this <<BUT NOT THIS <<this too>> IT HAS CHILDREN>> <<and <also> this>>';
#matches = $string =~ /(<<(?:[^<>]+|<(?!<)|>(?!>))*>>)/g;
Here's a way to use split for the job:
my $str = 'do not match this <<but match this>> not this <<BUT NOT THIS <<this too>> IT HAS CHILDREN>> <<and <also> this>>';
my #a = split /(?=<<)/, $str;
#a = map { split /(?<=>>)/, $_ } #a;
my #match = grep { /^<<.*?>>$/ } #a;
Keeps the tags in there, if you want them removed, just do:
#match = map { s/^<<//; s/>>$//; $_ } #match;