Splitting a css selector into its components [duplicate] - regex

I have a string in Perl: 'CCCCCCCC^hC^iC^*C^"C^8A'.
I want to split this string using a regular expression: "^[any_character]C". In other words, I want to split it by the actual character ^, followed by any character, followed by a specific letter (in this case C, but it could be A, or any other character).
I have tried looking at other questions/posts and finally came up with my #split_str = split(/\^(\.)C/, $letters), but this seems not to be working.
I'm sure I'm doing something wrong, but I don't know what.

You were very close. There were just a couple of errors in your code. Before I explain them, here's the code I was using to test solutions.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use Data::Dumper;
$_ = 'CCCCCCCC^hC^iC^*C^"C^8A';
my #data = split /\^(\.)C/;
say Dumper #data;
Running this with your original regex, we get this output:
$VAR1 = 'CCCCCCCC^hC^iC^*C^"C^8A';
No splitting has taken place at all. That's because your regex includes \.. The dot matches any character in a string, but by escaping it with the backslash you have told Perl to treat it as an ordinary dot. There are no dots in your string, so the regex doesn't match and the string is not split.
If we remove the backslash, we get this output:
$VAR1 = 'CCCCCCCC';
$VAR2 = 'h';
$VAR3 = '';
$VAR4 = 'i';
$VAR5 = '';
$VAR6 = '*';
$VAR7 = '';
$VAR8 = '"';
$VAR9 = '^8A';
This is better. Some splitting has taken place. But because we have parentheses around the dot ((.)), Perl has "captured" the characters that the dot matches and added them to the list of values that split() returns.
If we remove those parentheses, we get only the values between the split markers.
$VAR1 = 'CCCCCCCC';
$VAR2 = '';
$VAR3 = '';
$VAR4 = '';
$VAR5 = '^8A';
Note that we get a few empty elements. That's because in places like "^hC^iC" in your string, there is no data between two adjacent split markers.
By moving the parentheses around the whole of the regex (split /(\^.C)/), we can get a list which includes all of the split markers together with the data between them.
$VAR1 = 'CCCCCCCC';
$VAR2 = '^hC';
$VAR3 = '';
$VAR4 = '^iC';
$VAR5 = '';
$VAR6 = '^*C';
$VAR7 = '';
$VAR8 = '^"C';
$VAR9 = '^8A';
Which of these options is most useful to you depends on exactly what you're trying to do.

When you say [any_character], you must mean . pattern, a dot matches any char but linebreaks symbols, and if you use an s modifier, it will match any char.
So, in your case, you just should not have escape the dot:
#split_str = split /\^.C/, $letters;
^
Or, with an s modifier:
#split_str = split /\^.C/s, $letters;
^
The caret should be escaped to denote a literal caret symbol in a regex pattern.

There was a question regarding Counting and not Spliting.
Could be done using the regex substitution and global s//g for counting, and scalar return (the $_ contains the modified text):
my $text = 'CCCCCCCC^hC^iC^*C^"C^8C^9A^!B'; #litte longer than yours
$_ = $text ;
my $countanychar = s/\^.C//g ;
print "counting any char and C:\t $countanychar in $text\n";
$_ = $text ;
my $countnormalchar = s/\^\wC//g ; # h and i and 8 in this example avoid the * and "
print "counting normal char and C:\t $countnormalchar in $text\n";
$_ = $text ;
my $countnumber = s/\^\dC//g ;# the 8 in this example
print "counting number and C:\t $countnumber in $text\n";
$_ = $text ;
my $countextended = s/\^.\w//g ;# the he C and the A
print "counting extended C and A and B:\t $countextended in $text\n";

try like this #split_str = split(/\^/, $letters)

Related

perl regex match using global switch

I am trying to match a word that starts with a letter and is followed by at .
I use this regex for it
use strict;
use warnings;
use Data::Dumper;
my $str = "fat 123 cat sat on the mat";
my #a = $str =~ /(\s?[a-z]{1,2}(at)\s?)/g;
print Dumper( #a );
the out put I am getting is:
$ perl ~/playground/regex.pl
$VAR1 = 'fat ';
$VAR2 = 'at';
$VAR3 = ' cat ';
$VAR4 = 'at';
$VAR5 = 'sat ';
$VAR6 = 'at';
$VAR7 = ' mat';
$VAR8 = 'at';
why does it match "at" as well when I clearly say match just 1 character before at.
Your optional spaces aren't a good way to delimit words: they are optional
Use the word boundary construct \b for a rough match to the ends of words
use strict;
use warnings;
use Data::Dumper;
my $str = "fat 123 cat sat on the mat";
my #aa = $str =~ /\b[a-z]+at\b/gi;
print Dumper \#aa;
output
$VAR1 = [
'fat',
'cat',
'sat',
'mat'
];
If you want to be more clever and be certain that the word found isn't preceded or followed by a non-space character then you can write this instead
my #aa = $str =~ /(?<!\S)[a-z]+at(?!\S)/gi;
which produces the same result for the data you show

How do I split a string using a regular expression in Perl?

I have a string in Perl: 'CCCCCCCC^hC^iC^*C^"C^8A'.
I want to split this string using a regular expression: "^[any_character]C". In other words, I want to split it by the actual character ^, followed by any character, followed by a specific letter (in this case C, but it could be A, or any other character).
I have tried looking at other questions/posts and finally came up with my #split_str = split(/\^(\.)C/, $letters), but this seems not to be working.
I'm sure I'm doing something wrong, but I don't know what.
You were very close. There were just a couple of errors in your code. Before I explain them, here's the code I was using to test solutions.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use Data::Dumper;
$_ = 'CCCCCCCC^hC^iC^*C^"C^8A';
my #data = split /\^(\.)C/;
say Dumper #data;
Running this with your original regex, we get this output:
$VAR1 = 'CCCCCCCC^hC^iC^*C^"C^8A';
No splitting has taken place at all. That's because your regex includes \.. The dot matches any character in a string, but by escaping it with the backslash you have told Perl to treat it as an ordinary dot. There are no dots in your string, so the regex doesn't match and the string is not split.
If we remove the backslash, we get this output:
$VAR1 = 'CCCCCCCC';
$VAR2 = 'h';
$VAR3 = '';
$VAR4 = 'i';
$VAR5 = '';
$VAR6 = '*';
$VAR7 = '';
$VAR8 = '"';
$VAR9 = '^8A';
This is better. Some splitting has taken place. But because we have parentheses around the dot ((.)), Perl has "captured" the characters that the dot matches and added them to the list of values that split() returns.
If we remove those parentheses, we get only the values between the split markers.
$VAR1 = 'CCCCCCCC';
$VAR2 = '';
$VAR3 = '';
$VAR4 = '';
$VAR5 = '^8A';
Note that we get a few empty elements. That's because in places like "^hC^iC" in your string, there is no data between two adjacent split markers.
By moving the parentheses around the whole of the regex (split /(\^.C)/), we can get a list which includes all of the split markers together with the data between them.
$VAR1 = 'CCCCCCCC';
$VAR2 = '^hC';
$VAR3 = '';
$VAR4 = '^iC';
$VAR5 = '';
$VAR6 = '^*C';
$VAR7 = '';
$VAR8 = '^"C';
$VAR9 = '^8A';
Which of these options is most useful to you depends on exactly what you're trying to do.
When you say [any_character], you must mean . pattern, a dot matches any char but linebreaks symbols, and if you use an s modifier, it will match any char.
So, in your case, you just should not have escape the dot:
#split_str = split /\^.C/, $letters;
^
Or, with an s modifier:
#split_str = split /\^.C/s, $letters;
^
The caret should be escaped to denote a literal caret symbol in a regex pattern.
There was a question regarding Counting and not Spliting.
Could be done using the regex substitution and global s//g for counting, and scalar return (the $_ contains the modified text):
my $text = 'CCCCCCCC^hC^iC^*C^"C^8C^9A^!B'; #litte longer than yours
$_ = $text ;
my $countanychar = s/\^.C//g ;
print "counting any char and C:\t $countanychar in $text\n";
$_ = $text ;
my $countnormalchar = s/\^\wC//g ; # h and i and 8 in this example avoid the * and "
print "counting normal char and C:\t $countnormalchar in $text\n";
$_ = $text ;
my $countnumber = s/\^\dC//g ;# the 8 in this example
print "counting number and C:\t $countnumber in $text\n";
$_ = $text ;
my $countextended = s/\^.\w//g ;# the he C and the A
print "counting extended C and A and B:\t $countextended in $text\n";
try like this #split_str = split(/\^/, $letters)

perl regex square brackets and single quotes

Have this string:
ABC,-0.5,10Y,10Y,['TEST'],ABC.1000145721ABC,-0.5,20Y,10Y,['TEST'],ABC.1000145722
The data is repeated.
I need to remove the []' characters from the data so it looks like this:
ABC,-0.5,10Y,10Y,TEST,ABC.1000145721ABC,-0.5,20Y,10Y,TEST,ABC.1000145722
I'm also trying to split the data to assign it to variables as seen below:
my($currency, $strike, $tenor, $tenor2,$ado_symbol) = split /,/, $_;
This works for everything but the ['TEST'] section. Should I remove the []' characters first then keep my split the same or is there an easier way to do this?
Thanks
Something that's useful to know is this - that split takes a regex. (It'll even let you capture, but that'll insert into the returned list, which is why I've got (?: for non capturing groups)
I observe your data only has [' right next to the delimiter - so how about:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
while ( <DATA> ) {
chomp;
my #fields = split /(?:\'])?,(?:\[\')?/;
print Dumper \#fields;
}
__DATA__
ABC,-0.5,10Y,10Y,['TEST'],ABC.1000145721ABC,-0.5,20Y,10Y,['TEST'],ABC.1000145722
Output:
$VAR1 = [
'ABC',
'-0.5',
'10Y',
'10Y',
'TEST',
'ABC.1000145721ABC',
'-0.5',
'20Y',
'10Y',
'TEST',
'ABC.1000145722'
];
my $str = "ABC,-0.5,10Y,10Y,['TEST'],ABC.1000145721ABC,-0.5,20Y,10Y,['TEST'],ABC.1000145722";
$str =~ s/\['|'\]//g;
print $str;
output is
ABC,-0.5,10Y,10Y,TEST,ABC.1000145721ABC,-0.5,20Y,10Y,TEST,ABC.1000145722
Now you can split.
Clean up $ado_symbol after split:
$ado_symbol =~ s/^\['//;
$ado_symbol =~ s/'\]$//;
You can use a global regex match to find all substrings that are not a comma, a single quote, or a square bracket
Like this
use strict;
use warnings 'all';
my $s = q{ABC,-0.5,10Y,10Y,['TEST'],ABC.1000145721ABC,-0.5,20Y,10Y,['TEST'],ABC.1000145722};
my #data = $s =~ /[^,'\[\]]+/g;
my ( $currency, $strike, $tenor, $tenor2, $ado_symbol ) = #data;
print "\$currency = $currency\n";
print "\$strike = $strike\n";
print "\$tenor = $tenor\n";
print "\$tenor2 = $tenor2\n";
print "\$ado_symbol = $ado_symbol\n";
output
$currency = ABC
$strike = -0.5
$tenor = 10Y
$tenor2 = 10Y
$ado_symbol = TEST
Another alternative
my $str = "ABC,-0.5,10Y,10Y,['TEST'],ABC.1000145721ABC,-0.5,20Y,10Y,['TEST'],ABC.1000145722";
my ($currency, $strike, $tenor, $tenor2,$ado_symbol) = map{ s/[^A-Z0-9\.-]//g; $_} split ',',$str;
print "$currency, $strike, $tenor, $tenor2, $ado_symbol",$/;
Output is:
ABC, -0.5, 10Y, 10Y, TEST

perl - spliting string with quoted characters

I want to split the following string at the pipe character without split at the escaped pipe:
"123|ABC|x\|yz|123" should result in ["123","ABC","x|yz",123]
Does anyone had such a split regexp for perl?
You could use a negative lookbehind:
use warnings 'all';
use strict;
use Data::Dumper;
my $str = '123|ABC|x\|yz|123';
my #bits = split /(?<!\\)\|/, $str;
print Dumper(#bits);
Results in:
$VAR1 = '123';
$VAR2 = 'ABC';
$VAR3 = 'x\\|yz';
$VAR4 = '123';
As pointed out by Wiktor, if your string was of the form:
my $str = '123|ABC|x\|yz|123\\|456|123\\345';
The 123\\ would be grouped with 456 (athough the last string
123\\345 would be okay):
$VAR1 = '123';
$VAR2 = 'ABC';
$VAR3 = 'x\\|yz';
$VAR4 = '123\\|456';
$VAR5 = '123\\345';
This is because the negative lookbehind only asserts a single backslash.

How to find the largest repeating string with overlap in a line

I have a series of lines such as
my $string = "home test results results-apr-25 results-apr-251.csv";
#str = $string =~ /(\w+)\1+/i;
print "#str";
How do I find the largest repeating string with overlap which are separated by whitespace?
In this case I'm looking for the output :
results-apr-25
It looks like you need the String::LCSS_XS which calculates Longest Common SubStrings. Don't try it's Perl-only twin brother String::LCSS because there are bugs in that one.
use strict;
use warnings;
use String::LCSS_XS;
*lcss = \&String::LCSS_XS::lcss; # Manual import of `lcss`
my $var = 'home test results results-apr-25 results-apr-251.csv';
my #words = split ' ', $var;
my $longest;
my ($first, $second);
for my $i (0 .. $#words) {
for my $j ($i + 1 .. $#words) {
my $lcss = lcss(#words[$i,$j]);
unless ($longest and length $lcss <= length $longest) {
$longest = $lcss;
($first, $second) = #words[$i,$j];
}
}
}
printf qq{Longest common substring is "%s" between "%s" and "%s"\n}, $longest, $first, $second;
output
Longest common substring is "results-apr-25" between "results-apr-25" and "results-apr-251.csv"
my $var = "home test results results-apr-25 results-apr-251.csv";
my #str = split " ", $var;
my %h;
my $last = pop #str;
while (my $curr = pop #str ) {
if(($curr =~/^$last/) || $last=~/^$curr/) {
$h{length($curr)}= $curr ;
}
$last = $curr;
}
my $max_key = max(keys %h);
print $h{$max_key},"\n";
If you want to make it without a loop, you will need the /g regex modifier.
This will get you all the repeating string:
my #str = $string =~ /(\S+)(?=\s\1)/ig;
I have replaced \w with \S (in your example, \w doesn't match -), and used a look-ahead: (?=\s\1) means match something that is before \s\1, without matching \s\1 itself—this is required to make sure that the next match attempt starts after the first string, not after the second.
Then, it is simply a matter of extracting the longest string from #str:
my $longest = (sort { length $b <=> length $a } #str)[0];
(Do note that this is a legible but far from being the most efficient way of finding the longest value, but this is the subject of a different question.)
How about:
my $var = "home test results results-apr-25 results-apr-251.csv";
my $l = length $var;
for (my $i=int($l/2); $i; $i--) {
if ($var =~ /(\S{$i}).*\1/) {
say "found: $1";
last;
}
}
output:
found: results-apr-25