Interpreting powers in perl using regex - regex

I'm trying to read in two numbers separated by a : and perform a comparison.
Below is some code that illustrates my problems:
use strict;
use warnings;
my #nums = qw(1.23:2.13 0.1:2.11 1.17772e+06:1.32 2:10.2);
for my $number (#nums){
print "actual numbers $number\n";
my ($c, $e) = ($1, $2) if $number =~ /(\d+\.\d+|\d+):(\d+\.\d+|\d+)/;
print "regex matches: $c:$e\n";
}
Which outputs:
actual numbers 1.23:2.13
regex matches: 1.23:2.13
actual numbers 0.1:2.11
regex matches: 0.1:2.11
actual numbers 1.17772e+06:1.32
regex matches: 06:1.32 # not capturing 1.17772e+06
actual numbers 2:10.2
regex matches: 2:10.2
My question is: How can I a) capture 1.17772e+06 and b) evaluate it as a number?

From perldata:
/^(?:[+-]?)(?=\d|\.\d)\d*(?:\.\d*)?(?:[Ee](?:[+-]?\d+))?$/
Or,
use Regexp::Common;
/$RE{num}{real}/
(These assume you want Perl's definition of a number.)

I would just use the split function (split /:/) here.

Related

Distinguishing and substituting decimals in Perl

I want to substitute decimals from commas to fullstops in a file and I wanted to try to do this in perl.
An example of my dataset looks something like this:
Species_1:0,12, Species_2:0,23, Species_3:2,53
I want to substitute the decimals but not all commas such that:
Species_1:0.12, Species_2:0.23, Species_3:2.53
I was thinking it might work using the substitution function like such:
$comma_file= "Species_1:0,12 , Species_2:0,23, Species_3:2,53"
$comma = "(:\d+/,\d)";
#match a colon, any digits after the colon, the wanted comma and digits preceding it
if ($comma_file =~ m/$comma/g) {
$comma_file =~ tr/,/./;
}
print "$comma_file\n";
However, when I tried this, what happened was that all my commas changed into fullstops, not just the ones I was targetting. Is it an issue with the regex or am I just not doing the match substitution correctly?
Thanks!
This :
use strict;
use warnings;
my $comma_file = "Species_1:0,12, Species_2:0,23, Species_3:2,53";
$comma_file =~ s/(\d+),(\d+)/$1.$2/g;
print $comma_file, "\n";
Yields :
Species_1:0.12, Species_2:0.23, Species_3:2.53
The regex searches for commas having at least one digit on both sides and replaces them with a dot.
Your code doesn’t work because you first check for commas surrounded by digits, and, if ok, you then replace ALL commas with dots
From the shown data it appears that a comma to be replaced must always have a number on each side, and that every such occurrence need be replaced. There is a fine answer by GMB.
Another way for this kind of a problem is to use lookarounds
$comma_file =~ s/(?<=[0-9]),(?=[0-9])/./g;
which should be more efficient, as there is no copying into $1 and $2 and no quantifiers.
My benchmark
use warnings;
use strict;
use feature 'say';
use Benchmark qw(cmpthese);
my $str = q(Species_1:0,12, Species_2:0,23, Species_3:2,53);
sub subs {
my ($str) = #_;
$str =~ s/(\d+),(\d+)/$1.$2/g;
return $str;
}
sub look {
my ($str) = #_;
$str =~ s/(?<=\d),(?=\d)/./g;
return $str;
}
die "Output not equal" if subs($str) ne look($str);
cmpthese(-3, {
subs => sub { my $res = subs($str) },
look => sub { my $res = look($str) },
});
with output
Rate subs look
subs 256126/s -- -46%
look 472677/s 85% --
This is only one, particular, string but the efficiency advantage should only increase with the length of the string, while longer patterns (numbers here) should reduce that a little.

Perl split by regexp issue

I'm writing some parser on Perl and here is a problem with split. Here is my code:
my $str = 'a,b,"c,d",e';
my #arr = split(/,(?=([^\"]*\"[^\"]*\")*[^\"]*$)/, $str);
# try to split the string by comma delimiter, but only if comma is followed by the even or zero number of quotes
foreach my $val (#arr) {
print "$val\n"
}
I'm expecting the following:
a
b
"c,d"
e
But this is what am I really received:
a
b,"c,d"
b
"c,d"
"c,d"
e
I see my string parts are in array, their indices are 0, 2, 4, 6. But how to avoid these odd b,"c,d" and other rest string parts in the resulting array? Is there any error in my regexp delimiter or is there some special split options?
You need to use a non-capturing group:
my #arr = split(/,(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)/, $str);
^^
See IDEONE demo
Otherwise, the captured texts are output as part of the resulting array.
See perldoc reference:
If the regex has groupings, then the list produced contains the matched substrings from the groupings as well
What's tripping you up is a feature in split in that if you're using a group, and it's set to capture - it returns the captured 'bit' as well.
But rather than using split I would suggest the Text::CSV module, that already handles quoting for you:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new();
my $fields = $csv->getline( \*DATA );
print join "\n", #$fields;
__DATA__
a,b,"c,d",e
Prints:
a
b
c,d
e
My reasoning is fairly simple - you're doing quote matching and may have things like quoted/escaped quotes, etc. mean you're trying to do a recursive parse, which is something regex simply isn't well suited to doing.
You can use parse_line() of Text::ParseWords, if you are not really bounded for regex:
use Text::ParseWords;
my $str = 'a,b,"c,d",e';
my #arr = parse_line(',', 1, $str);
foreach (#arr)
{
print "$_\n";
}
Output:
a
b
"c,d"
e
Do matching instead of splitting.
use strict; use warnings;
my $str = 'a,b,"c,d",e';
my #matches = $str =~ /"[^"]*"|[^,]+/g;
foreach my $val (#matches) {
print "$val\n"
}

Match the nth longest possible string in Perl

The pattern matching quantifiers of a Perl regular expression are "greedy" (they match the longest possible string). To force the match to be "ungreedy", a ? can be appended to the pattern quantifier (*, +).
Here is an example:
#!/usr/bin/perl
$string="111s11111s";
#-- greedy match
$string =~ /^(.*)s/;
print "$1\n"; # prints 111s11111
#-- ungreedy match
$string =~ /^(.*?)s/;
print "$1\n"; # prints 111
But how one can find the second, third and .. possible string match in Perl? Make a simple example of yours --if need a better one.
Utilize a conditional expression, a code expression, and backtracking control verbs.
my $skips = 1;
$string =~ /^(.*)s(?(?{$skips-- > 0})(*FAIL))/;
The above will use greedy matching, but will cause the largest match to intentionally fail. If you wanted the 3rd largest, you could just set the number of skips to 2.
Demonstrated below:
#!/usr/bin/perl
use strict;
use warnings;
my $string = "111s11111s11111s";
$string =~ /^(.*)s/;
print "Greedy match - $1\n";
$string =~ /^(.*?)s/;
print "Ungreedy match - $1\n";
my $skips = 1;
$string =~ /^(.*)s(?(?{$skips-- > 0})(*FAIL))/;
print "2nd Greedy match - $1\n";
Outputs:
Greedy match - 111s11111s11111
Ungreedy match - 111
2nd Greedy match - 111s11111
When using such advanced features, it is important to have a full understanding of regular expressions to predict the results. This particular case works because the regex is fixed on one end with ^. That means that we know that each subsequent match is also one shorter than the previous. However, if both ends could shift, we could not necessarily predict order.
If that were the case, then you find them all, and then you sort them:
use strict;
use warnings;
my $string = "111s11111s";
my #seqs;
$string =~ /^(.*)s(?{push #seqs, $1})(*FAIL)/;
my #sorted = sort {length $b <=> length $a} #seqs;
use Data::Dump;
dd #sorted;
Outputs:
("111s11111s11111", "111s11111", 111)
Note for Perl versions prior to v5.18
Perl v5.18 introduced a change, /(?{})/ and /(??{})/ have been heavily reworked, that enabled the scope of lexical variables to work properly in code expressions as utilized above. Before then, the above code would result in the following errors, as demonstrated in this subroutine version run under v5.16.2:
Variable "$skips" will not stay shared at (re_eval 1) line 1.
Variable "#seqs" will not stay shared at (re_eval 2) line 1.
The fix for older implementations of RE code expressions is to declare the variables with our, and for further good coding practices, to localize them when initialized. This is demonstrated in this modified subroutine version run under v5.16.2, or as put below:
local our #seqs;
$string =~ /^(.*)s(?{push #seqs, $1})(*FAIL)/;
Start by getting all possible matches.
my $string = "111s1111s11111s";
local our #matches;
$string =~ /^(.*)s(?{ push #matches, $1 })(?!)/;
This finds
111s1111s11111
111s1111
111
Then, it's just a matter of finding out which one is the second longuest and filtering out the others.
use List::MoreUtils qw( uniq );
my $target_length = ( sort { $b <=> $a } uniq map length, #matches )[1];
#matches = uniq grep { length($_) == $target_length } #matches
if $target_length;

Perl alphabetic sort with alphanumeric string and regex

I am trying to sort alphanumeric string with perl and am facing the following problem:
Strings are like this: "XXXXX-1.0.0" where numbers can be on one or two digits and are representing release versions.
My problem is that when I have the two following strings:
- XXXXX-1.9.9
- XXXXX-1.10.0
XXXXX-1.9.9 is considered as greater than XXXXX-1.10.0 since 9 is greater than 1.
So I am trying to force numbers to be on two digits with a regexp.
There is the piece of code I am testing:
my $string = "XXXXXX-1.9.9";
my $pre = "0";
$string =~ s/(\.|-)+(\d{1})($|\.)/$1$pre$2$3/g;
print "$string\n";
This gives me the result "XXXXXX-01.9.09" which is not what I am looking for since it will not be sorted correctly. So I have to do that:
my $string = "XXXXXX-1.9.9";
my $pre = "0";
$string =~ s/(\.|-)+(\d{1})($|\.)/$1$pre$2$3/g;
$string =~ s/(\.|-)+(\d{1})($|\.)/$1$pre$2$3/g;
print "$string\n";
To get "XXXXXX-01.09.09"
My question is double:
- Is there a way to sort my strings with Perl without using regex?
- If I have to use regex is there a way to write it so that I do not have to execute it twice?
Thank you in advance.
You can use look around assertions to make sure there are not digits around the digit you are replacing without moving the position over the neighbouring characters.
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my #strings = qw( XXXXX-1.9.9
XXXXX-1.9.10
XXXXX-1.10.0 );
s/(?<=[^0-9]) ([0-9]) (?=[^0-9]|$) /0$1/xg for #strings;
#strings = sort #strings;
s/(?<=[^0-9]) 0+ ([0-9]) /$1/xg for #strings;
say for #strings;

How can I keep only the numbers in a variable in Perl?

I have a variable like this below:
G12345(##)
How can I keep in the variable only the numbers 12345. I have done it before in PHP but I cannot find a way in Perl.
$v =~ s/\D//g; should do the trick.
(Regular expression substitute "Not a number" with "nothing", globally)
This can also be done without regular expressions: Transliterate: tr///
use warnings;
use strict;
my $s = 'G12345(##)';
$s =~ tr/0-9//cd;
print "$s\n";
__END__
12345
Substitute any non-numeric characters with an empty string (\D is non-numeric):
$var =~ s/\D+//g;
You can also do it this way:
my ( $number ) = $string =~ /(\d+)/;
This means that were there some other digits to occur after the '(##)' --for whatever reason, that you would not suddenly concatenate those digits to the number that lies between 'G' and '('. So the capture method makes sure you get the first set of contiguous digits.