Distinguishing and substituting decimals in Perl - regex

I want to substitute decimals from commas to fullstops in a file and I wanted to try to do this in perl.
An example of my dataset looks something like this:
Species_1:0,12, Species_2:0,23, Species_3:2,53
I want to substitute the decimals but not all commas such that:
Species_1:0.12, Species_2:0.23, Species_3:2.53
I was thinking it might work using the substitution function like such:
$comma_file= "Species_1:0,12 , Species_2:0,23, Species_3:2,53"
$comma = "(:\d+/,\d)";
#match a colon, any digits after the colon, the wanted comma and digits preceding it
if ($comma_file =~ m/$comma/g) {
$comma_file =~ tr/,/./;
}
print "$comma_file\n";
However, when I tried this, what happened was that all my commas changed into fullstops, not just the ones I was targetting. Is it an issue with the regex or am I just not doing the match substitution correctly?
Thanks!

This :
use strict;
use warnings;
my $comma_file = "Species_1:0,12, Species_2:0,23, Species_3:2,53";
$comma_file =~ s/(\d+),(\d+)/$1.$2/g;
print $comma_file, "\n";
Yields :
Species_1:0.12, Species_2:0.23, Species_3:2.53
The regex searches for commas having at least one digit on both sides and replaces them with a dot.
Your code doesn’t work because you first check for commas surrounded by digits, and, if ok, you then replace ALL commas with dots

From the shown data it appears that a comma to be replaced must always have a number on each side, and that every such occurrence need be replaced. There is a fine answer by GMB.
Another way for this kind of a problem is to use lookarounds
$comma_file =~ s/(?<=[0-9]),(?=[0-9])/./g;
which should be more efficient, as there is no copying into $1 and $2 and no quantifiers.
My benchmark
use warnings;
use strict;
use feature 'say';
use Benchmark qw(cmpthese);
my $str = q(Species_1:0,12, Species_2:0,23, Species_3:2,53);
sub subs {
my ($str) = #_;
$str =~ s/(\d+),(\d+)/$1.$2/g;
return $str;
}
sub look {
my ($str) = #_;
$str =~ s/(?<=\d),(?=\d)/./g;
return $str;
}
die "Output not equal" if subs($str) ne look($str);
cmpthese(-3, {
subs => sub { my $res = subs($str) },
look => sub { my $res = look($str) },
});
with output
Rate subs look
subs 256126/s -- -46%
look 472677/s 85% --
This is only one, particular, string but the efficiency advantage should only increase with the length of the string, while longer patterns (numbers here) should reduce that a little.

Related

extract string between two dots

I have a string of the following format:
word1.word2.word3
What are the ways to extract word2 from that string in perl?
I tried the following expression but it assigns 1 to sub:
#perleval $vars{sub} = $vars{string} =~ /.(.*)./; 0#
EDIT:
I have tried several suggestions, but still get the value of 1. I suspect that the entire expression above has a problem in addition to parsing. However, when I do simple assignment, I get the correct result:
#perleval $vars{sub} = $vars{string} ; 0#
assigns word1.word2.word3 to variable sub
. has a special meaning in regular expressions, so it needs to be escaped.
.* could match more than intended. [^.]* is safer.
The match operator (//) simply returns true/false in scalar context.
You can use any of the following:
$vars{sub} = $vars{string} =~ /\.([^.]*)\./ ? $1 : undef;
$vars{sub} = ( $vars{string} =~ /\.([^.]*)\./ )[0];
( $vars{sub} ) = $vars{string} =~ /\.([^.]*)\./;
The first one allows you to provide a default if there's no match.
Try:
/\.([^\.]+)\./
. has a special meaning and would need to be escaped. Then you would want to capture the values between the dots, so use a negative character class like ([^\.]+) meaning at least one non-dot. if you use (.*) you will get:
word1.stuff1.stuff2.stuff3.word2 to result in:
stuff1.stuff2.stuff3
But maybe you want that?
Here is my little example, I do find the perl one liners a little harder to read at times so I break it out:
use strict;
use warnings;
if ("stuff1.stuff2.stuff3" =~ m/\.([^.]+)\./) {
my $value = $1;
print $value;
}
else {
print "no match";
}
result
stuff2
. has a special meaning: any character (see the expression between your parentheses)
Therefore you have to escape it (\.) if you search a literal dot:
/\.(.*)\./
You've got to make sure you're asking for a list when you do the search.
my $x= $string =~ /look for (pattern)/ ;
sets $x to 1
my ($x)= $string =~ /look for (pattern)/ ;
sets $x to pattern.

Perl grep a multi line output for a pattern

I have the below code where I am trying to grep for a pattern in a variable. The variable has a multiline text in it.
Multiline text in $output looks like this
_skv_version=1
COMPONENTSEQUENCE=C1-
BEGIN_C1
COMPONENT=SecurityJNI
TOOLSEQUENCE=T1-
END_C1
CMD_ID=null
CMD_USES_ASSET_ENV=null_jdk1.7.0_80
CMD_USES_ASSET_ENV=null_ivy,null_jdk1.7.3_80
BEGIN_C1_T1
CMD_ID=msdotnet_VS2013_x64
CMD_ID=ant_1.7.1
CMD_FILE=path/to/abcI.vc12.sln
BEGIN_CMD_OPTIONS_RELEASE
-useideenv
The code I am using to grep for the pattern
use strict;
use warnings;
my $cmd_pattern = "CMD_ID=|CMD_USES_ASSET_ENV=";
my #matching_lines;
my $output = `cmd to get output` ;
print "output is : $output\n";
if ($output =~ /^$cmd_pattern(?:null_)?(\w+([\.]?\w+)*)/s ) {
print "1 is : $1\n";
push (#matching_lines, $1);
}
I am getting the multiline output as expected from $output but the regex pattern match which I am using on $output is not giving me any results.
Desired output
jdk1.7.0_80
ivy
jdk1.7.3_80
msdotnet_VS2013_x64
ant_1.7.1
Regarding your regular expression:
You need a while, not an if (otherwise you'll only be matching once); when you make this change you'll also need the /gc modifiers
You don't really need the /s modifier, as that one makes . match \n, which you're not making use of (see note at the end)
You want to use the /m modifier so that ^ matches the beginning of every new line, and not just the beginning of the string
You want to add \s* to your regular expression right after ^, because in at least one of your lines you have a leading space
You need parenthesis around $cmd_pattern; otherwise, you're getting two options, the first one being ^CMD_ID= and the second one being CMD_USES_ASSET_ENV= followed by the rest of your expression
You can also simplify the (\w+([\.]?\w+)*) bit down to (.+).
The result would be:
while ($output =~ /^\s*(?:$cmd_pattern)(?:null_)?(.+)/gcm ) {
print "1 is : $1\n";
push (#matching_lines, $1);
}
That being said, your regular expression still won't split ivy and jdk1.7.3_80 on its own; I would suggest adding a split and removing _null with something like:
while ($output =~ /^\s*(?:$cmd_pattern)(?:null_)?(.+)/gcm ) {
my $text = $1;
my #text;
if ($text =~ /,/) {
#text = split /,(?:null_)?/, $text;
}
else {
#text = $text;
}
for (#text) {
print "1 is : $_\n";
push (#matching_lines, $_);
}
}
The only problem you're left with is the lone line CMD_ID=null. I'm gonna leave that to you :-)
(I recently wrote a blog post on best practices for regular expressions - http://blog.codacy.com/2016/03/30/best-practices-for-regular-expressions/ - you'll find there a note to always require the /s in Perl; the reason I mention here that you don't need it is that you're not using the ones you actually need, and that might mean you weren't certain of the meaning of /s)

Replace only the second occurance of string in a line in perl regex

I have a string like "ven|ven|vett|vejj|ven|ven". Treat each "|" delimiter for each column.
By splitting the string with "|" saving all the columns in array and reading each column into $str
So, I'm trying to do this as
$string =~ s/$str/venky/g if $str =~ /ven/i; # it will do globally.
Which not met the requirement.
On-demand basis, I need to replace string at the particular number of occurrence of the string.
For example, I've a request to change 2nd occurrence of "ven" to venky.
Then how can I met this requirement simply? Is it some-thing like
$string =~ s/ven/venky/2;
As of my knowledge we have 'o' for replace once and 'g' for globally. I'm struggling for the solution to get the replacement at particular occurrence. And I should not use pos() to get the position, because string keeps on change. It becomes difficult to trace it every-time. That's my intention.
Please help me on this regard.
There is no flag that you can add to the regex that will do this.
The easiest way would be to split and loop. However, if you insist to use one regex, it is doable:
/^(?:[^v]|v[^e]|ve[^n])*ven(?:[^v]|v[^e]|ve[^n])*\Kven/
If you want to replace the Nth occurrence instead of the second, you can do:
/^(?:(?:[^v]|v[^e]|ve[^n])*ven){N-1}(?:[^v]|v[^e]|ve[^n])*\Kven/
The general idea:
(?:[^v]|v[^e]|ve[^n])* - matches any string that isn't part of ven
\K is a cool matcher that drops everything matched so far, so you can sort of use it as a lookbehind with variable length
Currently you're replacing every instance of'ven' with 'venky' if your string contains a match for ven, which of course it does.
What I assume you're trying to do is to substitute 'ven' for 'venky' within your string if it's the second element:
my $string = 'ven|ven|vett|vejj|ven|ven';
my #elements = split(/\|/, $string);
my $count;
foreach (#elements){
$count++;
s/$_/venky/g if /ven/i and $count == 2;
}
print join('|', #elements);
print "\n";
Your approach was already pretty good. What you described makes sense, but I think you are having trouble implementing it.
I created a function to do the work. It takes 4 arguments:
$string is the string we want to work on
$n is the nth occurance you want to replace
$needle is the thing you want to replace – thing needle in a haystack
Note that right now we allow to pass stuff that might contain regular expressions. So you would have to use quotemeta on it or match with /\Q$needle\E/
$replacement is the replacement for the $needle
The idea is to split up the string, then check each element if it matches the pattern ($needle) and keep track of how many have matched. If the nth one is reached, replace it and stop processing. Then put the string back together.
use strict;
use warnings;
use feature 'say';
say replace_nth_occurance("ven|ven|vett|vejj|ven|ven", 2, 'ven', 'venky');
sub replace_nth_occurance {
my ($string, $n, $needle, $replacement) = #_;
# take the string appart
my #elements = split /\|/, $string;
my $count = 0; # keep track of ...
foreach my $e (#elements) {
$count++ if $e =~ m/$needle/; # ... how many matches we've found
if ($count == $n) {
$e =~ s/$needle/$replacement/; # replace
last; # and stop processing
}
}
# put it back into the pipe-separated format
return join '|', #elements;
}
Output:
ven|venky|vett|vejj|ven|ven
To replace the n'th occurrence of "ven" to "venky":
my $n = 3;
my $test = "seven given ravens";
$test =~ s/ven/--$n == 0 ? "venky" : $&/eg;
This uses the ability with the /e flag to specify the substitution part as an expression.

Perl regex output hyphens multiple "words"

When I have a string with multiple hyphens in it, I seem to be able to find the (only) desired value, but why are there multiple outputs? I'd like to just report the matched string entirely, with hyphens. I've included what the output probably is, along with a way to rebuild the string, but this method seems like unnecessary work.
my $string = "phonenumber123-456-7890";
my #secondStrings = $string =~ m/(\d+)-(\d+)-(\d+)/g;
foreach (#secondStrings){
print $_, "\n";
}
if ($string =~ m/(\d+)-(\d+)-(\d+)/g){
print $1."-".$2."-".$3, "\n";
}
I believe you just want to put the entire phone number (123-456-7890) into 1 capture group, right now you are using 3.
my ($number) = $string =~ m/(\d+-\d+-\d+)/g;
Further reading can be found here: http://perldoc.perl.org/perlre.html#Capture-groups

How do I use Perl to intersperse characters between consecutive matches with a regex substitution?

The following lines of comma-separated values contains several consecutive empty fields:
$rawData =
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n"
I want to replace these empty fields with 'N/A' values, which is why I decided to do it via a regex substitution.
I tried this first of all:
$rawdata =~ s/,([,\n])/,N\/A/g; # RELABEL UNAVAILABLE DATA AS 'N/A'
which returned
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,,N/A,\n
Not what I wanted. The problem occurs when more than two consecutive commas occur. The regex gobbles up two commas at a time, so it starts at the third comma rather than the second when it rescans the string.
I thought this could be something to do with lookahead vs. lookback assertions, so I tried the following regex out:
$rawdata =~ s/(?<=,)([,\n])|,([,\n])$/,N\/A$1/g; # RELABEL UNAVAILABLE DATA AS 'N/A'
which resulted in:
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,N/A,,N/A,,N/A,,N/A\n
That didn't work either. It just shifted the comma-pairings by one.
I know that washing this string through the same regex twice will do it, but that seems crude. Surely, there must be a way to get a single regex substitution to do the job. Any suggestions?
The final string should look like this:
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,N/A,,N/A,N/A,N/A,N/A,N/A\n
EDIT: Note that you could open a filehandle to the data string and let readline deal with line endings:
#!/usr/bin/perl
use strict; use warnings;
use autodie;
my $str = <<EO_DATA;
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,
EO_DATA
open my $str_h, '<', \$str;
while(my $row = <$str_h>) {
chomp $row;
print join(',',
map { length $_ ? $_ : 'N/A'} split /,/, $row, -1
), "\n";
}
Output:
E:\Home> t.pl
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,N/A,N/A,N/A
You can also use:
pos $str -= 1 while $str =~ s{,(,|\n)}{,N/A$1}g;
Explanation: When s/// finds a ,, and replaces it with ,N/A, it has already moved to the character after the last comma. So, it will miss some consecutive commas if you only use
$str =~ s{,(,|\n)}{,N/A$1}g;
Therefore, I used a loop to move pos $str back by a character after each successful substitution.
Now, as #ysth shows:
$str =~ s!,(?=[,\n])!,N/A!g;
would make the while unnecessary.
I couldn't quite make out what you were trying to do in your lookbehind example, but I suspect you are suffering from a precedence error there, and that everything after the lookbehind should be enclosed in a (?: ... ) so the | doesn't avoid doing the lookbehind.
Starting from scratch, what you are trying to do sounds pretty simple: place N/A after a comma if it is followed by another comma or a newline:
s!,(?=[,\n])!,N/A!g;
Example:
my $rawData = "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n";
use Data::Dumper;
$Data::Dumper::Useqq = $Data::Dumper::Terse = 1;
print Dumper($rawData);
$rawData =~ s!,(?=[,\n])!,N/A!g;
print Dumper($rawData);
Output:
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n"
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,N/A,N/A,N/A\n"
You could search for
(?<=,)(?=,|$)
and replace that with N/A.
This regex matches the (empty) space between two commas or between a comma and end of line.
The quick and dirty hack version:
my $rawData = "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n";
while ($rawData =~ s/,,/,N\/A,/g) {};
print $rawData;
Not the fastest code, but the shortest. It should loop through at max twice.
Not a regex, but not too complicated either:
$string = join ",", map{$_ eq "" ? "N/A" : $_} split (/,/, $string,-1);
The ,-1 is needed at the end to force split to include any empty fields at the end of the string.