Perl regex output hyphens multiple "words" - regex

When I have a string with multiple hyphens in it, I seem to be able to find the (only) desired value, but why are there multiple outputs? I'd like to just report the matched string entirely, with hyphens. I've included what the output probably is, along with a way to rebuild the string, but this method seems like unnecessary work.
my $string = "phonenumber123-456-7890";
my #secondStrings = $string =~ m/(\d+)-(\d+)-(\d+)/g;
foreach (#secondStrings){
print $_, "\n";
}
if ($string =~ m/(\d+)-(\d+)-(\d+)/g){
print $1."-".$2."-".$3, "\n";
}

I believe you just want to put the entire phone number (123-456-7890) into 1 capture group, right now you are using 3.
my ($number) = $string =~ m/(\d+-\d+-\d+)/g;
Further reading can be found here: http://perldoc.perl.org/perlre.html#Capture-groups

Related

Distinguishing and substituting decimals in Perl

I want to substitute decimals from commas to fullstops in a file and I wanted to try to do this in perl.
An example of my dataset looks something like this:
Species_1:0,12, Species_2:0,23, Species_3:2,53
I want to substitute the decimals but not all commas such that:
Species_1:0.12, Species_2:0.23, Species_3:2.53
I was thinking it might work using the substitution function like such:
$comma_file= "Species_1:0,12 , Species_2:0,23, Species_3:2,53"
$comma = "(:\d+/,\d)";
#match a colon, any digits after the colon, the wanted comma and digits preceding it
if ($comma_file =~ m/$comma/g) {
$comma_file =~ tr/,/./;
}
print "$comma_file\n";
However, when I tried this, what happened was that all my commas changed into fullstops, not just the ones I was targetting. Is it an issue with the regex or am I just not doing the match substitution correctly?
Thanks!
This :
use strict;
use warnings;
my $comma_file = "Species_1:0,12, Species_2:0,23, Species_3:2,53";
$comma_file =~ s/(\d+),(\d+)/$1.$2/g;
print $comma_file, "\n";
Yields :
Species_1:0.12, Species_2:0.23, Species_3:2.53
The regex searches for commas having at least one digit on both sides and replaces them with a dot.
Your code doesn’t work because you first check for commas surrounded by digits, and, if ok, you then replace ALL commas with dots
From the shown data it appears that a comma to be replaced must always have a number on each side, and that every such occurrence need be replaced. There is a fine answer by GMB.
Another way for this kind of a problem is to use lookarounds
$comma_file =~ s/(?<=[0-9]),(?=[0-9])/./g;
which should be more efficient, as there is no copying into $1 and $2 and no quantifiers.
My benchmark
use warnings;
use strict;
use feature 'say';
use Benchmark qw(cmpthese);
my $str = q(Species_1:0,12, Species_2:0,23, Species_3:2,53);
sub subs {
my ($str) = #_;
$str =~ s/(\d+),(\d+)/$1.$2/g;
return $str;
}
sub look {
my ($str) = #_;
$str =~ s/(?<=\d),(?=\d)/./g;
return $str;
}
die "Output not equal" if subs($str) ne look($str);
cmpthese(-3, {
subs => sub { my $res = subs($str) },
look => sub { my $res = look($str) },
});
with output
Rate subs look
subs 256126/s -- -46%
look 472677/s 85% --
This is only one, particular, string but the efficiency advantage should only increase with the length of the string, while longer patterns (numbers here) should reduce that a little.

Perl grep a multi line output for a pattern

I have the below code where I am trying to grep for a pattern in a variable. The variable has a multiline text in it.
Multiline text in $output looks like this
_skv_version=1
COMPONENTSEQUENCE=C1-
BEGIN_C1
COMPONENT=SecurityJNI
TOOLSEQUENCE=T1-
END_C1
CMD_ID=null
CMD_USES_ASSET_ENV=null_jdk1.7.0_80
CMD_USES_ASSET_ENV=null_ivy,null_jdk1.7.3_80
BEGIN_C1_T1
CMD_ID=msdotnet_VS2013_x64
CMD_ID=ant_1.7.1
CMD_FILE=path/to/abcI.vc12.sln
BEGIN_CMD_OPTIONS_RELEASE
-useideenv
The code I am using to grep for the pattern
use strict;
use warnings;
my $cmd_pattern = "CMD_ID=|CMD_USES_ASSET_ENV=";
my #matching_lines;
my $output = `cmd to get output` ;
print "output is : $output\n";
if ($output =~ /^$cmd_pattern(?:null_)?(\w+([\.]?\w+)*)/s ) {
print "1 is : $1\n";
push (#matching_lines, $1);
}
I am getting the multiline output as expected from $output but the regex pattern match which I am using on $output is not giving me any results.
Desired output
jdk1.7.0_80
ivy
jdk1.7.3_80
msdotnet_VS2013_x64
ant_1.7.1
Regarding your regular expression:
You need a while, not an if (otherwise you'll only be matching once); when you make this change you'll also need the /gc modifiers
You don't really need the /s modifier, as that one makes . match \n, which you're not making use of (see note at the end)
You want to use the /m modifier so that ^ matches the beginning of every new line, and not just the beginning of the string
You want to add \s* to your regular expression right after ^, because in at least one of your lines you have a leading space
You need parenthesis around $cmd_pattern; otherwise, you're getting two options, the first one being ^CMD_ID= and the second one being CMD_USES_ASSET_ENV= followed by the rest of your expression
You can also simplify the (\w+([\.]?\w+)*) bit down to (.+).
The result would be:
while ($output =~ /^\s*(?:$cmd_pattern)(?:null_)?(.+)/gcm ) {
print "1 is : $1\n";
push (#matching_lines, $1);
}
That being said, your regular expression still won't split ivy and jdk1.7.3_80 on its own; I would suggest adding a split and removing _null with something like:
while ($output =~ /^\s*(?:$cmd_pattern)(?:null_)?(.+)/gcm ) {
my $text = $1;
my #text;
if ($text =~ /,/) {
#text = split /,(?:null_)?/, $text;
}
else {
#text = $text;
}
for (#text) {
print "1 is : $_\n";
push (#matching_lines, $_);
}
}
The only problem you're left with is the lone line CMD_ID=null. I'm gonna leave that to you :-)
(I recently wrote a blog post on best practices for regular expressions - http://blog.codacy.com/2016/03/30/best-practices-for-regular-expressions/ - you'll find there a note to always require the /s in Perl; the reason I mention here that you don't need it is that you're not using the ones you actually need, and that might mean you weren't certain of the meaning of /s)

Whole word matching with unexpected insertion in data

I have string consider
my $string = 'String need to be evaluated';
in $string I'm searching evaluated or any other word.
problem is their may be insertion of some tags in string
eg. Str<data>ing need to be eval<data>ua<data>ted which is unexpected.
In this case how could I search for the words?
here is the code I tried:
my $string = 'Text to be evaluated';
my $string2 = "Te<data>xt need to be eval<data2>ua<data>ted";
# patten to match
$pattern = "evaluated";
#b = split('',$pattern);
for my $i(#b){
$i="$i"."\(?:<data>\)?";
print "$i#\n";
}
$pattern = join('',#b);
print "\n$pattern\n";
if ($string2 =~ /$pattern/){
print "$pattern found\n";
}
Do you suggest any other method or module to make it easy? i don't know what kind of data will get inserted.
Not sure if that is what you need but how about
#b = split('',$pattern);
for my $i(#b){
$i=$i.".*";
print "$i \n";
}
$pattern = join('',#b);
That should match any string that had the pattern before it got random insertions as long as the characters of the pattern are still there and in the correct order.
It does find evaluated in the string esouhgvw8vwrg355#*asrgl/\u[\w]atet(45)<data>efdvd what is about as noisy as it gets. But of course, if it is impossible to distinguish between insertion and original string, you will get "false" positives. For example if the string used to be evaluted and it becomes something like evalu<hereisyourmissinga>ted you will get a positive. Of course, if you knew that insertions would always be in tags while text is not, users answer is much safer.
As long as you single quote your input string, characters like [\w] (45) and whatnot should not hurt either. I cannot see why they would be interpolated at any point.
Of course, you could use regexp to do the job:
foreach my $s ($string,$string2){
my $cs= $s;
### canonize
$cs =~ s!<[^>]*>!!gs;
### match
if ($cs =~ m!$pattern!i){
print "Found $pattern in $s!\n";
}
}

Perl, match one pattern multiple times in the same line delimited by unknown characters

I've been able to find similar, but not identical questions to this one. How do I match one regex pattern multiple times in the same line delimited by unknown characters?
For example, say I want to match the pattern HEY. I'd want to recognize all of the following:
HEY
HEY HEY
HEYxjfkdsjfkajHEY
So I'd count 5 HEYs there. So here's my program, which works for everything but the last one:
open ( FH, $ARGV[0]);
while(<FH>)
{
foreach $w ( split )
{
if ($w =~ m/HEY/g)
{
$count++;
}
}
}
So my question is how do I replace that foreach loop so that I can recognize patterns delimited by weird characters in unknown configurations (like shown in the example above)?
EDIT:
Thanks for the great responses thus far. I just realized I need one other thing though, which I put in a comment below.
One question though: is there any way to save the matched term as well? So like in my case, is there any way to reference $w (say if the regex was more complicated, and I wanted to store it in a hash with the number of occurrences)
So if I was matching a real regex (say a sequence of alphanumeric characters) and wanted to save that in a hash.
One way is to capture all matches of the string and see how many you got. Like so:
open (FH, $ARGV[0]);
while(my $w = <FH>) {
my #matches = $w =~ m/(HEY)/g;
my $count = scalar(#matches);
print "$count\t$w\n";
}
EDIT:
Yes, there is! Just loop over all the matches, and use the capture variables to increment the count in a hash:
my %hash;
open (FH, $ARGV[0]);
while (my $w = <FH>) {
foreach ($w =~ /(HEY)/g) {
$hash{$1}++;
}
}
The problem is you really don't want to call split(). It splits things into words, and you'll note that your last line only has a single "word" (though you won't find it in the dictionary). A word is bounded by white-space and thus is just "everything but whitespace".
What you really want is to continue to do look through each line counting every HEY, starting where you left off each time. Which requires the /g at the end but to keep looking:
while(<>)
{
while (/HEY/g)
{
$count++;
}
}
print "$count\n";
There is, of course, more than one way to do it but this sticks close to your example. Other people will post other wonderful examples too. Learn from them all!
None of the above answers worked for my similar problem. $1 does not seem to change (perl 5.16.3) so $hash{$1}++ will just count the first match n times.
To get each match, the foreach needs a local variable assigned, which will then contain the match variable. Here's a little script that will match and print each occurrence of (number).
#!/usr/bin/perl -w
use strict;
use warnings FATAL=>'all';
my (%procs);
while (<>) {
foreach my $proc ($_ =~ m/\((\d+)\)/g) {
$procs{$proc}++;
}
}
print join("\n",keys %procs) . "\n";
I'm using it like this:
pstree -p | perl extract_numbers.pl | xargs -n 1 echo
(except with some relevant filters in that pipeline). Any pattern capture ought to work as well.

How do I use Perl to intersperse characters between consecutive matches with a regex substitution?

The following lines of comma-separated values contains several consecutive empty fields:
$rawData =
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n"
I want to replace these empty fields with 'N/A' values, which is why I decided to do it via a regex substitution.
I tried this first of all:
$rawdata =~ s/,([,\n])/,N\/A/g; # RELABEL UNAVAILABLE DATA AS 'N/A'
which returned
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,,N/A,\n
Not what I wanted. The problem occurs when more than two consecutive commas occur. The regex gobbles up two commas at a time, so it starts at the third comma rather than the second when it rescans the string.
I thought this could be something to do with lookahead vs. lookback assertions, so I tried the following regex out:
$rawdata =~ s/(?<=,)([,\n])|,([,\n])$/,N\/A$1/g; # RELABEL UNAVAILABLE DATA AS 'N/A'
which resulted in:
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,N/A,,N/A,,N/A,,N/A\n
That didn't work either. It just shifted the comma-pairings by one.
I know that washing this string through the same regex twice will do it, but that seems crude. Surely, there must be a way to get a single regex substitution to do the job. Any suggestions?
The final string should look like this:
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,N/A,,N/A,N/A,N/A,N/A,N/A\n
EDIT: Note that you could open a filehandle to the data string and let readline deal with line endings:
#!/usr/bin/perl
use strict; use warnings;
use autodie;
my $str = <<EO_DATA;
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,
EO_DATA
open my $str_h, '<', \$str;
while(my $row = <$str_h>) {
chomp $row;
print join(',',
map { length $_ ? $_ : 'N/A'} split /,/, $row, -1
), "\n";
}
Output:
E:\Home> t.pl
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,N/A,N/A,N/A
You can also use:
pos $str -= 1 while $str =~ s{,(,|\n)}{,N/A$1}g;
Explanation: When s/// finds a ,, and replaces it with ,N/A, it has already moved to the character after the last comma. So, it will miss some consecutive commas if you only use
$str =~ s{,(,|\n)}{,N/A$1}g;
Therefore, I used a loop to move pos $str back by a character after each successful substitution.
Now, as #ysth shows:
$str =~ s!,(?=[,\n])!,N/A!g;
would make the while unnecessary.
I couldn't quite make out what you were trying to do in your lookbehind example, but I suspect you are suffering from a precedence error there, and that everything after the lookbehind should be enclosed in a (?: ... ) so the | doesn't avoid doing the lookbehind.
Starting from scratch, what you are trying to do sounds pretty simple: place N/A after a comma if it is followed by another comma or a newline:
s!,(?=[,\n])!,N/A!g;
Example:
my $rawData = "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n";
use Data::Dumper;
$Data::Dumper::Useqq = $Data::Dumper::Terse = 1;
print Dumper($rawData);
$rawData =~ s!,(?=[,\n])!,N/A!g;
print Dumper($rawData);
Output:
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n"
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,N/A,N/A,N/A\n"
You could search for
(?<=,)(?=,|$)
and replace that with N/A.
This regex matches the (empty) space between two commas or between a comma and end of line.
The quick and dirty hack version:
my $rawData = "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n";
while ($rawData =~ s/,,/,N\/A,/g) {};
print $rawData;
Not the fastest code, but the shortest. It should loop through at max twice.
Not a regex, but not too complicated either:
$string = join ",", map{$_ eq "" ? "N/A" : $_} split (/,/, $string,-1);
The ,-1 is needed at the end to force split to include any empty fields at the end of the string.