How do I do optional matching in a regular expression using Perl? - regex

I want to extract the size value from a string. The string can be be formatted in one of two ways:
Data-Size: (2000 bytes)
or
file Data-Size: (2082 bytes)
If the string is present in a file, it will appear only once.
So far I have:
#!/usr/bin/perl
use strict;
use warnings;
open FILE, "</tmp/test";
my $input = do { local $/; <FILE> };
my ($length) = $input =~ /(file)?\s*Data-Size: \((\d+) bytes\)/m;
$length or die "could not get data length\n";
print "length: $length\n";
The problem seems to be with making the word file optional. I thought I could do this with:
(file)?
But this seems to be stopping matches when the word file is not present. Also when the word file is there it sets $length to the string "file". I think this is because the parenthesis around file also mean extraction.
So how do I match either of the two strings and extract the size value?

You want the second capture in $length. To do that, you could use
my (undef, $length) = $input =~ /(file)?\s*Data-Size: \((\d+) bytes\)/;
or
my $length = ( $input =~ /(file)?\s*Data-Size: \((\d+) bytes\)/ )[1];
But a much better approach would be to avoid capturing something you're not interested in capturing.
my ($length) = $input =~ /(?:file)?\s*Data-Size: \((\d+) bytes\)/;
Of course, you'd get the same result from
my ($length) = $input =~ /Data-Size: \((\d+) bytes\)/;
By the way, I removed the needless /m. /m changes the meaning of ^ and $, yet neither are present in the pattern.

Just my 2 cents, you can make optional matching other way:
/(file|)\s*Data-Size: ((\d+) bytes)/

Related

How to capture multiple words using regex on this particular text?

I'm trying to extract the best paying job titles from this sample text:
Data Scientist
#1 in Best Paying Jobs
5,100 Projected Jobs $250,000 Median Salary 0.5% Unemployment Rate
Programmer
#2 in Best Paying Jobs
4,000 Projected Jobs $240,000 Median Salary 1.0% Unemployment Rate
SAP Module Consultant
#3 in Best Paying Jobs
3,000 Projected Jobs $220,000 Median Salary 0.2% Unemployment Rate
by using the following regex and Perl code.
use File::Glob;
local $/ = undef;
my $file = #ARGV[0];
open INPUT, "<", $file
or die "Couldn't open file $!\n";
my $content = <INPUT>;
my $regex = "^\w+(\w+)*$\n\n#(\d+)";
my #arr_found = ($content =~ m/^\w+(\w+)*$\n\n#(\d+)/g);
close (INPUT);
Q1: The regex finds only the one-word titles*. How to make it find the multiple word titles and how to forward (i.e. how to properly capture) those found titles into the Perl array?
Q2: I defined the regex into a Perl variable and tried to use that variable for the regex operation like:
my #arr_found = ($content =~ m/"$regex"/g);
but it gave error. How to make it?
* When I apply the regex ^\w+(\w+)*$\n\n#(\d+) on Sublime Text 2, it finds only the one word titles.
Why not process line-by-line, simple and easy
use warnings;
use strict;
use feature 'say';
my $file = shift || die "Usage: $0 file\n";
open my $fh, '<', $file or die "Can't open $file: $!";
my (#jobs, $prev_line);
while (my $line = <$fh>) {
chomp $line;
next if not $line =~ /\S/;
if ($line =~ /^\s*#[0-9]/) {
push #jobs, $prev_line;
}
$prev_line = $line;
}
say for #jobs;
This relies on the requirement that the #N line is the first non-empty line after the jobs title.
It prints
Data Scientist
Programmer
SAP Module Consultant
The question doesn't say whether rankings are wanted as well but there is a hint in the regex that they may be. Then, assuming that the ordering in the file is "correct" you can iterate over the array indices and print elements (titles) with their indices (rank).
Or, to be certain, capture them in the regex, /^\s*#([0-9]+)/. Then you can directly print both the title and its rank, or perhaps store them in a hash with key-value pairs rank => title.
As for the regex, there are a few needed corrections. To compose a regex ahead of matching, what is a great idea, you want the qr operator. To work with multi-line strings you need the /m modifier. (See perlretut.) The regex itself needs fixing. For example
my $regex = qr/^(.+)?(?:\n\s*)+\n\s*#\s*[0-9]/m;
my #titles = $content =~ /$regex/g
what captures a line followed by at least one empty line and then #N on another line.
If the ranking of titles is needed as well then capture it, too, and store in a hash
my $regex = qr/^(.+)?(?:\n\s*)+\n\s*#\s*([0-9]+)/m;
my %jobs = reverse $content =~ /$regex/g;
or maybe better not push it with reverse-ing the list of matches but iterate through pairs instead
my %jobs;
while ($content =~ /$regex/g) {
$jobs{$2} = $1;
}
since with this we can check our "catch" at each iteration, do other processing, etc. Then you can sort the keys to print in order
say "#$_ $jobs{$_}" for sort { $a <=> $b } keys %jobs;
and just in general pick jobs by their rank as needed.
I think that it's fair to say that the regex here is much more complex than the first program.
Answers for your questions:
you are capturing the second word only and you do not allow for space in between them. That's why it won't match e.g. Data Scientist
use the qr// operator to compile regexes with dynamic content. The error stems from the $ in the middle of the regex which Perl regex compiler assumes you got wrong, because $ should only come at the end of a regex.
The following code should achieve what you want. Note the two-step approach:
Find matching text
beginning of a line (^)
one-or-more words separated by white space (\w+(?:\s+\w+)*, no need to capture match)
2 line ends (\n\n)
# followed by a number (\d+)
apply regex multiple times (/g) and treat strings as multiple lines (/m, i.e. ^ will match any beginning of a line in the input text)
Split match at line ends (\n) and extract the 1st and the 3rd field
as we know $match will contain three lines, this approach is much easier than writing another regex.
#!/usr/bin/perl
use strict;
use warnings;
use feature qw(say);
use File::Slurper qw(read_text);
my $input = read_text($ARGV[0])
or die "slurp: $!\n";
my $regex = qr/^(\w+(?:\s+\w+)*\n\n#\d+)/m;
foreach my $match ($input =~ /$regex/g) {
#say $match;
my($title, undef, $rank) = split("\n", $match);
$rank =~ s/^#//;
say "MATCH '${title}' '${rank}'";
}
exit 0;
Test run over the example text you provided in your question.
$ perl dummy.pl dummy.txt
MATCH 'Data Scientist' '1'
MATCH 'Programmer' '2'
MATCH 'SAP Module Consultant' '3'
UNICODE UPDATE: as suggested by #Jan's answer the code can be improved like this:
my $regex = qr/^(\w+(?:\s+\w+)*\R\R#\d+)/m;
...
my($title, undef, $rank) = split(/\R/, $match);
That is probably the more generic approach, as UTF-8 is the default for File::Slurper::read_text() anyway...
You were not taking whitespaces (as in Data Scientist) into account:
^\w+.*$\R+#(\d+)
See a demo on regex101.com.
\R is equal to (?>\r\n|\n|\r|\f|\x0b|\x85) (matches Unicode newlines sequences).

Replacing a single character in a perl regex match

How can I replace the 6th "_" that appears in the regex match?
Here is the literal input to be searched. It is not representing a path to the input:
/Users/rob/Documents/Test/m160505_031746_42156_c100980652550000001823221307061611_s1_p0_30_0_59.fsa
Here is my code, which parses out what I need. I just now need to replace the last matched "_" with a "/":
#!/usr/bin/perl
use strict;
use warnings;
open(IN, '<', '/Users/roblogan/Test_Database.txt') or die $!;
open(OUT, '>', '/Users/roblogan/Test_Output.txt') or die $!;
while (my $line = <IN>){
if ($line =~ m/(m160505_031746_42156_c100980652550000001823221307061611_s1_p0_[0-9]*)/){
print OUT $1, "\n";
}
}
Current output:
m160505_031746_42156_c100980652550000001823221307061611_s1_p0_30
Desired output:
m160505_031746_42156_c100980652550000001823221307061611_s1_p0/30
I have tried:
if ($line =~ s/(m160505_031746_42156_c100980652550000001823221307061611_s1_p0_[0-9]*)/(m160505_031746_42156_c100980652550000001823221307061611_s1_p0\/[0-9]*)/){
Any help would be appreciated.
This Perl code will do what I think you need, determined from your subject line and example output
It finds the sixth occurrence of an underscore in the target string and, if that underscore is followed by decimal digits, it changes the underscore to a slash and removes everything following the digits
I have used the pipe character | as the delimiter for the substitute operator s/// to avoid the need to escape forward slashes
use strict;
use warnings 'all';
my $path = q{/Users/rob/Documents/Test/m160505_031746_42156_c100980652550000001823221307061611_s1_p0_30_0_59.fsa};
$path =~ s|^(?:[^_]*_){5}[^_]*\K_(\d+).*|/$1|s;
print $path, "\n";
output
/Users/rob/Documents/Test/m160505_031746_42156_c100980652550000001823221307061611_s1_p0/30
From your description, the easiest way is:
$line =~ s!(m160505_031746_42156_c100980652550000001823221307061611_s1_p‌​‌​0)_!$1/!
I've chosen ! as the delimiter because / is used in the replacement part.
$1 is a variable containing the text matched by the first ( ) group in the regex (I didn't want to repeat the whole thing twice).
The final _ is not included in $1 (it's outside of the parens); instead we put / in the replacement part.
See perldoc perlretut for more information.

Replace only the second occurance of string in a line in perl regex

I have a string like "ven|ven|vett|vejj|ven|ven". Treat each "|" delimiter for each column.
By splitting the string with "|" saving all the columns in array and reading each column into $str
So, I'm trying to do this as
$string =~ s/$str/venky/g if $str =~ /ven/i; # it will do globally.
Which not met the requirement.
On-demand basis, I need to replace string at the particular number of occurrence of the string.
For example, I've a request to change 2nd occurrence of "ven" to venky.
Then how can I met this requirement simply? Is it some-thing like
$string =~ s/ven/venky/2;
As of my knowledge we have 'o' for replace once and 'g' for globally. I'm struggling for the solution to get the replacement at particular occurrence. And I should not use pos() to get the position, because string keeps on change. It becomes difficult to trace it every-time. That's my intention.
Please help me on this regard.
There is no flag that you can add to the regex that will do this.
The easiest way would be to split and loop. However, if you insist to use one regex, it is doable:
/^(?:[^v]|v[^e]|ve[^n])*ven(?:[^v]|v[^e]|ve[^n])*\Kven/
If you want to replace the Nth occurrence instead of the second, you can do:
/^(?:(?:[^v]|v[^e]|ve[^n])*ven){N-1}(?:[^v]|v[^e]|ve[^n])*\Kven/
The general idea:
(?:[^v]|v[^e]|ve[^n])* - matches any string that isn't part of ven
\K is a cool matcher that drops everything matched so far, so you can sort of use it as a lookbehind with variable length
Currently you're replacing every instance of'ven' with 'venky' if your string contains a match for ven, which of course it does.
What I assume you're trying to do is to substitute 'ven' for 'venky' within your string if it's the second element:
my $string = 'ven|ven|vett|vejj|ven|ven';
my #elements = split(/\|/, $string);
my $count;
foreach (#elements){
$count++;
s/$_/venky/g if /ven/i and $count == 2;
}
print join('|', #elements);
print "\n";
Your approach was already pretty good. What you described makes sense, but I think you are having trouble implementing it.
I created a function to do the work. It takes 4 arguments:
$string is the string we want to work on
$n is the nth occurance you want to replace
$needle is the thing you want to replace – thing needle in a haystack
Note that right now we allow to pass stuff that might contain regular expressions. So you would have to use quotemeta on it or match with /\Q$needle\E/
$replacement is the replacement for the $needle
The idea is to split up the string, then check each element if it matches the pattern ($needle) and keep track of how many have matched. If the nth one is reached, replace it and stop processing. Then put the string back together.
use strict;
use warnings;
use feature 'say';
say replace_nth_occurance("ven|ven|vett|vejj|ven|ven", 2, 'ven', 'venky');
sub replace_nth_occurance {
my ($string, $n, $needle, $replacement) = #_;
# take the string appart
my #elements = split /\|/, $string;
my $count = 0; # keep track of ...
foreach my $e (#elements) {
$count++ if $e =~ m/$needle/; # ... how many matches we've found
if ($count == $n) {
$e =~ s/$needle/$replacement/; # replace
last; # and stop processing
}
}
# put it back into the pipe-separated format
return join '|', #elements;
}
Output:
ven|venky|vett|vejj|ven|ven
To replace the n'th occurrence of "ven" to "venky":
my $n = 3;
my $test = "seven given ravens";
$test =~ s/ven/--$n == 0 ? "venky" : $&/eg;
This uses the ability with the /e flag to specify the substitution part as an expression.

How do I substitute with an evaluated expression in Perl?

There's a file dummy.txt
The contents are:
9/0/2010
9/2/2010
10/11/2010
I have to change the month portion (0,2,11) to +1, ie, (1,3,12)
I wrote the substitution regex as follows
$line =~ s/\/(\d+)\//\/\1+1\//;
It's is printing
9/0+1/2010
9/2+1/2010
10/11+1/2010
How to make it add - 3 numerically than perform string concat? 2+1??
Three changes:
You'll have to use the e modifier
to allow an expression in the
replacement part.
To make the replacement globally
you should use the g modifier. This is not needed if you've one date per line.
You use $1 on the replacement side, not a backreference
This should work:
$line =~ s{/(\d+)/}{'/'.($1+1).'/'}eg;
Also if your regex contains the delimiter you're using(/ in your case), it's better to choose a different delimiter ({} above), this way you don't have to escape the delimiter in the regex making your regex clean.
this works: (e is to evaluate the replacement string: see the perlrequick documentation).
$line = '8/10/2010';
$line =~ s!/(\d+)/!('/'.($1+1).'/')!e;
print $line;
It helps to use ! or some other character as the delimiter if your regular expression has / itself.
You can also use, from this question in Can Perl string interpolation perform any expression evaluation?
$line = '8/10/2010';
$line =~ s!/(\d+)/!("/#{[$1+1]}/")!e;
print $line;
but if this is a homework question, be ready to explain when the teacher asks you how you reach this solution.
How about this?
$ cat date.txt
9/0/2010
9/2/2010
10/11/2010
$ perl chdate.pl
9/1/2010
9/3/2010
10/12/2010
$ cat chdate.pl
use strict;
use warnings;
open my $fp, '<', "date.txt" or die $!;
while (<$fp>) {
chomp;
my #arr = split (/\//, $_);
my $temp = $arr[1]+1;
print "$arr[0]/$temp/$arr[2]\n";
}
close $fp;
$

How do I use Perl to intersperse characters between consecutive matches with a regex substitution?

The following lines of comma-separated values contains several consecutive empty fields:
$rawData =
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n"
I want to replace these empty fields with 'N/A' values, which is why I decided to do it via a regex substitution.
I tried this first of all:
$rawdata =~ s/,([,\n])/,N\/A/g; # RELABEL UNAVAILABLE DATA AS 'N/A'
which returned
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,,N/A,\n
Not what I wanted. The problem occurs when more than two consecutive commas occur. The regex gobbles up two commas at a time, so it starts at the third comma rather than the second when it rescans the string.
I thought this could be something to do with lookahead vs. lookback assertions, so I tried the following regex out:
$rawdata =~ s/(?<=,)([,\n])|,([,\n])$/,N\/A$1/g; # RELABEL UNAVAILABLE DATA AS 'N/A'
which resulted in:
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,N/A,,N/A,,N/A,,N/A\n
That didn't work either. It just shifted the comma-pairings by one.
I know that washing this string through the same regex twice will do it, but that seems crude. Surely, there must be a way to get a single regex substitution to do the job. Any suggestions?
The final string should look like this:
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,N/A,,N/A,N/A,N/A,N/A,N/A\n
EDIT: Note that you could open a filehandle to the data string and let readline deal with line endings:
#!/usr/bin/perl
use strict; use warnings;
use autodie;
my $str = <<EO_DATA;
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,
EO_DATA
open my $str_h, '<', \$str;
while(my $row = <$str_h>) {
chomp $row;
print join(',',
map { length $_ ? $_ : 'N/A'} split /,/, $row, -1
), "\n";
}
Output:
E:\Home> t.pl
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,N/A,N/A,N/A
You can also use:
pos $str -= 1 while $str =~ s{,(,|\n)}{,N/A$1}g;
Explanation: When s/// finds a ,, and replaces it with ,N/A, it has already moved to the character after the last comma. So, it will miss some consecutive commas if you only use
$str =~ s{,(,|\n)}{,N/A$1}g;
Therefore, I used a loop to move pos $str back by a character after each successful substitution.
Now, as #ysth shows:
$str =~ s!,(?=[,\n])!,N/A!g;
would make the while unnecessary.
I couldn't quite make out what you were trying to do in your lookbehind example, but I suspect you are suffering from a precedence error there, and that everything after the lookbehind should be enclosed in a (?: ... ) so the | doesn't avoid doing the lookbehind.
Starting from scratch, what you are trying to do sounds pretty simple: place N/A after a comma if it is followed by another comma or a newline:
s!,(?=[,\n])!,N/A!g;
Example:
my $rawData = "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n";
use Data::Dumper;
$Data::Dumper::Useqq = $Data::Dumper::Terse = 1;
print Dumper($rawData);
$rawData =~ s!,(?=[,\n])!,N/A!g;
print Dumper($rawData);
Output:
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n"
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,N/A,N/A,N/A\n"
You could search for
(?<=,)(?=,|$)
and replace that with N/A.
This regex matches the (empty) space between two commas or between a comma and end of line.
The quick and dirty hack version:
my $rawData = "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n";
while ($rawData =~ s/,,/,N\/A,/g) {};
print $rawData;
Not the fastest code, but the shortest. It should loop through at max twice.
Not a regex, but not too complicated either:
$string = join ",", map{$_ eq "" ? "N/A" : $_} split (/,/, $string,-1);
The ,-1 is needed at the end to force split to include any empty fields at the end of the string.