How do I substitute with an evaluated expression in Perl? - regex

There's a file dummy.txt
The contents are:
9/0/2010
9/2/2010
10/11/2010
I have to change the month portion (0,2,11) to +1, ie, (1,3,12)
I wrote the substitution regex as follows
$line =~ s/\/(\d+)\//\/\1+1\//;
It's is printing
9/0+1/2010
9/2+1/2010
10/11+1/2010
How to make it add - 3 numerically than perform string concat? 2+1??

Three changes:
You'll have to use the e modifier
to allow an expression in the
replacement part.
To make the replacement globally
you should use the g modifier. This is not needed if you've one date per line.
You use $1 on the replacement side, not a backreference
This should work:
$line =~ s{/(\d+)/}{'/'.($1+1).'/'}eg;
Also if your regex contains the delimiter you're using(/ in your case), it's better to choose a different delimiter ({} above), this way you don't have to escape the delimiter in the regex making your regex clean.

this works: (e is to evaluate the replacement string: see the perlrequick documentation).
$line = '8/10/2010';
$line =~ s!/(\d+)/!('/'.($1+1).'/')!e;
print $line;
It helps to use ! or some other character as the delimiter if your regular expression has / itself.
You can also use, from this question in Can Perl string interpolation perform any expression evaluation?
$line = '8/10/2010';
$line =~ s!/(\d+)/!("/#{[$1+1]}/")!e;
print $line;
but if this is a homework question, be ready to explain when the teacher asks you how you reach this solution.

How about this?
$ cat date.txt
9/0/2010
9/2/2010
10/11/2010
$ perl chdate.pl
9/1/2010
9/3/2010
10/12/2010
$ cat chdate.pl
use strict;
use warnings;
open my $fp, '<', "date.txt" or die $!;
while (<$fp>) {
chomp;
my #arr = split (/\//, $_);
my $temp = $arr[1]+1;
print "$arr[0]/$temp/$arr[2]\n";
}
close $fp;
$

Related

How to capture multiple words using regex on this particular text?

I'm trying to extract the best paying job titles from this sample text:
Data Scientist
#1 in Best Paying Jobs
5,100 Projected Jobs $250,000 Median Salary 0.5% Unemployment Rate
Programmer
#2 in Best Paying Jobs
4,000 Projected Jobs $240,000 Median Salary 1.0% Unemployment Rate
SAP Module Consultant
#3 in Best Paying Jobs
3,000 Projected Jobs $220,000 Median Salary 0.2% Unemployment Rate
by using the following regex and Perl code.
use File::Glob;
local $/ = undef;
my $file = #ARGV[0];
open INPUT, "<", $file
or die "Couldn't open file $!\n";
my $content = <INPUT>;
my $regex = "^\w+(\w+)*$\n\n#(\d+)";
my #arr_found = ($content =~ m/^\w+(\w+)*$\n\n#(\d+)/g);
close (INPUT);
Q1: The regex finds only the one-word titles*. How to make it find the multiple word titles and how to forward (i.e. how to properly capture) those found titles into the Perl array?
Q2: I defined the regex into a Perl variable and tried to use that variable for the regex operation like:
my #arr_found = ($content =~ m/"$regex"/g);
but it gave error. How to make it?
* When I apply the regex ^\w+(\w+)*$\n\n#(\d+) on Sublime Text 2, it finds only the one word titles.
Why not process line-by-line, simple and easy
use warnings;
use strict;
use feature 'say';
my $file = shift || die "Usage: $0 file\n";
open my $fh, '<', $file or die "Can't open $file: $!";
my (#jobs, $prev_line);
while (my $line = <$fh>) {
chomp $line;
next if not $line =~ /\S/;
if ($line =~ /^\s*#[0-9]/) {
push #jobs, $prev_line;
}
$prev_line = $line;
}
say for #jobs;
This relies on the requirement that the #N line is the first non-empty line after the jobs title.
It prints
Data Scientist
Programmer
SAP Module Consultant
The question doesn't say whether rankings are wanted as well but there is a hint in the regex that they may be. Then, assuming that the ordering in the file is "correct" you can iterate over the array indices and print elements (titles) with their indices (rank).
Or, to be certain, capture them in the regex, /^\s*#([0-9]+)/. Then you can directly print both the title and its rank, or perhaps store them in a hash with key-value pairs rank => title.
As for the regex, there are a few needed corrections. To compose a regex ahead of matching, what is a great idea, you want the qr operator. To work with multi-line strings you need the /m modifier. (See perlretut.) The regex itself needs fixing. For example
my $regex = qr/^(.+)?(?:\n\s*)+\n\s*#\s*[0-9]/m;
my #titles = $content =~ /$regex/g
what captures a line followed by at least one empty line and then #N on another line.
If the ranking of titles is needed as well then capture it, too, and store in a hash
my $regex = qr/^(.+)?(?:\n\s*)+\n\s*#\s*([0-9]+)/m;
my %jobs = reverse $content =~ /$regex/g;
or maybe better not push it with reverse-ing the list of matches but iterate through pairs instead
my %jobs;
while ($content =~ /$regex/g) {
$jobs{$2} = $1;
}
since with this we can check our "catch" at each iteration, do other processing, etc. Then you can sort the keys to print in order
say "#$_ $jobs{$_}" for sort { $a <=> $b } keys %jobs;
and just in general pick jobs by their rank as needed.
I think that it's fair to say that the regex here is much more complex than the first program.
Answers for your questions:
you are capturing the second word only and you do not allow for space in between them. That's why it won't match e.g. Data Scientist
use the qr// operator to compile regexes with dynamic content. The error stems from the $ in the middle of the regex which Perl regex compiler assumes you got wrong, because $ should only come at the end of a regex.
The following code should achieve what you want. Note the two-step approach:
Find matching text
beginning of a line (^)
one-or-more words separated by white space (\w+(?:\s+\w+)*, no need to capture match)
2 line ends (\n\n)
# followed by a number (\d+)
apply regex multiple times (/g) and treat strings as multiple lines (/m, i.e. ^ will match any beginning of a line in the input text)
Split match at line ends (\n) and extract the 1st and the 3rd field
as we know $match will contain three lines, this approach is much easier than writing another regex.
#!/usr/bin/perl
use strict;
use warnings;
use feature qw(say);
use File::Slurper qw(read_text);
my $input = read_text($ARGV[0])
or die "slurp: $!\n";
my $regex = qr/^(\w+(?:\s+\w+)*\n\n#\d+)/m;
foreach my $match ($input =~ /$regex/g) {
#say $match;
my($title, undef, $rank) = split("\n", $match);
$rank =~ s/^#//;
say "MATCH '${title}' '${rank}'";
}
exit 0;
Test run over the example text you provided in your question.
$ perl dummy.pl dummy.txt
MATCH 'Data Scientist' '1'
MATCH 'Programmer' '2'
MATCH 'SAP Module Consultant' '3'
UNICODE UPDATE: as suggested by #Jan's answer the code can be improved like this:
my $regex = qr/^(\w+(?:\s+\w+)*\R\R#\d+)/m;
...
my($title, undef, $rank) = split(/\R/, $match);
That is probably the more generic approach, as UTF-8 is the default for File::Slurper::read_text() anyway...
You were not taking whitespaces (as in Data Scientist) into account:
^\w+.*$\R+#(\d+)
See a demo on regex101.com.
\R is equal to (?>\r\n|\n|\r|\f|\x0b|\x85) (matches Unicode newlines sequences).

Replacing a single character in a perl regex match

How can I replace the 6th "_" that appears in the regex match?
Here is the literal input to be searched. It is not representing a path to the input:
/Users/rob/Documents/Test/m160505_031746_42156_c100980652550000001823221307061611_s1_p0_30_0_59.fsa
Here is my code, which parses out what I need. I just now need to replace the last matched "_" with a "/":
#!/usr/bin/perl
use strict;
use warnings;
open(IN, '<', '/Users/roblogan/Test_Database.txt') or die $!;
open(OUT, '>', '/Users/roblogan/Test_Output.txt') or die $!;
while (my $line = <IN>){
if ($line =~ m/(m160505_031746_42156_c100980652550000001823221307061611_s1_p0_[0-9]*)/){
print OUT $1, "\n";
}
}
Current output:
m160505_031746_42156_c100980652550000001823221307061611_s1_p0_30
Desired output:
m160505_031746_42156_c100980652550000001823221307061611_s1_p0/30
I have tried:
if ($line =~ s/(m160505_031746_42156_c100980652550000001823221307061611_s1_p0_[0-9]*)/(m160505_031746_42156_c100980652550000001823221307061611_s1_p0\/[0-9]*)/){
Any help would be appreciated.
This Perl code will do what I think you need, determined from your subject line and example output
It finds the sixth occurrence of an underscore in the target string and, if that underscore is followed by decimal digits, it changes the underscore to a slash and removes everything following the digits
I have used the pipe character | as the delimiter for the substitute operator s/// to avoid the need to escape forward slashes
use strict;
use warnings 'all';
my $path = q{/Users/rob/Documents/Test/m160505_031746_42156_c100980652550000001823221307061611_s1_p0_30_0_59.fsa};
$path =~ s|^(?:[^_]*_){5}[^_]*\K_(\d+).*|/$1|s;
print $path, "\n";
output
/Users/rob/Documents/Test/m160505_031746_42156_c100980652550000001823221307061611_s1_p0/30
From your description, the easiest way is:
$line =~ s!(m160505_031746_42156_c100980652550000001823221307061611_s1_pā€Œā€‹ā€Œā€‹0)_!$1/!
I've chosen ! as the delimiter because / is used in the replacement part.
$1 is a variable containing the text matched by the first ( ) group in the regex (I didn't want to repeat the whole thing twice).
The final _ is not included in $1 (it's outside of the parens); instead we put / in the replacement part.
See perldoc perlretut for more information.

regular expression that matches any word that starts with pre and ends in al

The following regular expression gives me proper results when tried in Notepad++ editor but when tried with the below perl program I get wrong results. Right answer and explanation please.
The link to file I used for testing my pattern is as follows:
(http://sainikhil.me/stackoverflow/dictionaryWords.txt)
Regular expression: ^Pre(.*)al(\s*)$
Perl program:
use strict;
use warnings;
sub print_matches {
my $pattern = "^Pre(.*)al(\s*)\$";
my $file = shift;
open my $fp, $file;
while(my $line = <$fp>) {
if($line =~ m/$pattern/) {
print $line;
}
}
}
print_matches #ARGV;
A few thoughts:
You should not escape the dollar sign
The capturing group around the whitespaces is useless
Same for the capturing group around the dot .
which leads to:
^Pre.*al\s*$
If you don't want words like precious final to match (because of the middle whitespace, change regex to:
^Pre\S*al\s*$
Included in your code:
while(my $line = <$fp>) {
if($line =~ /^Pre\S*al\s*$/m) {
print $line;
}
}
You're getting messed up by assigning the pattern to a variable before using it as a regex and putting it in a double-quoted string when you do so.
This is why you need to escape the $, because, in a double-quoted string, a bare $ indicates that you want to interpolate the value of a variable. (e.g., my $str = "foo$bar";)
The reason this is causing you a problem is because the backslash in \s is treated as escaping the s - which gives you just plain s:
$ perl -E 'say "^Pre(.*)al(\s*)\$";'
^Pre(.*)al(s*)$
As a result, when you go to execute the regex, it's looking for zero or more ses rather than zero or more whitespace characters.
The most direct fix for this would be to escape the backslash:
$ perl -E 'say "^Pre(.*)al(\\s*)\$";'
^Pre(.*)al(\s*)$
A better fix would be to use single quotes instead of double quotes and don't escape the $:
$ perl -E "say '^Pre(.*)al(\s*)$';"
^Pre(.*)al(\s*)$
The best fix would be to use the qr (quote regex) operator instead of single or double quotes, although that makes it a little less human-readable if you print it out later to verify the content of the regex (which I assume to be why you're putting it into a variable in the first place):
$ perl -E "say qr/^Pre(.*)al(\s*)$/;"
(?^u:^Pre(.*)al(\s*)$)
Or, of course, just don't put it into a variable at all and do your matching with
if($line =~ m/^Pre(.*)al(\s*)$/) ...
Try removing trailing newline character(s):
while(my $line = <$fp>) {
$line =~ s/[\r\n]+$//s;
And, to match only words that begin with Pre and end with al, try this regular expression:
/^Pre\w*al$/
(\w means any letter of a word, not just any character)
And, if you want to match both Pre and pre, do a case-insensitive match:
/^Pre\w*al$/i

How do I do optional matching in a regular expression using Perl?

I want to extract the size value from a string. The string can be be formatted in one of two ways:
Data-Size: (2000 bytes)
or
file Data-Size: (2082 bytes)
If the string is present in a file, it will appear only once.
So far I have:
#!/usr/bin/perl
use strict;
use warnings;
open FILE, "</tmp/test";
my $input = do { local $/; <FILE> };
my ($length) = $input =~ /(file)?\s*Data-Size: \((\d+) bytes\)/m;
$length or die "could not get data length\n";
print "length: $length\n";
The problem seems to be with making the word file optional. I thought I could do this with:
(file)?
But this seems to be stopping matches when the word file is not present. Also when the word file is there it sets $length to the string "file". I think this is because the parenthesis around file also mean extraction.
So how do I match either of the two strings and extract the size value?
You want the second capture in $length. To do that, you could use
my (undef, $length) = $input =~ /(file)?\s*Data-Size: \((\d+) bytes\)/;
or
my $length = ( $input =~ /(file)?\s*Data-Size: \((\d+) bytes\)/ )[1];
But a much better approach would be to avoid capturing something you're not interested in capturing.
my ($length) = $input =~ /(?:file)?\s*Data-Size: \((\d+) bytes\)/;
Of course, you'd get the same result from
my ($length) = $input =~ /Data-Size: \((\d+) bytes\)/;
By the way, I removed the needless /m. /m changes the meaning of ^ and $, yet neither are present in the pattern.
Just my 2 cents, you can make optional matching other way:
/(file|)\s*Data-Size: ((\d+) bytes)/

Perl regex which grabs ALL double-letter occurrences in a line

Still plugging away at teaching myself Perl. I'm trying to write some code that will count the lines of a file that contain double letters and then place parentheses around those double letters.
Now what I've come up with will find the first occurrence of double letters, but not any other ones. For instance, if the line is:
Amp, James Watt, Bob Transformer, etc. These pioneers conducted many
My code will render this:
19 Amp, James Wa(tt), Bob Transformer, etc. These pioneers conducted many
The "19" is the count (of lines containing double letters) and it gets the "tt" of "Watt" but misses the "ee" in "pioneers".
Below is my code:
$file = '/path/to/file/electricity.txt';
open(FH, $file) || die "Cannot open the file\n";
my $counter=0;
while (<FH>) {
chomp();
if (/(\w)\1/) {
$counter += 1;
s/$&/\($&\)/g;
print "\n\n$counter $_\n\n";
} else {
print "$_\n";
}
}
close(FH);
What am I overlooking?
use strict;
use warnings;
use 5.010;
use autodie;
my $file = '/path/to/file/electricity.txt';
open my $fh, '<', $file;
my $counter = 0;
while (<$fh>) {
chomp;
if (/(\w)\1/) {
$counter++;
s/
(?<full>
(?<letter>\p{L})
\g{letter}
)
/($+{full})/xg;
$_ = $counter . ' ' . $_;
}
say;
}
You are overlooking a few things. strict and warnings; 5.010 (or higher!) for say; autodie so you don't have to keep typing those 'or die'; Lexical filehandles and the three-argument form of open; A bit nitpicky, but knowing when (not) to use parens for function calls; Understanding why you shouldn't use $&; The autoincrement operator..
But on the regex part specifically, $& is only set on matches (m//), not substitution Actually no, ysth is right as usual. Sorry!
(I took the liberty of modifying your regex a bit; it makes use of named captures - (?) instead of bare parens, accessed through \g{} notation inside the regex, and the %+ hash outside of it - and Unicode-style properties - \p{Etc}). A lot more about those in perlre and perluniprops, respectively.
You need to use a back reference:
#! /usr/bin/env perl
use warnings;
use strict;
my $line = "this is a doubble letter test of my scrippt";
$line =~ s/([[:alpha:]])(\1)/($1$2)/g;
print "$line\n";
And now the test.
$ ./test.pl
this is a dou(bb)le le(tt)er test of my scri(pp)t
It works!
When you do a substitution, you use the $1 to represent what is in the parentheses. When you are referring to a part of the regular expression itself, you use the \1 form.
The [[:alpha:]] is a special POSIX class. You can find out more information by typing in
$ perldoc perlre
at the command line.
You're overcomplicating things by messing around with $&. s///g returns the number of substitutions performed when used in scalar context, so you can do it all in one shot without needing to count matches by hand or track the position of each match:
#!/usr/bin/env perl
use strict;
use warnings;
my $text = 'James Watt, a pioneer of wattage engineering';
my $doubles = $text =~ s/(\w)\1/($1$1)/g;
print "$doubles $text\n";
Output:
4 James Wa(tt), a pion(ee)r of wa(tt)age engin(ee)ring
Edit: OP stated in comments that the exercise in question says not to use =~, so here's a non-regex-based solution, since all regex matches use =~ (implicitly or explicitly):
#!/usr/bin/env perl
use strict;
use warnings;
my $text = 'James Watt, a pioneer of wattage engineering';
my $doubles = 0;
for my $i (reverse 1 .. length $text) {
if (substr($text, $i, 1) eq substr($text, $i - 1, 1)) {
$doubles++;
substr($text, $i - 1, 2) = '(' . substr($text, $i - 1, 2) . ')';
}
}
print "$doubles $text\n";
The problem is that you're using $& in the second regex which only matched the first occurance of a double letter set
if (/(\w)\1/) { #first occurance matched, so the pattern in the replace regex will only be that particular set of double letters
Try doing something like this:
s/(\w)\1/\($1$1\)/g; instead of s/$&/\($&\)/g;
Full code after editing:
$file = '/path/to/file/electricity.txt';
open(FH, $file) || die "Cannot open the file\n";
my $counter=0;
while (<FH>) {
chomp();
if (s/(\w)\1/\($1$1\)/g) {
$counter++;
print "\n\n$counter $_\n\n";
} else {
print "$_\n";
}
}
close(FH);
notice that you can use the s///g replace in a conditional statement which is true when a replace occurred.