Readers,
I have the following regex problem:
code
#!/usr/bin/perl -w
use 5.010;
use warnings;
my $filename = 'input.txt';
open my $FILE, "<", $filename or die $!;
while (my $row = <$FILE>)
{ # take one input line at a time
chomp $row;
if ($row =~ /\b\w*a\b/)
{
print "Matched: |$`<$&>$'|\n"; # the special match vars
print "\$1 contains '$1' \n";
}
else
{
#print "No match: |$row|\n";
}
}
input.txt
I like wilma.
this line does not match
output
Matched: |I like <wilma>|
Use of uninitialized value $1 in concatenation (.) or string at ./derp.pl line 14, <$FILE> line 22.
$1 contains ''
I am totally confused. If it is matching and I am checking things in a conditional. Why am I getting an empty result for $1? This isn't supposed to be happening. What am I doing wrong? How can I get 'wilma' to be in $1?
I looked here but this didn't help because I am getting a "match".
You don't have any parentheses in your regex. No parentheses, no $1.
I'm guessing you want the "word" value that ends in -a, so that would be /\b(\w*a)\b/.
Alternatively, since your whole regex only matches the bit you want, you can just use $& instead of $1, like you did in your debug output.
Another example:
my $row = 'I like wilma.';
$row =~ /\b(\w+)\b\s*\b(\w+)\b\s*(\w+)\b/;
print join "\n", "\$&='$&'", "\$1='$1'", "\$2='$2'", "\$3='$3'\n";
The above code produces this output:
$&='I like wilma'
$1='I'
$2='like'
$3='wilma'
Related
I have a file containing regular expressions, e.g.:
City of (.*)
(.*) State
Now I want to read these (line by line), match them against a string, and print out the extraction (matched group). For example: The string City of Berlin should match with the first expression City of (.*) from the file, after that Berlin should be extracted.
This is what I've got so far:
use warnings;
use strict;
my #pattern;
open(FILE, "<pattern.txt"); # open the file described above
while (my $line = <FILE>) {
push #pattern, $line; # store it inside the #pattern variable
}
close(FILE);
my $exampleString = "City of Berlin"; # line that should be matched and
# Berlin should be extracted
foreach my $p (#pattern) { # try each pattern
if (my ($match) = $exampleString =~ /$p/) {
print "$match";
}
}
I want Berlin to be printed.
What happens with the regex inside the foreach loop?
Is it not compiled? Why?
Is there even a better way to do this?
Your patterns contain a newline character which you need to chomp:
while (my $line = <FILE>) {
chomp $line;
push #pattern, $line;
}
First off - chomp is the root of your problem.
However secondly - your code is also very inefficient. Rather than checking patterns in a foreach loop, consider instead compiling a regex in advance:
#!/usr/bin/env perl
use strict;
use warnings;
# open ( my $pattern_fh, '<', "pattern.txt" ) or die $!;
my #patterns = <DATA>;
chomp(#patterns);
my $regex = join( '|', #patterns );
$regex = qr/(?:$regex)/;
print "Using regex of: $regex\n";
my $example_str = 'City of Berlin';
if ( my ($match) = $example_str =~ m/$regex/ ) {
print "Matched: $match\n";
}
Why is this better? Well, because it scales more efficiently. With your original algorithm - if I have 100 lines in the patterns file, and 100 lines to check as example str, it means making 10,000 comparisons.
With a single regex, you're making one comparison on each line.
Note - normally you'd use quotemeta when reading in regular expressions, which will escape 'meta' characters. We don't want to do this in this case.
If you're looking for even more concise, you can use map to avoid needing an intermediate array:
my $regex = join( '|', map { chomp; $_ } <$pattern_fh> );
$regex = qr/(?:$regex)/;
print "Using regex of: $regex\n";
my $example_str = 'City of Berlin';
if ( my ($match) = $example_str =~ m/$regex/ ) {
print "Matched: $match\n";
}
I'm trying to find occurrences of BLOB_SMUGHO, from the file test.out from the bottom of the file. If found, return a chunk of data which I'm interested in between the string "2014.10"
I'm getting Use of uninitialized value $cc in pattern match (m//) at
Whats is wrong with this script?
#!/usr/bin/perl
use strict;
use warnings;
use POSIX qw(strftime);
use File::ReadBackwards;
my $find = "BLOB_SMUGHO";
my $chnkdelim = "\n[" . strftime "%Y.%m", localtime;
my $fh = File::ReadBackwards->new('test.out', $chnkdelim, 0) or die "err-file: $!\n";
while ( defined(my $line = $fh->readline) ) {
if(my $cc =~ /$find/){
print $cc;
}
}
close($fh);
In case if this helps, here is a sample content of test.out
2014.10.31 lots and
lots of
gibbrish
2014.10.31 which I'm not
interested
in. It
also
2014.10.31 spans
across thousands of
lines and somewhere in the middle there will be
2014.10.31
this precious word BLOB_SMUGHO and
2014.10.31 certain other
2014.10.31 words
2014.10.31
this precious word BLOB_SMUGHO and
2014.10.31
this precious word BLOB_SMUGHO and
which
I
will
be
interested
in.
And I'm expecting to capture all the multiple occurrences of the chunk of the text from bottom of the file.
2014.10.31
this precious word BLOB_SMUGHO and
First, you have written your match incorrectly due to misunderstanding the =~ operator:
if(my $cc =~ /$find/){ # incorrect, like saying if(undef matches something)
If you want to match what is in $line against the pattern between /.../ then do:
if($line =~ /$find/) {
The match operator expects a value on left side as well as right side. you were using it like an assignment operator.
If you need to capture the match(es) into a variable or list, then add it to the left of an equal sign:
if(my ($cc) = $line =~ /$find/) { <-- wrap $cc in () for list context
By the way, I think you are better off writing:
if($line =~ /$find/) {
print $line;
or if you want to print what you matched only
print $0;
Since you aren't capturing a substring, it doesnt really matter here.
Now, as to how to match everything between two patterns, the task is easier if you don't match line by line, but match across newlines using the /s modifier.
In Perl, you can set the record separator to undef and use slurp mode.
local $/ = undef;
my $s = <>; # read all lines into $s
Now to scan $s for patterns
while($s =~ /(START.*?STOP)/gsm) { print "$1\n"; } # print the pattern inclusive of START and STOP
Or to capture between START and STOP
while($s =~ /START(.*?)STOP/gsm) { print "$1\n"; } # print the pattern between of START and STOP
So in your case the start pattern is 2014.10.31 and stop is BLOB_SMUGHO
while($s =~ /(2014\.10\.31.*?BLOB_SMUGHO)/gsm) {
print "$1\n";
}
NOTE: Regex modifiers in Perl come after the last / so if you see I use /gsm for multiline, match newline, and global matching (get multiple matches in a loop by remembering the last location).
I have the following code to get a substring inside an string, I'm using regular expressions but they seem not to work properly. How can I do it?
I have this string:
vlex.es/jurisdictions/ES/search?textolibre=transacciones+banco+de+bogota&translated_textolibre=,300,220,00:00:38,2,0.00%,38.67%,€0.00
and I want to get this substring:
transacciones+banco+de+bogota
The code:
open my $info, $myfile or die "Could not open $myfile: $!";
while (my $line = <$info>) {
if ($line =~ m/textolibre=/) {
my $line =~ m/textolibre=(.*?)&translated/g;
print $1;
}
last if $. == 3521239;
}
close $info;
The errors:
Use of uninitialized value $line in pattern match (m//) at classifier.pl line 10, <$info> line 20007.
Use of uninitialized value $1 in print at classifier.pl line 11, <$info> line 20007.
You are using the wrong tool for the job. You can use the URI module and its URI::QueryParam module to extract the parameters:
use strict;
use warnings;
use URI;
use URI::QueryParam;
my $str = "ivlex.es/jurisdictions/ES/search?textolibre=transacciones+banco+de+bogota&translated_textolibre=,300,220,00:00:38,2,0.00%,38.67%,0.00";
my $u = URI->new($str);
print $u->query_param('textolibre');
Output:
transacciones banco de bogota
the second declaration of $line is erroneous, drop my:
$line =~ m/textolibre=(.*?)&translated/g;
Sorry but I've found the answer. At line 10 my $line =~ m/textolibre=(.*?)&translated/g; I'm declaring the same variable two times so that's why throws the errors.
Thanks!
I have the following in an executable .pl file:
#!/usr/bin/env perl
$file = 'TfbG_peaks.txt';
open(INFO, $file) or die("Could not open file.");
foreach $line (<INFO>) {
if ($line =~ m/[^_]*(?=_)/){
#print $line; #this prints lines, which means there are matches
print $1; #but this prints nothing
}
}
Based on my reading at http://goo.gl/YlEN7 and http://goo.gl/VlwKe, print $1; should print the first match in each line, but it doesn't. Help!
No, $1 should print the string saved by so-called capture groups (created by the bracketing construct - ( ... )). For example:
if ($line =~ m/([^_]*)(?=_)/){
print $1;
# now this will print something,
# unless string begins from an underscore
# (which still matches the pattern, as * is read as 'zero or more instances')
# are you sure you don't need `+` here?
}
The pattern in your original code didn't have any capture groups, that's why $1 was empty (undef, to be precise) there. And (?=...) didn't count, as these were used to add a look-ahead subexpression.
$1 prints what the first capture ((...)) in the pattern captured.
Maybe you were thinking of
print $& if $line =~ /[^_]*(?=_)/; # BAD
or
print ${^MATCH} if $line =~ /[^_]*(?=_)/p; # 5.10+
But the following would be simpler (and work before 5.10):
print $1 if $line =~ /([^_]*)_/;
Note: You'll get a performance boost when the pattern doesn't match if you add a leading ^ or (?:^|_) (whichever is appropriate).
print $1 if $line =~ /^([^_]*)_/;
Still plugging away at teaching myself Perl. I'm trying to write some code that will count the lines of a file that contain double letters and then place parentheses around those double letters.
Now what I've come up with will find the first occurrence of double letters, but not any other ones. For instance, if the line is:
Amp, James Watt, Bob Transformer, etc. These pioneers conducted many
My code will render this:
19 Amp, James Wa(tt), Bob Transformer, etc. These pioneers conducted many
The "19" is the count (of lines containing double letters) and it gets the "tt" of "Watt" but misses the "ee" in "pioneers".
Below is my code:
$file = '/path/to/file/electricity.txt';
open(FH, $file) || die "Cannot open the file\n";
my $counter=0;
while (<FH>) {
chomp();
if (/(\w)\1/) {
$counter += 1;
s/$&/\($&\)/g;
print "\n\n$counter $_\n\n";
} else {
print "$_\n";
}
}
close(FH);
What am I overlooking?
use strict;
use warnings;
use 5.010;
use autodie;
my $file = '/path/to/file/electricity.txt';
open my $fh, '<', $file;
my $counter = 0;
while (<$fh>) {
chomp;
if (/(\w)\1/) {
$counter++;
s/
(?<full>
(?<letter>\p{L})
\g{letter}
)
/($+{full})/xg;
$_ = $counter . ' ' . $_;
}
say;
}
You are overlooking a few things. strict and warnings; 5.010 (or higher!) for say; autodie so you don't have to keep typing those 'or die'; Lexical filehandles and the three-argument form of open; A bit nitpicky, but knowing when (not) to use parens for function calls; Understanding why you shouldn't use $&; The autoincrement operator..
But on the regex part specifically, $& is only set on matches (m//), not substitution Actually no, ysth is right as usual. Sorry!
(I took the liberty of modifying your regex a bit; it makes use of named captures - (?) instead of bare parens, accessed through \g{} notation inside the regex, and the %+ hash outside of it - and Unicode-style properties - \p{Etc}). A lot more about those in perlre and perluniprops, respectively.
You need to use a back reference:
#! /usr/bin/env perl
use warnings;
use strict;
my $line = "this is a doubble letter test of my scrippt";
$line =~ s/([[:alpha:]])(\1)/($1$2)/g;
print "$line\n";
And now the test.
$ ./test.pl
this is a dou(bb)le le(tt)er test of my scri(pp)t
It works!
When you do a substitution, you use the $1 to represent what is in the parentheses. When you are referring to a part of the regular expression itself, you use the \1 form.
The [[:alpha:]] is a special POSIX class. You can find out more information by typing in
$ perldoc perlre
at the command line.
You're overcomplicating things by messing around with $&. s///g returns the number of substitutions performed when used in scalar context, so you can do it all in one shot without needing to count matches by hand or track the position of each match:
#!/usr/bin/env perl
use strict;
use warnings;
my $text = 'James Watt, a pioneer of wattage engineering';
my $doubles = $text =~ s/(\w)\1/($1$1)/g;
print "$doubles $text\n";
Output:
4 James Wa(tt), a pion(ee)r of wa(tt)age engin(ee)ring
Edit: OP stated in comments that the exercise in question says not to use =~, so here's a non-regex-based solution, since all regex matches use =~ (implicitly or explicitly):
#!/usr/bin/env perl
use strict;
use warnings;
my $text = 'James Watt, a pioneer of wattage engineering';
my $doubles = 0;
for my $i (reverse 1 .. length $text) {
if (substr($text, $i, 1) eq substr($text, $i - 1, 1)) {
$doubles++;
substr($text, $i - 1, 2) = '(' . substr($text, $i - 1, 2) . ')';
}
}
print "$doubles $text\n";
The problem is that you're using $& in the second regex which only matched the first occurance of a double letter set
if (/(\w)\1/) { #first occurance matched, so the pattern in the replace regex will only be that particular set of double letters
Try doing something like this:
s/(\w)\1/\($1$1\)/g; instead of s/$&/\($&\)/g;
Full code after editing:
$file = '/path/to/file/electricity.txt';
open(FH, $file) || die "Cannot open the file\n";
my $counter=0;
while (<FH>) {
chomp();
if (s/(\w)\1/\($1$1\)/g) {
$counter++;
print "\n\n$counter $_\n\n";
} else {
print "$_\n";
}
}
close(FH);
notice that you can use the s///g replace in a conditional statement which is true when a replace occurred.