Load regex from file and match groups with it in Perl - regex

I have a file containing regular expressions, e.g.:
City of (.*)
(.*) State
Now I want to read these (line by line), match them against a string, and print out the extraction (matched group). For example: The string City of Berlin should match with the first expression City of (.*) from the file, after that Berlin should be extracted.
This is what I've got so far:
use warnings;
use strict;
my #pattern;
open(FILE, "<pattern.txt"); # open the file described above
while (my $line = <FILE>) {
push #pattern, $line; # store it inside the #pattern variable
}
close(FILE);
my $exampleString = "City of Berlin"; # line that should be matched and
# Berlin should be extracted
foreach my $p (#pattern) { # try each pattern
if (my ($match) = $exampleString =~ /$p/) {
print "$match";
}
}
I want Berlin to be printed.
What happens with the regex inside the foreach loop?
Is it not compiled? Why?
Is there even a better way to do this?

Your patterns contain a newline character which you need to chomp:
while (my $line = <FILE>) {
chomp $line;
push #pattern, $line;
}

First off - chomp is the root of your problem.
However secondly - your code is also very inefficient. Rather than checking patterns in a foreach loop, consider instead compiling a regex in advance:
#!/usr/bin/env perl
use strict;
use warnings;
# open ( my $pattern_fh, '<', "pattern.txt" ) or die $!;
my #patterns = <DATA>;
chomp(#patterns);
my $regex = join( '|', #patterns );
$regex = qr/(?:$regex)/;
print "Using regex of: $regex\n";
my $example_str = 'City of Berlin';
if ( my ($match) = $example_str =~ m/$regex/ ) {
print "Matched: $match\n";
}
Why is this better? Well, because it scales more efficiently. With your original algorithm - if I have 100 lines in the patterns file, and 100 lines to check as example str, it means making 10,000 comparisons.
With a single regex, you're making one comparison on each line.
Note - normally you'd use quotemeta when reading in regular expressions, which will escape 'meta' characters. We don't want to do this in this case.
If you're looking for even more concise, you can use map to avoid needing an intermediate array:
my $regex = join( '|', map { chomp; $_ } <$pattern_fh> );
$regex = qr/(?:$regex)/;
print "Using regex of: $regex\n";
my $example_str = 'City of Berlin';
if ( my ($match) = $example_str =~ m/$regex/ ) {
print "Matched: $match\n";
}

Related

Content of unmatched string

I am parsing input from stdin, which is in the following format:
1
2
3
done
My current code is:
while(<> =~ /(\d+)/) {
# do something with $1
}
# I want to access the line following the last number here
which is fine for the 1st part of the file (with the list of numbers).
However, I would like the content of the line immediately following the last number.
Is there any elegant way to do it?
You could use:
my $line;
while( ($line = <>) =~ /^(\d+)$/ ) {
# do something with $1
}
chomp $line;
# Do something with $line (done)
As another answer states, it is a good idea to use chomp with a line of input. It will deal with the newline char or chars in a platform-independent way. If you use a regex to match the number, $1 will be "clean".
Also as another answer states, /(\d+)/ will match a number anywhere in the line. /^(\d+)$/ will match a line that only contains a number.
Another way:
foreach my $line (<>) {
if( $line =~ /^(\d+)$/ ) {
# Do something with $1
}
else {
chomp $line;
# Do something with $line (done)
}
}
This will let you deal with multiple or even alternating lines of numbers and other text.
while (<>) {
chomp;
last if !/^\d+\z/;
... $_ ...
}
die("Premature EOF") if !defined($_);
... $_ ...
You can do it this way
my $var = <>;
chomp $var;
while($var =~ /^(\d+)$/)
{
#here do stuff with $var, containing only numbers
$var = <>;
chomp $var;
}
# here do stuff with $var, containing now a word
other thing: by changing /(\d+)/ for /^(\d+)$/ you make sure your entry contains ONLY numbers ;)

Substitute a sentence of a text, with the corresponding sentence of another text using Perl

I've a text file like this
mc1s2 L#'|NA|det indice|indice|nc Sensex|NA|adj
progressait|progresser|v de|de|prep
and another file text like this
programmer:_[1]_:_P0_(P1)=1 progresser:_[1]_:_P0=1
prohiber:_[1]_:_P0_P1=1
projeter:_[3]_:_P0_P1=1;_:_P0_P1_(PL)=1;_:_P0_P1_(PP<sur>)=1
I would like to have a replace in order to create a third file text like this
mc1s2 L#'|NA|det indice|indice|nc Sensex|NA|adj
progresser:_[1]_:_P0=1 de|de|prep As you can see I'd like to replace
progressait|progresser|v with progresser:_[1]_:_P0=1.
I would like to do this for all verbs.
This script answer to my exigence but I can't understand the last part of it
use strict;
use warnings;
use autodie;
my $lookupfile = 'lookup.txt';
# Contains:
# programmer:_[1]_:_P0_(P1)=1
# progresser:_[1]_:_P0=1
# prohiber:_[1]_:_P0_P1=1
# projeter:_[3]_:_P0_P1=1;_:_P0_P1_(PL)=1;_:_P0_P1_(PP<sur>)=1
my $datafile = 'data.txt';
# Contains:
# mc1s2 L#'|NA|det indice|indice|nc Sensex|NA|adj progressait|progresser|v de|de|prep
my %lookup;
open my $fh, '<', $lookupfile;
while (<$fh>) {
chomp;
my ($field) = split ':';
$lookup{$field} = $_;
}
# use Data::Dump; # Used to debug the lookup table.
# dd \%lookup;
open $fh, '<', $datafile; while (<$fh>) {
s{(?<=\s)(\S+)} {
my $entry = $1;
my #fields = split '\|', $entry;
$lookup{$fields[1]} // $entry;
}eg;
print;
}
I can't understand this :
open $fh, '<', $datafile;
while (<$fh>) {
s{(?<=\s)(\S+)}{
my $entry = $1;
my #fields = split '\|', $entry;
$lookup{$fields[1]} // $entry;
}eg;
Can you help me?
This substitution
s{(?<=\s)(\S+)}{
my $entry = $1;
my #fields = split '\|', $entry;
$lookup{$fields[1]} // $entry;
}eg;
uses the /e modifier, which indicates that the replacement string is not to be used directly, but executed as Perl code to generate the string to replace the match.
The match finds the next sequence of non-space characters that follow a space character, so in this case $1 is initially set to L#'|NA|det
$1 is copied to $entry, and $entry is split on the pipe characters | into #fields
The %lookup hash is indexed with $fields[1] - the second entry in #fields. Here that is the string NA
The code block returns the value of that hash element, or the whole of $entry if there was no hash element with that key. Note that, because $entry is the whole of the matched string, there is no change unless a corresponding element is found in %lookup because the string is replaced with itself
I hope this helps

Put regex match only into array, not entire line

I am trying to check each line of a document for a regex match.
If the line has a match, I want to push the match only into an array.
In the code below, I thought that using the g operator at the end of the regex delimiters would make $lines value the regex match only. Instead $lines value is the entire line of the document containing the match...
my $line;
my #table;
while($line = <$input>){
if($line =~ m/foo/g){
push (#table, $line);
}
}
print #table;
If any one could help me get my matches into an array, it is much appreciated.
Thanks.
p.s.
Still learning... so any explanations of concepts I may have missed is also much appreciated.
g modifier in s///g is for global search and replace.
If you just want to push matching pattern into an array, you need to capture matching pattern enclosed by (). Captured elements are stored in variable $1, $2, etc..
Try following modification to your code:
my #table;
while(my $line = <$input>){
if($line =~ m/(foo)/){
push (#table, $1);
}
}
print #table;
Refer to this documentation for more details.
Or if you want to avoid needless use of global variables,
my #table;
while(my $line = <$input>){
if(my #captures = $line =~ m/(foo)/){
push #table, #captures;
}
}
which simplifies to
my #table;
while(my $line = <$input>){
push #table, $line =~ m/(foo)/;
}
Expanding on jkshah's answer a little, I'm explicitly storing the matches in #matches instead of using the magic variable $1 which I find a little harder to read.
"__DATA__" is a simple way to store lines in a filehandle in a perl source file.
use strict;
use warnings;
my #table;
while(my $line = <DATA>){
my #matches = $line =~ m/(foo)/;
if(#matches) {
warn "found: " . join(',', #matches );
push(#table,#matches);
}
}
print #table;
__DATA__
herp de derp foo
yerp fool foo flerp
heyhey
If you file is not very big(100-500mb fine for 2 GB RAM) then you can use below.Here I am extracting numbers if matched in line.It will be much faster than the foreach loop.
#!/usr/bin/perl
open my $file_h,"<abc" or die "ERROR-$!";
my #file = <$file_h>;
my $file_cont = join(' ',#file);
#file =();
my #match = $file_cont =~ /\d+/g;
print "#match";

detecting specific text from a line and saving it in array

I have a text file which consists of different lines it looks like
Destination|203.190.242.69|reached|203.190.244.6
Destination|208.109.249.198|reached|212.142.1.1
Destination|94.75.253.170|reached|85.17.100.90
Destination|212.112.234.228|reached|4.69.143.210
Destination|80.146.246.42|reached|192.168.1.1
Destination|122.209.193.217|reached|59.128.3.65
Destination|66.77.197.179|reached|66.77.197.251
Destination|195.254.227.65|reached|213.21.128.141
Destination|125.208.8.253|reached|125.208.15.254
I need to save both the IPs and save them in different arrays. So there will
be two arrays, one Destination and one reached. How can I do this. At the moment I have written a code to detect the IP but that does not seem to work.
while (my $line = <$in>) {
my $traceroute;
if ($line =~ /(^Destination)/) {
print "DUDE\n";
my $ip = $line =~ /(\d+\.\d+\.\d+\.\d+)$/s;
#$traceroute = $2;
print "$ip\n";
}
}
This should do the trick:
while (my $line = <DATA>) {
if($line =~ /(^Destination)/){
my($dest, $reach) = $line =~ /(\d+\.\d+\.\d+\.\d+)/g;
print "$dest $reach\n";
}
}
__DATA__
Destination|203.190.242.69|reached|203.190.244.6
Destination|208.109.249.198|reached|212.142.1.1
Destination|94.75.253.170|reached|85.17.100.90
Destination|212.112.234.228|reached|4.69.143.210
Destination|80.146.246.42|reached|192.168.1.1
Destination|122.209.193.217|reached|59.128.3.65
Destination|66.77.197.179|reached|66.77.197.251
Destination|195.254.227.65|reached|213.21.128.141
Destination|125.208.8.253|reached|125.208.15.254
Performing a regex in scalar context, just returns the number of successful matches or 0 if it fails. Assigning the result of a RegEx to an array or a list of variables puts it in list context, in which the RegEx returns the captured values.
The /g modifier matches the RegEx not only once, but as often, as it fits in the string. Read perldoc perlretut for more
Since you want to save the addresses in arrays, and none of the previous solutions explicitly does that, I'm posting another answer:
use strict;
my (#destination, #reached);
foreach my $line (<DATA>) {
chomp $line;
my #fields = split '\|', $line;
push #destination, $fields[1];
push #reached, $fields[3];
}
use Data::Dumper;
print "Destinations:\n".Dumper(#destination);
print "Reached:\n".Dumper(#reached);
__DATA__
Destination|203.190.242.69|reached|203.190.244.6
Destination|208.109.249.198|reached|212.142.1.1
Destination|94.75.253.170|reached|85.17.100.90
Destination|212.112.234.228|reached|4.69.143.210
Destination|80.146.246.42|reached|192.168.1.1
Destination|122.209.193.217|reached|59.128.3.65
Destination|66.77.197.179|reached|66.77.197.251
Destination|195.254.227.65|reached|213.21.128.141
Destination|125.208.8.253|reached|125.208.15.254
Split on the separator and save the items at indices 1 and 3.
perl -aF'\|' -lne 'next unless $F[0] eq "Destination"; print "$F[1]|$F[3]"' input >output
You may want a different output format and/or do something more, of course.
while (my $line = <$in>) {
chomp $line;
if ($line =~ /^Destination\|(\d+\.\d+\.\d+\.\d+)\|reached\|(\d+\.\d+\.\d+\.\d+)$/) {
my ($d, $r) = ($1,$2);
print "$d => $r\n";
}
}

How can I extract the substring in the last set of parentheses using Perl?

I am using Perl to parse out sizes in a string. What is the regex that I could use to accomplish this:
Example Data:
Sleepwell Mattress (Twin)
Magic Nite (Flip Free design) Mattress (Full XL)
Result:
Twin
Full XL
I know that I need to start at the end of the string and parse out the first set of parenthesis just not sure how to do it.
#!/usr/bin/perl
$file = 'input.csv';
open (F, $file) || die ("Could not open $file!");
while ($line = <F>)
{
($field1,$field2,$field3,$field4,$field5,$field6,$field7, $field8, $field9) = split ',', $line;
if ( $field1 =~ /^.*\((.*)\)/ ) {
print $1;
}
#print "$field1,$field2,$field3,$field4,$field5,$field6,$field7, $field8, $field9, $1\n";
}
close (F);
Not getting any results. Maybe I am not doing this right.
The answer depends on if the size information you are looking for always appears within parentheses at the end of the string. If that is the case, then your task is simple:
#!/usr/bin/perl
use strict;
use warnings;
while (<DATA> ) {
last unless /\S/;
my ($size) = /\( ( [^)]+ ) \)$/x;
print "$size\n";
}
__DATA__
Sleepwell Mattress (Twin)
Magic Nite (Flip Free design) Mattress (Full XL)
Output:
C:\Temp> xxl
Twin
Full XL
Note that the code you posted can be better written as:
#!/usr/bin/perl
use strict;
use warnings;
my ($input_file) = #ARGV;
open my $input, '<', $input_file
or die "Could not open '$input_file': $!";
while (my $line = <$input>) {
chomp $line;
my #fields = split /,/, $line;
if ($field[0] =~ /\( ( [^)]+ ) \)$/x ) {
print $1;
}
print join('|', #fields), "\n";
}
close $input;
Also, you should consider using Text::xSV or Text::CSV_XS to process CSV files.
The following regular expression will match the content at the end of the string:
m/\(([^)]+)\)$/m
The m at then end matches mutli-line strings and changes the $ to match at the end of the line, not the end of the string.
[edited to add the bit about multi-line strings]
Assuming your data arrives line by line, and you are only interested in the contents of the last set of parens:
if ( $string =~ /^.*\((.*)\)/ ) {
print $1;
}
fancy regex is not really necessary here. make it easier on yourself. you can do splitting on "[space](" and get the last element. Of course, this is when the data you want to get is always at the last...and have parenthesis
while(<>){
#a = split / \(/, $_;
print $a[-1]; # get the last element. do your own trimming
}
This is the answer as expressed in Perl5:
my $str = "Magic Nite (Flip Free design) Mattress (Full XL)";
$str =~ m/.*\((.*)\)/;
print "$1\r\n";