Perl -- regex issue - regex

I have the following code to get a substring inside an string, I'm using regular expressions but they seem not to work properly. How can I do it?
I have this string:
vlex.es/jurisdictions/ES/search?textolibre=transacciones+banco+de+bogota&translated_textolibre=,300,220,00:00:38,2,0.00%,38.67%,€0.00
and I want to get this substring:
transacciones+banco+de+bogota
The code:
open my $info, $myfile or die "Could not open $myfile: $!";
while (my $line = <$info>) {
if ($line =~ m/textolibre=/) {
my $line =~ m/textolibre=(.*?)&translated/g;
print $1;
}
last if $. == 3521239;
}
close $info;
The errors:
Use of uninitialized value $line in pattern match (m//) at classifier.pl line 10, <$info> line 20007.
Use of uninitialized value $1 in print at classifier.pl line 11, <$info> line 20007.

You are using the wrong tool for the job. You can use the URI module and its URI::QueryParam module to extract the parameters:
use strict;
use warnings;
use URI;
use URI::QueryParam;
my $str = "ivlex.es/jurisdictions/ES/search?textolibre=transacciones+banco+de+bogota&translated_textolibre=,300,220,00:00:38,2,0.00%,38.67%,0.00";
my $u = URI->new($str);
print $u->query_param('textolibre');
Output:
transacciones banco de bogota

the second declaration of $line is erroneous, drop my:
$line =~ m/textolibre=(.*?)&translated/g;

Sorry but I've found the answer. At line 10 my $line =~ m/textolibre=(.*?)&translated/g; I'm declaring the same variable two times so that's why throws the errors.
Thanks!

Related

Issue with Perl Regex

new perl coder here.
When I copy and paste the text from a website into a text file and read from that file, my perl script works with no issues. When I use getstore to create a file from the website automatically which is what I want, the output is a bunch of |'s.
The text looks identical when I copy and paste, or download the text with getstore.. I'm unable to figure out the problem. Any help would be highly appreciated.
The output that I desire is as follows:
|www\.arkinsoftware\.in|www\.askmeaboutrotary\.com|www\.assculturaleincontri\.it|www\.asu\.msmu\.ru|www\.atousoft\.com|www\.aucoeurdelanature\.
enter code here
Here is the code I am using:
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
getstore("http://www.malwaredomainlist.com/hostslist/hosts.txt", "malhosts.txt");
open(my $input, "<", "malhosts.txt");
while (my $line = <$input>) {
chomp $line;
$line =~ s/.*\s+//;
$line =~ s/\./\\\./g;
print "$line\|";
}
The bunch of | you get, is from the unfitting comment-lines at the beginning. So the solution is to ignore all "unfitting" lines.
So instead of
$line =~ s/.*\s+//;
use
next unless $line =~ s/^127.*\s+//;
so you would ignore every line except thos starting with 127.
Here's what I'd do:
my $first = 1;
while (<$input>) {
/^127\.0\.0\.1\s+(.+?)\s*$/ or next;
print '|' if !$first;
$first = 0;
print quotemeta($1);
}
This matches your input in a more precise way, and quotemeta takes care of true regex escaping.
I'd probably go with something like:
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
getstore( "http://www.malwaredomainlist.com/hostslist/hosts.txt",
"malhosts.txt" );
open( my $input, "<", "malhosts.txt" );
print join ( "|",
map { m/^\d/ && ! m/localhost/ ?
quotemeta ((split)[1]) : () } <$input> );
Gives:
0koryu0\.easter\.ne\.jp|1\-atraffickim\.tf|10\-trafficimj\.tf|109\-204\-26\-16\.netconnexion\.managedbroadband\.co\.uk|11\-atraasikim\.tf|11\.lamarianella\.info|12\-tgaffickvcmb\.tf| #etc.

Perl: Empty $1 regex value when matching?

Readers,
I have the following regex problem:
code
#!/usr/bin/perl -w
use 5.010;
use warnings;
my $filename = 'input.txt';
open my $FILE, "<", $filename or die $!;
while (my $row = <$FILE>)
{ # take one input line at a time
chomp $row;
if ($row =~ /\b\w*a\b/)
{
print "Matched: |$`<$&>$'|\n"; # the special match vars
print "\$1 contains '$1' \n";
}
else
{
#print "No match: |$row|\n";
}
}
input.txt
I like wilma.
this line does not match
output
Matched: |I like <wilma>|
Use of uninitialized value $1 in concatenation (.) or string at ./derp.pl line 14, <$FILE> line 22.
$1 contains ''
I am totally confused. If it is matching and I am checking things in a conditional. Why am I getting an empty result for $1? This isn't supposed to be happening. What am I doing wrong? How can I get 'wilma' to be in $1?
I looked here but this didn't help because I am getting a "match".
You don't have any parentheses in your regex. No parentheses, no $1.
I'm guessing you want the "word" value that ends in -a, so that would be /\b(\w*a)\b/.
Alternatively, since your whole regex only matches the bit you want, you can just use $& instead of $1, like you did in your debug output.
Another example:
my $row = 'I like wilma.';
$row =~ /\b(\w+)\b\s*\b(\w+)\b\s*(\w+)\b/;
print join "\n", "\$&='$&'", "\$1='$1'", "\$2='$2'", "\$3='$3'\n";
The above code produces this output:
$&='I like wilma'
$1='I'
$2='like'
$3='wilma'

Put regex match only into array, not entire line

I am trying to check each line of a document for a regex match.
If the line has a match, I want to push the match only into an array.
In the code below, I thought that using the g operator at the end of the regex delimiters would make $lines value the regex match only. Instead $lines value is the entire line of the document containing the match...
my $line;
my #table;
while($line = <$input>){
if($line =~ m/foo/g){
push (#table, $line);
}
}
print #table;
If any one could help me get my matches into an array, it is much appreciated.
Thanks.
p.s.
Still learning... so any explanations of concepts I may have missed is also much appreciated.
g modifier in s///g is for global search and replace.
If you just want to push matching pattern into an array, you need to capture matching pattern enclosed by (). Captured elements are stored in variable $1, $2, etc..
Try following modification to your code:
my #table;
while(my $line = <$input>){
if($line =~ m/(foo)/){
push (#table, $1);
}
}
print #table;
Refer to this documentation for more details.
Or if you want to avoid needless use of global variables,
my #table;
while(my $line = <$input>){
if(my #captures = $line =~ m/(foo)/){
push #table, #captures;
}
}
which simplifies to
my #table;
while(my $line = <$input>){
push #table, $line =~ m/(foo)/;
}
Expanding on jkshah's answer a little, I'm explicitly storing the matches in #matches instead of using the magic variable $1 which I find a little harder to read.
"__DATA__" is a simple way to store lines in a filehandle in a perl source file.
use strict;
use warnings;
my #table;
while(my $line = <DATA>){
my #matches = $line =~ m/(foo)/;
if(#matches) {
warn "found: " . join(',', #matches );
push(#table,#matches);
}
}
print #table;
__DATA__
herp de derp foo
yerp fool foo flerp
heyhey
If you file is not very big(100-500mb fine for 2 GB RAM) then you can use below.Here I am extracting numbers if matched in line.It will be much faster than the foreach loop.
#!/usr/bin/perl
open my $file_h,"<abc" or die "ERROR-$!";
my #file = <$file_h>;
my $file_cont = join(' ',#file);
#file =();
my #match = $file_cont =~ /\d+/g;
print "#match";

Capturing groups in a variable regexp in Perl

I have a bunch of matches that I need to make, and they all use the same code, except for the name of the file to read from, and the regexp itself. Therefore, I want to turn the match into a procedure that just accepts the filename and regexp as a string. When I use the variable to try to match, though, the special capture variables stopped being set.
$line =~ /(\d+)\s(\d+)\s/;
That code sets $1 and $2 correctly, but the following leaves them undefined:
$regexp = "/(\d+)\s(\d+)\s/";
$line =~ /$regexp/;
Any ideas how I can get around this?
Thanks,
Jared
Use qr instead of quotes:
$regexp = qr/(\d+)\s(\d+)\s/;
$line =~ /$regexp/;
Quote your string using the perl regex quote-like operator qr
$regexp = qr/(\d+)\s(\d+)\s/;
This operator quotes (and possibly compiles) its STRING as a regular expression.
See the perldoc page for more info:
http://perldoc.perl.org/functions/qr.html
Quote your regex string usin qr :
my $regex = qr/(\d+)\s(\d+)\s/;
my $file =q!/path/to/file!;
foo($file, $regex);
then in the sub :
sub foo {
my $file = shift;
my $regex = shift;
open my $fh, '<', $file or die "can't open '$file' for reading: $!";
while (my $line=<$fh>) {
if ($line =~ $regex) {
# do stuff
}
}

How can I extract the substring in the last set of parentheses using Perl?

I am using Perl to parse out sizes in a string. What is the regex that I could use to accomplish this:
Example Data:
Sleepwell Mattress (Twin)
Magic Nite (Flip Free design) Mattress (Full XL)
Result:
Twin
Full XL
I know that I need to start at the end of the string and parse out the first set of parenthesis just not sure how to do it.
#!/usr/bin/perl
$file = 'input.csv';
open (F, $file) || die ("Could not open $file!");
while ($line = <F>)
{
($field1,$field2,$field3,$field4,$field5,$field6,$field7, $field8, $field9) = split ',', $line;
if ( $field1 =~ /^.*\((.*)\)/ ) {
print $1;
}
#print "$field1,$field2,$field3,$field4,$field5,$field6,$field7, $field8, $field9, $1\n";
}
close (F);
Not getting any results. Maybe I am not doing this right.
The answer depends on if the size information you are looking for always appears within parentheses at the end of the string. If that is the case, then your task is simple:
#!/usr/bin/perl
use strict;
use warnings;
while (<DATA> ) {
last unless /\S/;
my ($size) = /\( ( [^)]+ ) \)$/x;
print "$size\n";
}
__DATA__
Sleepwell Mattress (Twin)
Magic Nite (Flip Free design) Mattress (Full XL)
Output:
C:\Temp> xxl
Twin
Full XL
Note that the code you posted can be better written as:
#!/usr/bin/perl
use strict;
use warnings;
my ($input_file) = #ARGV;
open my $input, '<', $input_file
or die "Could not open '$input_file': $!";
while (my $line = <$input>) {
chomp $line;
my #fields = split /,/, $line;
if ($field[0] =~ /\( ( [^)]+ ) \)$/x ) {
print $1;
}
print join('|', #fields), "\n";
}
close $input;
Also, you should consider using Text::xSV or Text::CSV_XS to process CSV files.
The following regular expression will match the content at the end of the string:
m/\(([^)]+)\)$/m
The m at then end matches mutli-line strings and changes the $ to match at the end of the line, not the end of the string.
[edited to add the bit about multi-line strings]
Assuming your data arrives line by line, and you are only interested in the contents of the last set of parens:
if ( $string =~ /^.*\((.*)\)/ ) {
print $1;
}
fancy regex is not really necessary here. make it easier on yourself. you can do splitting on "[space](" and get the last element. Of course, this is when the data you want to get is always at the last...and have parenthesis
while(<>){
#a = split / \(/, $_;
print $a[-1]; # get the last element. do your own trimming
}
This is the answer as expressed in Perl5:
my $str = "Magic Nite (Flip Free design) Mattress (Full XL)";
$str =~ m/.*\((.*)\)/;
print "$1\r\n";