Using Perl to match all words after a particular word

Using Perl to match all words after a particular word - regex

I am using Perl and need to get all domain names from http://www.malwaredomainlist.com/hostslist/hosts.txt into a flat file.
I think the easiest way to do this is to use a regular expression but I can't get my head around how to build the expression.
my code so far:
#!/usr/bin/perl
use LWP::Simple;
$url = 'http://www.malwaredomainlist.com/hostslist/hosts.txt';
$content = get $url;
open(my $fh, '>', '/home/jay/feed.txt');
#logic here
}
close $fh;
I'm not sure if I should loop over each line and perform an expression on that or if I should take the whole file as a string and work with that.

The page is just a text/plain document, so I think I would just copy and paste the page into my editor and remove the unwanted information. However if you would prefer a Perl program then this is all that is necessary. It uses LWP::Simple::get to fetch the text page and a regex to search it for lines starting with digits and dots, returning the second field of each
use strict;
use warnings;
use feature 'say';
use LWP::Simple qw/ get /;
my $url = 'http://www.malwaredomainlist.com/hostslist/hosts.txt';
say for get($url) =~ /^[\d.]+\s+(\S+)/gam;
or as a one-liner
perl -MLWP::Simple=get -E"say for get(shift) =~ /^[\d.]+\s+(\S+)/gam" http://www.malwaredomainlist.com/hostslist/hosts.txt

Unless you have a particular need, iterating by line is the way forward. Otherwise you just tie up memory unnecessarily.
However when you're fetching a url, it's a bit academic - I would suggest that fetching it to a file first isn't a bad thing though, so you can re-process it without needing to refetch.
Given source data sample:
for ( split ( "\n", $content ) ) {
next unless m/^\d/; #skip lines that don't start with a digit.
my ( $IP, $hostname ) = split;
my $domainname = $hostname =~ s/^\w+\.//r;
print $domainname,"\n";
}
This doesn't entirely work with your list though, because in that list you have a mix of hostnames and domain names, and it's not actually all that easy to tell the difference.
After all, the 'tld' at the end might be .com or it might be .org.it

127.0.0.1\s+(.*)
should work fine with global modifier.
Demo

Unless saving the list file locally is a requirement (in which case you might be better off just using wget or curl), there is no need to save it in an external file to process it line-by-line.
You can instead open a filehandle to the string itself.
In the script below, extract_hosts would work the same whether you give it a reference to a string or a filename:
#!/usr/bin/env perl
use strict;
use warnings;
use Carp qw( croak );
use LWP::Simple qw( get );
my $url = 'http://www.malwaredomainlist.com/hostslist/hosts.txt';
my $malware_hosts = get $url;
unless (defined $malware_hosts) {
die "Failed to get content from '$url'\n";
}
my $hosts = extract_hosts(\$malware_hosts);
print "$_\n" for #$hosts;
sub extract_hosts {
my $src = shift;
open my $fh, '<', $src
or croak "Failed to open '$src' for reading: $!";
my #hosts;
while (my $entry = <$fh>) {
next unless $entry =~ /\S/;
next if $entry =~ /^#/;
my (undef, $host) = split ' ', $entry;
push #hosts, $host;
}
close $fh
or croak "Failed to close '$src': $!";
\#hosts;
}
This will give you the list of hosts.

Code to grep the hostnames from the given file.
use LWP::Simple;
my $url = 'http://www.malwaredomainlist.com/hostslist/hosts.txt';
my $content = get $url;
my #server_names = split(/127\.0\.0\.1\s*/, $content);
open(my $fh, '>', '/home/jay/feed.txt');
print $fh "#server_names";
close $fh;

Here is another implementation. It uses HTML::Tiny which is part of the core so you don't have to install anything.
use HTTP::Tiny;
my $response = HTTP::Tiny->new->get('http://www.malwaredomainlist.com/hostslist/hosts.txt');
die "Failed!\n" unless $response->{success};
my #content;
for my $line ( split ( "\n", $response->{content} ) ){
next if ( $line =~ /^#|^$/);
push #content, ((split ( " ", $line ))[1]);
}
print Dumper (\#content);

Related

Why does this regex in perl work for one word but not another?

I'm new to perl so please excuse me if my question seems obvious. I made a small perl script that just examines itself to extract a particular substring I'm looking for and I'm getting results that I can't explain. Here is the script:
use 5.006;
use strict;
use warnings;
use File::Find;
my #files;
find(
sub { push #files, $File::Find::name unless -d; },
"."
);
my #filteredfiles = grep(/.pl/, #files);
foreach my $fileName (#filteredfiles)
{
open (my $fh, $fileName) or die "Could not open file $fileName";
while (my $row = <$fh>)
{
chomp $row;
if ($row =~ /file/)
{
my ($substring) = $row =~ /file\(([^\)]*)\)/;
print "$substring\n" if $substring;
}
}
close $fh;
}
# file(stuff)
# directory(stuff)
Now, when I run this, I get the following output:
stuff
[^\
Why is it printing the lines out of order? Since the "stuff" line occurs later in the file, shouldn't it print later?
Why is it printing that second line wrong? It should be "\(([^\". It's missing the first 3 characters.
If I change my regex to the following: /directory\(([^\)]*)\)/, I get no output. The only difference is the word. It should be finding the second comment. What is going on here?

use 5.006 kind of odd if you are just beginning to learn Perl ... That is an ancient version.
You should not build a potentially huge list of all files in all locations under the current directory and then filter it. Instead, push only the files you want to the list.
Especially with escaped meta characters, regex patterns can be become hard to read very quickly, so use the /x modifier to insert some whitespace into those patterns.
You do not have to match twice: Just check & capture at the same time.
If open fails, include the reason in the error message.
Your second question above does not make sense. You seem to expect your pattern to match the literal string file\(([^\)]*)\)/, but it cannot.
use strict;
use warnings;
use File::Find;
my #files;
find(
sub {
return if -d;
return unless / [.] pl \z/x;
push #files, $File::Find::name;
},
'.',
);
for my $file ( #files ) {
open my $fh, '<', $file
or die "Could not open file $file: $!";
while (my $line = <$fh>) {
if (my ($substring) = ($line =~ m{ (?:file|directory) \( ([^\)]*) \) }x)) {
print "$substring\n";
}
}
close $fh;
}
# file(stuff)
# directory(other)
Output:
stuff
other

Issue with Perl Regex

new perl coder here.
When I copy and paste the text from a website into a text file and read from that file, my perl script works with no issues. When I use getstore to create a file from the website automatically which is what I want, the output is a bunch of |'s.
The text looks identical when I copy and paste, or download the text with getstore.. I'm unable to figure out the problem. Any help would be highly appreciated.
The output that I desire is as follows:
|www\.arkinsoftware\.in|www\.askmeaboutrotary\.com|www\.assculturaleincontri\.it|www\.asu\.msmu\.ru|www\.atousoft\.com|www\.aucoeurdelanature\.
enter code here
Here is the code I am using:
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
getstore("http://www.malwaredomainlist.com/hostslist/hosts.txt", "malhosts.txt");
open(my $input, "<", "malhosts.txt");
while (my $line = <$input>) {
chomp $line;
$line =~ s/.*\s+//;
$line =~ s/\./\\\./g;
print "$line\|";
}

The bunch of | you get, is from the unfitting comment-lines at the beginning. So the solution is to ignore all "unfitting" lines.
So instead of
$line =~ s/.*\s+//;
use
next unless $line =~ s/^127.*\s+//;
so you would ignore every line except thos starting with 127.

Here's what I'd do:
my $first = 1;
while (<$input>) {
/^127\.0\.0\.1\s+(.+?)\s*$/ or next;
print '|' if !$first;
$first = 0;
print quotemeta($1);
}
This matches your input in a more precise way, and quotemeta takes care of true regex escaping.

I'd probably go with something like:
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
getstore( "http://www.malwaredomainlist.com/hostslist/hosts.txt",
"malhosts.txt" );
open( my $input, "<", "malhosts.txt" );
print join ( "|",
map { m/^\d/ && ! m/localhost/ ?
quotemeta ((split)[1]) : () } <$input> );
Gives:
0koryu0\.easter\.ne\.jp|1\-atraffickim\.tf|10\-trafficimj\.tf|109\-204\-26\-16\.netconnexion\.managedbroadband\.co\.uk|11\-atraasikim\.tf|11\.lamarianella\.info|12\-tgaffickvcmb\.tf| #etc.

How to store regex results in Perl for building substitution strings?

I looked at this question for starters, but I'm not sure I need a hash table to store intermediate results. If so great, but I'm new to Perl, so unsure.
It seems like this would have to be done in a loop, to store each result in a scalar and then apply, then move to the next line. But again I'm new to this.
Scan lines for pattern. In this case, HTML. Yes, I know about HTML and regex, but without regex, how can I build strings dynamically from a search pattern?
If pattern matches, use formed string A to derive new string form B.
Scan lines again and substitute B for A.
In other words:
$stringA = 'alias="#[found by $pattern]"'
$stringB = 'alias="#[prepended string] . [found by $pattern] . [appended string]"'
What I have so far:
my $pattern = 'alias="#(.*?)"';
my %seen = (); # ?
sub read_file {
my ($file) = #_;
open FILE, '<:encoding(UTF-8)', $file or die "Could not open '$file' for reading $!";
local $/ = undef;
while ( my $line = <FILE> ) {
if ( $line =~ /($pattern)/ ) {
$seen{$1}; # store results
return $line;
}
}
close FILE;
}
use Data::Dumper;
say Dumper( \%seen );

I think you want
$line =~ s/($pattern)/ transform($1) /eg;
where transform($1) is the code that derives B from A ($1).
As for a non-regex solution, XPaths can be used as means of identifying HTML nodes using a language that even simpler than regex patterns.
my $xpath = '//#alias[starts-with(., "#")]';
my $doc = XML::LibXML->new->parse_html_file($qfn);
for my $node ($doc->findnodes($xpath)) {
transform($node);
}
$doc->toFile($qfn);

Several comments are in the code. Sample output is below.
Not sure if this does what you want, but hopefully something in it will help at all.
use strict;
use warnings;
my $pattern = 'alias="#(.*?)"';
my %seen = (); # defines an empty hash
sub read_file {
my ($file) = #_;
# open using lexical filehandle
open (my $fp, '<:encoding(UTF-8)', $file)
or die "Could not open '$file' for reading $!";
local $/ = undef; # effects 'slurp mode', that is, lets you read the entire file into one scalar.
my $line = <$fp>;
close ($fp); # it's all read in, so it can be safely closed here.
# loop and use the g modifier to process every match.
# see the perlre man page for full discussion of modifiers
while ( $line =~ /($pattern)/smg ) {
$seen{$1} = 0 if (!exists ($seen{$1}));
++$seen{$1};
}
}
# There was not call to read_file. This is just a "serving suggestion:"
my $filename = $ARGV[0] || die "USAGE: $0 filename\n";
read_file ($filename);
use Data::Dumper;
print Dumper( \%seen ); # use 'print', not 'say'
I ran it with some sample data as indicated by the egrep output:
$ egrep '<(foo|bar)' index.html
<foo alias="#foobar">it's foo!</foo>
<bar alias="#barfoo">it's bar!</bar>
And here is the result:
$ perl foo.pl index.html
$VAR1 = {
'alias="#foobar"' => 1,
'alias="#barfoo"' => 1
};
$

Perl: unable to get the correct match from the file

I need help with my script. I am writing a script that will check if the username is still existing in /etc/passwd. I know this can be done on BASH but as much as possible I want to avoid using it, and just focus on writing using Perl instead.
Okay, so my problem is that, my script could not find the right match in my $password_file. I still got the No root found error even though it is still in the file.
Execution of the script.
jpd#skriv ~ $ grep root /etc/passwd
root:x:0:0:root:/root:/bin/bash
jpd#skriv ~ $ ~/Copy/documents/scripts/my_perl/test.pl root
Applying pattern match (m//) to #id will act on scalar(#id) at /home/jpd/Copy/documents/scripts/my_perl/test.pl line 16.
No root found!
jpd#skriv ~ $
Also, why do I always get this "Applying pattern match..." warning?
Here's the code:
#!/usr/bin/perl
use strict;
use warnings;
my $stdin = $ARGV[0];
my $password_file = '/etc/passwd';
open (PWD, $password_file) || die "Error: $!\n";
my #lines = (<PWD>);
close PWD;
for my $line (#lines) {
my #list = split /:/, $line;
my #id = "$list[0]\n";
if (#id =~ m/$stdin/) {
die "Match found!\n";
} else {
print "No $stdin found!\n";
exit 0;
}
}
Thanks in advance! :)
Regards,
sedawkgrep
Perl Newbie

I have a few things to point out regarding your code:
Good job using use strict; and use warnings;. They should be included in EVERY perl script.
Pick meaningful variable names.
$stdin is too generic. $username does a better job of documenting the intent of your script.
Concerning your file processing:
Include use autodie; anytime you're working with files.
This pragma will automatically handle error messages, and will give you better information than just "Error: $!\n". Also, if you are wanting to do a manual error messages, be sure to remove the new line from your message or die won't report the line number.
Use Lexical file handles and the three argument form of open
open my $fh, '<', $password_file;
Don't load an entire file into memory unless you need to. Instead, use while loop and process the file line by line
Concerning your comparison: #id =~ m/$stdin/:
Always use a scalar to the left of comparison =~
The comparison operator binds a scalar to a pattern. Therefore the line #id =~ m/$stdin/ is actually comparing the size of #id to your pattern: "1" =~ m/$stdin/. This is obviously a bug.
Be sure to escape the regular expression special characters using quotemeta or \Q...\E:
$list[0] =~ m/\Q$stdin/
Since you actually want a direct equality, don't use a regex at all, but instead use eq
You're exiting after only processing the first line of your file.
In one fork you're dying if you find a match in the first line. In your other fork, you're exiting with the assumption that no other lines are going to match either.
With these changes, I would correct your script to the following:
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $username = $ARGV[0];
my $password_file = '/etc/passwd';
open my $fh, '<', $password_file;
while (<$fh>) {
chomp;
my #cols = split /:/;
if ($cols[0] eq $username) {
die "Match found!\n";
}
}
print "No $username found!\n";

#!/usr/bin/perl
use strict;
use warnings;
my $stdin = $ARGV[0];
my $password_file = '/etc/passwd';
open (PWD,"<$password_file");
my #lines = <PWD>;
my #varr = grep (m/root/, #lines);
Then check varr array and split it if you need.

You'd be better off using a hash for key lookups, but with minimal modification this should work:
open my $in, '<', 'in.txt';
my $stdin = $ARGV[0];
while(<$in>){
chomp;
my #list = split(/\:/);
my ($id) = $list[0];
if ($id eq $stdin) {
die "Match found\n";
}
}

How to Circumvent Perl's string escaping the replacement string in s/// when it's read from a file?

This question is similar to my last one, with one difference to make the toy script more similar to my actual one.
Here is the toy script, replace.pl (Edit: now with 'use strict;', etc)
#! /usr/bin/perl -w
use strict;
open(REPL, "<", $ARGV[0]) or die "Couldn't open $ARGV[0]: $!!";
my %replacements;
while(<REPL>) {
chomp;
my ($orig, $new, #rest) = split /,/;
# Processing+sanitizing of orig/new here
$replacements{$orig} = $new;
}
close(REPL) or die "Couldn't close '$ARGV[0]': $!";
print "Performing the following replacements\n";
while(my ($k,$v) = each %replacements) {
print "\t$k => $v\n";
}
open(IN, "<", $ARGV[1]) or die "Couldn't open $ARGV[1]: $!!";
while ( <IN> ) {
while(my ($k,$v) = each %replacements) {
s/$k/$v/gee;
}
print;
}
close(IN) or die "Couldn't close '$ARGV[1]': $!";
So, now lets say I have two files, replacements.txt (using the best answer from the last question, plus a replacement pair that doesn't use substitution):
(f)oo,q($1."ar")
cat,hacker
and test.txt:
foo
cat
When I run perl replace.pl replacements.txt test.txt I would like the output to be
far
hacker
but instead it's '$1."ar"' (too much escaping) but the results are anything but (even with the other suggestions from that answer for the replacement string). The foo turns into ar, and the cat/hacker is eval'd to the empty string, it seems.
So, what changes do I need to make to replace.pl and/or replacements.txt? Other people will be creating the replacements.txt's, so I'd like to make that file as simple as possible (although I acknowledge that I'm opening the regex can of worms on them).
If this isn't possible to do in one step, I'll use macros to enumerate all possible replacement pairs for this particular file, and hope the issue doesn't come up again.

Please don't give us non-working toy scripts that don't use strict and warnings. Because one of the first things people will do in debugging is to turn those on, and you've just caused work.
Second tip, use the 3-argument version of open rather than the 2-argument version. It is safer. Also in your error checking do as perlstyle says (see http://perldoc.perl.org/perlstyle.html for the full advice) and include the file name and $!.
Anyways your problem is that the code you were including was q($1."ar"). When executed this returns the string $1."ar". Get rid of the q() and it works fine. BUT it causes warnings. That can be fixed by moving the quoting into the replace script, and out of the original script.
Here is a fixed script for you:
#! /usr/bin/perl -w
use strict;
open(REPL, "<", $ARGV[0]) or die "Couldn't open '$ARGV[0]': $!!";
my %replacements;
while(<REPL>) {
chomp;
my ($orig, $new) = split /,/;
# Processing+sanitizing of orig/new here
$replacements{$orig} = '"' . $new . '"';
}
close(REPL) or die "Couldn't close '$ARGV[0]': $!";
print "Performing the following replacements\n";
while(my ($k,$v) = each %replacements) {
print "\t$k => $v\n";
}
open(IN, "<", $ARGV[1]) or die "Couldn't open '$ARGV[1]': $!!";
while ( <IN> ) {
while(my($k,$v) = each %replacements) {
s/$k/$v/gee;
}
print;
}
close(IN) or die "Couldn't close '$ARGV[1]': $!";
And the modified replacements.txt is:
(f)oo,${1}ar
cat,hacker

You have introduced one more level of interpolation since the last question.
You can get the right result by either:
Lay a 3rd "e" modifier on your substitution
s/$k/$v/geee; # eeek
Remove a layer of interpolation in replacements.txt by making the first line
(f)oo,$1."ar"

Get rid of the q() in the replacement string;
Should be just
(f)oo,$1."ar"
as in ($k,$v) = split /,/, $_;
Warning: using external input data in evals is very, very dangerous
Or, just make it
(f)oo,"${1}ar"
No modification to the code is necessary either way e.g. s///gee.
Edit #drhorrible, if it doesen't work then you have other problems.
use strict;use warnings;
my $str = "foo";
my $repl = '(f)oo,q(${1}."ar")';
my ($k,$v) = split /,/, $repl;
$str =~ s/$k/$v/gee;
print $str,"\n";
$str = "foo";
$repl = '(f)oo,$1."ar"';
($k,$v) = split /,/, $repl;
$str =~ s/$k/$v/gee;
print $str,"\n";
$str = "foo";
$repl = '(f)oo,"${1}ar"';
($k,$v) = split /,/, $repl;
$str =~ s/$k/$v/gee;
print $str,"\n";
output:
${1}."ar"
far
far

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Using Perl to match all words after a particular word - regex

127.0.0.1\s+(.*) should work fine with global modifier. Demo

Code to grep the hostnames from the given file. use LWP::Simple; my $url = 'http://www.malwaredomainlist.com/hostslist/hosts.txt'; my $content = get $url; my #server_names = split(/127\.0\.0\.1\s*/, $content); open(my $fh, '>', '/home/jay/feed.txt'); print $fh "#server_names"; close $fh;

Related

Why does this regex in perl work for one word but not another?

Issue with Perl Regex

How to store regex results in Perl for building substitution strings?

Perl: unable to get the correct match from the file

How to Circumvent Perl's string escaping the replacement string in s/// when it's read from a file?

Categories

Resources