I'm trying to read multiple files that have the same format and want to make some statistics based on regex.
i.e I want to count similar items that are within the []
NC_013618 NC_013633 ([T(nad6 trnE ,cob trnT ,)])
C_013481 NC_013479 ([T(trnP ,rrnS trnF trnV rrnL nad1 trnI ,)])
NC_013485 NC_003159 ([T(trnC ,trnY ,)])
NC_013554 NC_013254 ([T(trnR ,trnN ,)])
NC_013607 NC_013618 ([T(nad6 trnE ,cob trnT ,)])
the problem is that i'm not getting right values, below is my code:
use strict;
use warnings;
my %data;
#FILES = glob("../mitos-crex/*.out");
foreach my $file (#FILES) {
local $/ = undef;
open my $fh, '<', $file;
$data{$file} = <$fh>;
}
my #t;
my $c = 0;
foreach my $line (keys %data) {
foreach my $l ($data{$line}) {
print $l."\n";
($t[$c]) = $l =~ m/(\[.*\])/;
$c++;
}
}
#the problem is here the counter is not giving the right value
print $c;
my %counts;
$counts{$_}++ for #t;
thanks in advance
First of all, always use strict and use warnings. This measure is vital for all programming, as it will quickly reveal simple problems that you may otherwise overlook or waste time on debugging. This is especially true and a simple courtesy if you are asking for others' help with your program
You seem to have become confused between slurping an entire file into a single string, and into an array of lines. The way you have written it, each element $data{file} is a single scalar value containing all of the file's data, and then you try to iterate over it with foreach $l ($data{$line}) { ... } which executes just once and so only find the first [...] string in the file
Ordinarily I would say that you shouldn't read in all of your file data in this way, as the problem is likely to have a better streamed solution, but I don't know what else you want to use the captured data for, so my solution follows your own design
I think you need to slurp the data into a virtual array, instead of a scalar, and then iterate over that in your loops. You must leave $/ defined so that the file is read in lines, and build an anonymous array with [ <$fh> ]. Then you can iterate over the lines with foreach my $line (#{ $data{$file} }) { ... }
use strict;
use warnings;
my %data;
my #files = glob("../mitos-crex/*.out");
foreach my $file (#files) {
open my $fh, '<', $file or die $!;
$data{$file} = [ <$fh> ];
}
my $c = 0;
my #t;
foreach my $file (keys %data) {
foreach my $line (#{ $data{$file} }) {
($t[$c]) = $line =~ /(\[.*\])/;
$c++;
}
}
print $c;
my %counts;
$counts{$_}++ for #t;
The counter is giving a correct value. Your problem is that you are slurping the file (reading it all in at once), but then only storing the first value found:
($t[$c]) = $data{$line} =~ m/(\[.*\])/; # only finds first value in file
Either loop over each file properly, and use the above regex for each line, or do something like:
push #t, ($data{$line} =~ m/(\[.*\])/g);
You should always use
use strict;
use warnings;
And solve the errors/warnings that result. Not doing so is a bad idea, and is only hiding the problems in your code -- not solving them.
Also, you should be aware that this statement:
foreach $l ($data{$line}) {
Only iterates once, because each "line" here is an entire file, and $data{$line} is besides a scalar value. Moreover, you iterate using $l as an alias, but you still use $data{$line} inside the loop, which makes the loop completely redundant.
Related
I'm new to programming so bear with me. I'm working on a Perl script that asks the user the number of different items they want to search for and what those items are, separating them by pressing ENTER. That part works okay.
Then, the script is to open up a file, parse through, and print each line that matches with the items that the user initially listed. This is the part that I haven't been able to figure out yet. I've tried different variations of the code. I saw many people suggest using the index function but I had no luck with it. It does seem to be working when I swap $line =~ $array for $line =~ /TEXT/. I'm hoping someone here can shed some light.
Thanks in advance!
#!usr/bin/perl
use strict;
use warnings;
my $line;
my $array;
print "Enter number of items: ";
chomp(my $n = <STDIN>);
my #arrays;
print "Enter items, press enter to separate: \n";
for (1..$n) {
my $input = <STDIN>;
push #arrays, $input;
}
open (FILE, "file.txt") || die "can't open file!";
chomp(my #lines = <FILE>);
close (FILE);
foreach $array (#arrays) {
foreach $line (#lines) {
if ($line =~ $array) {
print $line, "\n";
}
}
}
#purplekushbear Welcome to Perl! In Perl, there is more than one way to do it (TIMTOWTDI) so please take this in the spirit of teaching that it is given.
First off your line one -- the #! (sha bang line) is missing the leading / in the path to perl. In Linux/UNIX environments if your script is executable the path after the #! is used to run your program. --- If you do an ls on /usr/bin/perl you should see it. Sometimes it is found at /bin/perl or /usr/local/bin/perl.
When the person mentioned you forgot to chomp they where referring to where you are setting the $input variable. Just chomp like you did for $n and you will be ok.
As for the main part of your program go back and read what you wanted to do and do exactly that might be simpler to do. I think you have a good start on the problem and seem to know that arrays start with a # and scalar variables use the $ sigil, and you use strict which is great.
Here is one way to solve your problem:
#!/usr/bin/perl
use strict;
use warnings;
print "Enter number of items: ";
chomp(my $num = <STDIN>);
my #items = ();
print "Enter items, press enter to separate: \n";
for (1 .. $num)
{
chomp(my $input = <STDIN>);
push #items, $input;
}
open (FILE, "file.txt") || die "can't open file because $!";
while (my $line = <FILE>)
{
foreach my $item (#items)
{
if ($line =~ m/$item/)
{
print $line;
last;
}
}
}
close (FILE);
Notice I used the name #items for your items instead of #arrays which will make understanding the code easier when you come back to it someday. Always write with an eye towards maintainability. Anyways, ask if you have any questions but since I left much of the code the same I don't think you will have much trouble figuring it out. Perldoc and google are your friends. E.g. you can type:
perldoc -f last
to find out how last works. Have fun!
In you script you have forgot to add the chomp while giving the user input, then you need to last the inside for loop when pattern is matched.
Then here is another way,You can try the following, same thing with different method.
I'm making variable name $regex instead of #array. In $regex variable I'm concatenating user input values with | separated. (In regex | behave like or). While concatenating I'm making the quotemeta to escape the special characters. Then I'm making the precompiled regex with qr for $regex variable
#!usr/bin/perl
use strict;
use warnings;
print "Enter number of items: ";
chomp(my $n = <STDIN>);
my $regex;
print "Enter items, press enter to separate: \n";
for (1..$n)
{
chomp(my $input = <STDIN>);
$regex .= quotemeta($input)."|";
}
chop $regex; #remove the last pipe
$regex = qr($regex);
open my $fh,"<", "file.txt" || die "can't open file!";
while(<$fh>)
{
print if(/$regex/i);
}
Then user #ikegami said his comment, you can use the Perl inbuilt #ARGV method instead of STDIN , for example
Your method
my #array = #ARGV;
Another method
my $regex = join "|", map { quotemeta $_ } #ARGV;
Then run the script perl test.pl input1 input2 input3.
And always use 3 arguments to open a file
I am using Perl and need to get all domain names from http://www.malwaredomainlist.com/hostslist/hosts.txt into a flat file.
I think the easiest way to do this is to use a regular expression but I can't get my head around how to build the expression.
my code so far:
#!/usr/bin/perl
use LWP::Simple;
$url = 'http://www.malwaredomainlist.com/hostslist/hosts.txt';
$content = get $url;
open(my $fh, '>', '/home/jay/feed.txt');
#logic here
}
close $fh;
I'm not sure if I should loop over each line and perform an expression on that or if I should take the whole file as a string and work with that.
The page is just a text/plain document, so I think I would just copy and paste the page into my editor and remove the unwanted information. However if you would prefer a Perl program then this is all that is necessary. It uses LWP::Simple::get to fetch the text page and a regex to search it for lines starting with digits and dots, returning the second field of each
use strict;
use warnings;
use feature 'say';
use LWP::Simple qw/ get /;
my $url = 'http://www.malwaredomainlist.com/hostslist/hosts.txt';
say for get($url) =~ /^[\d.]+\s+(\S+)/gam;
or as a one-liner
perl -MLWP::Simple=get -E"say for get(shift) =~ /^[\d.]+\s+(\S+)/gam" http://www.malwaredomainlist.com/hostslist/hosts.txt
Unless you have a particular need, iterating by line is the way forward. Otherwise you just tie up memory unnecessarily.
However when you're fetching a url, it's a bit academic - I would suggest that fetching it to a file first isn't a bad thing though, so you can re-process it without needing to refetch.
Given source data sample:
for ( split ( "\n", $content ) ) {
next unless m/^\d/; #skip lines that don't start with a digit.
my ( $IP, $hostname ) = split;
my $domainname = $hostname =~ s/^\w+\.//r;
print $domainname,"\n";
}
This doesn't entirely work with your list though, because in that list you have a mix of hostnames and domain names, and it's not actually all that easy to tell the difference.
After all, the 'tld' at the end might be .com or it might be .org.it
127.0.0.1\s+(.*)
should work fine with global modifier.
Demo
Unless saving the list file locally is a requirement (in which case you might be better off just using wget or curl), there is no need to save it in an external file to process it line-by-line.
You can instead open a filehandle to the string itself.
In the script below, extract_hosts would work the same whether you give it a reference to a string or a filename:
#!/usr/bin/env perl
use strict;
use warnings;
use Carp qw( croak );
use LWP::Simple qw( get );
my $url = 'http://www.malwaredomainlist.com/hostslist/hosts.txt';
my $malware_hosts = get $url;
unless (defined $malware_hosts) {
die "Failed to get content from '$url'\n";
}
my $hosts = extract_hosts(\$malware_hosts);
print "$_\n" for #$hosts;
sub extract_hosts {
my $src = shift;
open my $fh, '<', $src
or croak "Failed to open '$src' for reading: $!";
my #hosts;
while (my $entry = <$fh>) {
next unless $entry =~ /\S/;
next if $entry =~ /^#/;
my (undef, $host) = split ' ', $entry;
push #hosts, $host;
}
close $fh
or croak "Failed to close '$src': $!";
\#hosts;
}
This will give you the list of hosts.
Code to grep the hostnames from the given file.
use LWP::Simple;
my $url = 'http://www.malwaredomainlist.com/hostslist/hosts.txt';
my $content = get $url;
my #server_names = split(/127\.0\.0\.1\s*/, $content);
open(my $fh, '>', '/home/jay/feed.txt');
print $fh "#server_names";
close $fh;
Here is another implementation. It uses HTML::Tiny which is part of the core so you don't have to install anything.
use HTTP::Tiny;
my $response = HTTP::Tiny->new->get('http://www.malwaredomainlist.com/hostslist/hosts.txt');
die "Failed!\n" unless $response->{success};
my #content;
for my $line ( split ( "\n", $response->{content} ) ){
next if ( $line =~ /^#|^$/);
push #content, ((split ( " ", $line ))[1]);
}
print Dumper (\#content);
I am trying to deobfuscate code. This code uses a lot of long variable names which are substituted with meaningful names at the time of running the code.
How do I preserve the state while searching and replacing?
For instance, with an obfuscated line like this:
${${"GLOBALS"}["ttxdbvdj"]}=_hash(${$urqboemtmd}.substr(${${"GLOBALS"}["wkcjeuhsnr"]},${${"GLOBALS"}["gjbhisruvsjg"]}-${$rrwbtbxgijs},${${"GLOBALS"}["ibmtmqedn"]}));
There are multiple mappings in mappings.txt which match above obfuscated line like:
$rrwbtbxgijs = hash_length;
$urqboemtmd = out;
At the first run, it will replace $rrwbtbxgijs with hash_length in the obfuscated line above. Now, when it comes across the second mapping during the next iteration of the outer while loop, it will replace $urqboemtmd with out in the obfuscated line.
The problem is:
When it comes across first mapping, it does the substitution. However, when it comes across next mapping in the same line for a different matching string, the previous search/replace result is not there.
It should preserve the previous substitution. How do I do that?
I wrote a Perl script, which would pick one mapping from mapping.txt and search the entire obfuscated code for all the occurrences of this mapping and replace it with the meaningful text.
Here is the code I wrote:
#! /usr/bin/perl
use warnings;
($mapping, $input) = #ARGV;
open MAPPING, '<', $mapping
or die "couldn't read from the file, $mapping with error: $!\n";
while (<MAPPING>) {
chomp;
$line = $_;
($key, $value) = split("=", $line);
open INPUT, '<', $input;
while (<INPUT>) {
chomp;
if (/$key/) {
$_=~s/\Q$key/$value/g;
print $_,"\n";
}
}
close INPUT;
}
close MAPPING;
To match the literal meta characters inside your string, you can use quotemeta or:
s/\Q$key\E/$replace/
Just tell Perl not to interpret the characters in $key:
s/\Q$key/$value/g
Consider using B::Deobfuscate and gradually enter variable names into its configuration file as you figure out what they do.
I'm a little confused about your request to save state. What exactly are you doing/do you intend to do with the output? Here's an (untested) example of doing all the substitutions in one pass, if that helps?
my %map;
while ( my $line = <MAPPING> ) {
chomp $line;
my ($key, $value) = split("=", $line);
$map{$key} = $value;
}
close MAPPING;
my $search = qr/(#{[ join '|', map quotemeta, sort { length $b <=> length $a } keys %map ]})/;
while ( my $line = <INPUT> ) {
$line =~ s/$search/$map{$1}/g;
print OUTPUT $line;
}
open (FH,"report");
read(FH,$text,-s "report");
$fill{"place"} = "Dhahran";
$fill{"wdesc:desc"} = "hot";
$fill{"dayno.days"} = 4;
$text =~ s/%(\w+)%/$fill{$1}/g;
print $text;
This is the content of the "report" template file
"I am giving a course this week in %place%. The weather is %wdesc:desc%
and we're now onto day no %dayno.days%. It's great group of blokes on the
course but the room is like the weather - %wdesc:desc% and it gets hard to
follow late in the day."
For reasons that I won't go into, some of the keys in the hash I'll be using will have dots (.) or colons (:) in them, but the regex stops working for these, so for instance in the example above only %place% gets correctly replaced. By the way, my code is based on this example.
Any help with the regex greatly appreciated, or maybe there's a better approach...
You could loosen it right up and use "any sequence of anything that isn't a %" for the replaceable tokens:
$text =~ s/%([^%]+)%/$fill{$1}/g;
Good answers so far, but you should also decide what you want to do with %foo% if foo isn't a key in the %fill hash. Plausible options are:
Replace it with an empty string (that's what the current solutions do, since undef is treated as an empty string in this context)
Leave it alone, so "%foo%" stays as it is.
Do some kind of error handling, perhaps printing a warning on STDERR, terminating the translation, or inserting an error indicator into the text.
Some other observations, not directly relevant to your question:
You should use the three-argument version of open.
That's not the cleanest way to read an entire file into a string. For that matter, for what you're doing you might as well process the input one line at a time.
Here's how I might do it (this version leaves unrecognized "%foo%" strings alone):
#!/usr/bin/perl
use strict;
use warnings;
my %fill = ( place => 'Dhahran',
'wdesc:desc' => 'hot',
'dayno.days' => 4 );
my $filename = 'report';
open my $FH,,'<', $filename or die "$filename: $!\n";
while (my $line = <$FH>) {
foreach my $key (keys %fill) {
$line =~ s/\Q%$key%/$fill{$key}/g;
}
print $line;
}
And here's a version that dies with an error message if there's an unrecognized key:
#!/usr/bin/perl
use strict;
use warnings;
my %fill = ( place => 'Dhahran',
'wdesc:desc' => 'hot',
'dayno.days' => 4 );
my $filename = 'report';
open my $FH,,'<', $filename or die "$filename: $!\n";
while (my $line = <$FH>) {
$line =~ s/%([^%]*)%/Replacement($1)/eg;
print $line;
}
sub Replacement {
my($key) = #_;
if (exists $fill{$key}) {
return $fill{$key};
}
else {
die "Unrecognized key \"$key\" on line $.\n";
}
}
http://codepad.org/G0WEDNyH
$text =~ s/%([a-zA-Z0-9_\.\:]+)%/$fill{$1}/g;
By default \w equates to [a-zA-Z0-9_], so you'll need to add in the \. and \:.
I'm attempting to compare each line in a CSV file to each and every element (strings) I have stored in an array using Perl. I want to return/print-to-file the line from the CSV file only if it is not matched by any of the strings in the array. I've tried numerous kinds of loops to achieve this, but have not only not found a solution, but none of my attempts is really giving me clues as to where I'm going wrong. Below are a few samples of the loops I've tried:
while (<CSVFILE>) {
foreach $i (#lines) {
print OUTPUTFILE $_ if $_ !~ m/$i/;
}; #foreach
}; #while
AND:
foreach $i (#lines) {
open (CSVFILE , "< $csv") or die "Can't open $csv for read: $!";
while (<CSVFILE>) {
if ($_ !~ m/$i/) {
print OUTPUTFILE $_;
}; #if
}; #while
close (CSVFILE) or die "Cannot close $csv: $!";
}; #foreach
Here is a sample of the CSV file I am attempting:
1,c.03_05delAAG,null,71...
2,c.12T>G,null,24T->G,5...
3,c.87C>T,null,96C->T,82....
And the array elements (with regex escape characters):
c\.12T\>G
c\.97A\>C
Assuming only the above as input data, I would hope to get back:
1,c.03_05delAAG,null,71...
3,c.87C>T,null,16C->T....
since they do not contain any of the elements from the array. Is this a situation where Hashes come into play? I don't have a great handle on them yet, aside from the standard "dictionary" definition. If anyone could help me get my head around this problem it would be greatly appreciated. A this point I might just do it manually as there isn't that many and I need this out of the way ASAP, but since I wasn't able to find any answers searching anywhere else I figured it was worthwhile asking.
Use Perl 5.10.1 or better, so you can apply smart matching.
Also, don't use the implicit $_ when you're dealing with two loops, it gets too confusing and is error prone.
The following code (untested) might do the trick:
use 5.010;
use strict;
use warnings;
use autodie;
...
my #regexes = map { qr{$_} } #lines;
open my $out, '>', $outputfile;
open my $csv, '<', $csvfile;
while (my $line = <$csv>) {
print $out $line unless $line ~~ #regexes;
}
close $csv;
close $out;
The reason your code doesn't work, by the way, is that it will print a line if any of the elements in #lines don't match, and that will always be the case.