I am trying to deobfuscate code. This code uses a lot of long variable names which are substituted with meaningful names at the time of running the code.
How do I preserve the state while searching and replacing?
For instance, with an obfuscated line like this:
${${"GLOBALS"}["ttxdbvdj"]}=_hash(${$urqboemtmd}.substr(${${"GLOBALS"}["wkcjeuhsnr"]},${${"GLOBALS"}["gjbhisruvsjg"]}-${$rrwbtbxgijs},${${"GLOBALS"}["ibmtmqedn"]}));
There are multiple mappings in mappings.txt which match above obfuscated line like:
$rrwbtbxgijs = hash_length;
$urqboemtmd = out;
At the first run, it will replace $rrwbtbxgijs with hash_length in the obfuscated line above. Now, when it comes across the second mapping during the next iteration of the outer while loop, it will replace $urqboemtmd with out in the obfuscated line.
The problem is:
When it comes across first mapping, it does the substitution. However, when it comes across next mapping in the same line for a different matching string, the previous search/replace result is not there.
It should preserve the previous substitution. How do I do that?
I wrote a Perl script, which would pick one mapping from mapping.txt and search the entire obfuscated code for all the occurrences of this mapping and replace it with the meaningful text.
Here is the code I wrote:
#! /usr/bin/perl
use warnings;
($mapping, $input) = #ARGV;
open MAPPING, '<', $mapping
or die "couldn't read from the file, $mapping with error: $!\n";
while (<MAPPING>) {
chomp;
$line = $_;
($key, $value) = split("=", $line);
open INPUT, '<', $input;
while (<INPUT>) {
chomp;
if (/$key/) {
$_=~s/\Q$key/$value/g;
print $_,"\n";
}
}
close INPUT;
}
close MAPPING;
To match the literal meta characters inside your string, you can use quotemeta or:
s/\Q$key\E/$replace/
Just tell Perl not to interpret the characters in $key:
s/\Q$key/$value/g
Consider using B::Deobfuscate and gradually enter variable names into its configuration file as you figure out what they do.
I'm a little confused about your request to save state. What exactly are you doing/do you intend to do with the output? Here's an (untested) example of doing all the substitutions in one pass, if that helps?
my %map;
while ( my $line = <MAPPING> ) {
chomp $line;
my ($key, $value) = split("=", $line);
$map{$key} = $value;
}
close MAPPING;
my $search = qr/(#{[ join '|', map quotemeta, sort { length $b <=> length $a } keys %map ]})/;
while ( my $line = <INPUT> ) {
$line =~ s/$search/$map{$1}/g;
print OUTPUT $line;
}
Related
I have a text file which contains a protein sequence in FASTA format. FASTA files have the first line which is a header and the rest is the sequence of interest. Each letter is one amino acid. I want to write a program that finds the motifs VSEX (X being any amino acid and the others being specific ones) and prints out the motif itself and the position it was found. This is my code:
#!usr/bin/perl
open (IN,'P51170.fasta.txt');
while(<IN>) {
$seq.=$_;
$seq=~s/ //g;
chomp $seq;
}
#print $seq;
$j=0;
$l= length $seq;
#print $l;
for ($i=0, $i<=$l-4,$i++){
$j=$i+1;
$motif= substr ($seq,$i,4);
if ($motif=~m/VSE(.)/) {
print "motif $motif found in position $j \n" ;
}
}
I'm pretty sure I have messed up the loop, but I don't know what went wrong. The output I get on cygwin is the following:
motif found in position 2
motif found in position 2
motif found in position 2
So some general perl tips:
Always use strict; and use warnings; - that'll tell you when your code is doing something bogus.
Anyway, in trying to figure out what's going wrong (although the other post correctly points out that perl for loops need semicolons not commas) I rewrote it a little to accomplish what I think is the same result:
#!usr/bin/perl
use strict;
use warnings;
#read in source data
open (my $input_data,' '<', P51170.fasta.txt') or die $!;
#extract 'first line':
#>sp|P51170|SCNNG_HUMAN Amiloride-sensitive sodium channel subunit gamma OS=Homo sapiens OX=9606 GN=SCNN1G PE=1 SV=4
my $header = <$input_data>;
#slurp all the rest of the file into the seq string.
#$/ is the 'end of line' separate, thus we temporarily undefine it to read the whole file rather than just one line
my $seq = do { local $/; <$input_data> };
#remove linefeeds from the sequence
$seq =~ s/[\r\n]//g;
close $input_data;
#printing what either looks like for clarity
print $header, "\n\n";
print $seq, "\n\n";
#iterate regex matches against $seq
while ( $seq =~ m/VSE(.)/g ) {
# use pos() function to report where matches happened. $1 is the contents
# of the first 'capture bracket'.
print "motif $1 at ", pos($seq), "\n";
}
So rather than manually for-looping through your data, we instead use the regex engine and the perl pos() function to find where any relevant matches occur.
Use semicolons in the C-style loop:
for ($i=0; $i<=$l-4; $i++) {
Or, use a Perl style loop:
for my $i (0 .. $l - 4) {
But you don't have to loop over the positions, Perl can do that for you (see pos):
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
open my $in, '<', 'P51170.fasta.txt' or die $!;
my $seq = "";
while (<$in>) {
chomp;
s/ //g;
$seq .= $_;
}
while ($seq =~ /(VSE.)/g) {
say "Motif $1 found at ", pos($seq) - 3;
}
Note that I followed some good practices:
I used strict and warnings;
I checked the return value of open;
I used the 3-argument version of open;
I used a lexical filehandle;
I don't remove spaces from $seq again and again, only from the newly read lines.
Your base program can be reduced to a one-liner:
perl -0777 -nwE's/\s+//g; say "motif $1 found at " . pos while /VSE(.)/g' P51170.fasta.txt
-0777 is slurp mode: the whole file is read in one go
-n uses either standard input, or a file name as argument as source for input
-E enable features (say in this case)
s/\s+//g removes all whitespace from the default variable $_ -- the input from the file
$1 contains the string matched by the parentheses in the regex
pos reports the position of the match
Though of course this includes the header in the offset, which I assume you do not want. So we're going to have to skip the slurp mode -0777 and the -n switch. It will clutter the code somewhat:
perl -wE'$h = <>; $_ = do { local $/; <> }; s/\s+//g; say "motif $1 found at " . pos while /VSE(.)/g' P51170.fasta.txt
Here we use the idiomatic do block slurp in combination with the diamond operator <>. The diamond operator will use either standard input, or a file name, just like -n.
The first <> reads the header, which can be used later.
The one-liner as a program file looks like this:
use strict;
use warnings; # never skip these pragmas
use feature 'say';
my $h = <>; # header, skip for now
$_ = do { local $/; <> }; # slurp rest of file
s/\s+//g; # remove whitespace
say "motif $1 found at " . pos while /VSE(.)/g;
Use it like this:
perl fasta.pl P51170.fasta.txt
I have been working on this for a little while now and can't seem to figure it out. I have a file containing a bunch of lines all structured like the one below meaning each line starts with "!" and has three separators "<DIV>".
!the<DIV>car<DIV>drove down the<DIV>road off into the distance
I am interested in retrieving the last string "road off into the distance" I can't seem to get it to work. Below I have listed the current code I have.
while($line = <INFILE>) {
$line =~ /<SEP>{3}(.*)/;
print $1;
}
Any help would be greatly appreciated!
The statement
#b = $a =~ /^!(.*?)<DIV>(.*?)<DIV>(.*?)<DIV>(.*)/
will split the string into a list, and you can then extract the 4th element with
$b[3]
If you really want only the last one, do this instead:
($text) = $a =~ /^!.*<DIV>(.*)/
I don't know whether you insist on regex or simply didn't think of else, but split will nicely do this
$text = (split '<DIV>', $str)[-1];
If you regularly have such repeating patterns split may well be better for the job than a pure regex. (Split also uses full regular expressions in its pattern, of course.)
ADDED
All this can be done directly, if you simply only need to pull the last thing from each line:
open my $fh, '<', $file;
my #text = map { (split '<DIV>')[-1] } <$fh>;
close $fh;
print "$_\n" for #text;
The split by default uses $_, which inside the map is the current element processed. For lines without a <DIV> this returns the whole line. A file handle in the list context serves all lines as a list; the list context is imposed by map here.
In case you want all text between delimiters you can do
my #rlines = map { [ split '<DIV>' ] } <$fh>;
where [ ] takes a reference to the list returned by split and thus #rlines contains references to arrays, each with text in between <DIV>s on a line. The leading ! is there though and to drop it a little more processing is needed.
Of course, for the map block you can use { (/.*<DIV>(.*)/)[0] } from Jim Garrison's answer for a single match, or modify the regex a little to catch'em all.
If performance is a factor then that's a little different question.
A simple substitution could work too:
while(<DATA>){
chomp;
my $text = (s/.*<DIV>//g, $_);
say $text;
}
Simple regex which answers your question:
my $match= '';
while($line = <INFILE>) {
($match) = $line =~/.*<DIV>(.*)/;
}
print $match, "\n";
I have a file containing regular expressions, e.g.:
City of (.*)
(.*) State
Now I want to read these (line by line), match them against a string, and print out the extraction (matched group). For example: The string City of Berlin should match with the first expression City of (.*) from the file, after that Berlin should be extracted.
This is what I've got so far:
use warnings;
use strict;
my #pattern;
open(FILE, "<pattern.txt"); # open the file described above
while (my $line = <FILE>) {
push #pattern, $line; # store it inside the #pattern variable
}
close(FILE);
my $exampleString = "City of Berlin"; # line that should be matched and
# Berlin should be extracted
foreach my $p (#pattern) { # try each pattern
if (my ($match) = $exampleString =~ /$p/) {
print "$match";
}
}
I want Berlin to be printed.
What happens with the regex inside the foreach loop?
Is it not compiled? Why?
Is there even a better way to do this?
Your patterns contain a newline character which you need to chomp:
while (my $line = <FILE>) {
chomp $line;
push #pattern, $line;
}
First off - chomp is the root of your problem.
However secondly - your code is also very inefficient. Rather than checking patterns in a foreach loop, consider instead compiling a regex in advance:
#!/usr/bin/env perl
use strict;
use warnings;
# open ( my $pattern_fh, '<', "pattern.txt" ) or die $!;
my #patterns = <DATA>;
chomp(#patterns);
my $regex = join( '|', #patterns );
$regex = qr/(?:$regex)/;
print "Using regex of: $regex\n";
my $example_str = 'City of Berlin';
if ( my ($match) = $example_str =~ m/$regex/ ) {
print "Matched: $match\n";
}
Why is this better? Well, because it scales more efficiently. With your original algorithm - if I have 100 lines in the patterns file, and 100 lines to check as example str, it means making 10,000 comparisons.
With a single regex, you're making one comparison on each line.
Note - normally you'd use quotemeta when reading in regular expressions, which will escape 'meta' characters. We don't want to do this in this case.
If you're looking for even more concise, you can use map to avoid needing an intermediate array:
my $regex = join( '|', map { chomp; $_ } <$pattern_fh> );
$regex = qr/(?:$regex)/;
print "Using regex of: $regex\n";
my $example_str = 'City of Berlin';
if ( my ($match) = $example_str =~ m/$regex/ ) {
print "Matched: $match\n";
}
so i'm currently stuck on this problem:
1. i declare a constant list, say LIST
2. i want to read through a file, which i do so line by line in a while loop, and if the line has a keyword from LIST, i print the line, or so something with it.
this is what i have currently:
use constant LIST => ('keyword1', 'keyword2', 'keyword3');
sub main{
unless(open(MYFILE, $file_read)){die "Error\n"};
while(<MYFILE>){
my $line = $_;
chomp($line);
if($line =~ m//){#here is where i'm stuck, i want is if $line has either of the keywords
print $line;
}
}
}
What should i do in that if statement to match what i want the program to do? and can i do so without having the $line variable and simply using $_? i only used $line because i thought grep would automatically place the constants in LIST into $_.
Thanks!
The easiest way is to define a quoted regular expression as your constant instead of a list:
use strict;
use warnings;
use autodie; # Will kill program on bad opens, closes, and writes
use feature qw(say); # Better than "print" in most situations
use constant {
LIST => qr/keyword1|keyword2|keyword3/, # Now a regular expression.
FILE_READ => 'file.txt', # You're defining constants, make this one too.
};
open my $read_fh, "<", FILE_READ; # Use scalars for file handles
# This isn't Java. You don't have to define "main" subroutine
while ( my $line = <$read_fh> ) {
chomp $line;
if ( $line =~ LIST ) { #Now I can use the constant as a regex
say $line;
}
}
close $read_fh;
By the way, if you don't use autodie, the standard way of opening a file and failing if it doesn't open is to use the or syntax:
open my $fh, "<", $file_name or die qq(Can't open file "$file_name": $!);
If you have to use a list as a constant, then you can use join to make the regular expression:
use constant LIST => qw( keyword1 keyword2 keyword3 );
...
my $regex = join "|", map LIST;
while ( my $line = <$file_fh> ) {
chomp $line;
if ( $line =~ /$regex/ ) {
say $line;
}
}
The join takes a list (in this case, a constant list), and separates each member by the string or character you give it. I hope your keywords contain no special regular expression characters. Otherwise, you need to quote those special characters.
Addendum
my $regex = join '|' => map +quotemeta, LIST; – Zaid
Thanks Zaid. I didn't know about the quotemeta command before. I had been trying various things with \Q and \E, but it started getting too complex.
Another way to do what Zaid did:
my #list = map { quotemeta } LIST;
my $regex = join "|", #list;
The map is a bit difficult for beginners to understand. map takes each element in LIST and runs the quotemeta command against it. This returns list which I assign to #list.
Imagine:
use constant LIST => qw( periods.are special.characters in.regular.expressions );
When I run:
my #list = map { quotemeta } LIST;
This returns the list:
my #list = ( "periods\.are", "special\.characters", "in\.regular\.expressions" );
Now, the periods are literal periods instead of special characters in the regular expression. When I run:
my $regex = join "|", #list;
I get:
$regex = "periods\.are|special\.characters|in\.regular\.expressions";
And that's a valid regular expression.
open (FH,"report");
read(FH,$text,-s "report");
$fill{"place"} = "Dhahran";
$fill{"wdesc:desc"} = "hot";
$fill{"dayno.days"} = 4;
$text =~ s/%(\w+)%/$fill{$1}/g;
print $text;
This is the content of the "report" template file
"I am giving a course this week in %place%. The weather is %wdesc:desc%
and we're now onto day no %dayno.days%. It's great group of blokes on the
course but the room is like the weather - %wdesc:desc% and it gets hard to
follow late in the day."
For reasons that I won't go into, some of the keys in the hash I'll be using will have dots (.) or colons (:) in them, but the regex stops working for these, so for instance in the example above only %place% gets correctly replaced. By the way, my code is based on this example.
Any help with the regex greatly appreciated, or maybe there's a better approach...
You could loosen it right up and use "any sequence of anything that isn't a %" for the replaceable tokens:
$text =~ s/%([^%]+)%/$fill{$1}/g;
Good answers so far, but you should also decide what you want to do with %foo% if foo isn't a key in the %fill hash. Plausible options are:
Replace it with an empty string (that's what the current solutions do, since undef is treated as an empty string in this context)
Leave it alone, so "%foo%" stays as it is.
Do some kind of error handling, perhaps printing a warning on STDERR, terminating the translation, or inserting an error indicator into the text.
Some other observations, not directly relevant to your question:
You should use the three-argument version of open.
That's not the cleanest way to read an entire file into a string. For that matter, for what you're doing you might as well process the input one line at a time.
Here's how I might do it (this version leaves unrecognized "%foo%" strings alone):
#!/usr/bin/perl
use strict;
use warnings;
my %fill = ( place => 'Dhahran',
'wdesc:desc' => 'hot',
'dayno.days' => 4 );
my $filename = 'report';
open my $FH,,'<', $filename or die "$filename: $!\n";
while (my $line = <$FH>) {
foreach my $key (keys %fill) {
$line =~ s/\Q%$key%/$fill{$key}/g;
}
print $line;
}
And here's a version that dies with an error message if there's an unrecognized key:
#!/usr/bin/perl
use strict;
use warnings;
my %fill = ( place => 'Dhahran',
'wdesc:desc' => 'hot',
'dayno.days' => 4 );
my $filename = 'report';
open my $FH,,'<', $filename or die "$filename: $!\n";
while (my $line = <$FH>) {
$line =~ s/%([^%]*)%/Replacement($1)/eg;
print $line;
}
sub Replacement {
my($key) = #_;
if (exists $fill{$key}) {
return $fill{$key};
}
else {
die "Unrecognized key \"$key\" on line $.\n";
}
}
http://codepad.org/G0WEDNyH
$text =~ s/%([a-zA-Z0-9_\.\:]+)%/$fill{$1}/g;
By default \w equates to [a-zA-Z0-9_], so you'll need to add in the \. and \:.