Can anyone spot the problem in my for loop in Perl? - regex

I have a text file which contains a protein sequence in FASTA format. FASTA files have the first line which is a header and the rest is the sequence of interest. Each letter is one amino acid. I want to write a program that finds the motifs VSEX (X being any amino acid and the others being specific ones) and prints out the motif itself and the position it was found. This is my code:
#!usr/bin/perl
open (IN,'P51170.fasta.txt');
while(<IN>) {
$seq.=$_;
$seq=~s/ //g;
chomp $seq;
}
#print $seq;
$j=0;
$l= length $seq;
#print $l;
for ($i=0, $i<=$l-4,$i++){
$j=$i+1;
$motif= substr ($seq,$i,4);
if ($motif=~m/VSE(.)/) {
print "motif $motif found in position $j \n" ;
}
}
I'm pretty sure I have messed up the loop, but I don't know what went wrong. The output I get on cygwin is the following:
motif found in position 2
motif found in position 2
motif found in position 2

So some general perl tips:
Always use strict; and use warnings; - that'll tell you when your code is doing something bogus.
Anyway, in trying to figure out what's going wrong (although the other post correctly points out that perl for loops need semicolons not commas) I rewrote it a little to accomplish what I think is the same result:
#!usr/bin/perl
use strict;
use warnings;
#read in source data
open (my $input_data,' '<', P51170.fasta.txt') or die $!;
#extract 'first line':
#>sp|P51170|SCNNG_HUMAN Amiloride-sensitive sodium channel subunit gamma OS=Homo sapiens OX=9606 GN=SCNN1G PE=1 SV=4
my $header = <$input_data>;
#slurp all the rest of the file into the seq string.
#$/ is the 'end of line' separate, thus we temporarily undefine it to read the whole file rather than just one line
my $seq = do { local $/; <$input_data> };
#remove linefeeds from the sequence
$seq =~ s/[\r\n]//g;
close $input_data;
#printing what either looks like for clarity
print $header, "\n\n";
print $seq, "\n\n";
#iterate regex matches against $seq
while ( $seq =~ m/VSE(.)/g ) {
# use pos() function to report where matches happened. $1 is the contents
# of the first 'capture bracket'.
print "motif $1 at ", pos($seq), "\n";
}
So rather than manually for-looping through your data, we instead use the regex engine and the perl pos() function to find where any relevant matches occur.

Use semicolons in the C-style loop:
for ($i=0; $i<=$l-4; $i++) {
Or, use a Perl style loop:
for my $i (0 .. $l - 4) {
But you don't have to loop over the positions, Perl can do that for you (see pos):
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
open my $in, '<', 'P51170.fasta.txt' or die $!;
my $seq = "";
while (<$in>) {
chomp;
s/ //g;
$seq .= $_;
}
while ($seq =~ /(VSE.)/g) {
say "Motif $1 found at ", pos($seq) - 3;
}
Note that I followed some good practices:
I used strict and warnings;
I checked the return value of open;
I used the 3-argument version of open;
I used a lexical filehandle;
I don't remove spaces from $seq again and again, only from the newly read lines.

Your base program can be reduced to a one-liner:
perl -0777 -nwE's/\s+//g; say "motif $1 found at " . pos while /VSE(.)/g' P51170.fasta.txt
-0777 is slurp mode: the whole file is read in one go
-n uses either standard input, or a file name as argument as source for input
-E enable features (say in this case)
s/\s+//g removes all whitespace from the default variable $_ -- the input from the file
$1 contains the string matched by the parentheses in the regex
pos reports the position of the match
Though of course this includes the header in the offset, which I assume you do not want. So we're going to have to skip the slurp mode -0777 and the -n switch. It will clutter the code somewhat:
perl -wE'$h = <>; $_ = do { local $/; <> }; s/\s+//g; say "motif $1 found at " . pos while /VSE(.)/g' P51170.fasta.txt
Here we use the idiomatic do block slurp in combination with the diamond operator <>. The diamond operator will use either standard input, or a file name, just like -n.
The first <> reads the header, which can be used later.
The one-liner as a program file looks like this:
use strict;
use warnings; # never skip these pragmas
use feature 'say';
my $h = <>; # header, skip for now
$_ = do { local $/; <> }; # slurp rest of file
s/\s+//g; # remove whitespace
say "motif $1 found at " . pos while /VSE(.)/g;
Use it like this:
perl fasta.pl P51170.fasta.txt

Related

String replace in Perl

I am trying to deobfuscate code. This code uses a lot of long variable names which are substituted with meaningful names at the time of running the code.
How do I preserve the state while searching and replacing?
For instance, with an obfuscated line like this:
${${"GLOBALS"}["ttxdbvdj"]}=_hash(${$urqboemtmd}.substr(${${"GLOBALS"}["wkcjeuhsnr"]},${${"GLOBALS"}["gjbhisruvsjg"]}-${$rrwbtbxgijs},${${"GLOBALS"}["ibmtmqedn"]}));
There are multiple mappings in mappings.txt which match above obfuscated line like:
$rrwbtbxgijs = hash_length;
$urqboemtmd = out;
At the first run, it will replace $rrwbtbxgijs with hash_length in the obfuscated line above. Now, when it comes across the second mapping during the next iteration of the outer while loop, it will replace $urqboemtmd with out in the obfuscated line.
The problem is:
When it comes across first mapping, it does the substitution. However, when it comes across next mapping in the same line for a different matching string, the previous search/replace result is not there.
It should preserve the previous substitution. How do I do that?
I wrote a Perl script, which would pick one mapping from mapping.txt and search the entire obfuscated code for all the occurrences of this mapping and replace it with the meaningful text.
Here is the code I wrote:
#! /usr/bin/perl
use warnings;
($mapping, $input) = #ARGV;
open MAPPING, '<', $mapping
or die "couldn't read from the file, $mapping with error: $!\n";
while (<MAPPING>) {
chomp;
$line = $_;
($key, $value) = split("=", $line);
open INPUT, '<', $input;
while (<INPUT>) {
chomp;
if (/$key/) {
$_=~s/\Q$key/$value/g;
print $_,"\n";
}
}
close INPUT;
}
close MAPPING;
To match the literal meta characters inside your string, you can use quotemeta or:
s/\Q$key\E/$replace/
Just tell Perl not to interpret the characters in $key:
s/\Q$key/$value/g
Consider using B::Deobfuscate and gradually enter variable names into its configuration file as you figure out what they do.
I'm a little confused about your request to save state. What exactly are you doing/do you intend to do with the output? Here's an (untested) example of doing all the substitutions in one pass, if that helps?
my %map;
while ( my $line = <MAPPING> ) {
chomp $line;
my ($key, $value) = split("=", $line);
$map{$key} = $value;
}
close MAPPING;
my $search = qr/(#{[ join '|', map quotemeta, sort { length $b <=> length $a } keys %map ]})/;
while ( my $line = <INPUT> ) {
$line =~ s/$search/$map{$1}/g;
print OUTPUT $line;
}

Perl regex: How to find in a file a word typed by a user

I am writing a script to read a LOG file. I want the user to type a word and then look it up and print the line (from a string) matching the word.
I'm just learning Perl so please be very specific and simple so that I can understand it.
print "Please Enter the word to find: ";
chomp ($userInput = <STDIN>);
while ($line = <INPUT>)
if ($line =~ /userInput/)
print $line;
I know that this is not perfect but I'm just learning.
You were close. You need to expand the variable in the pattern match.
print "Please Enter the word to find: ";
chomp ($userInput = <STDIN>);
while ($line = <INPUT>) {
if ($line =~ /$userInput/) { # note extra dollar sign
print $line;
}
}
Be aware that that is a pattern match, so you are searching with a string that potentially contains wildcards in it. If you want a literal string, put a \Q in front of the variable as you interpolate it: /\Q$userInput/.
Something like .\bWORD\b. might work (thou it is not tested)
print $line if ($line =~ /.*\bWORD\b/)
#NewLearner
\b is for word boundaries
http://www.regular-expressions.info/wordboundaries.html
If you're doing just one loopup, using a while loop is fine. Though of course you'll need to fix your syntax.
You could also use grep:
print grep /$userInput/, <INPUT>;
If you want to do multiple lookups, you can either reopen the file handle (if the file is large), or store it in an array:
print grep /$userInput/, #array;
You'll have meta characters in your input, of course. This can be a good thing, or bad, depending on your users. For example, an experienced user would recognize the option to refine his search by entering a search term such as ^foo(?=bar), whereas other people may get very confused when they can't find the string foo+bar.
A way to escape meta characters is by using quotemeta on your input. Another is to use \Q ... \E inside your regex.
$userInput = quotemeta($userInput);
# or
print grep /\Q$userInput\E/, <INPUT>;
I believe if I were you, I would use a subroutine for the lookup. That way you can perform as many lookups as you like rather handily.
use strict;
use warnings; # ALWAYS use these
print "Please Enter the word to find: ";
chomp (my $userInput = <>); # <> is a more flexible handle
print lookup($userInput);
sub lookup {
my $word = shift;
open my $fh, "<", $inputfile or die $!;
my #hits;
while (<$fh>) {
push #hits, $_ if /\Q$word\E/;
}
return #hits;
}

Perl regex which grabs ALL double-letter occurrences in a line

Still plugging away at teaching myself Perl. I'm trying to write some code that will count the lines of a file that contain double letters and then place parentheses around those double letters.
Now what I've come up with will find the first occurrence of double letters, but not any other ones. For instance, if the line is:
Amp, James Watt, Bob Transformer, etc. These pioneers conducted many
My code will render this:
19 Amp, James Wa(tt), Bob Transformer, etc. These pioneers conducted many
The "19" is the count (of lines containing double letters) and it gets the "tt" of "Watt" but misses the "ee" in "pioneers".
Below is my code:
$file = '/path/to/file/electricity.txt';
open(FH, $file) || die "Cannot open the file\n";
my $counter=0;
while (<FH>) {
chomp();
if (/(\w)\1/) {
$counter += 1;
s/$&/\($&\)/g;
print "\n\n$counter $_\n\n";
} else {
print "$_\n";
}
}
close(FH);
What am I overlooking?
use strict;
use warnings;
use 5.010;
use autodie;
my $file = '/path/to/file/electricity.txt';
open my $fh, '<', $file;
my $counter = 0;
while (<$fh>) {
chomp;
if (/(\w)\1/) {
$counter++;
s/
(?<full>
(?<letter>\p{L})
\g{letter}
)
/($+{full})/xg;
$_ = $counter . ' ' . $_;
}
say;
}
You are overlooking a few things. strict and warnings; 5.010 (or higher!) for say; autodie so you don't have to keep typing those 'or die'; Lexical filehandles and the three-argument form of open; A bit nitpicky, but knowing when (not) to use parens for function calls; Understanding why you shouldn't use $&; The autoincrement operator..
But on the regex part specifically, $& is only set on matches (m//), not substitution Actually no, ysth is right as usual. Sorry!
(I took the liberty of modifying your regex a bit; it makes use of named captures - (?) instead of bare parens, accessed through \g{} notation inside the regex, and the %+ hash outside of it - and Unicode-style properties - \p{Etc}). A lot more about those in perlre and perluniprops, respectively.
You need to use a back reference:
#! /usr/bin/env perl
use warnings;
use strict;
my $line = "this is a doubble letter test of my scrippt";
$line =~ s/([[:alpha:]])(\1)/($1$2)/g;
print "$line\n";
And now the test.
$ ./test.pl
this is a dou(bb)le le(tt)er test of my scri(pp)t
It works!
When you do a substitution, you use the $1 to represent what is in the parentheses. When you are referring to a part of the regular expression itself, you use the \1 form.
The [[:alpha:]] is a special POSIX class. You can find out more information by typing in
$ perldoc perlre
at the command line.
You're overcomplicating things by messing around with $&. s///g returns the number of substitutions performed when used in scalar context, so you can do it all in one shot without needing to count matches by hand or track the position of each match:
#!/usr/bin/env perl
use strict;
use warnings;
my $text = 'James Watt, a pioneer of wattage engineering';
my $doubles = $text =~ s/(\w)\1/($1$1)/g;
print "$doubles $text\n";
Output:
4 James Wa(tt), a pion(ee)r of wa(tt)age engin(ee)ring
Edit: OP stated in comments that the exercise in question says not to use =~, so here's a non-regex-based solution, since all regex matches use =~ (implicitly or explicitly):
#!/usr/bin/env perl
use strict;
use warnings;
my $text = 'James Watt, a pioneer of wattage engineering';
my $doubles = 0;
for my $i (reverse 1 .. length $text) {
if (substr($text, $i, 1) eq substr($text, $i - 1, 1)) {
$doubles++;
substr($text, $i - 1, 2) = '(' . substr($text, $i - 1, 2) . ')';
}
}
print "$doubles $text\n";
The problem is that you're using $& in the second regex which only matched the first occurance of a double letter set
if (/(\w)\1/) { #first occurance matched, so the pattern in the replace regex will only be that particular set of double letters
Try doing something like this:
s/(\w)\1/\($1$1\)/g; instead of s/$&/\($&\)/g;
Full code after editing:
$file = '/path/to/file/electricity.txt';
open(FH, $file) || die "Cannot open the file\n";
my $counter=0;
while (<FH>) {
chomp();
if (s/(\w)\1/\($1$1\)/g) {
$counter++;
print "\n\n$counter $_\n\n";
} else {
print "$_\n";
}
}
close(FH);
notice that you can use the s///g replace in a conditional statement which is true when a replace occurred.

parse multiple lines in perl regular expression and extract value

I am a beginner in perl. I have a text file with text similar to as below. i need to extract VALUE="<NEEDED VALUE>". Say for SPINACH, i should be getting SALAD alone.
How to use perl regex to get the value. i need to parse multiple lines to get it. ie between each #ifonly --- #endifonly
$cat check.txt
while (<$file>)
{
if (m/#ifonly .+ SPINACH .+ VALUE=(")([\w]*)(") .+ #endifonly/g)
{
my $chosen = $2;
}
}
#ifonly APPLE CARROT SPINACH
VALUE="SALAD" REQUIRED="yes"
QW RETEWRT OIOUR
#endifonly
#ifonly APPLE MANGO ORANGE CARROT
VALUE="JUICE" REQUIRED="yes"
as df fg
#endifonly
use strict;
use warnings;
use 5.010;
while (<DATA>) {
my $rc = /#ifonly .+ SPINACH/ .. (my ($value) = /VALUE="([^"]*)"/);
next unless $rc =~ /E0$/;
say $value;
}
__DATA__
#ifonly APPLE CARROT SPINACH
VALUE="SALAD" REQUIRED="yes"
QW RETEWRT OIOUR
#endifonly
#ifonly APPLE MANGO ORANGE CARROT
VALUE="JUICE" REQUIRED="yes"
as df fg
#endifonly
This uses a small trick described by brian d foy here. As the link describes, it uses the scalar range operator / flipflop.
In case your file is very big (or you want to read it line by line for some other reason) you could do it as follows:
#!/usr/bin/perl
use strict;
use warnings;
use Getopt::Long;
my ($file, $keyword);
# now get command line options (see Usage note below)
GetOptions(
"f=s" => \$file,
"k=s" => \$keyword,
);
# if either the file or the keyword has not been provided, display a
# help text and exit
if (! $file || ! $keyword) {
print STDERR<<EOF;
Usage: script.pl -f filename -k keyword
EOF
exit(1);
}
my $found; # indicator that the keyword has been found
my $returned_word; # will store the word you want to retrieve
open FILE, "<$file" or die "Cannot open file '$file': $!";
while (<FILE>) {
if (/$keyword/) {
$found = 1;
}
# the following condition will be true between all lines that
# start with '#ifonly' or '#endifonly' - but only if the keyword
# has been found!
if (/^#ifonly/ .. /^#endifonly/ && $found) {
if (/VALUE="(\w+)"/) {
$returned_word = $1;
print "looking for $keyword --> found $returned_word\n";
last; # if you want to get ALL values after the keyword
# remove the 'last' statement, as it makes the script
# exit the while loop
}
}
}
close FILE;
You can read the file contents in a string and then search for the pattern in the string:
my $file;
$file.=$_ while(<>);
if($file =~ /#ifonly.+?\bSPINACH\b.+?VALUE="(\w*)".+?#endifonly/s) {
print $1;
}
Your original regex needs some tweaking:
You need to make your quantifiers
non-greedy.
Use the s modifier to make .
match newline as-well.
Ideone Link
Here's another answer based on the flip-flop operator:
use strict;
use warnings;
use 5.010;
while (<$file>)
{
if ( (/^#ifonly.*\bSPINACH\b/ .. /^#endifonly/) &&
(my ($chosen) = /^VALUE="(\w+)"/) )
{
say $chosen;
}
}
This solution applies the second test to all of the lines in the range. The trick #Hugmeir used to exclude the start and end lines isn't needed because the "inner" regex, /^VALUE="(\w+)"/, can never match them anyway (I added the ^ anchor to all regexes to make doubly sure of that).
These two lines in one answer given two days ago
my $file;
$file.=$_ while(<>);
are not very efficient. Perl will likely read the file in big chunks, break those chunks into lines of text for the <> and then the .= will join those lines back to make a big string. It would be more efficient to slurp the file. The basic style is to alter \$ the input record separator.
undef $/;
$file = <>;
The module File::Slurp; (see perldoc File::Slurp) may be even better.

How can I extract the substring in the last set of parentheses using Perl?

I am using Perl to parse out sizes in a string. What is the regex that I could use to accomplish this:
Example Data:
Sleepwell Mattress (Twin)
Magic Nite (Flip Free design) Mattress (Full XL)
Result:
Twin
Full XL
I know that I need to start at the end of the string and parse out the first set of parenthesis just not sure how to do it.
#!/usr/bin/perl
$file = 'input.csv';
open (F, $file) || die ("Could not open $file!");
while ($line = <F>)
{
($field1,$field2,$field3,$field4,$field5,$field6,$field7, $field8, $field9) = split ',', $line;
if ( $field1 =~ /^.*\((.*)\)/ ) {
print $1;
}
#print "$field1,$field2,$field3,$field4,$field5,$field6,$field7, $field8, $field9, $1\n";
}
close (F);
Not getting any results. Maybe I am not doing this right.
The answer depends on if the size information you are looking for always appears within parentheses at the end of the string. If that is the case, then your task is simple:
#!/usr/bin/perl
use strict;
use warnings;
while (<DATA> ) {
last unless /\S/;
my ($size) = /\( ( [^)]+ ) \)$/x;
print "$size\n";
}
__DATA__
Sleepwell Mattress (Twin)
Magic Nite (Flip Free design) Mattress (Full XL)
Output:
C:\Temp> xxl
Twin
Full XL
Note that the code you posted can be better written as:
#!/usr/bin/perl
use strict;
use warnings;
my ($input_file) = #ARGV;
open my $input, '<', $input_file
or die "Could not open '$input_file': $!";
while (my $line = <$input>) {
chomp $line;
my #fields = split /,/, $line;
if ($field[0] =~ /\( ( [^)]+ ) \)$/x ) {
print $1;
}
print join('|', #fields), "\n";
}
close $input;
Also, you should consider using Text::xSV or Text::CSV_XS to process CSV files.
The following regular expression will match the content at the end of the string:
m/\(([^)]+)\)$/m
The m at then end matches mutli-line strings and changes the $ to match at the end of the line, not the end of the string.
[edited to add the bit about multi-line strings]
Assuming your data arrives line by line, and you are only interested in the contents of the last set of parens:
if ( $string =~ /^.*\((.*)\)/ ) {
print $1;
}
fancy regex is not really necessary here. make it easier on yourself. you can do splitting on "[space](" and get the last element. Of course, this is when the data you want to get is always at the last...and have parenthesis
while(<>){
#a = split / \(/, $_;
print $a[-1]; # get the last element. do your own trimming
}
This is the answer as expressed in Perl5:
my $str = "Magic Nite (Flip Free design) Mattress (Full XL)";
$str =~ m/.*\((.*)\)/;
print "$1\r\n";