How to remove the whitespaces in fasta file using perl?

How to remove the whitespaces in fasta file using perl? - regex

My fasta file
>1a17_A a.118.8 TPR-like
PADGALKRAEELKTQANDYFKAKDYENAIKFYSQAIELNPSNAIYYGNRS
LAYLRTECYGYALGDATRAIELDKKYIKGYYRRAASNMALGKFRAALRDY
ETVVKVKPHDKDAKMKYQECNKIVKQKAFERAIAGDEHKRSVVDSLDIES
MTIEDEYS
Else try this http://www.ncbi.nlm.nih.gov/nuccore/?term=keratin for fasta files.
open(fas,'d:\a4.fas');
$s=<fas>;
#fasta = <fas>;
#r1 = grep{s/\s//g} #fasta; #It is not remove the white space
#r2 = grep{s/(\s)$//g} #fasta; #It is not working
#r3 = grep{s/.$//g} #fasta; #It is remove the last character, but not remove the last space
print "#r1\n#r2\n#r3\n";
These codes are give the outputs is:
PADGALKRAEELKTQANDYFKAKDYENAIKFYSQAIELNPSNAIYYGNRS LAYLRT
ECYGYALGDATRAIELDKKYIKGYYRRAASNMALGKFRAALRDY ETVVKVKPHDKDAKMKYQECNKIVKQKAFERAIAG
DEHKRSVVDSLDIES MTIEDEYS
I expect Remove the whitespaces from line two and above the lines. How can i do it?

Using perl one liner,
perl -i -pe 's|[ \t]||g' a4.fas
removing all white spaces, including new lines,
perl -i -pe 's|\s||g' a4.fas

use strict;
use warnings;
while(my $line = <DATA>) {
$line =~ s/\s+//g;
print $line;
}
__DATA__
PADGALKRAEELKTQANDYFKAKDYENAIKFYSQAIELNPSNAIYYGNRS
LAYLRTECYGYALGDATRAIELDKKYIKGYYRRAASNMALGKFRAALRDY
ETVVKVKPHDKDAKMKYQECNKIVKQKAFERAIAGDEHKRSVVDSLDIES
MTIEDEYS

grep is the wrong choice to make changes to an array. It filters the elements of the input array, passing as output only those elements for which the expression in the braces { .. } is true.
A substitution s/// is true unless it made no changes to the target string, so of your grep statements,
#r1 = grep { s/\s//g } #fasta
This removes all spaces, including newlines, from the strings in #fasta. It puts in #r1 only those elements that originally contained whitespace, which is probably all of them as they all ended in newline.
#r2 = grep { s/(\s)$//g } #fasta
Because of the anchor $, this removes the character before the newline at the end of the string if it is a whitespace character. It also removes the newline. Any whitespace before the end of the string is untouched. It puts in #r2 only those elements that end in whitespace, which is probably all of them as they all ended in newline.
#r3 = grep { s/.$//g } #fasta;
This removes the character before the newline, whether it is whitespace or not. It leaves the newline, as well as any whitespace before the end. It puts in #r3 only those elements that contain more than just a newline, which again is probably all of them.
I think you want to retain the newlines (which are normally considered as whitespace).
This example will read the whole file, apart from the header, into the variables $data, and then use tr/// to remove spaces and tabs.
use strict;
use warnings;
use 5.010;
use autodie;
my $data = do {
open my $fas, '<', 'D:\a4.fas';
<$fas>; # Drop the header
local $/;
<$fas>;
};
$data =~ tr/ \t//d;
print $data;

Per perlrecharclass:
\h matches any character considered horizontal whitespace; this includes the platform's space and tab characters and several others listed in the table below. \H matches any character not considered horizontal whitespace. They use the platform's native character set, and do not consider any locale that may otherwise be in use.
Therefore the following will display your file with horizontal spacing removed:
perl -pe "s|\h+||g" d:\a4.fas
If you don't want to display the header, just add a condition with $.
perl -ne "s|\h+||g; print if $. > 1" d:\a4.fas
Note: I used double quotes in the above commands since your D:\ volume implies you're likely on Windows.

Related

perl match consecutive newlines: `echo "aaa\n\n\nbbb" | perl -pe "s/\\n\\n/z/gm"`

This works:
echo "aaa\n\n\nbbb" | perl -pe "s/\\n/z/gm"
aaazzzbbbz
This doesn't match anything:
echo "aaa\n\n\nbbb" | perl -pe "s/\\n\\n/z/gm"
aaa
bbb
How do I fix, so the regex matches two consecutive newlines?

A linefeed is matched by \n
echo "a\n\n\b" | perl -pe's/\n/z/'
This prints azzb, and without the following newline, so with the next prompt on the same line. Note that the program is fed one line at a time so there is no need for /g modifier. (And which is why \n\n doesn't match.) That /m modifier is then unrelated to this example.†
I don't know in what form this is used but I'd imagine not with echo feeding the input? Then better test it with input in a file, or in a multi-line string (in which case /g may be needed).
An example
use warnings;
use strict;
use feature 'say';
# Test with multiline string
my $ml_str = "a\n\nb\n";
$ml_str =~ s/\n/z/g; #--> azzbz (no newline at the end)
print $ml_str;
say ''; # to terminate the line above
# Or to replace two consecutive newlines (everywhere)
$ml_str = "a\n\nb\n"; # restore the example string
$ml_str =~ s/\n\n/z/g; #--> azb\n
print $ml_str;
# To replace the consecutive newlines in a file read it into a string
my $file = join '', <DATA>; # lines of data after __DATA__
$file =~ s/\n\n/z/g;
print $file;
__DATA__
one
two
last
This prints
azzbz
azb
one
twoz
last
As a side note, I'd like to mention that with the modifier /s the . matches a newline as well. (For example, this is handy for matching substrings that may contain newlines by .* (or .+); without /s modifier that pattern stops at a newline.)
See perlrebackslash and search for newline.
† The /m modifier makes ^ and $ also match beginning and end of lines inside a multi-line string. Then
$multiline_string =~ s/$/z/mg;
will replace newlines inside the string. However, this example bears some complexities since some of the newlines stay.

You are applying substitution to only one line at a time, and one line will never have two newlines. Apply the substitution to the entire file instead:
perl -0777 -pe 's/\n\n/z/g'

Remove string after finding specific character

My Perl program only removes the last three characters of the string. Currently, I am finding a way to find the count including + and remove using substr or if there is any built-in function in Perl.
open my $hfile, $ARGV[0] or die "Can't open $ARGV[0] for reading: $!";
while( my $line = <$hfile> ){
if ($line =~ /+/){
$line = substr($line, -3);
print $line;
}
}
close $hfile;
Input file
hello_aba+32
gaww_ajnd_arhb+176
ajnbjsdsjn+416
Output file
hello_aba
gaww_ajnd_arhb
ajnbjsdsjn

This is a classic task for a perl one-liner. I would use some caution when deleting after a control character. First off, anchor it to the end of line. Second, make sure only expected characters are deleted.
Assuming the characters to delete 1) does not include plus-signs +, 2) does not include newlines, I would write:
perl -pe' s/\+[^+\n]*$// ' file.txt
[^ ... ] is a character class, and it is negated with ^ to mean "match any character that does not match what is inside".
While + following nothing is probably considered a literal plus and not a meta character, I think escaping it \+ is proper, and prevents future update errors. Assuring that the rest of the line does not contain + assures that any extra plus signs do not cause us to lose data, e.g.
if foo = 2, then foo + bar = 4+123
# ^ first ^ second
Adding $ for end of line will anchor the match to the end of the line. This will prevent any extra + signs to mess up our input. Otherwise it would delete between the two first plus signs found.
Since we do not delete the line endings \n, the file structure remains unchanged.
Demonstration:
$ cat plus.txt
hello_aba+32
gaww_ajnd_arhb+176
ajnbjsdsjn+416
foo + bar = 3+123
$ perl -pe' s/\+[^+\n]*$//' plus.txt
hello_aba
gaww_ajnd_arhb
ajnbjsdsjn
foo + bar = 3
If you want to change the input file, you can either use redirection:
$ perl -pe ..... > newfile.txt
Or add the -i switch to perform in-place edit:
$ perl -pi.bak -e ....
(.bak will create a backup file with extension .bak). Note that the original file is overwritten, so use caution.

To remove anything after "+" in your lines, use a substitution regex:
$line =~ s/\+.*// && print $line;

How to extract 2+ character words from string in perl

I assume some sort of regex would be used to accomplish this?
I need to get it where each word consists of 2 or more characters, start with a letter, and the remaining characters consist of letters, digits, and underscores.
This is the code I currently have, although it isn't very close to my desired output:
while (my $line=<>) {
# remove leading and trailing whitespace
$line =~ s/^\s+|\s+$//g;
$line = lc $line;
#array = split / /, $line;
foreach my $a (#array){
$a =~ s/[\$##~!&*()\[\];.,:?^ `\\\/]+//g;
push(#list, "$a");
}
}
A sample input would be:
#!/usr/bin/perl -w
use strict;
# This line will print a hello world line.
print "Hello world!\n";
exit 0;
And the desired output would be (alphabetical order):
bin
exit
hello
hello
line
perl
print
print
strict
this
use
usr
will
world

my #matches = $string =~ /\b([a-z][a-z0-9_]+)/ig;
If case-insensitive operation need be applied only to a subpattern, can embed it
/... \b((?i)[a-z][a-z0-9_]+) .../
(or, it can be turned off after the subpattern, (?i)pattern(?-i))
That [a-zA-Z0-9_] goes as \w, a "word character", if that's indeed exactly what is needed.
The above regex picks words as required without a need to first split the line on space, done in the shown program. Can apply it on the whole line (or on the whole text for that matter), perhaps after the shown stripping of the various special characters.†
There is a question of some other cases -- how about hyphens? Apostrophes? Tilde? Those aren't found in identifiers, while this appears to be intended to process programming text, but comments are included; what other legitimate characters may there be?
Note on split-ing on whitespace
The shown split / /, $line splits on exactly that one space. Better is split /\s+/, $line -- or, better yet is to use split's special pattern split ' ', $line: split on any number of any consecutive whitespace, and where leading and trailing spaces are discarded.
† The shown example is correctly processed as desired by the given regex alone
use strict;
use warnings;
use feature 'say';
use Path::Tiny qw(path); # convenience, to slurp the file
my $fn = shift // die "Usage: $0 filename\n";
my #matches = sort map { lc }
path($fn)->slurp =~ /\b([a-z][a-z0-9_]+)/ig;
say for #matches;
I threw in sorting and lower-casing to match the sample code in the question but all processing is done with the shown regex on the file's content in a string.
Output is as desired (except that line and world here come twice, what is correct).
Note that lc can be applied on the string with the file content, which is then processed with the regex, what is more efficient. While this is in principle not the same in this case it may be
perl -MPath::Tiny -wE'$f = shift // die "Need filename\n";
#m = sort lc(path($f)->slurp) =~ /\b([a-z]\w+)/ig;
say for #m'
Here I actually used \w. Adjust to the actual character to match, if different.

Curiously, this can be done with one of those long, typical Perl one-liners
$ perl -lwe'print for sort grep /^\pL/ && length > 1, map { split /\W+/ } map lc, <>' a.txt
bin
exit
hello
hello
line
line
perl
print
print
strict
this
use
usr
will
world
world
Lets go through that and see what we can learn. This line reads from right to left.
a.txt is the argument file to read
<> is the diamond operator, reading the lines from the file. Since this is list context, it will exhaust the file handle and return all the lines.
map lc, short for map { lc($_) } will apply the lc function on all the lines and return the result.
map { split /\W+/ } is a multi-purpose operation. It will remove the unwanted characters (the non-word characters), and also split the line there, and return a list of all those words.
grep /^\pL/ && length > 1 sorts out strings that begin with a letter \pL and are longer than 1 and returns them.
sort sorts alphabetically the list coming in from the right and returns it left
for is a for-loop, applied to the incoming list, in the post-fix style.
print is short for print $_, and it will print once for each list item in the for loop.
The -l switch in the perl command will "fix" line endings for us (remove them from input, add them in output). This will make the print pretty at the end.
I won't say this will produce a perfect result, but you should be able to pick up some techniques to finish your own program.

Repeating regex pattern

I have a string such as this
word <gl>aaa</gl> word <gl>aaa-bbb=ccc</gl>
where, if there is one ore more words enclosed in tags. In those instances where there are more than one words (which are usually separated by - or = and potentially other non-word characters), I'd like to make sure that the tags enclose each word individually so that the resulting string would be:
word <gl>aaa</gl> word <gl>aaa</gl>-<gl>bbb</gl>=<gl>ccc</gl>
So I'm trying to come up with a regex that would find any number of iterations of \W*?(\w+) and then enclose each word individually with the tags. And ideally I'd have this as a one-liner that I can execute from the command line with perl, like so:
perl -pe 's///g;' in out
This is how far I've gotten after a lot of trial and error and googling - I'm not a programmer :( ... :
/<gl>\W*?(\w+)\W*?((\w+)\W*?){0,10}<\/gl>/
It finds the first and last word (aaa and ccc). Now, how can I make it repeat the operation and find other words if present? And then how to get the replacement? Any hints on how to do this or where I can find further information would be much appreciated?
EDIT:
This is part of a workflow that does some other transformations within a shell script:
#!/bin/sh
perl -pe '#
s/replace/me/g;
s/replace/me/g;
' $1 > tmp
... some other commands ...

This needs a mini nested-parser and I'd recommend a script, as easier to maintain
use warnings;
use strict;
use feature 'say';
my $str = q(word <gl>aaa</gl> word <gl>aaa-bbb=ccc</gl>);
my $tag_re = qr{(<[^>]+>) (.+?) (</[^>]+>)}x; # / (stop markup highlighter)
$str =~ s{$tag_re}{
my ($o, $t, $c) = ($1, $2, $3); # open (tag), text, close (tag)
$t =~ s/(\w+)/$o$1$c/g;
$t;
}ge;
say $str;
The regex gives us its built-in "parsing," where words that don't match the $tag_re are unchanged. Once the $tag_re is matched, it is processed as required inside the replacement side. The /e modifier makes the replacement side be evaluated as code.
One way to provide input for a script is via command-line arguments, available in #ARGV global array in the script. For the use indicated in the question's "Edit" replace the hardcoded
my $str = q(...);
with
my $str = shift #ARGV; # first argument on the command line
and then use that script in your shell script as
#!/bin/sh
...
script.pl $1 > output_file
where $1 is the shell variable as shown in the "Edit" to the question.
In a one-liner
echo "word <gl>aaa</gl> word <gl>aaa-bbb=ccc</gl>" |
perl -wpe'
s{(<[^>]+>) (.+?) (</[^>]+>)}
{($o,$t,$c)=($1,$2,$3);$t=~s/(\w+)/$o$1$c/g; $t}gex;
'
what in your shell script becomes echo $1 | perl -wpe'...' > output_file. Or you can change the code to read from #ARGV and drop the -n switch, and add a print
#!/bin/sh
...
perl -wE'$_=shift; ...; say' $1 > output_file
where ... in one-liner indicate the same code as above, and say is now needed since we don't have the -p with which the $_ is printed out once it's processed.
The shift takes an element off of an array's front and returns it. Without an argument it does that to #ARGV when outside a subroutine, as here (inside a subroutine its default target is #_).

This will do it:
s/(\w+)([\-=])(?=\w+)/$1<\/gl>$2<gl>/g;
The /g at the end is the repeat and stands for "global". It will pick up matching at the end of the previous match and keep matching until it doesn't match anymore, so we have to be careful about where the match ends. That's what the (?=...) is for. It's a "followed by pattern" that tells the repeat to not include it as part of "where you left off" in the previous match. That way, it picks up where it left off by re-matching the second "word".
The s/ at the beginning is a substitution, so the command would be something like:
cat in | perl -pne 's/(\w+)([\-=])(?=\w+)/$1<\/gl>$2<gl>/g;$_' > out
You need the $_ at the end because the result of the global substitution is the number of substitutions made.
This will only match one line. If your pattern spans multiple lines, you'll need some fancier code. It also assumes the XML is correct and that there are no words surrounding dashes or equals signs outside of tags. To account for this would necessitate an extra pattern match in a loop to pull out the values surrounded by gl tags so that you can do your substitution on just those portions, like:
my $e = $in;
while($in =~ /(.*?<gl>)(.*?)(?=<\/gl>)/g){
my $p = $1;
my $s = $2;
print($p);
$s =~ s/(\w+)([\-=])(?=\w+)/$1<\/gl>$2<gl>/g;
print($s);
$e = $'; # ' (stop markup highlighter)
}
print($e);
You'd have to write your own surrounding loop to read STDIN and put the lines read in into $in. (You would also need to not use -p or -n flags to the perl interpreter since you're reading the input and printing the output manually.) The while loop above however grabs everything inside the gl tags and then performs your substitution on just that content. It prints everything occurring between the last match (or the beginning of the string) and before the current match ($p) and saves everything after in $e which gets printed after the last match outside the loop.

Removing newline character from a string in Perl

I have a string that is read from a text file, but in Ubuntu Linux, and I try to delete its newline character from the end.
I used all the ways. But for s/\n|\r/-/ (I look whether it finds any replaces any new line string) it replaces the string, but it still goes to the next line when I print it. Moreover, when I used chomp or chop, the string is completely deleted. I could not find any other solution. How can I fix this problem?
use strict;
use warnings;
use v5.12;
use utf8;
use encoding "utf-8";
open(MYINPUTFILE, "<:encoding(UTF-8)", "file.txt");
my #strings;
my #fileNames;
my #erroredFileNames;
my $delimiter;
my $extensions;
my $id;
my $surname;
my $name;
while (<MYINPUTFILE>)
{
my ($line) = $_;
my ($line2) = $_;
if ($line !~ /^(((\X|[^\W_ ])+)(.docx)(\n|\r))/g) {
#chop($line2);
$line2 =~ s/^\n+//;
print $line2 . " WRONG FORMAT!\n";
}
else {
#print "INSERTED:".$13."\n";
my($id) = $13;
my($name) = $2;
print $name . "\t" . $id . "\n";
unshift(#fileNames, $line2);
unshift(#strings, $line2 =~ /[^\W_]+/g);
}
}
close(MYINPUTFILE);

The correct way to remove Unicode linebreak graphemes, including CRLF pairs, is using the \R regex metacharacter, introduced in v5.10.
The use encoding pragma is strongly deprecated. You should either use the use open pragma, or use an encoding in the mode argument on 3-arg open, or use binmode.
use v5.10; # minimal Perl version for \R support
use utf8; # source is in UTF-8
use warnings qw(FATAL utf8); # encoding errors raise exceptions
use open qw(:utf8 :std); # default open mode, `backticks`, and std{in,out,err} are in UTF-8
while (<>) {
s/\R\z//;
...
}

You are probably experiencing a line ending from a Windows file causing issues. For example, a string such as "foo bar\n", would actually be "foo bar\r\n". When using chomp on Ubuntu, you would be removing whatever is contained in the variable $/, which would be "\n". So, what remains is "foo bar\r".
This is a subtle, but very common error. For example, if you print "foo bar\r" and add a newline, you would not notice the error:
my $var = "foo bar\r\n";
chomp $var;
print "$var\n"; # Remove and put back newline
But when you concatenate the string with another string, you overwrite the first string, because \r moves the output handle to the beginning of the string. For example:
print "$var: WRONG\n";
It would effectively be "foo bar\r: WRONG\n", but the text after \r would cause the following text to wrap back on top of the first part:
foo bar\r # \r resets position
: WRONG\n # Second line prints and overwrites
This is more obvious when the first line is longer than the second. For example, try the following:
perl -we 'print "foo bar\rbaz\n"'
And you will get the output:
baz bar
The solution is to remove the bad line endings. You can do this with the dos2unix command, or directly in Perl with:
$line =~ s/[\r\n]+$//;
Also, be aware that your other code is somewhat horrific. What do you for example think that $13 contains? That'd be the string captured by the 13th parenthesis in your previous regular expression. I'm fairly sure that value will always be undefined, because you do not have 13 parentheses.
You declare two sets of $id and $name. One outside the loop and one at the top. This is very poor practice, IMO. Only declare variables within the scope they need, and never just bunch all your declarations at the top of your script, unless you explicitly want them to be global to the file.
Why use $line and $line2 when they have the same value? Just use $line.
And seriously, what is up with this:
if ($line !~ /^(((\X|[^\W_ ])+)(.docx)(\n|\r))/g) {
That looks like an attempt to obfuscate, no offence. Three nested negations and a bunch of unnecessary parentheses?
First off, since it is an if-else, just swap it around and reverse the regular expression. Second, [^\W_] a double negation is rather confusing. Why not just use [A-Za-z0-9]? You can split this up to make it easier to parse:
if ($line =~ /^(.+)(\.docx)\s*$/) {
my $pre = $1;
my $ext = $2;

You can wipe the linebreaks with something like this:
$line =~ s/[\n\r]//g;
When you do that though, you'll need to change the regex in your if statement to not look for them. I also don't think you want a /g in your if. You really shouldn't have a $line2 either.
I also wouldn't do this type of thing:
print $line2." WRONG FORMAT!\n";
You can do
print "$line2 WRONG FORMAT!\n";
... instead. Also, print accepts a list, so instead of concatenating your strings, you can just use commas.

You can do something like:
=~ tr/\n//
But really chomp should work:
while (<filehandle>){
chomp;
...
}
Also s/\n|\r// only replaces the first occurrence of \r or \n. If you wanted to replace all occurrences you would want the global modifier at the end s/\r|\n//g.
Note: if you're including \r for windows it usually ends its line as \r\n so you would want to replace both (e.g. s/(?:\r\n|\n)//), of course the statement above (s/\r|\n//g) with the global modifier would take care of that anyways.

$variable = join('',split(/\n/,$variable))

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to remove the whitespaces in fasta file using perl? - regex

Using perl one liner, perl -i -pe 's|[ \t]||g' a4.fas removing all white spaces, including new lines, perl -i -pe 's|\s||g' a4.fas

use strict; use warnings; while(my $line = <DATA>) { $line =~ s/\s+//g; print $line; } DATA PADGALKRAEELKTQANDYFKAKDYENAIKFYSQAIELNPSNAIYYGNRS LAYLRTECYGYALGDATRAIELDKKYIKGYYRRAASNMALGKFRAALRDY ETVVKVKPHDKDAKMKYQECNKIVKQKAFERAIAGDEHKRSVVDSLDIES MTIEDEYS

Related

perl match consecutive newlines: `echo "aaa\n\n\nbbb" | perl -pe "s/\\n\\n/z/gm"`

Remove string after finding specific character

How to extract 2+ character words from string in perl

Repeating regex pattern

Removing newline character from a string in Perl

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to remove the whitespaces in fasta file using perl? - regex

Using perl one liner, perl -i -pe 's|[ \t]||g' a4.fas removing all white spaces, including new lines, perl -i -pe 's|\s||g' a4.fas

use strict; use warnings; while(my $line = <DATA>) { $line =~ s/\s+//g; print $line; } __DATA__ PADGALKRAEELKTQANDYFKAKDYENAIKFYSQAIELNPSNAIYYGNRS LAYLRTECYGYALGDATRAIELDKKYIKGYYRRAASNMALGKFRAALRDY ETVVKVKPHDKDAKMKYQECNKIVKQKAFERAIAGDEHKRSVVDSLDIES MTIEDEYS

Related

perl match consecutive newlines: `echo "aaa\n\n\nbbb" | perl -pe "s/\\n\\n/z/gm"`

Remove string after finding specific character

How to extract 2+ character words from string in perl

Repeating regex pattern

Removing newline character from a string in Perl

Categories

Resources

use strict; use warnings; while(my $line = <DATA>) { $line =~ s/\s+//g; print $line; } DATA PADGALKRAEELKTQANDYFKAKDYENAIKFYSQAIELNPSNAIYYGNRS LAYLRTECYGYALGDATRAIELDKKYIKGYYRRAASNMALGKFRAALRDY ETVVKVKPHDKDAKMKYQECNKIVKQKAFERAIAGDEHKRSVVDSLDIES MTIEDEYS