I've got a Perl script which consumes an XML file on Linux and occasionally there are CRLF (Hex 0D0A, Dos new lines) in some of the node values which.
The system which produces the XML file writes it all as a single line, and it looks as if it occasionally decides that this is too long and writes a CRLF into one of the data elements. Unfortunately there's nothing I can do about the providing system.
I just need to remove these from the string before I process it.
I've tried all sorts of regex replacement using the perl char classes, hex values, all sorts and nothing seems to work.
I've even run the input file through dos2unix before processing and I still can't get rid of the erroneous characters.
Does anyone have any ideas?
Many Thanks,
Typical, After battling for about 2 hours, I solved it within 5 minutes of asking the question..
$output =~ s/[\x0A\x0D]//g;
Finally got it.
$output =~ tr/\x{d}\x{a}//d;
These are both whitespace characters, so if the terminators are always at the end, you can right-trim with
$output =~ s/\s+\z//;
A few options:
1. Replace all occurrences of cr/lf with lf: $output =~ s/\r\n/\n/g; #instead of \r\n might want to use \012\015
2. Remove all trailing whitespace: output =~ s/\s+$//g;
3. Slurp and split:
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
sub main{
createfile();
outputfile();
}
main();
sub createfile{
(my $file = $0)=~ s/\.pl/\.txt/;
open my $fh, ">", $file;
print $fh "1\n2\r\n3\n4\r\n5";
close $fh;
}
sub outputfile{
(my $filei = $0)=~ s/\.pl/\.txt/;
(my $fileo = $0)=~ s/\.pl/out\.txt/;
open my $fin, "<", $filei;
local $/; # slurp the file
my $text = <$fin>; # store the text
my #text = split(/(?:\r\n|\n)/, $text); # split on dos or unix newlines
close $fin;
local $" = ", "; # change array scalar separator
open my $fout, ">", $fileo;
print $fout "#text"; # should output numbers separated by comma space
close $fout;
}
Related
I have a text file which contains a protein sequence in FASTA format. FASTA files have the first line which is a header and the rest is the sequence of interest. Each letter is one amino acid. I want to write a program that finds the motifs VSEX (X being any amino acid and the others being specific ones) and prints out the motif itself and the position it was found. This is my code:
#!usr/bin/perl
open (IN,'P51170.fasta.txt');
while(<IN>) {
$seq.=$_;
$seq=~s/ //g;
chomp $seq;
}
#print $seq;
$j=0;
$l= length $seq;
#print $l;
for ($i=0, $i<=$l-4,$i++){
$j=$i+1;
$motif= substr ($seq,$i,4);
if ($motif=~m/VSE(.)/) {
print "motif $motif found in position $j \n" ;
}
}
I'm pretty sure I have messed up the loop, but I don't know what went wrong. The output I get on cygwin is the following:
motif found in position 2
motif found in position 2
motif found in position 2
So some general perl tips:
Always use strict; and use warnings; - that'll tell you when your code is doing something bogus.
Anyway, in trying to figure out what's going wrong (although the other post correctly points out that perl for loops need semicolons not commas) I rewrote it a little to accomplish what I think is the same result:
#!usr/bin/perl
use strict;
use warnings;
#read in source data
open (my $input_data,' '<', P51170.fasta.txt') or die $!;
#extract 'first line':
#>sp|P51170|SCNNG_HUMAN Amiloride-sensitive sodium channel subunit gamma OS=Homo sapiens OX=9606 GN=SCNN1G PE=1 SV=4
my $header = <$input_data>;
#slurp all the rest of the file into the seq string.
#$/ is the 'end of line' separate, thus we temporarily undefine it to read the whole file rather than just one line
my $seq = do { local $/; <$input_data> };
#remove linefeeds from the sequence
$seq =~ s/[\r\n]//g;
close $input_data;
#printing what either looks like for clarity
print $header, "\n\n";
print $seq, "\n\n";
#iterate regex matches against $seq
while ( $seq =~ m/VSE(.)/g ) {
# use pos() function to report where matches happened. $1 is the contents
# of the first 'capture bracket'.
print "motif $1 at ", pos($seq), "\n";
}
So rather than manually for-looping through your data, we instead use the regex engine and the perl pos() function to find where any relevant matches occur.
Use semicolons in the C-style loop:
for ($i=0; $i<=$l-4; $i++) {
Or, use a Perl style loop:
for my $i (0 .. $l - 4) {
But you don't have to loop over the positions, Perl can do that for you (see pos):
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
open my $in, '<', 'P51170.fasta.txt' or die $!;
my $seq = "";
while (<$in>) {
chomp;
s/ //g;
$seq .= $_;
}
while ($seq =~ /(VSE.)/g) {
say "Motif $1 found at ", pos($seq) - 3;
}
Note that I followed some good practices:
I used strict and warnings;
I checked the return value of open;
I used the 3-argument version of open;
I used a lexical filehandle;
I don't remove spaces from $seq again and again, only from the newly read lines.
Your base program can be reduced to a one-liner:
perl -0777 -nwE's/\s+//g; say "motif $1 found at " . pos while /VSE(.)/g' P51170.fasta.txt
-0777 is slurp mode: the whole file is read in one go
-n uses either standard input, or a file name as argument as source for input
-E enable features (say in this case)
s/\s+//g removes all whitespace from the default variable $_ -- the input from the file
$1 contains the string matched by the parentheses in the regex
pos reports the position of the match
Though of course this includes the header in the offset, which I assume you do not want. So we're going to have to skip the slurp mode -0777 and the -n switch. It will clutter the code somewhat:
perl -wE'$h = <>; $_ = do { local $/; <> }; s/\s+//g; say "motif $1 found at " . pos while /VSE(.)/g' P51170.fasta.txt
Here we use the idiomatic do block slurp in combination with the diamond operator <>. The diamond operator will use either standard input, or a file name, just like -n.
The first <> reads the header, which can be used later.
The one-liner as a program file looks like this:
use strict;
use warnings; # never skip these pragmas
use feature 'say';
my $h = <>; # header, skip for now
$_ = do { local $/; <> }; # slurp rest of file
s/\s+//g; # remove whitespace
say "motif $1 found at " . pos while /VSE(.)/g;
Use it like this:
perl fasta.pl P51170.fasta.txt
I'm a regex newbie, and I am trying to use a regex to return a list of dates from a text file. The dates are in mm/dd/yy format, so for years it would be '55' for '1955', for example. I am trying to return all entries from years'50' to '99'.
I believe the problem I am having is that once my regex finds a match on a line, it stops right there and jumps to the next line without checking the rest of the line. For example, I have the dates 12/12/12, 10/10/57, 10/09/66 all on one line in the text file, and it only returns 10/10/57.
Here is my code thus far. Any hints or tips? Thank you
open INPUT, "< dates.txt" or die "Can't open input file: $!";
while (my $line = <INPUT>){
if ($line =~ /(\d\d)\/(\d\d)\/([5-9][0-9])/g){
print "$&\n" ;
}
}
A few points about your code
You must always use strict and use warnings 'all' at the top of all your Perl programs
You should prefer lexical file handles and the three-parameter form of open
If your regex pattern contains literal slashes then it is clearest to use a non-standard delimiter so that they don't need to be escaped
Although recent releases of Perl have fixed the issue, there used to be a significant performance hit when using $&, so it is best to avoid it, at least for now. Put capturing parentheses around the whole pattern and use $1 instead
This program will do as you ask
use strict;
use warnings 'all';
open my $fh, '<', 'dates.txt' or die "Can't open input file: $!";
while ( <$fh> ) {
print $1, "\n" while m{(\d\d/\d\d/[5-9][0-9])}g
}
output
10/10/57
10/09/66
You are printing $& which gets updated whenever any new match is encountered.
But in this case you need to store the all the previous matches and the updated one too, so you can use array for storing all the matches.
while(<$fh>) {
#dates = $_ =~ /(\d\d)\/(\d\d)\/([5-9][0-9])/g;
print "#dates\n" if(#dates);
}
You just need to change the 'if' to a 'while' and the regex will take up where it left off;
open INPUT, "< a.dat" or die "Can't open input file: $!";
while (my $line = <INPUT>){
while ($line =~ /(\d\d)\/(\d\d)\/([5-9][0-9])/g){
print "$&\n" ;
}
}
# Output given line above
# 10/10/57
# 10/09/66
You could also capture the whole of the date into one capture variable and use a different regex delimiter to save escaping the slashes:
while ($line =~ m|(\d\d/\d\d/[5-9]\d)|g) {
print "$1\n" ;
}
...but that's a matter of taste, perhaps.
You can use map also to get year range 50 to 99 and store in array
open INPUT, "< dates.txt" or die "Can't open input file: $!";
#as = map{$_ =~ m/\d\d\/\d\d\/[5-9][0-9]/g} <INPUT>;
$, = "\n";
print #as;
Another way around it is removing the dates you don't want.
$line =~ s/\d\d\/\d\d\/[0-4]\d//g;
print $line;
I'll start by saying I'm brand new to Perl and regex and I have never been best buddies.
My problem is, I have a text file full of lines. Each line contains many 'words'. These words can contain letters, numbers, -, =, etc. Pretty much everything except white space. Each words is separated by white space.
In every line there is one word that starts with three unique characters, 'mc='. So the word could be 'mc=abcde123', 'mc=12345hij', 'mc=blah'... you get my drift. I want to extract this word from each line and insert them into a new text file.
#!/usr/bin/perl
use warnings;
my $input = 'input.txt';
my $output = 'output.txt';
open (FILE, "<", $input) or die "Can not open $input $!";
open my $out, '>' $output or die "Can not open $output $!";
while (<FILE>){
/(\s+mc=\/*S)/g;
print $out $_;
}
Not sure how much use any of this code will be to you. I'm well aware the regex is wrong- this just prints the entire content of input.txt into output.txt. Eventually I will be extracting additional values so If anyone could find it in their heart to help this poor, young, ignorant programmer out, I would be more than grateful!
The only thing you want to match is a string of non-whitespace that begins with mc=, which is preceded by either start of string or whitespace. So the regex you want would be
/(?<!\S)(mc=\S*)/g
Using the negative lookbehind assertion (?<!\S) is a way to assert that we do not have non-whitespace before our keyword. We cannot use (?<=\s|^) (match whitespace or beginning of string) because lookbehind assertions cannot be variable length, so this is a workaround.
You can use for example:
perl -nle 'print for /(?<!\S)(mc=\S*)/g' input.txt > output.txt
A one-liner which will print the matched strings on a new line for each word, and using shell redirection (*nix shell) we print the words to a new file. This will replace your entire script.
You can also use the following to patch your own code:
print $out $_ for /(?<!\S)(mc=\S*)/g;
But making the file names hard coded is unnecessary, I feel, especially when perl has such nice predefined features to use in this case.
#!/usr/bin/perl
use warnings;
my $input = 'input.txt';
my $output = 'output.txt';
open (FILE, "<", $input) or die "Can not open $input $!";
open my $out, '>' $output or die "Can not open $output $!";
while (<FILE>){
my #arr = /(?: ^|\s )(mc=\S*)/xg or next;
print $out "$_\n" for #arr;
}
I have a giant text data file (~100MB) that is a concatenation of a bunch of data files with various header information then some columns of data. Here's the problem. I want to extract a particular number from the header info before each of these data sets and then append that to another column in the data (and write out that data to a different file).
The header info that I want is of the format ex: BGA 1
Where what I want for that extra data column is the # after word BGA. It will be a number between 1 and maybe 20000. I can write the regex to pull the word BGA, but I don't seem to be able to figure out how to just get the digit after it.
To add EXTRA fun, that text "BGA 1" is repeated in each data section TWICE.
Here's what I have so far, which actually doesn't work... I want it to at least print "BGA" everytime it encounters the word BGA, but it prints nothing.... Any help would be appreciated.
#!/usr/bin/perl
use strict;
use warnings;
my $file = 'alldata.txt';
open my $info, $file or die "Could not open $file: $!";
$_="";
while(my $line = <$info>){
if ($line eq "/BGA/"){
print <>,"\n";
}
}
close $file;
if ($line =~ /BGA\s(\d+)/){
#your code
print "BGA number $1 \n";
#your code
}
And $1 variable will have the number you want
If there is more than one BGA per line, you'll need to allow the regex to match more than once per line:
while (my $line = <$info>) {
while ( $line =~ /BGA\s(\d+)/g ) {
print "$1\n";
}
}
This should print out all the BGA numbers as a single column. Without any further information it's hard to answer this any better.
First, a 100 MB file is not giant. Don't be so defeatist. You could even slurp it into memory:
Let's look at the few critical places in your code:
while(my $line = <$info>) {
if ($line eq "/BGA/") {
Your condition $line eq "/BGA/" tests if the line literally consists of the string "/BGA/". But, that can never be true for the line with at least have the input record separator, i.e. the contents of $/ at the end because you did not chomp it. In any case, what you want is to match lines that contain "BGA" anywhere and the proper Perl syntax to do that is
if ($line =~ /BGA/) {
Now, once you fix that, you are going to run into a problem with the following statement:
print <>,"\n";
What you really want is print $line;. The diamond operator, <>, in list context is going to try to slurp from STDIN or any files specified as arguments on the command line. Not a good idea.
Others have pointed out how to match the string "BGA" followed by a digit. For better answers, you are going to need to show examples of input and expected output.
I gave up on sed and I've heard it is better in Perl.
I would like a script that can be called from the 'unix' command line and converts DOS line endings CRLF from the input file and replaces them with commas in the output file:
like
myconvert infile > outfile
where infile was:
1
2
3
and would result in outfile:
1,2,3
I would prefer more explicit code with some minimal comments over "the shortest possible solution", so I can learn from it, I have no perl experience.
In shell, you can do it in many ways:
cat input | xargs echo | tr ' ' ,
or
perl -pe 's/\r?\n/,/' input > output
I know you wanted this to be longer, but I don't really see the point of writing multi line script to solve such simple task - simple regexp (in case of perl solution) is fully workable, and it's not something artificially shortened - it's the type of code that I would use on daily basis to solve the issue at hand.
#!/bin/perl
while(<>) { # Read from stdin one line at a time
s:\r\n:,:g; # Replace CRLF in current line with comma
print; # Write out the new line
}
use strict;
use warnings;
my $infile = $ARGV[0] or die "$0 Usage:\n\t$0 <input file>\n\n";
open(my $in_fh , '<' , $infile) or die "$0 Error: Couldn't open $infile for reading: $!\n";
my $file_contents;
{
local $/; # slurp in the entire file. Limit change to $/ to enclosing block.
$file_contents = <$in_fh>
}
close($in_fh) or die "$0 Error: Couldn't close $infile after reading: $!\n";
# change DOS line endings to commas
$file_contents =~ s/\r\n/,/g;
$file_contents =~ s/,$//; # get rid of last comma
# finally output the resulting string to STDOUT
print $file_contents . "\n";
Your question text and example output were not consistent. If you're converting all line endings to commas, you will end up with an extra comma at the end, from the last line ending. But you example shows only commas between the numbers. I assumed you wanted the code output to match your example and that the question text was incorrect, however if you want the last comma just remove the line with the comment "get rid of last comma".
If any command is not clear, http://perldoc.perl.org/ is your friend (there is a search box at the top right corner).
It's as simple as:
tr '\n' , <infile >outfile
Avoid slurping, don't tack on a trailing comma and print out a well-formed text file (all lines must end in newlines):
#!/usr/bin/perl
use strict;
use warnings;
my $line = <>;
while ( 1 ) {
my $next = <>;
s{(?:\015\012?|\012)+$}{} for $line, $next;
if ( length $next ) {
print $line, q{,};
$line = $next;
}
else {
print $line, "\n";
last;
}
}
__END__
Personally I would avoid having to look a line ahead (as in Sinar's answer). Sometimes you need to but I have sometimes done things wrong in processing the last line.
use strict;
use warnings;
my $outputcomma = 0; # No comma before first line
while ( <> )
{
print ',' if $outputcomma ;
$outputcomma = 1 ; # output commas from now on
s/\r?\n$// ;
print ;
}
print "\n" ;
BTW: In sed, it would be:
sed ':a;{N;s/\r\n/,/;ba}' infile > outfile
with Perl
$\ = "\n"; # set output record separator
$, = ',';
$/ = "\n\n";
while (<>) {
chomp;
#f = split('\s+', $_);
print join($,,#f);
}
in unix, you can also use tools such as awk or tr
awk 'BEGIN{OFS=",";RS=""}{$1=$1}1' file
or
tr "\n" "," < file