I gave up on sed and I've heard it is better in Perl.
I would like a script that can be called from the 'unix' command line and converts DOS line endings CRLF from the input file and replaces them with commas in the output file:
like
myconvert infile > outfile
where infile was:
1
2
3
and would result in outfile:
1,2,3
I would prefer more explicit code with some minimal comments over "the shortest possible solution", so I can learn from it, I have no perl experience.
In shell, you can do it in many ways:
cat input | xargs echo | tr ' ' ,
or
perl -pe 's/\r?\n/,/' input > output
I know you wanted this to be longer, but I don't really see the point of writing multi line script to solve such simple task - simple regexp (in case of perl solution) is fully workable, and it's not something artificially shortened - it's the type of code that I would use on daily basis to solve the issue at hand.
#!/bin/perl
while(<>) { # Read from stdin one line at a time
s:\r\n:,:g; # Replace CRLF in current line with comma
print; # Write out the new line
}
use strict;
use warnings;
my $infile = $ARGV[0] or die "$0 Usage:\n\t$0 <input file>\n\n";
open(my $in_fh , '<' , $infile) or die "$0 Error: Couldn't open $infile for reading: $!\n";
my $file_contents;
{
local $/; # slurp in the entire file. Limit change to $/ to enclosing block.
$file_contents = <$in_fh>
}
close($in_fh) or die "$0 Error: Couldn't close $infile after reading: $!\n";
# change DOS line endings to commas
$file_contents =~ s/\r\n/,/g;
$file_contents =~ s/,$//; # get rid of last comma
# finally output the resulting string to STDOUT
print $file_contents . "\n";
Your question text and example output were not consistent. If you're converting all line endings to commas, you will end up with an extra comma at the end, from the last line ending. But you example shows only commas between the numbers. I assumed you wanted the code output to match your example and that the question text was incorrect, however if you want the last comma just remove the line with the comment "get rid of last comma".
If any command is not clear, http://perldoc.perl.org/ is your friend (there is a search box at the top right corner).
It's as simple as:
tr '\n' , <infile >outfile
Avoid slurping, don't tack on a trailing comma and print out a well-formed text file (all lines must end in newlines):
#!/usr/bin/perl
use strict;
use warnings;
my $line = <>;
while ( 1 ) {
my $next = <>;
s{(?:\015\012?|\012)+$}{} for $line, $next;
if ( length $next ) {
print $line, q{,};
$line = $next;
}
else {
print $line, "\n";
last;
}
}
__END__
Personally I would avoid having to look a line ahead (as in Sinar's answer). Sometimes you need to but I have sometimes done things wrong in processing the last line.
use strict;
use warnings;
my $outputcomma = 0; # No comma before first line
while ( <> )
{
print ',' if $outputcomma ;
$outputcomma = 1 ; # output commas from now on
s/\r?\n$// ;
print ;
}
print "\n" ;
BTW: In sed, it would be:
sed ':a;{N;s/\r\n/,/;ba}' infile > outfile
with Perl
$\ = "\n"; # set output record separator
$, = ',';
$/ = "\n\n";
while (<>) {
chomp;
#f = split('\s+', $_);
print join($,,#f);
}
in unix, you can also use tools such as awk or tr
awk 'BEGIN{OFS=",";RS=""}{$1=$1}1' file
or
tr "\n" "," < file
Related
I have a text file which contains a protein sequence in FASTA format. FASTA files have the first line which is a header and the rest is the sequence of interest. Each letter is one amino acid. I want to write a program that finds the motifs VSEX (X being any amino acid and the others being specific ones) and prints out the motif itself and the position it was found. This is my code:
#!usr/bin/perl
open (IN,'P51170.fasta.txt');
while(<IN>) {
$seq.=$_;
$seq=~s/ //g;
chomp $seq;
}
#print $seq;
$j=0;
$l= length $seq;
#print $l;
for ($i=0, $i<=$l-4,$i++){
$j=$i+1;
$motif= substr ($seq,$i,4);
if ($motif=~m/VSE(.)/) {
print "motif $motif found in position $j \n" ;
}
}
I'm pretty sure I have messed up the loop, but I don't know what went wrong. The output I get on cygwin is the following:
motif found in position 2
motif found in position 2
motif found in position 2
So some general perl tips:
Always use strict; and use warnings; - that'll tell you when your code is doing something bogus.
Anyway, in trying to figure out what's going wrong (although the other post correctly points out that perl for loops need semicolons not commas) I rewrote it a little to accomplish what I think is the same result:
#!usr/bin/perl
use strict;
use warnings;
#read in source data
open (my $input_data,' '<', P51170.fasta.txt') or die $!;
#extract 'first line':
#>sp|P51170|SCNNG_HUMAN Amiloride-sensitive sodium channel subunit gamma OS=Homo sapiens OX=9606 GN=SCNN1G PE=1 SV=4
my $header = <$input_data>;
#slurp all the rest of the file into the seq string.
#$/ is the 'end of line' separate, thus we temporarily undefine it to read the whole file rather than just one line
my $seq = do { local $/; <$input_data> };
#remove linefeeds from the sequence
$seq =~ s/[\r\n]//g;
close $input_data;
#printing what either looks like for clarity
print $header, "\n\n";
print $seq, "\n\n";
#iterate regex matches against $seq
while ( $seq =~ m/VSE(.)/g ) {
# use pos() function to report where matches happened. $1 is the contents
# of the first 'capture bracket'.
print "motif $1 at ", pos($seq), "\n";
}
So rather than manually for-looping through your data, we instead use the regex engine and the perl pos() function to find where any relevant matches occur.
Use semicolons in the C-style loop:
for ($i=0; $i<=$l-4; $i++) {
Or, use a Perl style loop:
for my $i (0 .. $l - 4) {
But you don't have to loop over the positions, Perl can do that for you (see pos):
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
open my $in, '<', 'P51170.fasta.txt' or die $!;
my $seq = "";
while (<$in>) {
chomp;
s/ //g;
$seq .= $_;
}
while ($seq =~ /(VSE.)/g) {
say "Motif $1 found at ", pos($seq) - 3;
}
Note that I followed some good practices:
I used strict and warnings;
I checked the return value of open;
I used the 3-argument version of open;
I used a lexical filehandle;
I don't remove spaces from $seq again and again, only from the newly read lines.
Your base program can be reduced to a one-liner:
perl -0777 -nwE's/\s+//g; say "motif $1 found at " . pos while /VSE(.)/g' P51170.fasta.txt
-0777 is slurp mode: the whole file is read in one go
-n uses either standard input, or a file name as argument as source for input
-E enable features (say in this case)
s/\s+//g removes all whitespace from the default variable $_ -- the input from the file
$1 contains the string matched by the parentheses in the regex
pos reports the position of the match
Though of course this includes the header in the offset, which I assume you do not want. So we're going to have to skip the slurp mode -0777 and the -n switch. It will clutter the code somewhat:
perl -wE'$h = <>; $_ = do { local $/; <> }; s/\s+//g; say "motif $1 found at " . pos while /VSE(.)/g' P51170.fasta.txt
Here we use the idiomatic do block slurp in combination with the diamond operator <>. The diamond operator will use either standard input, or a file name, just like -n.
The first <> reads the header, which can be used later.
The one-liner as a program file looks like this:
use strict;
use warnings; # never skip these pragmas
use feature 'say';
my $h = <>; # header, skip for now
$_ = do { local $/; <> }; # slurp rest of file
s/\s+//g; # remove whitespace
say "motif $1 found at " . pos while /VSE(.)/g;
Use it like this:
perl fasta.pl P51170.fasta.txt
So I am currently trying to write a perl script that reads to a file and writes to another. Currently, the problem that I have been having is removing new line characters from parsed rows.I feed in a file like this
BetteDavisFilms.txt
1.
Wicked Stepmother (1989) as Miranda
A couple comes home from vacation to find that their grandfather has …
2.
Directed By William Wyler (1988) as Herself
During the Golden Age of Hollywood, William Wyler was one of the …
3.
Whales of August, The (1987) as Libby Strong
Drama revolving around five unusual elderly characters, two of whom …
Ultimately I am trying to put it into a format like this
1,Wicked Stepmother ,1989, as Miranda,A couple comes home from vacation to …
2,Directed By William Wyler ,1988, as Herself,During the Golden Age of …
3,"Whales of August, The ",1987, as Libby Strong,Drama revolving around five…
it sucessfully removes recognizes to each number but then I want to remove the \n then replace the "." with a ",". Sadly the chomp function destroys or hides the data some how so when I print after chomping $row, nothing shows... What should I do to correct this?
#!bin/usr/perl
use strict;
use warnings;
my $file = "BetteDavisFilms";
my #stack = ();
open (my $in , '<', "$file.txt" ) or die "Could not open to read \n ";
open (my $out , '>', "out.txt" ) or die "Could not out to file \n";
my #array = <$in>;
sub readandparse() {
for(my $i = 0 ; $i < scalar(#array); $i++) {
my $row = $array[$i];
if($row =~ m/\d[.]/) {
parseFirstRow($row);
}
}
}
sub parseFirstRow() {
my $rowOne = shift;
print $rowOne; ####prints a number
chomp($rowOne);
print $rowOne; ###prints nothing
#$rowOne =~ s/./,/;
}
#call to run program
readandparse();
Your lines end with CR LF. You remove the LF, leaving the CR behind. Your terminal is homing the cursor on CR causing the next line output to overwrite the last line output.
$ perl -e'
print "XXXXXX\r";
print "xxx\n";
'
xxxXXX
Fix your input file
dos2unix file
or remove the CR along with the LF.
s/\s+\z// # Instead of chomp
Hi I have a requirement where I need to pull text of the form - = from a large log file.
log file consists of data like this:
[accountNumber=0, email=tom.cruise#gmail.com, firstName=Tom, lastName= , message=Hello How are you doing today ?
The output I expect is:
accountNumber=0
email=tom.cruise#gmail.com
firstName=Tom
etc.
Can anyone please help ? Also please explain the solution so that I can extend it to cater to my similar needs.
I wrote a one-liner for this:
perl -nle 's/^\[//; for (split(/,/)){s/(?:^\s+|\s+$)//g; print}' input.txt
I also made another line of input to test with:
Matt#MattPC ~/perl/testing/13
$ cat input.txt
[accountNumber=0, email=tom.cruise#gmail.com, firstName=Tom, lastName= , message=Hello How are you doing today ?
[accountNumber=2, email=john.smith#gmail.com, firstName=John, lastName= , message=What is up with you?
Here is the output:
Matt#MattPC ~/perl/testing/13
$ perl -nle 's/^\[//; for (split(/,/)){s/(?:^\s+|\s+$)//g; print}' input.txt
accountNumber=0
email=tom.cruise#gmail.com
firstName=Tom
lastName=
message=Hello How are you doing today ?
accountNumber=2
email=john.smith#gmail.com
firstName=John
lastName=
message=What is up with you?
Explanation:
Expanded code:
perl -nle '
s/^\[//;
for (split(/,/)){
s/(?:^\s+|\s+$)//g;
print
}'
input.txt
Line by line explanation:
perl -nle calls perl with the command line options -n, -l, and -e. The -n adds a while loop around the program like this:
LINE:
while (<>) {
... # your program goes here
}
The -l adds a newline at the end of every print. And the -e specifies my code which will be in single quotes (').
s/^\[//; removes the first [ if there is one. This searches and replaces on $_ which is equal to the line.
for (split(/,/)){ begins the for loop which will loop through the array returned by split(/,/). The split will split $_ since it was called with just one argument, and it will split on ,. $_ was equal to the line, but inside the for loop, $_ still get set to the element of the array we are on.
s/(?:^\s+|\s+$)//g; this line removes leading and trailing white space.
print will print $_ followed by a newline, which is our string=value.
}' close the for loop and finish the '.
input.txt provide input to the program.
Going off your specific data and desired output, you could try the following:
use strict;
use warnings;
open my $fh, '<', 'file.txt' or die "Can't open file $!";
my $data = do { local $/; <$fh> };
my #matches = $data =~ /(\w+=\S+),/g;
print join "\n", #matches;
Working Demo
Perl One-Liner
Use this:
perl -0777 -ne 'while(m/[^ ,=]+=[^,]*/g){print "$&\n";}' yourfile
Assuming that each line of the log ends with a closing square bracket, you can use this:
#!/usr/bin/perl
use strict;
use warnings;
my $line = '[accountNumber=0, email=tom.cruise#gmail.com, firstName=Tom, lastName= , message=Hello How are you doing today ?]';
while($line =~ /([^][,\s][^],]*?)\s*[],]/g) {
print $1 . "\n";
}
I am trying to delete the 1st line and removing leading and trailing white spaces in the subsequent lines using sed
If I have something like
line1
line2
line3
It should print
line2
line3
So I tried this command on unix shell:
sed '1d;s/^ [ \t]*//;s/[ \t]*$//' file.txt
and it works as expected.
When I try the same in my perl script:
my #templates = `sed '1d;s/^ [ \t]*//;s/[ \t]*$//' $MY_FILE`;
It gives me this message "sed: -e expression #1, char 10: unterminated `s' command" and doesn't print anything. Can someone tell me where I am going wrong
Why would you invoke Sed from Perl anyway? Replacing the sed with the equivalent Perl code is just a few well-planned keystrokes.
my #templates;
if (open (M, '<', $MY_FILE)) {
#templates = map { s/(?:^\s*|\s*$)//g; $_ } <M>;
shift #templates;
close M;
} else { # die horribly? }
The backticks work like double-quotes. Perl interpolates variables inside them, as you already know due to your use of $MY_FILE. What you may not know is that $/ is actually a variable, the input record separator (by default a newline character). The same is true for the backslashes before the tab character. Here Perl will interpret \t for you and replace it with the tab character. You'll need a second backslash so that sed sees \t instead of an actual tab character. The latter might work as well, though.
Consider to use safe pipe open instead of backticks, to avoid problems with escaping. For example:
my #templates = do {
open my $fh, "|-", 'sed', '1d;s/^ [ \t]*//;s/[ \t]*$//', $MY_FILE
or die $!;
local $/;
<$fh>;
};
You have a typo in your expression. You need a semicolon between the 2 substitution statements. You should use the following instead:
my #templates = `sed '1d;s/^ [ \\t]*//;s/[ \\t]*\$//' $MY_FILE`;
escaping $ and \ as suggested in the other answer. I should note that it also worked for me without escaping \ as it was replaced by a literal tab.
As others have mentioned, I would recommend you do this only in Perl, or only in Sed, because there's really no reason to use both for this task. Using Sed in Perl will mean you have to worry about escaping, quoting and capturing the output (unless reading from a pipe) somehow. Obviously, all that complicates things and it also makes the code very ugly.
Here is a Perl one-liner that will handle your reformatting:
perl -le 'my $line = <>; while (<>) { chomp; s/^\s*|\s*$//; print $_; }' file.txt
Basically, you just take the first line and store in a variable that won't be used, then process the rest of the lines. Below is a small script version that you can add to your existing script.
#!/usr/bin/env perl
use strict;
use warnings;
my $usage = "$0 infile";
my $infile = shift or die $usage;
open my $in, '<', $infile or die "Could not open file: $infile";
my $first = <$in>;
while (<$in>) {
chomp;
s/^\s*|\s*$//;
# process your data here, or just print...
print $_, "\n";
}
close $in;
This can also be down with awk
awk 'NR>1 {$1=$1;print}' file
line2
line3
I've got a Perl script which consumes an XML file on Linux and occasionally there are CRLF (Hex 0D0A, Dos new lines) in some of the node values which.
The system which produces the XML file writes it all as a single line, and it looks as if it occasionally decides that this is too long and writes a CRLF into one of the data elements. Unfortunately there's nothing I can do about the providing system.
I just need to remove these from the string before I process it.
I've tried all sorts of regex replacement using the perl char classes, hex values, all sorts and nothing seems to work.
I've even run the input file through dos2unix before processing and I still can't get rid of the erroneous characters.
Does anyone have any ideas?
Many Thanks,
Typical, After battling for about 2 hours, I solved it within 5 minutes of asking the question..
$output =~ s/[\x0A\x0D]//g;
Finally got it.
$output =~ tr/\x{d}\x{a}//d;
These are both whitespace characters, so if the terminators are always at the end, you can right-trim with
$output =~ s/\s+\z//;
A few options:
1. Replace all occurrences of cr/lf with lf: $output =~ s/\r\n/\n/g; #instead of \r\n might want to use \012\015
2. Remove all trailing whitespace: output =~ s/\s+$//g;
3. Slurp and split:
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
sub main{
createfile();
outputfile();
}
main();
sub createfile{
(my $file = $0)=~ s/\.pl/\.txt/;
open my $fh, ">", $file;
print $fh "1\n2\r\n3\n4\r\n5";
close $fh;
}
sub outputfile{
(my $filei = $0)=~ s/\.pl/\.txt/;
(my $fileo = $0)=~ s/\.pl/out\.txt/;
open my $fin, "<", $filei;
local $/; # slurp the file
my $text = <$fin>; # store the text
my #text = split(/(?:\r\n|\n)/, $text); # split on dos or unix newlines
close $fin;
local $" = ", "; # change array scalar separator
open my $fout, ">", $fileo;
print $fout "#text"; # should output numbers separated by comma space
close $fout;
}