translating awk script into perl - regex

I'm trying to translate this code into perl.
gawk '/^>c/ {OUT=substr($0,2) ".fa";print " ">OUT}; OUT{print >OUT}' your_input
Can someone help me?

Perl has a utility to do this for you called a2p. If your script is call script.awk then you would run:
$ a2p script.awk
Which produces:
#!/usr/bin/perl
eval 'exec /usr/bin/perl -S $0 ${1+"$#"}'
if $running_under_some_shell;
# this emulates #! processing on NIH machines.
# (remove #! line above if indigestible)
eval '$'.$1.'$2;' while $ARGV[0] =~ /^([A-Za-z_0-9]+=)(.*)/ && shift;
# process any FOO=bar switches
$, = ' '; # set output field separator
$\ = "\n"; # set output record separator
while (<>) {
chomp; # strip record separator
if (/^>c/) {
$OUT = substr($_, (2)-1) . '.fa';
&Pick('>', $OUT) &&
(print $fh ' ');
}
;
if ($OUT) {
&Pick('>', $OUT) &&
(print $fh $_);
}
}
sub Pick {
local($mode,$name,$pipe) = #_;
$fh = $name;
open($name,$mode.$name.$pipe) unless $openammeamme}++;
}
To save this to a file, use redirection:
$ a2p script.awk > script.pl
Perl also provides a tool for converting sed scripts: s2p.

#!/usr/bin/perl
my ($outf,$OUT) ;
while(<>){
if(/^>(c.*)/){ $OUT = "$1.fa";
close($outf) if $outf;
open($outf,">",$OUT);
print OUT " \n"}
if($outf){ print $outf $_ }
}
if input is:
>caaa
sdf
sdff
>cbbb
ew
ew
Creats 2 files:
==> caaa.fa <==
>caaa
sdf
sdff
==> cbbb.fa <==
>cbbb
ew
ew

This perl one liner should be equivalent of that awk command:
perl -ane 'if($F[0] =~ /^>c/){$OUT=substr($F[0],1).".fa"; if(OUT==null) {open(OUT,">$OUT");} print OUT " \n"} if ($OUT){print OUT $_} END{close(OUT)}' file
Indented command line:
perl -ane 'if ($F[0] =~ /^>c/) {
$OUT = substr($F[0], 1).".fa";
if (OUT==null) { open(OUT, ">$OUT") }
print OUT " \n"
}
if ($OUT) {
print OUT $_
}
END{close(OUT)
}' file

Related

How can I limit this term to a 3 piece hit?

It would be the real task to sadfsadf! Ghfgh% fgh65 %% of this text to replace the first 3 characters that are true for regexp, so replace it with 'o' sadfsadfoghfghofgh65o%
#!/usr/bin/perl -w
#list=<>;
chomp(#list);
foreach(#list) {
if($_ =~ m/\W/) {
# here is the problem because all the characters you find it
# overwrite it, but I only need to translate 3 characters from it
$_ =~ s/\W/o/g;
print $_."\n";
}
else {
print "->\n";
}
}
#start string => sadfsadf!ghfgh%fgh65%%
#result my program => sadfsadfoghfghofgh65oo
#and I need it => sadfsadfoghfghofgh65o%
change only the first 3 results to 'o'
Here is a way to go:
use feature 'say';
my $in = 'sadfsadf!ghfgh%fgh65%%';
$in =~ s/\W/o/ for 1..3;
say $in;
Output:
sadfsadfoghfghofgh65o%
Solved the problem
Code:
#!/usr/bin/perl -w
while ($be=<STDIN>)
{
chomp $be;
push (#list, $be);
}
foreach $a (#list)
{
$count=0;
if ($a=~/\W/)
{
while (($a=~m/\W/g) && ($count < 3))
{
$count++;
$a=~s/\W/o/;
}
print "$a\n";
}
else {print "->\n";}
}
your data in 'list' file
perl -pe 'for $i(1..3){s/\W/o/} ' list
on bash if you put data such list='sadfsadf!ghfgh%fgh65%%'
for((i=1;i<=3;i++)){ list=`echo $list|sed -E "s/\W/Y/"`; }

Printing both the matched line and lines ±1 upon regex match (1-liner)

How can I print a matched line as well as a line before and after it? I've currently got:
perl -lane 'print if $F[3] > 100000 && $F[2] =~ /^C$/ && print $last; $last = $_'
Which is capable of printing both the matched line and a line before it - but I am not sure how to include the line after.
You can read the next line from the file directly using scalar <>:
perl -lane 'print, print scalar <>
if $F[3] > 100000 && $F[2] =~ /^C$/
&& print $last;
$last = $_' input
Or use a sliding window for overlapping matches:
perl -ane 'BEGIN { #b = map [], 1 .. 3 }
sub out {
shift #b;
if ($b[1][3] > 100_000 && $b[1][2] =~ /^C$/) {
print for map $_->[-1], #b;
}
}
push #b, [#F, $_];
out()
}{ out() # Process the last line
' input
The following handles overlapping matches. $N is the number of lines to printe before and after the matching lines.
perl -lane'
BEGIN { $N = 1 }
if ($F[3] > 100000 && $F[2] =~ /^C$/) { print for splice(#buf), $_; $next=$N }
elsif ($next) { --$next; print }
else { push #buf, $_; splice #buf, 0, -$N }
'
Since we know $N = 1, we can simplify the above into the following:
perl -lane'
if ($F[3] > 100000 && $F[2] =~ /^C$/) { print for splice(#buf), $_; $next=1 }
elsif ($next) { $next=0; print }
else { #buf = $_ }
'
You can also use seek and tell and rewind back one line for overlapping matches:
#!/usr/bin/perl
use strict;
use warnings;
open my $fh ,'<', 'input' or die "unable to open file: $!\n";
my $last="";
while(<$fh>){
my #b=split;
if(($b[3] > 100000) && ($b[2] =~ /^C$/)){
print $last if $last;
print;
my $t=tell $fh;
print scalar <$fh>,"\n";
seek $fh,$t,0; #rewind
}
$last=$_;
}
close($fh);

Perl Parse $1 from ARGV

Is it possible to parse a $1 from ARGV in Perl?
I am trying to parse a passed in $1 from ARGV, but its treating it as a literal string, rather than the $1.
This is just a simple script trying to show proof of concept.
#!/usr/bin/perl
my $from = '^(.*/)([^/]*)$';
my $dir = "/var/foo/baz";
$dir =~ /$from/;
# This works and prints the expected output I would like
print "Dir $1\n";
print "File $2\n";
# This is printing a literal '$1' and '$2'
print "Dir $ARGV[0]\n";
print "File $ARGV[1]\n";
This is what I am running and my output:
$ ./test.pl '$1' '$2'
Dir /var/foo/
File baz
Dir $1
File $2
You want to use String::Substitution.
$ perl -E'
use String::Substitution qw( interpolate_match_vars last_match_vars );
my ($str, $pat, $x_template, $y_template) = #ARGV;
$str =~ $pat
or die("Didn'\''t match\n");
my $x = interpolate_match_vars($x_template, last_match_vars());
my $y = interpolate_match_vars($y_template, last_match_vars());
say "x: $x";
say "y: $y";
' \
abcdef '(.)c(.)' '$1' '$2'
x: b
y: d

match using regex in perl

HI I am trying to exract some data from a text file in perl. My file looks like this
Name:John
FirstName:Smith
Name:Alice
FirstName:Meyers
....
I want my string to look like John Smith and Alice Meyers
I tried something like this but I'm stuck and I don't know how to continue
while (<INPUT>) {
if (/^[Name]/) {
$match =~ /(:)(.*?)(\n) /
$string = $string.$2;
}
if (/^[FirstName]/) {
$match =~ /(:)(.*?)(\n)/
$string = $string.$2;
}
}
What I try to do is that when I match Name or FirstName to copy to content between : and \n but I get confused which is $1 and $2
This will put you first and last names in a hash:
use strict;
use warnings;
use Data::Dumper;
open my $in, '<', 'in.txt';
my (%data, $names, $firstname);
while(<$in>){
chomp;
($names) = /Name:(.*)/ if /^Name/;
($firstname) = /FirstName:(.*)/ if /^FirstName/;
$data{$names} = $firstname;
}
print Dumper \%data;
Through perl one-liner,
$ perl -0777 -pe 's/(?m).*?Name:([^\n]*)\nFirstName:([^\n]*).*/\1 \2/g' file
John Smith
Alice Meyers
while (<INPUT>) {
/^([A-Za-z])+\:\s*(.*)$/;
if ($1 eq 'Name') {
$surname = $2;
} elsif ($1 eq 'FirstName') {
$completeName = $2 . " " . $surname;
} else {
/* Error */
}
}
You might want to add some error handling, e.g. make sure that a Name is always followed by a FirstName and so on.
$1 $2 $3 .. $N , it's the capture result of () inside regex.
If you do something like that , you cant avoid using $1 like variables.
my ($matched1,$matched2) = $text =~ /(.*):(.*)/
my $names = [];
my $name = '';
while(my $row = <>){
$row =~ /:(.*)/;
$name = $name.' '.$1;
push(#$names,$name) if $name =~ / /;
$name = '' if $name =~ / /;
}
`while(<>){
}
`
open (FH,'abc.txt');
my(%hash,#array);
map{$_=~s/.*?://g;chomp($_);push(#array,$_)} <FH>;
%hash=#array;
print Dumper \%hash;

How to extract the text between two patterns using REGEX perl

In the following lines how can I store the lines between "Description:" and "Tag:" in a variable using REGEX PERL and what would be a good datatype to use, string or list or something else?
(I am trying to write a program in Perl to extract the information of a text file with Debian package information and convert it into a RDF(OWL) file(ontology).)
Description: library for decoding ATSC A/52 streams (development)
liba52 is a free library for decoding ATSC A/52 streams. The A/52 standard is
used in a variety of applications, including digital television and DVD. It is
also known as AC-3.
This package contains the development files.
Homepage: http://liba52.sourceforge.net/
Tag: devel::library, role::devel-lib
The code I have written so far is:
#!/usr/bin/perl
open(DEB,"Packages");
open(ONT,">>debianmodelling.txt");
$i=0;
while(my $line = <DEB>)
{
if($line =~ /Package/)
{
$line =~ s/Package: //;
print ONT ' <package rdf:ID="instance'.$i.'">';
print ONT ' <name rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</name>'."\n";
}
elsif($line =~ /Priority/)
{
$line =~ s/Priority: //;
print ONT ' <priority rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</priority>'."\n";
}
elsif($line =~ /Section/)
{
$line =~ s/Section: //;
print ONT ' <Section rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</Section>'."\n";
}
elsif($line =~ /Maintainer/)
{
$line =~ s/Maintainer: //;
print ONT ' <maintainer rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</maintainer>'."\n";
}
elsif($line =~ /Architecture/)
{
$line =~ s/Architecture: //;
print ONT ' <architecture rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</architecture>'."\n";
}
elsif($line =~ /Version/)
{
$line =~ s/Version: //;
print ONT ' <version rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</version>'."\n";
}
elsif($line =~ /Provides/)
{
$line =~ s/Provides: //;
print ONT ' <provides rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</provides>'."\n";
}
elsif($line =~ /Depends/)
{
$line =~ s/Depends: //;
print ONT ' <depends rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</depends>'."\n";
}
elsif($line =~ /Suggests/)
{
$line =~ s/Suggests: //;
print ONT ' <suggests rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</suggests>'."\n";
}
elsif($line =~ /Description/)
{
$line =~ s/Description: //;
print ONT ' <Description rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</Description>'."\n";
}
elsif($line =~ /Tag/)
{
$line =~ s/Tag: //;
print ONT ' <Tag rdf:datatype="http://www.w3.org/2001/XMLSchema#string">'.$line.'</Tag>'."\n";
print ONT ' </Package>'."\n\n";
}
$i=$i+1;
}
my $desc = "Description:";
my $tag = "Tag:";
$line =~ /$desc(.*?)$tag/;
my $matched = $1;
print $matched;
or
my $desc = "Description:";
my $tag = "Tag:";
my #matched = $line =~ /$desc(.*?)$tag/;
print $matched[0];
or
my $desc = "Description:";
my $tag = "Tag:";
(my $matched = $line) =~ s/$desc(.*?)$tag/$1/;
print $matched;
Additional
If your Description and Tag may be on separate lines, you may need to use the /s modifier, to treat it as a single line, so the \n won't wreck it. Example:
$_=qq{Description:foo
more description on
new line Tag: some
tag};
s/Description:(.*?)Tag:/$1/s; #notice the trailing slash
print;
Assuming:
my $example; # holds the example text above
You could:
(my $result=$example)=~s/^.*?\n(Description:)/$1/s; # strip up to first marker
$result=~s/(\nTag:[^\n]*\n).+$/$1/s; # strip everything after second marker line
Or
(my $result=$example)=~s/^.*?\n(Description:.+?Tag:[^\n]*\n).*$/$1/s;
Both assume the Tag: value is contained on a single line.
If this is not the case, you might try:
(my $result=$example)=~s/
( # start capture
Description: # literal 'Description:'
.+? # any chars (non-greedy) up to
Tag: # literal 'Tag:'
.+? # any chars up to
)
(?: # either
\n[A-Z][a-z]+\: # another tagged value name
| # or
$ # end of string
)
/$1/sx;
I believe that the problem is caused by using a line reading loop for data structured by paragraphs. If you can slurp the file into memory and and apply split with a captured delimiter, the processing will be much smoother:
#!/usr/bin/perl -w
use strict;
use diagnostics;
use warnings;
use English;
# simple sample sub
my $printhead = sub {
printf "%5s got the tag '%s ...'\n", '', substr( shift, 0, 30 );
};
# map keys/tags? to functions
my %tagsoups = (
'PackageName' => sub {printf "%5s got the name '%s'\n", '', shift;}
, 'Description' => sub {printf "%5s got the description:\n---------\n%s\n----------\n", '', shift;}
, 'Tag' => $printhead
);
# slurp Packages (fallback: parse using $INPUT_RECORD_SEPARATOR = "Package:")
open my $fh, "<", './Packages-00.txt' or die $!;
local $/; # enable localized slurp mode
my $all = <$fh>;
my #pks = split /^(Package):\s+/ms, $all;
close $fh;
# outer loop: Packages
for (my $p = 1, my $n = 0; $p < scalar #pks; $p +=2) {
my $blk = "PackageName: " . $pks[$p + 1];
my #inf = split /\s*^([\w-]+):\s+/ms, $blk;
printf "%3d %s named %s\n", ++$n, $pks[$p], $inf[ 2 ];
# outer loop: key-value-pairs (or whatever they are called)
for (my $x = 1; $x < scalar #inf; $x += 2) {
if (exists($tagsoups{$inf[ $x ]})) {
$tagsoups{$inf[ $x ]}($inf[$x + 1]);
}
}
}
output for a shortened Packages file from my Ubuntu Linux:
3 Package named abrowser-3.5-branding
got the PackageName:
---------
abrowser-3.5-branding
----------
got the Description:
---------
dummy upgrade package for firefox-3.5 -> firefox
This is a transitional package so firefox-3.5 users get firefox on
upgrades. It can be safely removed.
----------
4 Package named casper
got the PackageName:
---------
casper
----------
got the Description:
---------
Run a "live" preinstalled system from read-only media
----------
got the Tag:
---------
admin::boot, admin::filesystem, implemented-in::shell, protocol::smb, role::plugin, scope::utility, special::c
ompletely-tagged, works-with-format::iso9660
----------
Using a hash for the functions to apply to the extracted parts will keep the details of generating xml out of the parser loops.