Parsing of Text File using Perl - regex

i have a text file which looks like this.
Parameter 0:
Field 1 : 100
Field 2 : 0
Field 3 : 4
Parameter 1:
Field 1 : 873
Field 2 : 23
Field 3 : 89
I want to write a perl script that parses this file in the following format
Parameter Field1 Field2 Field3
0 100 0 4
1 873 23 89
Can anyone help me with this. Any help will be greatly appreciated.
i have tried the following so far
my %hash = ();
my $file = "sample.txt";
open (my $fh, "<", $file) or die "Can't open the file $file: ";
while (my $line =<$fh>)
{
chomp ($line);
my($key) = split(" : ", $line);
$hash{$key} = 1;
}
foreach my $key (sort keys %hash)
{
print "$key\n";
}

This Perl program does what you ask. It allows for any number of fields for each parameter (although there must be the same number of fields for every parameter) and takes the header labels for the fields from the data itself.
use strict;
use warnings;
my $file = 'sample.txt';
open my $fh, '<', $file or die qq{Can't open "$file" for input: $!};
my %data;
my #params;
my #fields;
while (<$fh>) {
next unless /\S/;
chomp;
my ($key, $val) = split /\s*:\s*/;
if ($val =~ /\S/) {
push #fields, $key if #params == 1;
push #{ $data{$params[-1]} }, $val if #params;
}
else {
die qq{Unexpected parameter format "$key"} unless $key =~ /parameter\s+(\d+)/i;
push #params, $1;
}
}
my #headers = ('Parameter', #fields);
my #widths = map length, #headers;
my $format = join(' ', map "%${_}s", #widths) . "\n";
printf $format, #headers;
for my $param (#params) {
printf $format, $param, #{ $data{$param} };
}
output
Parameter Field 1 Field 2 Field 3
0 100 0 4
1 873 23 89

use warnings; use strict;
my $file = "sample.txt";
open (my $fh, "<", $file) or die "Can't open the file $file: ";
print "Parameter Field1 Field2 Field3\n";
while (my $line=<$fh>) {
process_parameter($1) if $line =~ /Parameter (\d+):/;
}
sub process_parameter {
my $parameter = shift;
my ($field_1) = (<$fh> =~ /(\d+) *$/);
my ($field_2) = (<$fh> =~ /(\d+) *$/);
my ($field_3) = (<$fh> =~ /(\d+) *$/);
printf " %-2d %-6d %-6d %-6d\n", $parameter, $field_1, $field_2, $field_3;
}

#!/usr/bin/perl
my %hash = ();
my %fields;
my $param;
while ( chomp( my $line = <DATA> ) ) {
if ( $line =~ /Parameter (\d+):/ ) {
$param = $1;
}
next unless ( defined $param );
if ( my ( $f, $v ) = $line =~ /(Field \d+)[\s\t]*: (\d+)/ ) {
$hash{$param} ||= {};
$hash{$param}->{$f} = $v;
$fields{$f} ||= 1;
}
}
my #fields = sort keys %fields;
print join( ',', 'Parameter', #fields ), "\n";
foreach my $param ( sort { $a <=> $b } keys %hash ) {
print join( ',', $param, #{ $hash{$param} }{#fields} ), "\n";
}
__DATA__
Parameter 0:
Field 1 : 100
Field 2 : 0
Field 3 : 4
Parameter 1:
Field 1 : 873
Field 2 : 23
Field 3 : 89

Here is a way that accepts any number of fields for each parameter:
my $par;
my %out;
my $max = 0;
while(<DATA>) {
chomp;
next if /^\s*$/;
if (/Parameter\s*(\d+)/) {
$par = $1;
next;
}
my ($k, $v) = $_ =~/Field\s+(\d+)\s*:\s*(\d+)/;
$out{$par}[$k] = $v;
$max = $k if $k > $max;
}
my $cols = 'Param';
$cols .= "\tField $_" for (1..$max);
say $cols;
foreach my $par(sort (keys %out)) {
my $out = $par;
$out .= "\t".($out{$par}[$_]//' ') for (1..$max);
say $out;
}
__DATA__
Parameter 0:
Field 1 : 100
Field 2 : 0
Field 3 : 4
Field 5 :18
Parameter 1:
Field 1 : 873
Field 2 : 23
Field 3 : 89
Field 4 : 123
output:
Param Field 1 Field 2 Field 3 Field 4 Field 5
0 100 0 4 18
1 873 23 89 123

Related

Perl : match multiple lines

With the use of perl regex, if two consecutive lines match than count the number of lines.
I want the number of lines until matches the pattern
D001
0000
open ($file, "$file") || die;
my #lines_f = $file;
my $total_size = $#lines_f +1;
foreach my $line (#lines_f)
{
if ($line =~ /D001/) {
$FSIZE = $k + 1;
} else {
$k++;}
}
Instead of just D001, I also want to check if the next line is 0000. If so $FSIZE is the $file size.
The $file would look something like this
00001
00002
.
.
.
D0001
00000
00000
Here is an example. This sets $FSIZE to undef if it cannot find the marker lines:
use strict;
use warnings;
my $fn = 'test.txt';
open ( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
chomp (my #lines = <$fh>);
close $fh;
my $FSIZE = undef;
for my $i (0..$#lines) {
if ($lines[$i] =~ /D0001/) {
if ( $i < $#lines ) {
if ( $lines[$i+1] =~ /00000/ ) {
$FSIZE = $i + 1;
last;
}
}
}
}

Updating and creating new elements in a hash table

After i parse a large .sql file this is the printed output from a %hashtable:
Key:AS_LINR
Value:
Name:DS_LSNE_DDD_TS_A
Type:view
Parents:DM_LINE_END MINA_TI_GRP_V
This is the %hash :
$hashtable{$name}="Name:$name
Type:$type
Parents:#parents"."\n\n
".
"----------------------------"
;
I need to check each parent if he exists as a key in the %hash . If he does i need to update it and add a new filed named children: , i will add as a value to the field children the name where i first found the parent .Like in this example :
Key:DM_LINE_END
Value:
Name:DS_LSNE_DDD_TS_A
Type:view
Children:AS_LINR
And i need to do this for each Parent . I want to update a hash by adding new elements to it , and if the key of the has does not exist i have to create one .
If i must explain better what i have to do please ask it in comments .
Here is my perl code :
my $var=0;
my #joinparents=();
use warnings;
my %hashtable;
open(DATA,'<','NaViews.sql') or die "Error $!";
open(Writer,'>','ResultFile.txt') or die "Error $!";
open(Writer1,'>','AuxResult.txt') or die "Error $!";
my #create_cmds = ();
my $create_cmd = "";
READ_DATA : while (<DATA>) {
chop;
my $ln = $_;
$ln =~ s/^\s+//;
$ln =~ s/\s+$//;
next READ_DATA if($ln =~ /^\-\-/);
next READ_DATA if($ln =~ /^REM/);
if($create_cmd ne "") {
$create_cmd = $create_cmd." ".$ln;
}
if($ln =~ /^create/i) {
$create_cmd = $ln;
}
elsif($ln =~ /\;$/) {
push #create_cmds, $create_cmd;
$create_cmd = "";
}
}
close DATA;
my #views = ();
foreach my $create_cmd (#create_cmds) {
$create_cmd =~ s/\s+/ /;
$create_cmd =~ s/^\s+//;
$create_cmd =~ s/\s+$//;
my $name = get_view($create_cmd);
my $type = get_type($create_cmd);
my $content = substr($create_cmd, 0, -1);
my #parents =();#get_parents();
my #children = ();#get_children();
#------------------------------------------------------------------------
my #froms = split(/ from\s+/i, $create_cmd);
my #joins = split(/ join /i, $create_cmd);
#parcurge mai multe for in aceeasi structura
#FOR FROM
# body...
foreach my $i (1..#froms-1) {
#print Writer1 "$froms[$i]"."\n\n";
my $from = (split(/ where |select | left | left | right | as /i, $froms[$i])) [0];
$from=~s/^\s+//;
$from=~s/\(+//;
my #Spaces = split(/, | , /,$from);
foreach my $x (0..#Spaces-1) {
my $SpaceFrom = (split(/ /,$Spaces[$x])) [0];
$SpaceFrom=~s/;//;
$SpaceFrom=~s/\)+//;
#print Writer1 $SpaceFrom."\n\n";
push(#parents,$SpaceFrom);
# print "\n\n".$SpaceFrom."\n\n";
# print Writer "\n\n".$SpaceFrom."\n\n";
}
foreach my $x (1..#joins-1){
#print "$joins[$i]"."\n\n";
my $join = (split(/ on /i,$joins[$x])) [0];
my $joinspace = (split(/ /i,$joins[$x])) [0];
#print Writer "\n\n".$join."\n\n";
#print Writer1 $joinspace."\n\n";
#"$joinspace\n\n";
push(#parents,$joinspace);
print Writer1"\n\n".$parents[$_]."\n\n";
}
}
push #views, [$name, $type, $content, #parents, #children];
$hashtable{$name}="[0]Name:$name
[1]Type:$type
[2]Content:$content
[3]Parents:#parents"."\n\n
".
"----------------------------";
}
print Writer "Key:$_
Value:
$hashtable{$_}\n" foreach (keys%hashtable);
#------------------------------------------------------------------------------
print_views(\#views);
exit;
#------------------------------------------------------------------------------
sub get_view {
my $create_cmd = $_[0];
my $tmp = (split(/ view | trigger | table /i, $create_cmd))[1];
$tmp =~ s/^\s+//;
my $view = (split(/\s+/, $tmp))[0];
return $view;
}
#-----------------------------------------------------------------------------
sub get_type{
my $create_cmd = $_[0];
my $tmp = (split(/ replace /i, $create_cmd))[1];
$tmp =~ s/^\s+//;
my $view = (split(/\s+/, $tmp))[0];
return $view;
}
#-----------------------------------------------------------------------------
sub get_parents {
}
sub get_children {
}
get_children();
close Writer1;
close Writer;
This is how a chunk of data i have to parse looks like :
create or replace view MINA_TI_GRP_V
as
select NVL(max(t1.interval_group),(select dm_group from sdate_dm_grp_v)) AS DM_GROUP,
(t2.interval_number) INTERVAL_NUMBER , t2.time_interval_s
from MINA_INTERVAL_CONTROL t2
left join DM_TI_GRP_DATE_TIME t1 on t2.time_interval_s >= t1.time_interval_s
group by t2.interval_number , t2.time_interval_s
order by t2.interval_number;
If you want to easily find out what the parents are for an entry in %hashtable, you'll find it much easier if you store the data as another hash rather than one giant string like this...
$hashtable{$name}={"Name" => $name, "Type" => $type, "Parents" => \#Parent};
Then you can reference $hashtable{$key}->{"Parents"} to get an array ref that contains the parents for that data entry which you could use like this...
foreach my $parent (#{$hashtable{$key}->{"Parents"}})
{
if(defined($hashtable{$parent}))
{
# Parent exists in hashtable
}
else
{
# Parent does not exist in hashtable
}
}

perl to print lines with matching fields from two files, error: use of uninitialized value in string

I have two files, which look like:
File1:
chr id position a0 a1
22 rs4820378:39869209:C:T 39869209 C T
22 22:16050075:A:G 16050075 A G
22 22:16050115:G:A 16050115 G A
22 rs199694733:39913976:C:CT 39913976 C CT
22 rs139408809:39937958:GC:G 39937958 GC G
File2:
SNP CHR BP A1 A2
rs4820378 22 39869209 C T
rs4821900 22 39869719 G A
rs1984662 22 39869997 T G
rs35629588 22 39913976 I2 D
rs139408809 22 39937958 D I2
I would like to find lines where
Fields 1 and 3 from File1 match fields 2 and 3 from File2
and also either
Fields 4 and 5 from File1 are matching fields 4 and 5 from File2
Field 4 from both files have more than 1 character
Field 5 from both files have more than 1 character
Then print out field 2 from File1, and fields 1 and 3 from File2
Code below
#! perl -w
use strict;
use warnings;
my %kglocus;
open( my $loci_in, "<", "File1" ) or die $!;
while ( <$loci_in> ) {
next if m/chr/;
my ( $CHR, $id, $BP, $A1, $A2 ) = split;
my $reg = "${CHR}_$BP";
$kglocus{$reg} = [ $CHR, $id, $BP, $A1, $A2 ];
}
close $loci_in;
my $filename = shift #ARGV;
open( my $input, "<", $filename ) or die $!;
while ( <$input> ) {
next if m/SNP/;
my ( $SNP, $CHR, $BP, $A1, $A2 ) = split;
my $reg = "${CHR}_$BP";
if ( $A1 eq $kglocus{$reg}->[3] and $A2 eq $kglocus{$reg}->[4] ) {
print "$kglocus{$reg}->[1] $SNP $BP\n";
}
elsif ( ( length( $A1 ) > 1 && length( $kglocus{$reg}->[3] ) > 1 ) ||
( length( $A2 ) > 1 && length( $kglocus{$reg}->[4] ) > 1 ) ) {
print "$kglocus{$reg}->[1] $SNP $BP\n";
}
}
close( $input );
I'm getting the error below for all input lines:
Use of uninitialized value in string eq at find_ID.hash.chr22.pl line 23
Use of uninitialized value in length at find_ID.hash.chr22.pl line 27
Can anyone point out the problem?
The problem is that the existence of the hash element $kglocus{$reg} forms the first test, that "Fields 1 and 3 from File1 match fields 2 and 3 from File2". But you are treating it as if that test always passes, and simply using it to access elements of the File1 record
You need something like a next unless $kglocus{$reg} in there to make it work correctly. I would also prefer to see that value pulled out as a separate variable to avoid indexing the hash over and over again
Here's a solution that will work for you
use strict;
use warnings;
use v5.10.1;
use autodie;
my %kglocus;
{
open my $in_fh, '<', 'File1';
while ( <$in_fh> ) {
next if /chr/;
my ( $chr, $id, $bp, $a1, $a2 ) = split;
my $key = "${chr}_$bp";
$kglocus{$key} = [ $chr, $id, $bp, $a1, $a2 ];
}
}
{
my ( $filename ) = #ARGV;
open my $in_fh, '<', $filename;
while ( <$in_fh> ) {
next if /SNP/;
my ( $snp, $chr, $bp, $a1, $a2 ) = split;
my $key = "${chr}_$bp";
next unless my $item = $kglocus{$key};
if ( $a1 eq $item->[3] and $a2 eq $item->[4]
or length $a1 > 1 and length $item->[3] > 1
or length $a2 > 1 and length $item->[4] > 1 ) {
print "$item->[1] $snp $bp\n";
}
}
}

Perl: How to extract sequences based on gene number and nucleotide length?

I have 2 files, as follows:
file1.txt:
0 117nt, >gene_73|GeneMark.hm... *
0 237nt, >gene_3097|GeneMark.... *
0 237nt, >gene_579|GeneMark.h... *
0 237nt, >gene_988|GeneMark.h... *
0 189nt, >gene_97|GeneMark.hm... *
0 183nt, >gene_97|GeneMark.hm... *
file2.fasta:
>gene_735|GeneMark.hmm|237_nt|+|798985|799221
TTGTGGTTCGTGCCGCGCGACGCGTTGCGTCTGCAAACGCCCGACGAAGACATCGCGACCTATCTGTTCAACAAGCATGTGATTCGGCATCGGTTCTGTCCGACCTGCGGGATTCATCCGTTCGCGGAAGGCACGGACCCGAAGGGCAACGCGATGGCGGCCGTCAATCTTCGCTGCGTCGACGGCGTCGATCTCGACGCGTTGAGCGTCCGCCATTTCGACGGGCGCGCGCTCTGA
>gene_579|GeneMark.hmm|237_nt|+|667187|667423
ATGTACCACGGCGCCGAATTTGCCGCTGCCAAGGGCATGCGCTGGCTGCGAGATGCCGCCAACGGCTCTGCCTTCATCGCACCGGGCAGTCCGTGGCAAAACGGTTTCGTCGAGCGTTTCAACGGCAAGCTGCATGACGAATTGCTGAACCGGGAATGGTTCCGCGGCCGTGCCGAGACCAAGATGCTCATCGAACGCTCCGGCTACGGTCCGTCGAGTCTGACCGGATTCCGATGA
>gene_1876|GeneMark.hmm|234_nt|-|2168498|2168731
ATGCTGTTCTTTTCGCGCGCGGGCGTGTCGCGTGCGGCCGGCGGCCAATCATGCGGCGAGTCGTTTTGTCGCGGCTCGCGGCGCTTGCCGACGTTGGAATCGCGCGCGCCGATGCGCGGATCGGGGCGGCAACGTTTGCGTATGAGGAATGATGCGTTTGCGCATCGGGAATGGGCGCCTCGCCCCGGTTTCGCCGCGATTCCGCCCGACTCGAGGCAGTCGTTTTTCCGCTAA
>gene_3097|GeneMark.hmm|237_nt|-|3467022|3467258
GTGTCGAACGAACGTCGCGGCGAACGGCCGCTGCGGGCATCGCCGCAGGACGTCACACGGCGAACGTCGCGCGCGATCCTCGGCGGCCGCGAACGTGGGCCGTCCCGTGGCACGTTCGGCTCGCTCGGCATGGCGAACGACCGCCGCATCGCGCATCGCCGTCGCGCGGCCTCCAAAAAAACGGCGGTCAGCGACCGCCGGCTTTGGCCGAAACCGATGCGTCGTACGAATCAGTGA
>gene_988|GeneMark.hmm|237_nt|+|1121027|1121263
ATGACCTTGTCAGGCAACATCAAGGACGGCGACTGGACGGTCGAGGTGACGACATCGCCGGTGCAGGGCGGTTACGTGTGCGACATCGAGGTGATGCACGGCGCGCCGGGCGGCGCGTTCCGGCACGCGTTCCGGCACGGCGGCACTTATCCGGCCGAGCGCGACGCGATGATCGAGGGGCTGCGCGCGGGCATGACCTGGATCGAGCTGAAGATGTCGAAAGCATTCAATCTGTAA
>gene_97|GeneMark.hmm|105_nt|+|90122|90226
GTGACGCGTTTCGCGACGCGCGTCGATGGGGCGGGCGCGAAACCCGTTCGCCGCGATGCGGCGGACGGGGTATGGCCGAGCGCCGTCCGTCGCGGCGAGAGTTGA
>gene_97|GeneMark.hmm|183_nt|-|107002|107184
ATGGAGGCAATCGTGATCGAGCAAGTGATACTGGGCGTCTTTCTCGTACTGCCGCTTCTCATCGTCGCGGTGCTGTACTCCGACGAACTCTGGCAAGAACACCGCCTGCAGCATCCGCGCGACGAGCACACGCCACATATCGACTGGCGTCATCCGTGGCGGATCCTGCGGCGAGGGCACTAA
>gene_97|GeneMark.hmm|189_nt|-|98624|98812
GTGAAATACACGAGCGACCATTACGCGGGCGTCAAATTTGGCGCGCTGTACGGGTTCTCGAACGCGGCGAACTTCGCCGACAACCGCGCTCGCCGGCGCATGCGCGGCGTTCGCATACGCGATCGGCAAAAGCGGCGTGATGTGCGGTTGCCTGCCGCGCTCGCGCTATGCGCGGCACGCCATCGATGA
>gene_97|GeneMark.hmm|234_nt|+|105494|105727
ATGAAGATTCAAATCGCCATTGTTTATTTTGTCGCCCGTCACGCAAACGAGCAGGCGCGAAGCGGATCGGCGCGCATTGGCGAAGAGCCGGCGCGCATCGGCATCGCGCTCGCGCGACACATGCGCGCCGCGCGCGGCCGGTCGACGCCGGATTCGCCTGTCGATCGATCCGGTGCGCCCCGAGCCGATGAGCGGTACGCTTCGGCGCGCGCGCGACACGCGCGACACGCGTGA
>gene_979|GeneMark.hmm|225_nt|-|1115442|1115666
TTGATCGACGCGCGGGGCCGGCCGGGCCGCGGGGTATCGAAGGCGATCGACGCGCAACACGAATCGCCGCCGCGCGCCGAAACCTCGCTATGCGCGTCGCGCGCACGCGCGGCCGGCGGCGCACGCGCGGGTGTGCGCGGGCCGGCGGCGCGGCCGCTCGCACTGCGCGACCGCTCGCGCGCACGCCTTCCTCGGCACGCGCCGGGAATCCCGGCCCTTCAATGA
The output that I expect is:
>gene_579|GeneMark.hmm|237_nt|+|667187|667423
ATGTACCACGGCGCCGAATTTGCCGCTGCCAAGGGCATGCGCTGGCTGCGAGATGCCGCCAACGGCTCTGCCTTCATCGCACCGGGCAGTCCGTGGCAAAACGGTTTCGTCGAGCGTTTCAACGGCAAGCTGCATGACGAATTGCTGAACCGGGAATGGTTCCGCGGCCGTGCCGAGACCAAGATGCTCATCGAACGCTCCGGCTACGGTCCGTCGAGTCTGACCGGATTCCGATGA
>gene_3097|GeneMark.hmm|237_nt|-|3467022|3467258
GTGTCGAACGAACGTCGCGGCGAACGGCCGCTGCGGGCATCGCCGCAGGACGTCACACGGCGAACGTCGCGCGCGATCCTCGGCGGCCGCGAACGTGGGCCGTCCCGTGGCACGTTCGGCTCGCTCGGCATGGCGAACGACCGCCGCATCGCGCATCGCCGTCGCGCGGCCTCCAAAAAAACGGCGGTCAGCGACCGCCGGCTTTGGCCGAAACCGATGCGTCGTACGAATCAGTGA
>gene_988|GeneMark.hmm|237_nt|+|1121027|1121263
ATGACCTTGTCAGGCAACATCAAGGACGGCGACTGGACGGTCGAGGTGACGACATCGCCGGTGCAGGGCGGTTACGTGTGCGACATCGAGGTGATGCACGGCGCGCCGGGCGGCGCGTTCCGGCACGCGTTCCGGCACGGCGGCACTTATCCGGCCGAGCGCGACGCGATGATCGAGGGGCTGCGCGCGGGCATGACCTGGATCGAGCTGAAGATGTCGAAAGCATTCAATCTGTAA
>gene_97|GeneMark.hmm|183_nt|-|107002|107184
ATGGAGGCAATCGTGATCGAGCAAGTGATACTGGGCGTCTTTCTCGTACTGCCGCTTCTCATCGTCGCGGTGCTGTACTCCGACGAACTCTGGCAAGAACACCGCCTGCAGCATCCGCGCGACGAGCACACGCCACATATCGACTGGCGTCATCCGTGGCGGATCCTGCGGCGAGGGCACTAA
>gene_97|GeneMark.hmm|189_nt|-|98624|98812
GTGAAATACACGAGCGACCATTACGCGGGCGTCAAATTTGGCGCGCTGTACGGGTTCTCGAACGCGGCGAACTTCGCCGACAACCGCGCTCGCCGGCGCATGCGCGGCGTTCGCATACGCGATCGGCAAAAGCGGCGTGATGTGCGGTTGCCTGCCGCGCTCGCGCTATGCGCGGCACGCCATCGATGA
There are 4 sequences with gene number 97, but all in different length. I want the sequence with the correct gene length only which listed in file1.txt to output in the output.fasta file. What I've done so far is as follows (but failed and have some errors):
#!/usr/bin/perl
use strict;
use warnings;
my #genes;
open my $list, '<file1.txt';
while (my $line = <$list>) {
push (#genes, $1) if $line =~/\>(.*?)\|/gs;
}
my $tag1 = "0\t";
my $tag2 = "nt";
while (my $line = <$list>) {
if ($line =~ /$tag1(.*?)$tag2/) {
my $match1 = $1;
}
}
my $input;
{
local $/ = undef;
open my $fasta, '<file2.fasta';
my $tag3 = "GeneMark.hmm";
my $tag4 = "_nt";
while (my $input = <$fasta>) {
if ($input =~ /$tag3(.*?)$tag4/) {
my $match2 = $1; }}
close $fasta;
}
my #lines = split(/>/,$input);
foreach my $l (#lines) {
if ($l =~ /(.+?)\|/) {
my $real_name = $1;
if ($real_name ~~ #genes) {
if ($match2 = $match1) {
open (OUTFILE, '>>output.fasta');
print OUTFILE ">$l"; }
}
}
}
Can anyone give me some guide to correct the code? Or is there any better way to do this? Any help will be very much appreciated! Thanks! :)
Here's an option that uses Bio::SeqIO:
use strict;
use warnings;
use Bio::SeqIO;
my %hash;
open my $fh, '<', $ARGV[0] or die $!;
while (<$fh>) {
push #{ $hash{$2} }, $1 if /\s+(\d+)nt,.+?>(gene_\d+)\|/;
}
close $fh;
my $in = Bio::SeqIO->new( -file => $ARGV[1], -format => 'Fasta' );
my $out = Bio::SeqIO->new( -fh => \*STDOUT, -format => 'Fasta' );
while ( my $seq = $in->next_seq() ) {
$out->write_seq($seq)
if $seq->id =~ /(gene_\d+)\|.+?\|(\d+)_nt\|/ and grep /$2/, #{ $hash{$1} };
}
Usage: perl script.pl file1.txt file2.fasta [>outFile.fasta]
The second, optional parameter directs output to a file.
Output from your data:
>gene_579|GeneMark.hmm|237_nt|+|667187|667423
ATGTACCACGGCGCCGAATTTGCCGCTGCCAAGGGCATGCGCTGGCTGCGAGATGCCGCC
AACGGCTCTGCCTTCATCGCACCGGGCAGTCCGTGGCAAAACGGTTTCGTCGAGCGTTTC
AACGGCAAGCTGCATGACGAATTGCTGAACCGGGAATGGTTCCGCGGCCGTGCCGAGACC
AAGATGCTCATCGAACGCTCCGGCTACGGTCCGTCGAGTCTGACCGGATTCCGATGA
>gene_3097|GeneMark.hmm|237_nt|-|3467022|3467258
GTGTCGAACGAACGTCGCGGCGAACGGCCGCTGCGGGCATCGCCGCAGGACGTCACACGG
CGAACGTCGCGCGCGATCCTCGGCGGCCGCGAACGTGGGCCGTCCCGTGGCACGTTCGGC
TCGCTCGGCATGGCGAACGACCGCCGCATCGCGCATCGCCGTCGCGCGGCCTCCAAAAAA
ACGGCGGTCAGCGACCGCCGGCTTTGGCCGAAACCGATGCGTCGTACGAATCAGTGA
>gene_988|GeneMark.hmm|237_nt|+|1121027|1121263
ATGACCTTGTCAGGCAACATCAAGGACGGCGACTGGACGGTCGAGGTGACGACATCGCCG
GTGCAGGGCGGTTACGTGTGCGACATCGAGGTGATGCACGGCGCGCCGGGCGGCGCGTTC
CGGCACGCGTTCCGGCACGGCGGCACTTATCCGGCCGAGCGCGACGCGATGATCGAGGGG
CTGCGCGCGGGCATGACCTGGATCGAGCTGAAGATGTCGAAAGCATTCAATCTGTAA
>gene_97|GeneMark.hmm|183_nt|-|107002|107184
ATGGAGGCAATCGTGATCGAGCAAGTGATACTGGGCGTCTTTCTCGTACTGCCGCTTCTC
ATCGTCGCGGTGCTGTACTCCGACGAACTCTGGCAAGAACACCGCCTGCAGCATCCGCGC
GACGAGCACACGCCACATATCGACTGGCGTCATCCGTGGCGGATCCTGCGGCGAGGGCAC
TAA
>gene_97|GeneMark.hmm|189_nt|-|98624|98812
GTGAAATACACGAGCGACCATTACGCGGGCGTCAAATTTGGCGCGCTGTACGGGTTCTCG
AACGCGGCGAACTTCGCCGACAACCGCGCTCGCCGGCGCATGCGCGGCGTTCGCATACGC
GATCGGCAAAAGCGGCGTGATGTGCGGTTGCCTGCCGCGCTCGCGCTATGCGCGGCACGC
CATCGATGA
Bio::SeqIO lives to parse fasta (and other such) files, so the above leverages this capability. After creating a hash of arrays (HoA) from file1.txt, the fasta file is processed, and only matching fasta records are printed.
Hope this helps!

Perl regex syntax generation

This is a follow up to the question posted here: Perl Regex syntax
The results from that discussion yielded this script:
#!/usr/bin/env perl
use strict;
use warnings;
my #lines = <DATA>;
my $current_label = '';
my #ordered_labels;
my %data;
for my $line (#lines) {
if ( $line =~ /^\/(.*)$/ ) { # starts with slash
$current_label = $1;
push #ordered_labels, $current_label;
next;
}
if ( length $current_label ) {
if ( $line =~ /^(\d) "(.*)"$/ ) {
$data{$current_label}{$1} = $2;
next;
}
}
}
for my $label ( #ordered_labels ) {
print "$label <- as.factor($label\n";
print " , levels= c(";
print join(',',map { $_ } sort keys %{$data{$label}} );
print ")\n";
print " , labels= c(";
print join(',',
map { '"' . $data{$label}{$_} . '"' }
sort keys %{$data{$label}} );
print ")\n";
print " )\n";
}
__DATA__
...A bunch of nonsense I do not care about...
...
Value Labels
/gender
1 "M"
2 "F"
/purpose
1 "business"
2 "vacation"
3 "tiddlywinks"
execute .
Essentially, I need to build the Perl to accommodate a syntax shorthand found in the SPSS file. For adjacent columns, SPSS allows one to type something like:
VALUE LABELS
/agree1 to agree5
1 "Strongly disagree"
2 "Disagree"
3 "Neutral"
4 "Agree"
5 "Strongly agree"
As the script currently exists, it will generate this:
agree1 to agree5 <- factor(agree1 to agree5
, levels= c(1,2,3,4,5,6)
, labels= c("Strongly disagree","Disagree","Neutral","Agree","Strongly agree","N/A")
)
and I need it to produce something like this:
agree1 <- factor(agree1
, levels= c(1,2,3,4,5,6)
, labels= c("Strongly disagree","Disagree","Neutral","Agree","Strongly agree","N/A")
)
agree2 <- factor(agree2
, levels= c(1,2,3,4,5,6)
, labels= c("Strongly disagree","Disagree","Neutral","Agree","Strongly agree","N/A")
)
…
use strict;
use warnings;
main();
sub main {
my #lines = <DATA>;
my $vlabels = get_value_labels(#lines);
write_output_delim($vlabels);
}
# Extract the value label information from SPSS syntax.
sub get_value_labels {
my (#vlabels, $i, $j);
for my $line (#_){
if ( $line =~ /^\/(.+)/ ){
my #vars = parse_var_range($1);
$i = #vlabels;
$j = $i + #vars - 1;
push #vlabels, { var => $_, codes => [] } for #vars;
}
elsif ( $line =~ /^\s* (\d) \s+ "(.*)"$/x ){
push #{$vlabels[$_]{codes}}, [$1, $2] for $i .. $j;
}
}
return \#vlabels;
}
# A helper function to handle variable ranges: "agree1 to agree3".
sub parse_var_range {
my $vr = shift;
my #vars = split /\s+ to \s+/x, $vr;
return $vr unless #vars > 1;
my ($stem) = $vars[0] =~ /(.+?)\d+$/;
my #n = map { /(\d+)$/ } #vars;
return map { "$stem" . $_ } $n[0] .. $n[1];
}
sub write_output_delim {
my $vlabels = shift;
for my $vlab (#$vlabels){
print $vlab->{var}, "\n";
print join("\t", '', #$_), "\n" for #{$vlab->{codes}}
}
}
sub write_output_factors {
# You get the idea...
}
__DATA__
/gender
1 "M"
2 "F"
/purpose
1 "business"
2 "vacation"
3 "tiddlywinks"
/agree1 to agree3
1 "Disagree"
2 "Neutral"
3 "Agree"