Dynamic variable names and hash keys - regex

My script written in Perl parse text output of a command which is structured like this:
Controller information
Controller Status : Optimal
Logical device information
Logical Device number 1
Logical Status : Optimal
Logical Device number 2
Logical Status : Optimal
Parser should fill multidimentional structure:
{
"controller": {
"Controller Status": "Optimal"
}
"logical": [
{
"Logical Status": "Optimal"
},
{
"Logical Status": "Optimal"
}
]
}
I try parsing using dynamic variables:
foreach (#lines) {
$variable = "info{controller}" if (m/Controller information/);
$variable = "info{logical}[$1]" if (m/Logical Device number (\d+)/);
${$variable}{1} = $2 if (m/\s+(.*?)\s+:\s(.*)$/);
}
In this case header will set hash key corresponding to the topic and all following parameters will be put into chosen key.
The first problem is that dynamic ${$variable} will not work if $variable contain any hash or array keys. Is there a way to make dynamic variables work with hash keys inside?
The second problem is that dynamic variables are called "always-a-bad-idea", and I wonder if there is short but effective way to build parsing function without dynamic variables?

Use Data::DeepAccess (or Data::Diver with a bit different interface).
Note that I used $1 - 1 as the index, otherwise the first element with index 0 under logical was undef.
#!/usr/bin/perl
use warnings;
use strict;
use Data::DeepAccess qw{ deep_set };
my %info;
my #where;
while (<DATA>) {
#where = ('controller') if /Controller information/;
#where = ('logical', {index => $1 - 1}) if /Logical Device number (\d+)/;
deep_set(\%info, #where, "$1", "$2") if m/\s+(.*?)\s+:\s(.*)$/;
}
use Data::Dumper; print Dumper \%info;
__DATA__
Controller information
Controller Status : Optimal
Logical device information
Logical Device number 1
Logical Status : Optimal
Logical Device number 2
Logical Status : Optimal
Or, use references:
#!/usr/bin/perl
use warnings;
use strict;
my %info;
my $where;
while (<DATA>) {
$where = \$info{controller} if /Controller information/;
$where = \$info{logical}[ $1 - 1 ] if /Logical Device number (\d+)/;
$$where->{$1} = $2 if m/\s+(.*?)\s+:\s(.*)$/;
}
...

Related

How do i read lines from a file into a hash in Perl

I'm working with a file format that has lines of information that looks like this:
ATOM 1 N LYS A 56 20.508 14.774 -7.432 1.00 50.83 N
All i want is the first number, and the three numbers following '56' in the example above; so im using regular expressions to get that information. How do i then put that info into a hash?
So far i have:
my $pdb_file = $ARGV[0];
open (PDBFILE, "<$pdb_file") or die ("$pdb_file not found");
while (<PDBFILE>) {
if ($_=~ /^ATOM\s+(\d+)\s+\w+\s+\w+\s+\w+\s+\d+\s+(\d+\.\d+)\s+(\d+\.\d+)\s+(\d+\.\d+)/) {
my $atom = $1;
my $xcor = $2;
my $ycor = $3;
my $zcor = $4;
print "AtomNumber: $atom\t xyz: $xcor $ycor $zcor\n";
}
}
Instead of using a regex, I would instead recommend using split to split it into fields on whitespace. This will be faster and more robust, it doesn't depend on a detailed knowledge of the format of each field (which could change, like if a number has a minus sign which you forgot to take into account). And it's a lot easier to understand.
my #fields = split /\s+/, $line;
Then you can pick out the fields (for example, the first number is field 2, so $fields[1]) and put them into your hash.
my %coordinate = (
atom => $fields[1],
x => $fields[6],
y => $fields[7],
z => $fields[8]
);
You're reading a bunch of lines, so you're going to make a bunch of hashes which have to go somewhere. I'd recommend putting them all in another hash with some sort of unique field as the key. Possibly the atom field.
$atoms{$coordinate{atom}} = \%coordinate;

Parse CSV based on random header layout perl

I am trying some code using Tie::Handle::CSV, however I suppose this could be done using other modules or none at all.
What I want to do is take a file of random layout and match it to what I predict the headers will be. Then I want to arrange it to my table structure.
DATA
First Name,last name,date of birth
Jim,Johansen,08/25/1989
OR
2nd Name,1st Name,D.O.B
Johansen,Jim,08/25/1989
OR
2nd Name,1st Name,D.O.B,city,county
Johansen,Jim,08/25/1989,milwaukee,N/A
As you can see, I have varying data structures. I want it to arrange them to my schema, even if the fields I request are empty. The way I want to do this is by searching my input with my column variables.
Here is what I am trying.
Code
use Tie::Handle::CSV;
my $name1 =qr/First Name|Name|1st Name/i;
my $name2 =qr/Last Name|Maiden Name|2nd Name/i;
my $date_of_birth =qr/date of birth|D.O.B/i;
my $city =qr/city|town/i;
my $csv_fh = Tie::Handle::CSV->new('list.txt', header => 1);
while (my $csv_line = <$csv_fh>)
{
print $csv_line->{'$date_of_birth'}.",".$csv_line->{'$name1'}." ".$csv_line->{'$name2'}.",".$csv_line->{'$city'}.\n"; ##note I am searching for the column {$'colummn regex'} instead of {'column'} to see if my input file matches any of the header options.
}
close $csv_fh;
My output is blank since this module is not understanding the regex I am implimenting. However, my output would contain the columns specified if I used their literal names, i.e.
The out put I want would be:
Scenario 1
Date of Birth,Name,City ##my implemented header
08/25/1989,Jim Johansen, ##noting also that if there is no 'city' in the input data, leave blank.
Scenario 2
Date of Birth,Name,City ##my implemented header
08/25/1989,Jim Johansen,
Scenario 3
Date of Birth,Name,City ##my implemented header
08/25/1989,Jim Johansen,milwaukee
Perhaps there is a better option than a module or even my regex variables. Has anyone had to parse csvs in ever changing layouts?
You never use $name1, $name2, etc, much less in a match (or substitution) operator, so you ever execute any regex match.
my $field_names = $csv_fh->header();
my ($name1_header) = grep /First Name|^Name$|1st Name/i, #$field_names;
my ($name2_header) = grep /Last Name|Maiden Name|2nd Name/i, #$field_names;
my ($dob_header ) = grep /date of birth|D\.O\.B/i, #$field_names;
my ($city_header ) = grep /city|town/i, #$field_names;
my #recognized_fields = ( $name1_header, $name2_header, $dob_header, $city_header );
my %recognized_fields = map { $_ => 1 } #recognized_fields;
my #other_headers = grep !$recognized_fields{$_}, #$field_names;
while (my $row = <$csv_fh>) {
my $name1 = $name1_header ? $row->{$name1_header} : undef;
my $name2 = $name2_header ? $row->{$name2_header} : undef;
my $dob = $dob_header ? $row->{$dob_header } : undef;
my $city = $city_header ? $row->{$city_header } : undef;
my #other_fields = #$row{#other_headers};
...
}

Perl String Regular Expression

i need some help in Perl regular expressions.
I have this string:
{
"ITEM":[
{
"-itemID": "1000000" ,
"-itemName": "DisneyJuniorLA" ,
"-thumbUrl": "" ,
"-packageID": "1" ,
"-itemPrice": "0" ,
"-isLock": "true"
},
{
"-itemID": "1000001" ,
"-itemName": "31 minutos" ,
"-thumbUrl": "" ,
"-packageID": "1" ,
"-itemPrice": "0" ,
"-isLock": "true"
},
{
"-itemID": "1000002" ,
"-itemName": "Plaza Sésamo" ,
"-thumbUrl": "" ,
"-packageID": "1" ,
"-itemPrice": "0" ,
"-isLock": "true"
},
]
}
The string is in a variable: $jsonString
I have another variable: $itemName
I want to only keep in $jsonString the itemId value that is above itemName (where itemName equals $itemName)
I would really appreciate your help. I am really amateur in regular expressions.
Thank you!
Notwithstanding that your JSON string is very slightly malformed (there's an extra comma after the last element in the array that should be fixed by whoever's generating the "JSON"), attempting to use regexps to handle this just means you now have two problems instead of one.
More specifically, objects within JSON are explicitly unordered sets of key/value pairs. It's perfectly possible that whatever's changing the JSON could be rewritten such that the JSON is semantically identical but serialised differently, making anything that relies on the current structure brittle and error prone.
Instead, use a proper JSON decoder, and then traverse the resulting object hierarchy directly to find the desired element:
use JSON;
use utf8;
# decode the JSON
my $obj = decode_json($jsonString);
# get the ITEM array
my $itemRef = $obj->{ITEM};
# find all elements matching the item name
my #match = grep { $_->{'-itemName'} eq $itemName } #{$itemRef};
# extract the item ID
if (#match) {
my $itemID = $match[0]->{'-itemID'};
print $itemID;
}
Don't use a regular expression to parse JSON. Use JSON.
Basically :
use strict;
use warnings;
use Data::Dumper;
use JSON;
my $json_string;
{
open( my $json_in, "<", 'test.json' );
local $/;
$json_string = <$json_in>;
}
my $json = decode_json ( $json_string );
print Dumper \$json;
foreach my $item ( #{ $json -> {'ITEM'} } ) {
print $item -> {'-itemID'},"\n";
print $item -> {'-itemName'},"\n";
}
But you have to fix your json first. (There's a trailing comma that shouldn't be there. )
JSON is a defined data transfer structure. Whilst you can technically treat it as 'plain text' and extract things from the text, that's definitively the wrong way to do things.
It might work fine for a good long time, but if your source program changes a little - and change their output, whilst still sticking to the JSON standard - your code will break unexpectedly, and you may not realise. That can set off a domino effect of breakages, making a whole system or site just crash and burn. And worse yet - the source of this crash and burn will be hidden away in some script that hasn't been touched in years, so will be very difficult to fix.
This is one of my pet peeves as a professional sysadmin. Please don't even go there.

Retrieve the coding amino-acid when there is certain pattern in a DNA sequence

I would like to retrieve the coding amino-acid when there is certain pattern in a DNA sequence. For example, the pattern could be: ATAGTA. So, when having:
Input file:
>sequence1
ATGGCGCATAGTAATGC
>sequence2
ATGATAGTAATGCGCGC
The ideal output would be a table having for each amino-acid the number of times is coded by the pattern. Here in sequence1, pattern codes only for one amino-acid, but in sequence2 it codes for two. I would like to have this tool working to scale to thousands of sequences. I've been thinking about how to get this done, but I only thought to: replace all nucleotides different than the pattern, translate what remains and get summary of the coded amino-acids.
Please let me know if this task can be performed by an already available tool.
Thanks for your help. All the best, Bernardo
Edit (due to the confusion generated with my post):
Please forget the original post and sequence1 and sequence2 too.
Hi all, and sorry for the confusion. The input fasta file is a *.ffn file derived from a GenBank file using 'FeatureExtract' tool (http://www.cbs.dtu.dk/services/FeatureExtract/download.php), so a can imagine they are already in frame (+1) and there is no need to get amino-acids coded in a frame different than +1.
I would like to know for which amino-acid the following sequences are coding for:
AGAGAG
GAGAGA
CTCTCT
TCTCTC
The unique strings I want to get coding amino-acids are repeats of three AG, GA, CT or TC, that is (AG)3, (GA)3, (CT)3 and (TC)3, respectively. I don't want the program to retrieve coding amino-acids for repeats of four or more.
Thanks again, Bernardo
Here's some code that should at least get you started. For example, you can run like:
./retrieve_coding_aa.pl file.fa ATAGTA
Contents of retrieve_coding_aa.pl:
#!/usr/bin/perl
use strict;
use warnings;
use File::Basename;
use Bio::SeqIO;
use Bio::Tools::CodonTable;
use Data::Dumper;
my $pattern = $ARGV[1];
my $fasta = Bio::SeqIO->new ( -file => $ARGV[0], -format => 'fasta');
while (my $seq = $fasta->next_seq ) {
my $pos = 0;
my %counts;
for (split /($pattern)/ => $seq->seq) {
if ($_ eq $pattern) {
my $dist = $pos % 3;
unless ($dist == 0) {
my $num = 3 - $dist;
s/.{$num}//;
chop until length () % 3 == 0;
}
my $table = Bio::Tools::CodonTable->new();
$counts{$_}++ for split (//, $table->translate($_));
}
$pos += length;
}
print $seq->display_id() . ":\n";
map {
print "$_ => $counts{$_}\n"
}
sort {
$counts{$a} <=> $counts{$b}
}
keys %counts;
print "\n";
}
Here are the results using the sample input:
sequence1:
S => 1
sequence2:
V => 1
I => 1
The Bio::Tools::CodonTable class also supports non-standard codon usage tables. You can change the table using the id pointer. For example:
$table = Bio::Tools::CodonTable->new( -id => 5 );
or:
$table->id(5);
For more information, including how to examine these tables, please see the documentation here: http://metacpan.org/pod/Bio::Tools::CodonTable
I will stick to that first version of what you wanted cause the addendum only confused me even more. (frame?)
I only found ATAGTA once in sequence2 but I assume you want the mirror images/reverse sequence as well, which would be ATGATA in this case. Well my script doesn't do that so you would have to write it up twice in the input_sequences file but that should be no problem I would think.
I work with a file like yours which I call "dna.txt" and a input sequences file called "input_seq.txt". The result file is a listing of patterns and their occurences in the dna.txt file (including overlap-results but it can be set to non-overlap as explained in the awk).
input_seq.txt:
GC
ATA
ATAGTA
ATGATA
dna.txt:
>sequence1
ATGGCGCATAGTAATGC
>sequence2
ATGATAGTAATGCGCGC
results.txt:
GC,6
ATA,2
ATAGTA,2
ATGATA,1
Code is awk calling another awk (but one of them is simple). You have to run
"./match_patterns.awk input_seq.txt" to get the results file generated.:
*match_patterns.awk:*
#! /bin/awk -f
{return_value= system("awk -vsubval="$1" -f test.awk dna.txt")}
test.awk:
#! /bin/awk -f
{string=$0
do
{
where = match(string, subval)
# code is for overlapping matches (i.e ATA matches twice in ATATAC)
# for non-overlapping replace +1 by +RLENGTH in following line
if (RSTART!=0){count++; string=substr(string,RSTART+1)}
}
while (RSTART != 0)
}
END{print subval","count >> "results.txt"}
Files have to be all in the same directory.
Good luck!

Comparison between 2 files and printing the difference between them in a sequential manner

I have 2 files namely
**a.txt**
cars
bikes
bus
vehicle
atv
**b.txt**
hawk
hero
atv
bus
***result.txt***
cars
bikes
vehicle
hawk
hero
I want to print the difference between the 2 files. Now I have tried a code but it gives me the difference in a random manner; I want it to display it in a proper sequential manner. Can anyone help me out in this.
use strict;
my %results = ();
open FILE1, "<a.txt"
or die "Could not open file: $! \n";
while (my $line = <FILE1>) {
$results{$line}=1;
}
close FILE1;
open FILE2, "<b.txt"
or die "Could not open file: $! \n";
while (my $line = <FILE2>) {
$results{$line}++;
}
close FILE2;
open OUTFILE, ">>result.txt"
or die "Cannot open $outfile for writing \n";
foreach my $line (keys %results) {
print OUTFILE $line if $results{$line} == 1;
}
close OUTFILE;
close OUTFILE1;
The output that I am getting with this code is
***result.txt***
cars
hawk
bikes
hero
vehicle
Hashes store their elements in a random order, therefore we have to carry positional information with us. In the following solution, I use the hash value to carry an unique ID.
Code
#!/usr/bin/perl
use strict; use warnings; use Data::Dumper;
my #a = qw( cars bikes bus vehicle atv );
my #b = qw( hawk hero atv bus );
my $i = 0;
my %ahash = map {$_ => ++$i} #a;
my %bhash = map {$_ => ++$i} #b;
my %different = map {
($ahash{$_} and $bhash{$_})
? ()
: ($_ => ($ahash{$_} or $bhash{$_}))
} (keys %ahash, keys %bhash);
my #sorted = sort {$different{$a} <=> $different{$b}} keys %different;
print Dumper(\#sorted);
Discussion
For the purpose of demonstration, I didn't use files, but stored the records in arrays #a and #b.
I build hashes %ahash and %bhash from these arrays, using the array element as key and using an unique number as value. In the case of the %ahash, these IDs are the same as the line number. You can think of the IDs in %bhash as line numbers with an offset.
I then gather all key-value pairs that differ between the hashes. To do that, I take the list of all keys in %ahash and %bhash. For each of these keys, I test for the existence of this key in both hashes (I don't use exists, but rather test against the value, as I can guarantee that no ID with a false value is assigned—all numbers except 0 evaluate true). If the key exists in both hashes, I return the empty list (). If the key exists in only one of the hashes, I return a two-element list containing the key and the value, which is either in %ahash or %bhash.
In the next step, I sort all different keys in ascending order by their associated value, and store the keys in this order in #sorted. You would print them to a file, I simply dumped them, and got this output:
$VAR1 = [
'cars',
'bikes',
'vehicle',
'hawk',
'hero'
];
I believe this meets your criteria.
Basically, you had a quite elegant algorithm to find differences, but you didn't incorporate the sorting data, as I did with my line-number like IDs.
Style notes
In most cases, using lexical filehandles with my is preferred over using global filehandles. Also, the three-argument form of open is more flexible and worth getting used to. For example, I'd write your third line as
my $filename1 = "a.txt";
open my $file1, '<', $filename1 or die qq{Could not open "$filename1": $!\n};
but then again, TIM TOWTDI.
perl -lne '$X{$_}++;if(eof){$a++;}if($a==2){foreach (keys %X){if($X{$_}==1){print $_}}}' file1 file2
tested below:
> cat temp
cars
bikes
bus
vehicle
atv
> cat temp2
hawk
hero
atv
bus
> perl -lne '$X{$_}++;if(eof){$a++;}if($a==2){foreach (keys %X){if($X{$_}==1){print $_}}}' temp temp2
cars
hawk
hero
vehicle
bikes
>
Since the data is stored as a hash reference the order is not preserved. See How can I print a Perl hash in specific order?
$ diff a.txt b.txt
$ is not part of the code, it is there to denote a bash command prompt.
comm may also be of use to you.