Perl String Regular Expression - regex

i need some help in Perl regular expressions.
I have this string:
{
"ITEM":[
{
"-itemID": "1000000" ,
"-itemName": "DisneyJuniorLA" ,
"-thumbUrl": "" ,
"-packageID": "1" ,
"-itemPrice": "0" ,
"-isLock": "true"
},
{
"-itemID": "1000001" ,
"-itemName": "31 minutos" ,
"-thumbUrl": "" ,
"-packageID": "1" ,
"-itemPrice": "0" ,
"-isLock": "true"
},
{
"-itemID": "1000002" ,
"-itemName": "Plaza Sésamo" ,
"-thumbUrl": "" ,
"-packageID": "1" ,
"-itemPrice": "0" ,
"-isLock": "true"
},
]
}
The string is in a variable: $jsonString
I have another variable: $itemName
I want to only keep in $jsonString the itemId value that is above itemName (where itemName equals $itemName)
I would really appreciate your help. I am really amateur in regular expressions.
Thank you!

Notwithstanding that your JSON string is very slightly malformed (there's an extra comma after the last element in the array that should be fixed by whoever's generating the "JSON"), attempting to use regexps to handle this just means you now have two problems instead of one.
More specifically, objects within JSON are explicitly unordered sets of key/value pairs. It's perfectly possible that whatever's changing the JSON could be rewritten such that the JSON is semantically identical but serialised differently, making anything that relies on the current structure brittle and error prone.
Instead, use a proper JSON decoder, and then traverse the resulting object hierarchy directly to find the desired element:
use JSON;
use utf8;
# decode the JSON
my $obj = decode_json($jsonString);
# get the ITEM array
my $itemRef = $obj->{ITEM};
# find all elements matching the item name
my #match = grep { $_->{'-itemName'} eq $itemName } #{$itemRef};
# extract the item ID
if (#match) {
my $itemID = $match[0]->{'-itemID'};
print $itemID;
}

Don't use a regular expression to parse JSON. Use JSON.
Basically :
use strict;
use warnings;
use Data::Dumper;
use JSON;
my $json_string;
{
open( my $json_in, "<", 'test.json' );
local $/;
$json_string = <$json_in>;
}
my $json = decode_json ( $json_string );
print Dumper \$json;
foreach my $item ( #{ $json -> {'ITEM'} } ) {
print $item -> {'-itemID'},"\n";
print $item -> {'-itemName'},"\n";
}
But you have to fix your json first. (There's a trailing comma that shouldn't be there. )
JSON is a defined data transfer structure. Whilst you can technically treat it as 'plain text' and extract things from the text, that's definitively the wrong way to do things.
It might work fine for a good long time, but if your source program changes a little - and change their output, whilst still sticking to the JSON standard - your code will break unexpectedly, and you may not realise. That can set off a domino effect of breakages, making a whole system or site just crash and burn. And worse yet - the source of this crash and burn will be hidden away in some script that hasn't been touched in years, so will be very difficult to fix.
This is one of my pet peeves as a professional sysadmin. Please don't even go there.

Related

Dynamic variable names and hash keys

My script written in Perl parse text output of a command which is structured like this:
Controller information
Controller Status : Optimal
Logical device information
Logical Device number 1
Logical Status : Optimal
Logical Device number 2
Logical Status : Optimal
Parser should fill multidimentional structure:
{
"controller": {
"Controller Status": "Optimal"
}
"logical": [
{
"Logical Status": "Optimal"
},
{
"Logical Status": "Optimal"
}
]
}
I try parsing using dynamic variables:
foreach (#lines) {
$variable = "info{controller}" if (m/Controller information/);
$variable = "info{logical}[$1]" if (m/Logical Device number (\d+)/);
${$variable}{1} = $2 if (m/\s+(.*?)\s+:\s(.*)$/);
}
In this case header will set hash key corresponding to the topic and all following parameters will be put into chosen key.
The first problem is that dynamic ${$variable} will not work if $variable contain any hash or array keys. Is there a way to make dynamic variables work with hash keys inside?
The second problem is that dynamic variables are called "always-a-bad-idea", and I wonder if there is short but effective way to build parsing function without dynamic variables?
Use Data::DeepAccess (or Data::Diver with a bit different interface).
Note that I used $1 - 1 as the index, otherwise the first element with index 0 under logical was undef.
#!/usr/bin/perl
use warnings;
use strict;
use Data::DeepAccess qw{ deep_set };
my %info;
my #where;
while (<DATA>) {
#where = ('controller') if /Controller information/;
#where = ('logical', {index => $1 - 1}) if /Logical Device number (\d+)/;
deep_set(\%info, #where, "$1", "$2") if m/\s+(.*?)\s+:\s(.*)$/;
}
use Data::Dumper; print Dumper \%info;
__DATA__
Controller information
Controller Status : Optimal
Logical device information
Logical Device number 1
Logical Status : Optimal
Logical Device number 2
Logical Status : Optimal
Or, use references:
#!/usr/bin/perl
use warnings;
use strict;
my %info;
my $where;
while (<DATA>) {
$where = \$info{controller} if /Controller information/;
$where = \$info{logical}[ $1 - 1 ] if /Logical Device number (\d+)/;
$$where->{$1} = $2 if m/\s+(.*?)\s+:\s(.*)$/;
}
...

Parse CSV based on random header layout perl

I am trying some code using Tie::Handle::CSV, however I suppose this could be done using other modules or none at all.
What I want to do is take a file of random layout and match it to what I predict the headers will be. Then I want to arrange it to my table structure.
DATA
First Name,last name,date of birth
Jim,Johansen,08/25/1989
OR
2nd Name,1st Name,D.O.B
Johansen,Jim,08/25/1989
OR
2nd Name,1st Name,D.O.B,city,county
Johansen,Jim,08/25/1989,milwaukee,N/A
As you can see, I have varying data structures. I want it to arrange them to my schema, even if the fields I request are empty. The way I want to do this is by searching my input with my column variables.
Here is what I am trying.
Code
use Tie::Handle::CSV;
my $name1 =qr/First Name|Name|1st Name/i;
my $name2 =qr/Last Name|Maiden Name|2nd Name/i;
my $date_of_birth =qr/date of birth|D.O.B/i;
my $city =qr/city|town/i;
my $csv_fh = Tie::Handle::CSV->new('list.txt', header => 1);
while (my $csv_line = <$csv_fh>)
{
print $csv_line->{'$date_of_birth'}.",".$csv_line->{'$name1'}." ".$csv_line->{'$name2'}.",".$csv_line->{'$city'}.\n"; ##note I am searching for the column {$'colummn regex'} instead of {'column'} to see if my input file matches any of the header options.
}
close $csv_fh;
My output is blank since this module is not understanding the regex I am implimenting. However, my output would contain the columns specified if I used their literal names, i.e.
The out put I want would be:
Scenario 1
Date of Birth,Name,City ##my implemented header
08/25/1989,Jim Johansen, ##noting also that if there is no 'city' in the input data, leave blank.
Scenario 2
Date of Birth,Name,City ##my implemented header
08/25/1989,Jim Johansen,
Scenario 3
Date of Birth,Name,City ##my implemented header
08/25/1989,Jim Johansen,milwaukee
Perhaps there is a better option than a module or even my regex variables. Has anyone had to parse csvs in ever changing layouts?
You never use $name1, $name2, etc, much less in a match (or substitution) operator, so you ever execute any regex match.
my $field_names = $csv_fh->header();
my ($name1_header) = grep /First Name|^Name$|1st Name/i, #$field_names;
my ($name2_header) = grep /Last Name|Maiden Name|2nd Name/i, #$field_names;
my ($dob_header ) = grep /date of birth|D\.O\.B/i, #$field_names;
my ($city_header ) = grep /city|town/i, #$field_names;
my #recognized_fields = ( $name1_header, $name2_header, $dob_header, $city_header );
my %recognized_fields = map { $_ => 1 } #recognized_fields;
my #other_headers = grep !$recognized_fields{$_}, #$field_names;
while (my $row = <$csv_fh>) {
my $name1 = $name1_header ? $row->{$name1_header} : undef;
my $name2 = $name2_header ? $row->{$name2_header} : undef;
my $dob = $dob_header ? $row->{$dob_header } : undef;
my $city = $city_header ? $row->{$city_header } : undef;
my #other_fields = #$row{#other_headers};
...
}

Retrieve the coding amino-acid when there is certain pattern in a DNA sequence

I would like to retrieve the coding amino-acid when there is certain pattern in a DNA sequence. For example, the pattern could be: ATAGTA. So, when having:
Input file:
>sequence1
ATGGCGCATAGTAATGC
>sequence2
ATGATAGTAATGCGCGC
The ideal output would be a table having for each amino-acid the number of times is coded by the pattern. Here in sequence1, pattern codes only for one amino-acid, but in sequence2 it codes for two. I would like to have this tool working to scale to thousands of sequences. I've been thinking about how to get this done, but I only thought to: replace all nucleotides different than the pattern, translate what remains and get summary of the coded amino-acids.
Please let me know if this task can be performed by an already available tool.
Thanks for your help. All the best, Bernardo
Edit (due to the confusion generated with my post):
Please forget the original post and sequence1 and sequence2 too.
Hi all, and sorry for the confusion. The input fasta file is a *.ffn file derived from a GenBank file using 'FeatureExtract' tool (http://www.cbs.dtu.dk/services/FeatureExtract/download.php), so a can imagine they are already in frame (+1) and there is no need to get amino-acids coded in a frame different than +1.
I would like to know for which amino-acid the following sequences are coding for:
AGAGAG
GAGAGA
CTCTCT
TCTCTC
The unique strings I want to get coding amino-acids are repeats of three AG, GA, CT or TC, that is (AG)3, (GA)3, (CT)3 and (TC)3, respectively. I don't want the program to retrieve coding amino-acids for repeats of four or more.
Thanks again, Bernardo
Here's some code that should at least get you started. For example, you can run like:
./retrieve_coding_aa.pl file.fa ATAGTA
Contents of retrieve_coding_aa.pl:
#!/usr/bin/perl
use strict;
use warnings;
use File::Basename;
use Bio::SeqIO;
use Bio::Tools::CodonTable;
use Data::Dumper;
my $pattern = $ARGV[1];
my $fasta = Bio::SeqIO->new ( -file => $ARGV[0], -format => 'fasta');
while (my $seq = $fasta->next_seq ) {
my $pos = 0;
my %counts;
for (split /($pattern)/ => $seq->seq) {
if ($_ eq $pattern) {
my $dist = $pos % 3;
unless ($dist == 0) {
my $num = 3 - $dist;
s/.{$num}//;
chop until length () % 3 == 0;
}
my $table = Bio::Tools::CodonTable->new();
$counts{$_}++ for split (//, $table->translate($_));
}
$pos += length;
}
print $seq->display_id() . ":\n";
map {
print "$_ => $counts{$_}\n"
}
sort {
$counts{$a} <=> $counts{$b}
}
keys %counts;
print "\n";
}
Here are the results using the sample input:
sequence1:
S => 1
sequence2:
V => 1
I => 1
The Bio::Tools::CodonTable class also supports non-standard codon usage tables. You can change the table using the id pointer. For example:
$table = Bio::Tools::CodonTable->new( -id => 5 );
or:
$table->id(5);
For more information, including how to examine these tables, please see the documentation here: http://metacpan.org/pod/Bio::Tools::CodonTable
I will stick to that first version of what you wanted cause the addendum only confused me even more. (frame?)
I only found ATAGTA once in sequence2 but I assume you want the mirror images/reverse sequence as well, which would be ATGATA in this case. Well my script doesn't do that so you would have to write it up twice in the input_sequences file but that should be no problem I would think.
I work with a file like yours which I call "dna.txt" and a input sequences file called "input_seq.txt". The result file is a listing of patterns and their occurences in the dna.txt file (including overlap-results but it can be set to non-overlap as explained in the awk).
input_seq.txt:
GC
ATA
ATAGTA
ATGATA
dna.txt:
>sequence1
ATGGCGCATAGTAATGC
>sequence2
ATGATAGTAATGCGCGC
results.txt:
GC,6
ATA,2
ATAGTA,2
ATGATA,1
Code is awk calling another awk (but one of them is simple). You have to run
"./match_patterns.awk input_seq.txt" to get the results file generated.:
*match_patterns.awk:*
#! /bin/awk -f
{return_value= system("awk -vsubval="$1" -f test.awk dna.txt")}
test.awk:
#! /bin/awk -f
{string=$0
do
{
where = match(string, subval)
# code is for overlapping matches (i.e ATA matches twice in ATATAC)
# for non-overlapping replace +1 by +RLENGTH in following line
if (RSTART!=0){count++; string=substr(string,RSTART+1)}
}
while (RSTART != 0)
}
END{print subval","count >> "results.txt"}
Files have to be all in the same directory.
Good luck!

Error in regex hyphen usage inside brackets utilizing perl

I have this perl script that compares two arrays to give me back those results found in both of them. The problem arises I believe in a regular expression, where it encounters a hyphen ( - ) inside of brackets [].
I am getting the following error:
Invalid [] range "5-3" in regex; marked by <-- HERE in m/>gi|403163623|ref|XP_003323683.2| leucyl-tRNA synthetase [Puccinia graminis f. sp. tritici CRL 75-3 <-- HERE 6-700-3]
MAQSTPSSIQELMDKKQKEATLDMGGNFTKRDDLIRYEKEAQEKWANSNIFQTDSPYIENPELKDLSGEE
LREKYPKFFGTFPYPYMNGSLHLGHAFTISKIEFAVGFERMRGRRALFPVGWHATGMPIKSASDKIIREL
EQFGQDLSKFDSQSNPMIETNEDKSATEPTTASESQDKSKAKKGKIQAKSTGLQYQFQIMESIGVSRTDI
PKFADPQYWLQYFPPIAKNDLNAFGARVDWRRSFITTDINPYYDAFVRWQMNRLKEKGYVKFGERYTIYS
PKDGQPCMDHDRSSGERLGSQEYTCLKMKVLEWGPQAGDLAAKLGGKDVFFV at comparer line 21, <NUC> chunk 168.
I thought the error could be solved by just adding \Q..\E in the regex so as to bypass the [] but this has not worked. Here is my code, and thanks in advance for any and all help that you may offer.
#cyt = <CYT>;
#nuc = <NUC>;
$cyt = join ('',#cyt);
$cyt =~ /\[([^\]]+)\]/g;
#shared = '';
foreach $nuc (#nuc) {
if ($cyt =~ $nuc) {
push #shared, $nuc;
}
}
print #shared;
What I am trying to achieve with this code is compare two different lists loaded into the arrays #cyt and #nuc. I then compare the name in between the [] of one of the elements in list to to the name in [] of the other. All those finds are then pushed into #shared. Hope that clarifies it a bit.
Your question describes a set intersection, which is covered in the Perl FAQ.
How do I compute the difference of two arrays? How do I compute the intersection of two arrays?
Use a hash. Here's code to do both and more. It assumes that each
element is unique in a given array:
my (#union, #intersection, #difference);
my %count = ();
foreach my $element (#array1, #array2) { $count{$element}++ }
foreach my $element (keys %count) {
push #union, $element;
push #{ $count{$element} > 1 ? \#intersection : \#difference }, $element;
}
Note that this is the symmetric difference, that is, all elements in
either A or in B but not in both. Think of it as an xor operation.
Applying it to your problem gives the code below.
Factor out the common code to find the names in the data files. This sub assumes that
every [name] will be entirely contained within a given line rather than crossing a newline boundary
each line of input will contain at most one [name]
If these assumptions are invalid, please provide more representative samples of your inputs.
Note the use of the /x regex switch that tells the regex parser to ignore most whitespace in patterns. In the code below, this permits visual separation between the brackets that are delimiters and the brackets surrounding the character class that captures names.
sub extract_names {
my($fh) = #_;
my %name;
while (<$fh>) {
++$name{$1} if /\[ ([^\]]+) \]/x;
}
%name;
}
Your question uses old-fashioned typeglob filehandles. Note that the paramter extract_names expects is a filehandle. Convenient parameter passing is one of many benefits of indirect filehandles, such as those created below.
open my $cyt, "<", "cyt.dat" or die "$0: open: $!";
open my $nuc, "<", "nuc.dat" or die "$0: open: $!";
my %cyt = extract_names $cyt;
my %nuc = extract_names $nuc;
With the names from cyt.dat in the hash %cyt and likewise for nuc.dat and %nuc, the code here iterates over the keys of both hashes and increments the corresponding keys in %shared.
my %shared;
for (keys %cyt, keys %nuc) {
++$shared{$_};
}
At this point, %shared represents a set union of the names in cyt.dat and nuc.dat. That is, %shared contains all keys from either %cyt or %nuc. To compute the set difference, we observe that the value in %shared for a key present in both inputs must be greater than one.
The final pass below iterates over the keys in sorted order (because hash keys are stored internally in an undefined order). For truly shared keys (i.e., those whose values are greater than one), the code prints them and deletes the rest.
for (sort keys %shared) {
if ($shared{$_} > 1) {
print $_, "\n";
}
else {
delete $shared{$_};
}
}

Compare date stored as a string

I have 2 dates (along with date and time)written such that they are in the following formats:
Mon Aug 12 17:32:39 PDT 2013
This is my local time and the other time is stored in the hash. I need to compare this time with that of hash stored time using all the possible comparisions of date as well as time.
This is what I have got
#!/usr/bin/perl
use DateTime;
open(FH,'log.txt');
my %stat;
my ($qbsid, $exittime, $exittimeval);
while ($line = <FH>) {
if ($line =~ /Exit time/) {
($exittime, $exittimeval) = split(': ',$line);
$stat{$qbsid} = {
time => $exittimeval
};
}
}
my $local_time = localtime time;
foreach my $qbsid (keys %stat){
my $cmd = $stat{$qbsid}->{time};
my $cmp = DateTime->compare($cmd,$datetime);
print "$cmp\n";
}
Please suggest me a way to do it.
The hash prints date in the same format as above:
Mon Aug 10 14:31:49 PDT 2013
Thank you for your time.
You should take a look at Date::Calc. That module provides utilities for comparing dates. Regex is the wrong approach here.
I'm going to recommend Time::Piece. This doesn't seem to be too many people's favorite time module, but it comes with Perl which means you don't have to install it. I suspect it will become the standard time module and you might as well learn how to use it. Earlier versions of Time::Piece had issues with Julian dates, but the latest version solved that (and will be in the next Perl release).
Without knowing what your log looks like, and what you are trying to do, it's a bit difficult to help you with your coding. For example, you have a hash of a hash using $qbsid which isn't being set. You need to use use strict; and use warnings; in your program. This would have caught this error. It would have also caught the bug with you declaring $local_time in before your second loop, but using $datetime in that loop.
Also, split does not take a string, but a regular expression pattern as the first argument.
You also need to do a chomp $line right after your while statement. Otherwise, your $line will end with an invisible NL character that will probably cause all sorts of issues.
Now, what exactly are you trying to store? It looks like your inner hash is stored as the current time as the key and your $exittimeval as the value.
I would use the epoch time (which is the number of seconds since January 1, 1970) as the key, and a Time::Piece object as your data:
($exittime, $exittimeval) = split /:\s+/, $line);
my $exit_time_obj = Time::Piece->strptime( $exittimeval, "%a %b %d %H:%M:%S %Z %Y");
my $time_obj = Time::Piece->new(time);
$stat{$qbsid} = {$time_obj->epoch => $exit_time_obj};
Now, you can do your comparisons by changing the key into a Time::Piece object and doing what ever comparisons you want.
By the way, another issue: time keeps a changin'! If you're keying your sub-hash by time, it will be different by the time you get to your second loop. You need to use a second loop to get those actual keys.
my $local_time = Time::Piece->new(time);
foreach my $qbsid (keys %stat) {
for my $key_time ( keys %{ $stat{$qbsid} } )(
my $cmd = $stat{$qbsid}->{$key_time}; # Already a Time::Piece object
if ( $local_time->year eq $cmd->year ) {
print "Years match: " . $cmd->year . "\n";
}
if ( $local_time->month eq $cmd->month ) {
print "Months match: " . $cmd->Month . "\n";
}
...
}
}
Unfortunately, this is very incomplete because I have no idea what you really want to do.
What does your log.txt look like?
What are the different fields in log.txt?
What are you trying to do with `log.txt?
What do you want as a result of your program?
It is impossible for me to know how to guide you.