Compare date stored as a string - regex

I have 2 dates (along with date and time)written such that they are in the following formats:
Mon Aug 12 17:32:39 PDT 2013
This is my local time and the other time is stored in the hash. I need to compare this time with that of hash stored time using all the possible comparisions of date as well as time.
This is what I have got
#!/usr/bin/perl
use DateTime;
open(FH,'log.txt');
my %stat;
my ($qbsid, $exittime, $exittimeval);
while ($line = <FH>) {
if ($line =~ /Exit time/) {
($exittime, $exittimeval) = split(': ',$line);
$stat{$qbsid} = {
time => $exittimeval
};
}
}
my $local_time = localtime time;
foreach my $qbsid (keys %stat){
my $cmd = $stat{$qbsid}->{time};
my $cmp = DateTime->compare($cmd,$datetime);
print "$cmp\n";
}
Please suggest me a way to do it.
The hash prints date in the same format as above:
Mon Aug 10 14:31:49 PDT 2013
Thank you for your time.

You should take a look at Date::Calc. That module provides utilities for comparing dates. Regex is the wrong approach here.

I'm going to recommend Time::Piece. This doesn't seem to be too many people's favorite time module, but it comes with Perl which means you don't have to install it. I suspect it will become the standard time module and you might as well learn how to use it. Earlier versions of Time::Piece had issues with Julian dates, but the latest version solved that (and will be in the next Perl release).
Without knowing what your log looks like, and what you are trying to do, it's a bit difficult to help you with your coding. For example, you have a hash of a hash using $qbsid which isn't being set. You need to use use strict; and use warnings; in your program. This would have caught this error. It would have also caught the bug with you declaring $local_time in before your second loop, but using $datetime in that loop.
Also, split does not take a string, but a regular expression pattern as the first argument.
You also need to do a chomp $line right after your while statement. Otherwise, your $line will end with an invisible NL character that will probably cause all sorts of issues.
Now, what exactly are you trying to store? It looks like your inner hash is stored as the current time as the key and your $exittimeval as the value.
I would use the epoch time (which is the number of seconds since January 1, 1970) as the key, and a Time::Piece object as your data:
($exittime, $exittimeval) = split /:\s+/, $line);
my $exit_time_obj = Time::Piece->strptime( $exittimeval, "%a %b %d %H:%M:%S %Z %Y");
my $time_obj = Time::Piece->new(time);
$stat{$qbsid} = {$time_obj->epoch => $exit_time_obj};
Now, you can do your comparisons by changing the key into a Time::Piece object and doing what ever comparisons you want.
By the way, another issue: time keeps a changin'! If you're keying your sub-hash by time, it will be different by the time you get to your second loop. You need to use a second loop to get those actual keys.
my $local_time = Time::Piece->new(time);
foreach my $qbsid (keys %stat) {
for my $key_time ( keys %{ $stat{$qbsid} } )(
my $cmd = $stat{$qbsid}->{$key_time}; # Already a Time::Piece object
if ( $local_time->year eq $cmd->year ) {
print "Years match: " . $cmd->year . "\n";
}
if ( $local_time->month eq $cmd->month ) {
print "Months match: " . $cmd->Month . "\n";
}
...
}
}
Unfortunately, this is very incomplete because I have no idea what you really want to do.
What does your log.txt look like?
What are the different fields in log.txt?
What are you trying to do with `log.txt?
What do you want as a result of your program?
It is impossible for me to know how to guide you.

Related

cts:value-match on xs:dateTime() type in Marklogic

I have a variable $yearMonth := "2015-02"
I have to search this date on an element Date as xs:dateTime.
I want to use regex expression to find all files/documents having this date "2015-02-??"
I have path-range-index enabled on ModifiedInfo/Date
I am using following code but getting Invalid cast error
let $result := cts:value-match(cts:path-reference("ModifiedInfo/Date"), xs:dateTime("2015-02-??T??:??:??.????"))
I have also used following code and getting same error
let $result := cts:value-match(cts:path-reference("ModifiedInfo/Date"), xs:dateTime(xs:date("2015-02-??"),xs:time("??:??:??.????")))
Kindly help :)
It seems you are trying to use wild card search on Path Range index which has data type xs:dateTime().
But, currently MarkLogic don't support this functionality. There are multiple ways to handle this scenario:
You may create Field index.
You may change it to string index which supports wildcard search.
You may run this workaround to support your existing system:
for $x in cts:values(cts:path-reference("ModifiedInfo/Date"))
return if(starts-with(xs:string($x), '2015-02')) then $x else ()
This query will fetch out values from lexicon and then you may filter your desired date.
You can solve this by combining a couple cts:element-range-querys inside of an and-query:
let $target := "2015-02"
let $low := xs:date($target || "-01")
let $high := $low + xs:yearMonthDuration("P1M")
return
cts:search(
fn:doc(),
cts:and-query((
cts:element-range-query("country", ">=", $low),
cts:element-range-query("country", "<", $high)
))
)
From the cts:element-range-query documentation:
If you want to constrain on a range of values, you can combine multiple cts:element-range-query constructors together with cts:and-query or any of the other composable cts:query constructors, as in the last part of the example below.
You could also consider doing a cts:values with a cts:query param that searches for values between for instance 2015-02-01 and 2015-03-01. Mind though, if multiple dates occur within one document, you will need to post filter manually after all (like in option 3 of Navin), but it could potentially speed up post-filtering a lot..
HTH!

Perl String Regular Expression

i need some help in Perl regular expressions.
I have this string:
{
"ITEM":[
{
"-itemID": "1000000" ,
"-itemName": "DisneyJuniorLA" ,
"-thumbUrl": "" ,
"-packageID": "1" ,
"-itemPrice": "0" ,
"-isLock": "true"
},
{
"-itemID": "1000001" ,
"-itemName": "31 minutos" ,
"-thumbUrl": "" ,
"-packageID": "1" ,
"-itemPrice": "0" ,
"-isLock": "true"
},
{
"-itemID": "1000002" ,
"-itemName": "Plaza Sésamo" ,
"-thumbUrl": "" ,
"-packageID": "1" ,
"-itemPrice": "0" ,
"-isLock": "true"
},
]
}
The string is in a variable: $jsonString
I have another variable: $itemName
I want to only keep in $jsonString the itemId value that is above itemName (where itemName equals $itemName)
I would really appreciate your help. I am really amateur in regular expressions.
Thank you!
Notwithstanding that your JSON string is very slightly malformed (there's an extra comma after the last element in the array that should be fixed by whoever's generating the "JSON"), attempting to use regexps to handle this just means you now have two problems instead of one.
More specifically, objects within JSON are explicitly unordered sets of key/value pairs. It's perfectly possible that whatever's changing the JSON could be rewritten such that the JSON is semantically identical but serialised differently, making anything that relies on the current structure brittle and error prone.
Instead, use a proper JSON decoder, and then traverse the resulting object hierarchy directly to find the desired element:
use JSON;
use utf8;
# decode the JSON
my $obj = decode_json($jsonString);
# get the ITEM array
my $itemRef = $obj->{ITEM};
# find all elements matching the item name
my #match = grep { $_->{'-itemName'} eq $itemName } #{$itemRef};
# extract the item ID
if (#match) {
my $itemID = $match[0]->{'-itemID'};
print $itemID;
}
Don't use a regular expression to parse JSON. Use JSON.
Basically :
use strict;
use warnings;
use Data::Dumper;
use JSON;
my $json_string;
{
open( my $json_in, "<", 'test.json' );
local $/;
$json_string = <$json_in>;
}
my $json = decode_json ( $json_string );
print Dumper \$json;
foreach my $item ( #{ $json -> {'ITEM'} } ) {
print $item -> {'-itemID'},"\n";
print $item -> {'-itemName'},"\n";
}
But you have to fix your json first. (There's a trailing comma that shouldn't be there. )
JSON is a defined data transfer structure. Whilst you can technically treat it as 'plain text' and extract things from the text, that's definitively the wrong way to do things.
It might work fine for a good long time, but if your source program changes a little - and change their output, whilst still sticking to the JSON standard - your code will break unexpectedly, and you may not realise. That can set off a domino effect of breakages, making a whole system or site just crash and burn. And worse yet - the source of this crash and burn will be hidden away in some script that hasn't been touched in years, so will be very difficult to fix.
This is one of my pet peeves as a professional sysadmin. Please don't even go there.

Pig capture matching string with regex

I am trying to capture image url's from inside tweets.
REGISTER 'hdfs:///user/cloudera/elephant-bird-pig-4.1.jar';
REGISTER 'hdfs:///user/cloudera/elephant-bird-core-4.1.jar';
REGISTER 'hdfs:///user/cloudera/elephant-bird-hadoop-compat-4.1.jar';
--Load Json
loadJson = LOAD '/user/cloudera/tweetwall' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (json:map []);
B = FOREACH loadJson GENERATE flatten(json#'tweets') as (m:map[]);
tweetText = FOREACH B GENERATE FLATTEN(m#'text') as (str:chararray);
intermediate date looks like this:
(#somenameontwitter your nan makes me laugh with some of the things she comes out with like http://somepics.com/my.jpg)
then I try to do the following to get only the image url back :
x = foreach tweetText generate REGEX_EXTRACT_ALL(str, '((http)(.*)(.jpg|.bmp|.png))');
dump x;
but that doesn't seem to work. I have also been trying with filter to no avail.
Even when trying the above with .* it returns empty results () or (())
I'm not good with regex and pretty new to Pig so it could be that I'm missing something simple here that I'm just not seeing.
update
example input data
{"tweets":[{"created_at":"Sat Nov 01 23:15:45 +0000 2014","id":5286804225,"id_str":"5286864225","text":"#Beace_ your nan makes me laugh with some of the things she comes out with blabla http://t.co/b7hjMWNg is an url, but not a valid one http://www.something.com/this.jpg should be a valid url","source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":52812992878592,"in_reply_to_status_id_str":"522","in_reply_to_user_id":398098,"in_reply_to_user_id_str":"3","in_reply_to_screen_name":"Be_","user":{"id":425,"id_str":"42433395","name":"SAINS","screen_name":"sa3","location":"Lincoln","profile_location":null,"description":"","url":null,"entities":{"description":{"urls":[]}},"protected":false,"followers_count":92,"friends_count":526,"listed_count":0,"created_at":"Mon May 25 16:18:05 +0000 2009","favourites_count":6,"utc_offset":0,"time_zone":"London","geo_enabled":true,"verified":false,"statuses_count":19,"lang":"en","contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"profile_background_color":"EDECE9","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_tile":false,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/52016\/DGDCj67z_normal.jpeg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/526\/DGDCj67z_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/424395\/13743515","profile_link_color":"088253","profile_sidebar_border_color":"D3D2CF","profile_sidebar_fill_color":"E3E2DE","profile_text_color":"634047","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":false,"follow_request_sent":false,"notifications":false},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":1,"entities":{"hashtags":[],"symbols":[],"user_mentions":[{"screen_name":"e_","name":"\u2601\ufe0f effy","id":3998,"id_str":"398","indices":[0,15]}],"urls":[]},"favorited":false,"retweeted":false,"lang":"en"}]}
Try this and let me know if this works
x = foreach tweetText generate REGEX_EXTRACT(str,'.*(http://.*.[jpg|bmp|png])',1);
DUMP x;
I managed to get it working (though I doubt it is totally optimal)
x = foreach tweetText generate REGEX_EXTRACT(str,'(http://.*(.jpg|.bmp|.png))',1) as image;
filtered = FILTER x BY $0 is not null;
dump filtered;
so the initial problem was just the regex (and my lack of knowledge on the subject).
Thanks for the assistance sivasakthi jayaraman!

Perl regex.. match words exactly 2 times...Input is a JSON file

I am a beginner for any sort of regex. I need your help/pointers in resolving an issue. I have a JSON file which looks like this below.
JSON format
{"record-type":"int-stats","time":1389309548046925,"host-id":"a.b.c.d","port":"ab-0/0/44","latency":108992}
{"record-type":"int-stats","time":1389309548046925,"host-id":"x.x.x.x","port":"ab-0/0/45","latency":36940}
{"record-type":"int-stats","time":1389309548046925,"host-id":"x.x.x.x","port":"ab-0/0/46","latency":11315}
{"record-type":"int-stats","time":1389309548046925,"host-id":"x.x.x.x","port":"ab-0/0/47","latency":102668}
{"record-type":"int-stats","time":1389309548046925,"host-id":"x.x.x.x","port":"ab-0/0/9","latency":347776}
{"record-type":"int-stats","time":1389309548041555,"host-id":"a.b.c.d","port":"ab-0/0/44","latency":108992}
{"record-type":"int-stats","time":1389309548041554,"host-id":"x.x.x.x","port":"ab-0/0/45","latency":36940}
{"record-type":"int-stats","time":1389309548046151,"host-id":"x.x.x.x","port":"ab-0/0/46","latency":11315}
{"record-type":"int-stats","time":1389309548041667,"host-id":"x.x.x.x","port":"ab-0/0/47","latency":102668}
{"record-type":"int-stats","time":1389309548042626,"host-id":"x.x.x.x","port":"ab-0/0/9","latency":347776}
{"record-type":"int-stats","time":1389309548035666,"host-id":"a.b.c.d","port":"ab-0/0/44","latency":108992}
{"record-type":"int-stats","time":1389309548035635,"host-id":"x.x.x.x","port":"ab-0/0/45","latency":36940}
{"record-type":"int-stats","time":1389309548042255,"host-id":"x.x.x.x","port":"ab-0/0/46","latency":11315}
{"record-type":"int-stats","time":1389309548041715,"host-id":"x.x.x.x","port":"ab-0/0/47","latency":102668}
{"record-type":"int-stats","time":1389309548046161,"host-id":"x.x.x.x","port":"ab-0/0/9","latency":347776}
{"record-type":"int-stats","time":1389309548023422,"host-id":"a.b.c.d","port":"ab-0/0/44","latency":108992}
{"record-type":"int-stats","time":1389309548041617,"host-id":"x.x.x.x","port":"ab-0/0/45","latency":36940}
{"record-type":"int-stats","time":1389309548046676,"host-id":"x.x.x.x","port":"ab-0/0/46","latency":11315}
{"record-type":"int-stats","time":1389309548045675,"host-id":"x.x.x.x","port":"ab-0/0/47","latency":102668}
{"record-type":"int-stats","time":1389309548046172,"host-id":"x.x.x.x","port":"ab-0/0/9","latency":347776}
{"record-type":"int-stats","time":1389309548034534,"host-id":"a.b.c.d","port":"ab-0/0/44","latency":108992}
{"record-type":"int-stats","time":1389309548012345,"host-id":"x.x.x.x","port":"ab-0/0/45","latency":36940}
{"record-type":"int-stats","time":1389309548025232,"host-id":"x.x.x.x","port":"ab-0/0/46","latency":11315}
{"record-type":"int-stats","time":1389309548023423,"host-id":"x.x.x.x","port":"ab-0/0/47","latency":102668}
{"record-type":"int-stats","time":1389309548252352,"host-id":"x.x.x.x","port":"ab-0/0/9","latency":347776}
I need to extract "port":"ab-0/0/44" and associated "time" with that port. I am trying to calculate the time difference for any two such occurrences, i.e 1st occurrence-> "time":1389309548046925 "port":"ab-0/0/44" 2nd occurrence -> "time":1389309548041555 "port":"ab-0/0/44". The calculated time difference must be stored in a variable. I tried with a regular expression like this /\"time\":\\d+\.*\"port\":\".b-0\/0\/44\"/. Any help is appreciated. Thanks in advance!
Use the JSON module. It's rather simple.
use strict;
use warnings;
use JSON;
while (<>) {
/\S/ or next;
my $data = decode_json($_);
print "port -> $data->{port}\n";
print "time -> $data->{time}\n";
}
With your data, I get output like this:
port -> ab-0/0/44
time -> 1389309548046925
port -> ab-0/0/45
time -> 1389309548046925
... etc
I'm not sure how you want to calculate your time, but I assume that doing arithmetic is something you can figure out best on your own.

Retrieve the coding amino-acid when there is certain pattern in a DNA sequence

I would like to retrieve the coding amino-acid when there is certain pattern in a DNA sequence. For example, the pattern could be: ATAGTA. So, when having:
Input file:
>sequence1
ATGGCGCATAGTAATGC
>sequence2
ATGATAGTAATGCGCGC
The ideal output would be a table having for each amino-acid the number of times is coded by the pattern. Here in sequence1, pattern codes only for one amino-acid, but in sequence2 it codes for two. I would like to have this tool working to scale to thousands of sequences. I've been thinking about how to get this done, but I only thought to: replace all nucleotides different than the pattern, translate what remains and get summary of the coded amino-acids.
Please let me know if this task can be performed by an already available tool.
Thanks for your help. All the best, Bernardo
Edit (due to the confusion generated with my post):
Please forget the original post and sequence1 and sequence2 too.
Hi all, and sorry for the confusion. The input fasta file is a *.ffn file derived from a GenBank file using 'FeatureExtract' tool (http://www.cbs.dtu.dk/services/FeatureExtract/download.php), so a can imagine they are already in frame (+1) and there is no need to get amino-acids coded in a frame different than +1.
I would like to know for which amino-acid the following sequences are coding for:
AGAGAG
GAGAGA
CTCTCT
TCTCTC
The unique strings I want to get coding amino-acids are repeats of three AG, GA, CT or TC, that is (AG)3, (GA)3, (CT)3 and (TC)3, respectively. I don't want the program to retrieve coding amino-acids for repeats of four or more.
Thanks again, Bernardo
Here's some code that should at least get you started. For example, you can run like:
./retrieve_coding_aa.pl file.fa ATAGTA
Contents of retrieve_coding_aa.pl:
#!/usr/bin/perl
use strict;
use warnings;
use File::Basename;
use Bio::SeqIO;
use Bio::Tools::CodonTable;
use Data::Dumper;
my $pattern = $ARGV[1];
my $fasta = Bio::SeqIO->new ( -file => $ARGV[0], -format => 'fasta');
while (my $seq = $fasta->next_seq ) {
my $pos = 0;
my %counts;
for (split /($pattern)/ => $seq->seq) {
if ($_ eq $pattern) {
my $dist = $pos % 3;
unless ($dist == 0) {
my $num = 3 - $dist;
s/.{$num}//;
chop until length () % 3 == 0;
}
my $table = Bio::Tools::CodonTable->new();
$counts{$_}++ for split (//, $table->translate($_));
}
$pos += length;
}
print $seq->display_id() . ":\n";
map {
print "$_ => $counts{$_}\n"
}
sort {
$counts{$a} <=> $counts{$b}
}
keys %counts;
print "\n";
}
Here are the results using the sample input:
sequence1:
S => 1
sequence2:
V => 1
I => 1
The Bio::Tools::CodonTable class also supports non-standard codon usage tables. You can change the table using the id pointer. For example:
$table = Bio::Tools::CodonTable->new( -id => 5 );
or:
$table->id(5);
For more information, including how to examine these tables, please see the documentation here: http://metacpan.org/pod/Bio::Tools::CodonTable
I will stick to that first version of what you wanted cause the addendum only confused me even more. (frame?)
I only found ATAGTA once in sequence2 but I assume you want the mirror images/reverse sequence as well, which would be ATGATA in this case. Well my script doesn't do that so you would have to write it up twice in the input_sequences file but that should be no problem I would think.
I work with a file like yours which I call "dna.txt" and a input sequences file called "input_seq.txt". The result file is a listing of patterns and their occurences in the dna.txt file (including overlap-results but it can be set to non-overlap as explained in the awk).
input_seq.txt:
GC
ATA
ATAGTA
ATGATA
dna.txt:
>sequence1
ATGGCGCATAGTAATGC
>sequence2
ATGATAGTAATGCGCGC
results.txt:
GC,6
ATA,2
ATAGTA,2
ATGATA,1
Code is awk calling another awk (but one of them is simple). You have to run
"./match_patterns.awk input_seq.txt" to get the results file generated.:
*match_patterns.awk:*
#! /bin/awk -f
{return_value= system("awk -vsubval="$1" -f test.awk dna.txt")}
test.awk:
#! /bin/awk -f
{string=$0
do
{
where = match(string, subval)
# code is for overlapping matches (i.e ATA matches twice in ATATAC)
# for non-overlapping replace +1 by +RLENGTH in following line
if (RSTART!=0){count++; string=substr(string,RSTART+1)}
}
while (RSTART != 0)
}
END{print subval","count >> "results.txt"}
Files have to be all in the same directory.
Good luck!