Apache Pig Date time - invalid format error - casting

I have a problem with pig datetime datatype. I am tried to using but format is not working properly. I don't understand what the error is : The code used is below
`records = LOAD '/tmp/project/sample.csv' USING PigStorage(',') AS (CUSTOMER_ID:in`t,READING_DATETIME:chararray,CALENDER_KEY:int,EVENT_KEY:int,GENERAL_SUPPLY_KWH:float,CONTROLLED_LOAD_KWH:float,GROSS_GENERATION_KWH:float,NET_GENERATION_KWH:float,OTHER_KWH:float);
test = FOREACH records GENERATE CUSTOMER_ID,READING_DATETIME;
dates= FOREACH test GENERATE CUSTOMER_ID,ToDate(READING_DATETIME,'dd-MM-yyyy HH:mm') AS READING_DATETIME;
Sample data from sample.csv is below, (first two columns only pasted here)
CUSTOMER_ID READING_DATETIME
10017574 31-05-2013 18:30
10017574 10-06-2013 05:30
10017574 29-06-2013 04:30
10017574 04-07-2013 20:30
10017574 05-07-2013 17:00
10017574 12-07-2013 10:30
10017574 13-07-2013 20:00
10017574 16-07-2013 13:00
10017574 19-07-2013 20:00
The above commands are executing properly. Also when I use DESCRIBE for 'dates',
it returns :
grunt> DESCRIBE dates
dates: {CUSTOMER_ID: int,READING_DATETIME: datetime}
Now when I use
toPrint = LIMIT dates 5;
DUMP toPrint;
2016-09-15 05:43:39,000 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Some jobs have failed! Stop running all dependent jobs
2016-09-15 05:43:39,013 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias toPrint. Backend error : Invalid format: "READING_DATETIME"
I have verified the string format several times. Even I checked from oracle simple data format which is used in JODA class for date times used by pig.
I have tried several combinations for the same data. Tried brwosing online. Still issue not resolved. Seems to be a very silly thing for me but cudn't solve it.

From the sample data you have attached, the data does not look like it is comma separated.In your load statement you are using ',' as the delimiter.
In order to fix this you have 2 options.
Convert the input file to a comma separated input file
Or
Use the correct delimiter to load the data.
I have used tab as the delimiter and it works fine.See below
Data
10017574 31-05-2013 18:30
10017574 10-06-2013 05:30
10017574 29-06-2013 04:30
10017574 04-07-2013 20:30
10017574 05-07-2013 17:00
10017574 12-07-2013 10:30
10017574 13-07-2013 20:00
10017574 16-07-2013 13:00
10017574 19-07-2013 20:00
Script
records = LOAD 'test12.txt' USING PigStorage('\t') AS (CUSTOMER_ID:int,READING_DATETIME:chararray);
test = FOREACH records GENERATE CUSTOMER_ID,READING_DATETIME;
dates= FOREACH test GENERATE CUSTOMER_ID,ToDate(READING_DATETIME,'dd-MM-yyyy HH:mm') AS READING_DATETIME;
DUMP dates;
Output

Related

Getting records in the last 1 hour using PROC SQL (Teradata)

I am using SAS to connect to Teradata. Given the below dataset (it's a transaction table that updates records regularly), I need to be able to select records from the past hour (at least 3). So for example, if I am running the query at 6pm, I should get txn_id 5678, 1985, 2985 (refer to below dataset). Can you please help? This needs to be done in proc sql (connecting to teradata) or even just a SQL query running in Teradata SQL Assistant.
Dataset:
TXN_ID Date Time
1234 20200608 4:00 PM
5678 20200608 5:00 PM
1985 20200608 5:30 PM
2985 20200608 5:45 PM
2365 20200608 2:30 PM
Expected Output:
TXN_ID Date Time
5678 20200608 5:00 PM
1985 20200608 5:30 PM
2985 20200608 5:45 PM
Try outobs option :
proc sql outobs=3;
select * from sashelp.class order by Age, Name;
quit;
This option is used to limit the number of rows in the output.

regex to find the data

EDIT - I added all the last 50 texts, I saw that were sent from various people, unfortunately, it's not an automatic email...
list of all the text is HERE
I'm struggling to find a matched pattern that will identify the needed items (date, start time, time zone) from this text:
1 April 20 16:00-16:30 Israel Time
Tomorrow, Wed Feb 12, 08:00-9:00 AM IST(IL)
Tomorrow, Wed Jan 22, 09:30-10:00 PM PST
11-May-20 19:00-20:30 Israel Time
The start time is an easy one: (\d+:\d+)- but I'm not sure what to be done with the other words and digits.
Based on the data you provided, something like this would do it, with 3 captures as requested:
(\d+[-\s]\w+[-\s]\d+|\w+ \d+),?\s(\d+\:\d+)\-\d+\:\d+\s(?:AM\s|PM\s)?(.*)
Online reference

Redshift - Adding timezone offset (Varchar) to timestamp column

as part of ETL to Redshift, in one of the source tables, there are 2 columns:
original_timestamp - TIMESTAMP: which is the local time when the record was inserted in whichever region
original_timezone_offset - Varchar: which is the offset to UTC
The data looks something like this:
original_timestamp original_timezone_offset
2011-06-22 11:00:00.000000 -0700
2014-11-29 17:00:00.000000 -0800
2014-12-02 22:00:00.000000 +0900
2011-06-03 09:23:00.000000 -0700
2011-07-28 03:00:00.000000 -0700
2011-05-01 01:30:00.000000 -0700
In my target table, I need to convert this to UTC (using the offset). How do I do it?
So far I have tried multiple things but dateadd() seems to be the closest solution. But the problem with dateadd() is, when I say:
SELECT original_timestamp, original_timezone_offset
,dateadd(H, original_timezone_offset, original_timestamp) as original_utc_time
it is adding/subtracting '700'/'800' hours instead of 7/8 hrs to the original timestamp because the offset is a VARCHAR and the values are like: -0700 etc.
Did anyone see this issue before? Appreciate any help/inputs. Thanks.
Just take the 'hours' part of the offset:
WITH t as (
SELECT '2011-06-22 11:00:00.000000'::timestamp as original_timestamp, '-0700' as original_timezone_offset
UNION ALL
SELECT '2014-11-29 17:00:00.000000'::timestamp,'-0800'
UNION ALL
SELECT '2014-12-02 22:00:00.000000'::timestamp,'+0900'
)
SELECT
original_timestamp,
original_timezone_offset,
DATEADD(hour, SUBSTRING(original_timezone_offset, 1, 3)::INT, original_timestamp)
FROM t
2011-06-22 11:00:00 -0700 2011-06-22 04:00:00
2014-11-29 17:00:00 -0800 2014-11-29 09:00:00
2014-12-02 22:00:00 +0900 2014-12-03 07:00:00
You'll need some additional fancy code if you have non-full-hour offsets (eg +0730).
First, recognize that if your timestamps are already in local time of the given offset, then you need to subtract that offset to convert back to UTC. In that first example you gave, 2011-06-22 11:00:00 -0700 is equivalent to 2011-06-22 18:00:00 UTC.
However, rather than try to add or subtract these values yourself, you should let the AT TIME ZONE function do the work for you. It will create a timestamptz that is in your supplied offset, then you can use it again to convert to UTC.
(Note that you could use the CONVERT_TIMEZONE function instead, but that one is only understood by Redshift, where AT TIME ZONE works on regular PostgreSQL also.)
However, you have is that the time zone offsets you have aren't in a format understood by these functions. See time zone usage notes. So, before we try to convert, let's translate your offset strings to an understood format.
We will want -0700 to become +07:00. The colon is required, and the sign must be flipped because it will be interpreted with the POSIX-style time zone format. In that format, positive values lie west of GMT instead of the usual conventions specified in ISO 8601.
concat(translate(substring(original_timezone_offset, 1, 3), '-+', '+-'),':',substring(original_timezone_offset, 4, 2))
Then we will use that with AT TIME ZONE to do the conversion:
(original_timezone AT TIME ZONE <the above mess>) AT TIME ZONE 'UTC' AS utc_timestamp
Putting it all together...
WITH t as (
SELECT '2011-06-22 11:00:00.000000'::timestamp as original_timestamp, '-0700' as original_timezone_offset
UNION ALL
SELECT '2014-11-29 17:00:00.000000'::timestamp,'-0800'
UNION ALL
SELECT '2014-12-02 22:00:00.000000'::timestamp,'+0900'
)
SELECT
original_timestamp,
original_timezone_offset,
concat(translate(substring(original_timezone_offset, 1, 3), '-+', '+-'),':',substring(original_timezone_offset, 4, 2)) as modified_timezone_offset,
(original_timestamp AT TIME ZONE concat(translate(substring(original_timezone_offset, 1, 3), '-+', '+-'),':',substring(original_timezone_offset, 4, 2))) AT TIME ZONE 'UTC' AS utc_timestamptz
FROM t
Output:
2011-06-22 11:00:00 -0700 +07:00 2011-06-22 18:00:00
2014-11-29 17:00:00 -0800 +08:00 2014-11-30 01:00:00
2014-12-02 22:00:00 +0900 -09:00 2014-12-02 13:00:00
SQL Fiddle here.

How to filter logs easily with awk?

Suppose I have a log file mylog like this:
[01/Oct/2015:16:12:56 +0200] error number 1
[01/Oct/2015:17:12:56 +0200] error number 2
[01/Oct/2015:18:07:56 +0200] error number 3
[01/Oct/2015:18:12:56 +0200] error number 4
[02/Oct/2015:16:12:56 +0200] error number 5
[10/Oct/2015:16:12:58 +0200] error number 6
[10/Oct/2015:16:13:00 +0200] error number 7
[01/Nov/2015:00:10:00 +0200] error number 8
[01/Nov/2015:01:02:00 +0200] error number 9
[01/Jan/2016:01:02:00 +0200] error number 10
And I want to find those lines that occur between 1 Oct at 18.00 and 1 Nov at 1.00. That is, the expected output would be:
[01/Oct/2015:18:07:56 +0200] error number 3
[01/Oct/2015:18:12:56 +0200] error number 4
[02/Oct/2015:16:12:56 +0200] error number 5
[10/Oct/2015:16:12:58 +0200] error number 6
[10/Oct/2015:16:13:00 +0200] error number 7
[01/Nov/2015:00:10:00 +0200] error number 8
I have managed to convert the times to timestamp by using match() and then mktime(). First one finds the specified pattern, that is stored in the array a[] so it can be accessed (interesting to see glenn jackman's answer to access captured group from line pattern for a good example). Since mktime requires a format YYYY MM DD HH MM SS[ DST], I also have to convert the month in the form Xxx into a digit, for which I use an answer by Ed Morton to "convert month from Aaa to xx": awk '{printf "%02d\n",(match("JanFebMarAprMayJunJulAugSepOctNovDec",$0)+2)/3}'.
All together, finally I have the timestamp in the variable mytimestamp:
awk 'match($0, /([0-9]+)\/([A-Z][a-z]{2})\/([0-9]{4}):([0-9]{1,2}):([0-9]{1,2}):([0-9]{1,2}) ([+-][0-9]{4})/, a) {
day=a[1]; month=a[2]; year=a[3];
hour=a[4]; min=a[5]; sec=a[6]; utc=a[7];
month=sprintf("%02d",(match("JanFebMarAprMayJunJulAugSepOctNovDec",month)+2)/3);
mydate=sprintf("%s %s %s %s %s %s %s", year,month,day,hour,min,sec,utc);
mytimestamp=mktime(mydate)
print mytimestamp
}' mylog
Returns:
1443708776
1443712376
1443715676
etc.
So now I am ready to convert against the given dates. Since awk takes a lot to handle such format, I prefer to provide them through an external shell variable, using date -d"my date" +"%s" to print the timestamp:
start="$(date -d"1 Oct 2015 18:00 +0200" +"%s")"
end="$(date -d"1 Nov 2015 01:00 +0200" +"%s")"
All together, this works:
awk start="$(date -d"1 Oct 2015 18:00 +0200" +"%s")" end="$(date -d"1 Nov 2015 01:00 +0200" +"%s")" 'match($0, /([0-9]+)\/([A-Z][a-z]{2})\/([0-9]{4}):([0-9]{1,2}):([0-9]{1,2}):([0-9]{1,2}) ([+-][0-9]{4})/, a) {day=a[1]; month=a[2]; year=a[3]; hour=a[4]; min=a[5]; sec=a[6]; utc=a[7]; month=sprintf("%02d",(match("JanFebMarAprMayJunJulAugSepOctNovDec",month)+2)/3); mydate=sprintf("%s %s %s %s %s %s %s", year,month,day,hour,min,sec,utc); mytimestamp=mktime(mydate); if (start<=mytimestamp && mytimestamp<=end) print}' mylog
[01/Oct/2015:18:07:56 +0200] error number 3
[01/Oct/2015:18:12:56 +0200] error number 4
[02/Oct/2015:16:12:56 +0200] error number 5
[10/Oct/2015:16:12:58 +0200] error number 6
[10/Oct/2015:16:13:00 +0200] error number 7
[01/Nov/2015:00:10:00 +0200] error number 8
However, this seems to be quite a bit of work for something that should be more straight forward. Nonetheless, the introduction of the "Time functions" section in man gawk is
Since one of the primary uses of AWK programs is processing log files
that contain time stamp information, gawk provides the following
functions for obtaining time stamps and formatting them.
So I wonder: is there any better way to do this? For example, what if the format instead of dd/Mmm/YYYY:HH:MM:ss was something like dd Mmm YYYY HH:MM:ss? Couldn't it be possible to provide the match pattern externally instead of having to change it every time this would happen? Do I really have to use match() and then process that output to then feed mktime()? Doesn't gawk provide a more simple way to do this?
Use ISO 8601 time format!
However, this seems to be quite a bit of work for something that should be more straight forward.
Yes, this should be straightforward, and the reason why it is not, is because the logs do not use ISO 8601. Application logs should use ISO format and UTC to display times, other settings should be considered broken and fixed.
Your request should be split in two parts. The first part canonise the logs, converting dates to the ISO format, the second performs a research:
awk '
match($0, /([0-9]+)\/([A-Z][a-z]{2})\/([0-9]{4}):([0-9]{1,2}):([0-9]{1,2}):([0-9]{1,2}) ([+-][0-9]{4})/, a) {
day=a[1]
month=a[2];
year=a[3]
hour=a[4]
min=a[5]
sec=a[6]
utc=a[7];
month=sprintf("%02d", (match("JanFebMarAprMayJunJulAugSepOctNovDec",month)+2)/3);
myisodate=sprintf("%4d-%2d-%2dT%2d:%2d:%2d%6s", year,month,day,hour,min,sec,utc);
$1 = myisodate
print
}' mylog
The nice thing about ISO 8601 dates – besides them being a standard – is that the chronological order coincide with lexicographic order, therefore, you can use the /…/,/…/ operator to extract the dates you are interested in. For instance to find what happened between 1 Oct 2015 18:00 +0200 and 1 Nov 2015 01:00 +0200, append the following filter to the previous, standardising filter:
awk '/2015-10-01:18:00:00+0200/,/2015-11-01:01:00:00+0200/'
without getting into time format (assuming all records are formatted the same) you can use sort | awk combination to achieve the same with ease.
This assumes logs are not ordered, based on your format and special sort option to sort months (M) and awk to pick the interested range. The sorting is based on year, month, and day in that order.
$ sort -k1.9,1.12 -k1.5,1.7M -k1.2,1.3 log | awk '/01\/Oct\/2015/,/01\/Nov\/2015/'
You can easily extend to include time as well and drop the sort if the file is already sorted.
The following has the time constraint as well
awk -F: '/01\/Oct\/2015/ && $2>=18{p=1}
/01\/Nov\/2015/ && $2>=1 {p=0} p'
I would use date command inside awk to achieve this, though no idea how this would perform with large log files.
awk -F "[][]" -v start="$(date -d"1 Oct 2015 18:00 +0200" +"%s")"
-v end="$(date -d"1 Nov 2015 01:00 +0200" +"%s")" '{
gsub(/\//,"-",$2);sub(/:/," ",$2);
cmd="date -d\""$2"\" +%s" ;
cmd|getline mytimestamp;
close(cmd);
if (start<=mytimestamp && mytimestamp<=end) print
}' mylog

Need to filter log to search for the lines from the last 5 minutes

2011-04-13 00:09:07,731 INFO [STDOUT] 04/13 00:09:07 Information...
Hi everyone. I would post some of my code, but I don't even think it's worthy of posting. What I'm trying to do is that I've got a log file with lines like above. What I need to do is take the last lines timestamp, and keep all the lines from the last 5 minutes (rather than the last 200 lines or whatever....which would be easier). Could anyone help? I've searched the web, some decent tips, but still nothing going and frustrated as hell. Thanks!
Here's a simple Perl script that iterates over the file and prints every line whose timestamp is within 5 minutes of the time at the start of execution. For more efficiency, and assuming that the lines are in timestamp order, you could modify this to set a boolean flag when it encounters the first printable line and skip the testing from that point forwards.
#!/usr/bin/perl
use POSIX qw(mktime);
$now = time();
while(<>)
{
($yy,$mm,$dd,$h,$m,$s,$t) = /^(\d+)-(\d+)-(\d+)\s+(\d+):(\d+):(\d+),(\d+)/;
$t = mktime($s+$t/1000, $m, $h, $dd, $mm-1, $yy-1900);
print "$_" if ($t >= $now-300);
}
I take it by your latest comment that you are interested in finding out how to find the timestamp that is last in your log, and the entries that are 5 minutes before that.
I think Jim Garrison's solution could be patched to replace this:
$now = time();
with this:
open F, "<server.log" or die $!;
seek F,-1000,2; # set pos to last 1000 bytes
my #f = <F>;
$_ = $f[$#f];
($yy,$mm,$dd,$h,$m,$s,$t) = /^(\d+)-(\d+)-(\d+)\s+(\d+):(\d+):(\d+),(\d+)/;
$now = mktime($s+$t/1000, $m, $h, $dd, $mm-1, $yy-1900);
$now should now contain the last timestamp in the log.
I approximated "-1000" to be long enough to go past at least one line in the log. You could set it much higher if you expect to have long lines in the log, but from what I saw, the last log entry "should" be fairly short.
If you have a huge log file and want to increase performance in the following search, you can use an estimation and perform a seek to find the last, say, 1000000 bytes in the file with:
seek F, -1000000, 2;
Good luck!
Iterate over all the lines, using regexp grab: 00:09:07, and check against current time (localtime, etc...).
if the file contains entries from different dates, then also grab the dates using regexp, and again compare using the output of locatime
How to modify your script to make it work with the logs below
Dec 18 09:41:18 sd
Dec 18 09:46:29 sds
Dec 18 09:48:39 sds
Dec 18 09:48:54 sds
Dec 18 09:54:47 sds
Dec 18 09:55:33 sds
Dec 18 09:55:38 sds
Dec 18 09:57:58 sds
Dec 18 09:58:10 sds
Dec 18 10:00:50 sdsd
Dec 18 10:03:43 sds
Dec 18 10:03:50 sdsd
Dec 18 10:04:06 sdsd
Dec 18 10:04:15 sdsd
Dec 18 10:14:50 wdad
Dec 18 10:19:16 sdadsa
Dec 18 10:19:23 dsds
Dec 18 10:21:03 sadsd
Dec 18 10:22:54 adas
Dec 18 10:27:32 qadad