How to find the hdfs files time stamp to milli seconds level - hdfs

Is there a way we can get the time stamp of the files in HDFS to millisecond level.
For example:
in linux we can get the full time stamp like below
$ ls --full-time
total 4
-rw-r--r--. 1 bigdatauser hadoop 0 2017-09-15 01:09:25.068425282 -0400 newfile1.txt
-rwxrwxrwx. 1 bigdatauser hadoop 106 2017-09-15 01:08:16.791844270 -0400 test.sh

If you use hdfs dfs -stat '%Y' you can see the time in milliseconds.
$ hdfs dfs -touchz /tmp/test_file
$ hdfs dfs -stat "%Y" /tmp/test_file
1506621031648
From http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FileSystemShell.html#stat:
Print statistics about the file/directory at in the specified format. Format accepts filesize in blocks (%b), type (%F), group name of owner (%g), name (%n), block size (%o), replication (%r), user name of owner(%u), and modification date (%y, %Y). %y shows UTC date as “yyyy-MM-dd HH:mm:ss” and %Y shows milliseconds since January 1, 1970 UTC. If the format is not specified, %y is used by default.

Related

RRD Tool - confusing start time

I'm setting up a rrd database to store sensor data for 3 days in 12hr intervalls (43200s) = 6 row in RRA.
rrdtool create test.rrd --step 43200 --start 1562429286 DS:temp:GAUGE:86400:U:U RRA:AVERAGE:0:1:6
The databases starting time is 1562429286 (06.07.2019 - 18:08:06).
When I dump the database:
rrdtool dump test.rrd
it says (output trimmed for clarity):
2019-07-04 02:00:00 CEST / 1562198400 NaN
2019-07-04 14:00:00 CEST / 1562241600 NaN
2019-07-05 02:00:00 CEST / 1562284800 NaN
2019-07-05 14:00:00 CEST / 1562328000 NaN
2019-07-06 02:00:00 CEST / 1562371200 NaN
2019-07-06 14:00:00 CEST / 1562414400 NaN
I expected rrdtool to give the next nearest timestamp ( 6.7.19 18:00 ) as the last entry ("starting point") instead. So why is it at 14:00 ?
At first this explanation (How to create a rrd file with a specific time?) made perfect sense for the small intervall of 5m to me. But in my case I cannot get behind the logic if the intervall is bigger (12h)
This is because the RRA buckets are always normalised to be aligned to the GMT (UCT) timezone. It is not visible if you are using a cdp (consolodated data point) width of an hour or less; but in your case, your cdp are 12 hours in width. Your timezone means that these are offset by 2 hours from UCT zero resulting in apparent boundaries of 02 and 14 local time (if you were in London then you'd be seeing 0 and 12 as expected).
This effect is much more noticeable when you are using 1-day rollups and are located in somewhere like New Zealand, when you'll see the CDP boundary appearing at noon rather than at midnight.
It is not currently possible to specify a different timezone to use as a base for the RRA buckets (this would make the data nonportable) though I believe it has been on the RRDTool feature request list for a number of years.

Redshift - Adding timezone offset (Varchar) to timestamp column

as part of ETL to Redshift, in one of the source tables, there are 2 columns:
original_timestamp - TIMESTAMP: which is the local time when the record was inserted in whichever region
original_timezone_offset - Varchar: which is the offset to UTC
The data looks something like this:
original_timestamp original_timezone_offset
2011-06-22 11:00:00.000000 -0700
2014-11-29 17:00:00.000000 -0800
2014-12-02 22:00:00.000000 +0900
2011-06-03 09:23:00.000000 -0700
2011-07-28 03:00:00.000000 -0700
2011-05-01 01:30:00.000000 -0700
In my target table, I need to convert this to UTC (using the offset). How do I do it?
So far I have tried multiple things but dateadd() seems to be the closest solution. But the problem with dateadd() is, when I say:
SELECT original_timestamp, original_timezone_offset
,dateadd(H, original_timezone_offset, original_timestamp) as original_utc_time
it is adding/subtracting '700'/'800' hours instead of 7/8 hrs to the original timestamp because the offset is a VARCHAR and the values are like: -0700 etc.
Did anyone see this issue before? Appreciate any help/inputs. Thanks.
Just take the 'hours' part of the offset:
WITH t as (
SELECT '2011-06-22 11:00:00.000000'::timestamp as original_timestamp, '-0700' as original_timezone_offset
UNION ALL
SELECT '2014-11-29 17:00:00.000000'::timestamp,'-0800'
UNION ALL
SELECT '2014-12-02 22:00:00.000000'::timestamp,'+0900'
)
SELECT
original_timestamp,
original_timezone_offset,
DATEADD(hour, SUBSTRING(original_timezone_offset, 1, 3)::INT, original_timestamp)
FROM t
2011-06-22 11:00:00 -0700 2011-06-22 04:00:00
2014-11-29 17:00:00 -0800 2014-11-29 09:00:00
2014-12-02 22:00:00 +0900 2014-12-03 07:00:00
You'll need some additional fancy code if you have non-full-hour offsets (eg +0730).
First, recognize that if your timestamps are already in local time of the given offset, then you need to subtract that offset to convert back to UTC. In that first example you gave, 2011-06-22 11:00:00 -0700 is equivalent to 2011-06-22 18:00:00 UTC.
However, rather than try to add or subtract these values yourself, you should let the AT TIME ZONE function do the work for you. It will create a timestamptz that is in your supplied offset, then you can use it again to convert to UTC.
(Note that you could use the CONVERT_TIMEZONE function instead, but that one is only understood by Redshift, where AT TIME ZONE works on regular PostgreSQL also.)
However, you have is that the time zone offsets you have aren't in a format understood by these functions. See time zone usage notes. So, before we try to convert, let's translate your offset strings to an understood format.
We will want -0700 to become +07:00. The colon is required, and the sign must be flipped because it will be interpreted with the POSIX-style time zone format. In that format, positive values lie west of GMT instead of the usual conventions specified in ISO 8601.
concat(translate(substring(original_timezone_offset, 1, 3), '-+', '+-'),':',substring(original_timezone_offset, 4, 2))
Then we will use that with AT TIME ZONE to do the conversion:
(original_timezone AT TIME ZONE <the above mess>) AT TIME ZONE 'UTC' AS utc_timestamp
Putting it all together...
WITH t as (
SELECT '2011-06-22 11:00:00.000000'::timestamp as original_timestamp, '-0700' as original_timezone_offset
UNION ALL
SELECT '2014-11-29 17:00:00.000000'::timestamp,'-0800'
UNION ALL
SELECT '2014-12-02 22:00:00.000000'::timestamp,'+0900'
)
SELECT
original_timestamp,
original_timezone_offset,
concat(translate(substring(original_timezone_offset, 1, 3), '-+', '+-'),':',substring(original_timezone_offset, 4, 2)) as modified_timezone_offset,
(original_timestamp AT TIME ZONE concat(translate(substring(original_timezone_offset, 1, 3), '-+', '+-'),':',substring(original_timezone_offset, 4, 2))) AT TIME ZONE 'UTC' AS utc_timestamptz
FROM t
Output:
2011-06-22 11:00:00 -0700 +07:00 2011-06-22 18:00:00
2014-11-29 17:00:00 -0800 +08:00 2014-11-30 01:00:00
2014-12-02 22:00:00 +0900 -09:00 2014-12-02 13:00:00
SQL Fiddle here.

Why my Python timestamp to datetime conversion is wrong?

Portal epochconverter.com converts timestamp 1531423084013 to correct date of Thursday, July 12, 2018 3:18:04.013 PM GMT-04:00 DST. But in Python 2.7.12 I got below which is wrong
>>> timestamp=1531423084013
>>> time.ctime(timestamp).rsplit(' ', 1)[0]
'Wed Nov 12 00:06:53'
How to make it correct ?
1531423084013 is in milliseconds not is seconds.
As you can see from epochconverter.com the hour is : 3:18:04.013, so the seconds part is 4.013, this site handle time in seconds and in milliseconds (it seems when the input has 13 digits instead of 10 for time around nowadays).
But time.ctime() from python handle only time in seconds and this is why you get a wrong answer when you enter a time in milliseconds (in my system it throws an out of range).
So you must divide your time in milliseconds by 1000 :
time.ctime(1531423084)
'Thu Jul 12 21:18:04 2018'
(My time zone is UTC+0200)

How to do parsing of Elapsed time in seconds in linux

I want to do parsing of Elapsed time in seconds .Time formats given below:
1) 3 day 18h
2) 3 day
3) 3h 15min
4) 3h
5) 15min 10sec
6) 15min
7) 10sec
i'm getting values from systemctl status cassandra | awk '/(Active: active)/{print $9, $10,$11}' Now storing it's value in variable A,like
A=$(systemctl status cassandra | awk '/(Active: active)/{print $9, $10,$11}'
now A has input as 3 day 18h or 3 day etc. More examples-
A=3 day 18h or 3 day or 3h 15min or 3h or 15min 10sec or 15min or 10sec
now take different values of A, and parse in seconds.
What you want to achieve could be done directly in awk using the following line :
$ systemctl status cassandra | awk '/(Active: active)/{s=$6" "$7;gsub(/-|:/," ",s); print systime() - mktime(s)}'
This will give you the running time directly based on the start-time and not on the approximated running time printed by systemctl.
If this approach is not working then I suggest to use the date command to do all the parsing. If you can change the h by hour in your examples, then you can do the following :
$ date -d "1970-01-01 + 3day 18hour 15min 16sec" +%s
324916
If you cannot, then I suggest the following. If duration is stored in the variable $duration, then you do
$ date -d "1970-01-01 + ${duration/h/hour}" +%s
Having spaces between the numbers and the strings day, h,min or sec does not matter.
The idea of this is that you ask date to compute everything for you as %s returns the unix time since 1970-01-01 in seconds.
man date:
%s seconds since 1970-01-01 00:00:00 UTC
The given value of A is*:
A="3day 3day/3h 15min/3h/15min 10sec/15min/10sec"
To compute A in seconds you can use bash's parameter expansion:
A=${A//day/*86400}
A=${A//h/*3600}
A=${A//min/*60}
A=${A//sec/*1}
A=${A//\//+}
A=${A// /+}
echo "A = $A"
echo $A | bc
Output:
A = 3*86400+3*86400+3*3600+15*60+3*3600+15*60+10*1+15*60+10*1
542720
* Note here I changed the original value of A as provided by the OP. From
3 day/3 day/3h...
to
3day 3day/3h... # the rest is the same as OP's.
Using awk to s/h/hours/ and to launch date +"%s" -d "1970-01-01 GMT +" to parse the time strings and to count the seconds:
$ awk '{
sub(/h/,"hours") # date no eat h
$1="" # remove $1
"date +\"%s\" -d \"1970-01-01 GMT + " $0 "\"" | getline s # date
print s
}' file
324000
259200
11700
10800
910
900
10
for the data:
$ cat file
1) 3 day 18h
2) 3 day
3) 3h 15min
4) 3h
5) 15min 10sec
6) 15min
7) 10sec

rrd graph configurate query

I am updating my RRD file with some counts...
For example:
time: value:
12:00 120
12:05 135
12:10 154
12:20 144
12:25 0
12:30 23
13:35 36
here my RRD is updating as below logic:
((current value)-(previous value))/((current time)-(previous time))
eg. ((135-120))/5 = 15
but my problem is when it comes 0 the reading will be negative:
((0-144))/5
Here " 0 " value comes with system failure only( from where the data is fetched)..It must not display this reading graph.
How can I configure like when 0 comes it will not update the "RRD graph" (skip this reading (0-144/5)) and next time it will take reading like ((23-0)/5) but not (23-144/10)
When specifying the data sources when creating the RRD, you can specify which range of values is acceptable.
DS:data_source:GAUGE:10:1:U will only accept values above 1.
So if you get a 0 during an update, rrd will replace it with unknown and i assume it can find a way to discard it.