How to query the time in unix epoch timestamp in aws athena - amazon-web-services

I have a simple table contains the node, message, starttime, endtime details where starttime and endtime are in unix timestamp. The query I am running is:
select node, message, (select from_unixtime(starttime)), (select from_unixtime(endtime)) from table1 WHERE try(select from_unixtime(starttime)) > to_iso8601(current_timestamp - interval '24' hour) limit 100
The query is not working and throwing the syntax error.
I am trying to fetch the following information from the table:
query the table using start time and end time for past 'n' hours or 'n' days and get the output of starttime and endtime in human readable format
query the table using a specific date and time in human readable format

You don't need "extra" selects and you don't need to_iso8601 in the where clasue:
WITH dataset AS (
SELECT * FROM (VALUES
(1627409073, 1627409074),
(1627225824, 1627225826)
) AS t (starttime, endtime))
SELECT from_unixtime(starttime), from_unixtime(endtime)
FROM
dataset
WHERE from_unixtime(starttime) > (current_timestamp - interval '24' hour) limit 100
Output:
_col0
_col1
2021-07-27 18:04:33.000
2021-07-27 18:04:34.000

to search last week you can use
WHERE your_date >= to_unixtime(CAST(now() - interval '7' day AS timestamp))

Related

Timestamp of yesterday at specific hour bigquery

I need to schedule a query on BigQuery, that will retrieve some data from 24h ago. The problem is that for example if its scheduled for 5am, the query needs time to be executed (let's say 2.248 seconds), and so the yesterday's data from 5:00:00.000 to 5:00:02.248 will not be retrieved. I'm using a timestamp field, and i do something like this :
SELECT *
FROM my_table
WHERE my_timestamp >= TIMESTAMP_SUB(current_timestamp(), INTERVAL 24 hour)
I would like to know if there is a way to get the yesterday's date, at a specific hour, so even if there is a little gap due to the execution, it will still retrieve data from a specific hour.
edit:
I found something :
SELECT *
FROM my_table
WHERE my_timestamp >= TIMESTAMP(DATE_SUB(current_date(), INTERVAL 1 DAY) )
AND my_timestamp < TIMESTAMP(current_date)
But this retrieves yesterday between 00:00:00.000 to 23:59:59.999
It is okay but is there a way to choose the hour ?
Consider below least verbose approach
select *
from my_table
where my_timestamp >= timestamp(current_date - 1 + interval 5 hour)
and my_timestamp < timestamp(current_date + interval 5 hour)
Okay I found on my own lol, maybe it will help someone else looking for this.
SELECT *
FROM my_table
WHERE my_timestamp >= TIMESTAMP_ADD(TIMESTAMP(DATE_SUB(current_date, INTERVAL 1 DAY)), INTERVAL 5 HOUR)
AND my_timestamp < TIMESTAMP_ADD(TIMESTAMP(current_date), INTERVAL 5 HOUR)

Athena query performance between similar queries differs significantly

Noticed the other day that there are some significant differences in query performance when running two nearly identical queries.
QUERY 1:
SELECT * FROM "table"
WHERE (badge = 'xyz' or badge = 'abc')
and ((year = '2021' and month = '11' and day = '1')
or (year = '2021' and month = '10' and day = '31'))
ORDER BY timestamp
Runtime: 40.751 sec
Data scanned: 94.06 KB
QUERY 2:
SELECT * FROM "table"
WHERE (badge = 'xyz' or badge = 'abc')
and ((year = '2021' and month = '10' and day = '30')
or (year = '2021' and month = '10' and day = '31'))
ORDER BY timestamp
Runtime: 1.78 sec
Data scanned: 216.86 KB
The only major difference between the two is that one query looks at 11/1 & 10/31 and the other looks at 10/31 & 10/30. So there is an additional month partition being looked at in QUERY 1.
When running both queries
with EXPLAIN I
noticed that
QUERY 2 uses a TableScan while QUERY1 uses a ScanFilter.
Anyone know why this might be the case between these two queries?
Additional Details:
Time in queue for both queries was sub 1 second.
In s3, the data is structured as follows:
badge=%s/year=%s/month=%s/day=%s/hour=%s
badge,year,month,day & hour are all partitions defined via Partition Projection.

How to find missing dates in BigQuery table using sql

How to get a list of missing dates from a BigQuery table. For e.g. a table(test_table) is populated everyday by some job but on few days the jobs fails and data isn't written into the table.
Use Case:
We have a table(test_table) which is populated everyday by some job( a scheduled query or cloud function).Sometimes those job fail and data isn't available for those particular dates in my table.
How to find those dates rather than scrolling through thousands of rows.
The below query will return me a list of dates and ad_ids where data wasn't uploaded (null).
note: I have used MAX(Date) as I knew dates was missing in between my boundary dates. For safe side you can also specify the starting_date and ending_date incase data hasn't been populated in the last few days at all.
WITH Date_Range AS
-- anchor for date range
(
SELECT MIN(DATE) as starting_date,
MAX(DATE) AS ending_date
FROM `project_name.dataset_name.test_table`
),
day_series AS
-- anchor to get all the dates within the range
(
SELECT *
FROM Date_Range
,UNNEST(GENERATE_TIMESTAMP_ARRAY(starting_date, ending_date, INTERVAL 1 DAY)) AS days
-- other options depending on your date type ( mine was timestamp)
-- GENERATE_DATETIME_ARRAY or GENERATE_DATE_ARRAY
)
SELECT
day_series.days,
original_table.ad_id
FROM day_series
-- do a left join on the source table
LEFT JOIN `project_name.dataset_name.test_table` AS original_table ON (original_table.date)= day_series.days
-- I only want the records where data is not available or in other words empty/missing
WHERE original_table.ad_id IS NULL
GROUP BY 1,2
ORDER BY 1
Final output will look like below:
An Alternate solution you can try following query to get desired output:-
with t as (select 1 as id, cast ('2020-12-25' as timestamp) Days union all
select 1 as id, cast ('2020-12-26' as timestamp) Days union all
select 1 as id, cast ('2020-12-27' as timestamp) Days union all
select 1 as id, cast ('2020-12-31' as timestamp) Days union all
select 1 as id, cast ('2021-01-01' as timestamp) Days union all
select 1 as id, cast ('2021-01-04' as timestamp) Days)
SELECT *
FROM (
select TIMESTAMP_ADD(Days, INTERVAL 1 DAY) AS Days, TIMESTAMP_SUB(next_days, INTERVAL 1 DAY) AS next_days from (
select t.Days,
(case when lag(Days) over (partition by id order by Days) = Days
then NULL
when lag(Days) over (partition by id order by Days) is null
then Null
else Lead(Days) over (partition by id order by Days)
end) as next_days
from t) where next_days is not null
and Days <> TIMESTAMP_SUB(next_days, INTERVAL 1 DAY)),
UNNEST(GENERATE_TIMESTAMP_ARRAY(Days, next_days, INTERVAL 1 DAY)) AS days
Output will be as :-
I used the code above but had to restructure it for BigQuery:
-- anchor for date range - this will select dates from the source table (i.e. the table your query runs off of)
WITH day_series AS(
SELECT *
FROM (
SELECT MIN(DATE) as starting_date,
MAX(DATE) AS ending_date
FROM --enter source table here--
---OPTIONAL: filter for a specific date range
WHERE DATE BETWEEN 'YYYY-MM-DD' AND YYYY-MM-DD'
),UNNEST(GENERATE_DATE_ARRAY(starting_date, ending_date, INTERVAL 1 DAY)) as days
-- other options depending on your date type ( mine was timestamp)
-- GENERATE_DATETIME_ARRAY or GENERATE_DATE_ARRAY
)
SELECT
day_series.days,
output_table.date
FROM day_series
-- do a left join on the output table (i.e. the table you are searching the missing dates for)
LEFT JOIN `project_name.dataset_name.test_table` AS output_table
ON (output_table.date)= day_series.days
-- I only want the records where data is not available or in other words empty/missing
WHERE output_table.date IS NULL
GROUP BY 1,2
ORDER BY 1

How to calculate gap between 2 timestamps (edited for AWS Athena )

I Have many IOT devices that sends data to my Amazon Athena server, i created a table to store the data and the table contains 2 columns: LocalTime indicate the time that the IOT device capture his status, ServerTime indicate the time the Data arrived to server (sometimes the IOT device doesn't have network connections )
I would like to count the "gaps" in block of hours (let's say 1 hour ) in order to know the deviation of the data arriving, for example:
the result that I would like to get is:
In order to calculate the result i want to calculate how many hours passed between serverTime and LocalTime.
so the first entry (1.1.2019 12:15 - 1.1.2019 10:25 ) = 1-2 hours.
Thanks
If it is MSSQL Server is your database, you can try this below script to get your desired output-
SELECT
CAST(DATEDIFF(HH,localTime,serverTime)-1 AS VARCHAR) +'-'+
CAST(DATEDIFF(HH,localTime,serverTime) AS VARCHAR) [Hours],
COUNT(*) [Count]
FROM your_table
GROUP BY CAST(DATEDIFF(HH,localTime,serverTime)-1 AS VARCHAR) +'-'+
CAST(DATEDIFF(HH,localTime,serverTime) AS VARCHAR)
Oracle
If you using Oracle database as a system, you can use this statement:
select CONCAT(CONCAT (diff_hours,'-') , diff_hours+1) as Hours, count(diff_hours) as Count
from (select 24 * (to_date(LocalTime, 'YYYY-MM-DD hh24:mi') - to_date(ServerTime, 'YYYY-MM-DD hh24:mi')) diff_hours from T_TIMETABLE )
group by diff_hours
order by diff_hours;
Note: This will not display the empty intervals.

SHOW PARTITIONS with order by in Amazon Athena

I have this query:
SHOW PARTITIONS tablename;
Result is:
dt=2018-01-12
dt=2018-01-20
dt=2018-05-21
dt=2018-04-07
dt=2018-01-03
This gives the list of partitions per table. The partition field for this table is dt which is a date column. I want to see the partitions ordered.
The documentation doesn't explain how to do it:
https://docs.aws.amazon.com/athena/latest/ug/show-partitions.html
I tried to add order by:
SHOW PARTITIONS tablename order by dt;
But it gives:
AmazonAthena; Status Code: 400; Error Code: InvalidRequestException;
AWS currently (as of Nov 2020) supports two versions of the Athena engines. How one selects and orders partitions depends upon which version is used.
Version 1:
Use the information_schema table. Assuming you have year, month as partitions (with one partition key, this is of course simpler):
WITH
a as (
SELECT partition_number as pn, partition_key as key, partition_value as val
FROM information_schema.__internal_partitions__
WHERE table_schema = 'my_database'
AND table_name = 'my_table'
)
SELECT
year, month
FROM (
SELECT val as year, pn FROM a WHERE key = 'year'
) y
JOIN (
SELECT val as month, pn FROM a WHERE key = 'month'
) m ON m.pn = y.pn
ORDER BY year, month
which outputs:
year month
0 2018 10
0 2018 11
0 2018 12
0 2019 01
...
Version 2:
Use the built-in $partitions functionality, where the partitions are explicitly available as columns and the syntax is much simpler:
SELECT year, month FROM my_database."my_table$partitions" ORDER BY year, month
year month
0 2018 10
0 2018 11
0 2018 12
0 2019 01
...
For more information, see:
https://docs.aws.amazon.com/athena/latest/ug/querying-glue-catalog.html#querying-glue-catalog-listing-partitions
From your comment it sounds like you're looking to sort the partitions as a way to figure out whether or not a specific partition exists. For this purpose I suggest you use the Glue API instead of querying Athena. Run aws glue get-partition help or check your preferred SDK's documentation for how it works.
There is also a variant to list all partitions of a table, run aws glue get-partitions help to read more about that. I don't think it returns the partitions in alphabetical order, but it has operators for filtering.
The SHOW PARTITIONS command will not allow you to order the result, since this command does not produce a resultset to sort. This command only produces a string output.
You can on the other hand query the partition column and then order the result by value.
select distinct dt from tablename order by dt asc;