I have this query:
SHOW PARTITIONS tablename;
Result is:
dt=2018-01-12
dt=2018-01-20
dt=2018-05-21
dt=2018-04-07
dt=2018-01-03
This gives the list of partitions per table. The partition field for this table is dt which is a date column. I want to see the partitions ordered.
The documentation doesn't explain how to do it:
https://docs.aws.amazon.com/athena/latest/ug/show-partitions.html
I tried to add order by:
SHOW PARTITIONS tablename order by dt;
But it gives:
AmazonAthena; Status Code: 400; Error Code: InvalidRequestException;
AWS currently (as of Nov 2020) supports two versions of the Athena engines. How one selects and orders partitions depends upon which version is used.
Version 1:
Use the information_schema table. Assuming you have year, month as partitions (with one partition key, this is of course simpler):
WITH
a as (
SELECT partition_number as pn, partition_key as key, partition_value as val
FROM information_schema.__internal_partitions__
WHERE table_schema = 'my_database'
AND table_name = 'my_table'
)
SELECT
year, month
FROM (
SELECT val as year, pn FROM a WHERE key = 'year'
) y
JOIN (
SELECT val as month, pn FROM a WHERE key = 'month'
) m ON m.pn = y.pn
ORDER BY year, month
which outputs:
year month
0 2018 10
0 2018 11
0 2018 12
0 2019 01
...
Version 2:
Use the built-in $partitions functionality, where the partitions are explicitly available as columns and the syntax is much simpler:
SELECT year, month FROM my_database."my_table$partitions" ORDER BY year, month
year month
0 2018 10
0 2018 11
0 2018 12
0 2019 01
...
For more information, see:
https://docs.aws.amazon.com/athena/latest/ug/querying-glue-catalog.html#querying-glue-catalog-listing-partitions
From your comment it sounds like you're looking to sort the partitions as a way to figure out whether or not a specific partition exists. For this purpose I suggest you use the Glue API instead of querying Athena. Run aws glue get-partition help or check your preferred SDK's documentation for how it works.
There is also a variant to list all partitions of a table, run aws glue get-partitions help to read more about that. I don't think it returns the partitions in alphabetical order, but it has operators for filtering.
The SHOW PARTITIONS command will not allow you to order the result, since this command does not produce a resultset to sort. This command only produces a string output.
You can on the other hand query the partition column and then order the result by value.
select distinct dt from tablename order by dt asc;
Related
I have a sample table:
id
start_dt
end_dt
100
06/07/2021
30/09/2021
I would like to get the following output
id
start_dt
end_dt
100
06/07/2021
31/07/2021
100
01/08/2021
30/08/2021
100
01/09/2021
30/09/2021
I have tried using GENERATE_SERIES() in Amazon Redshift, but that does not give the required result.
The existing table is quite large so I could use temp tables then join back to another table at a later stage.
I have trawled through other posts, but other proposed solutions isn't quite giving the desired results / don't work at all on Amazon Redshift. Any help in solving this would be appreciated.
The traditional method would be:
Create a Calendar table that contains one row per month, with start_date and end_date columns
Join your table to the Calendar table, where table.start_dt <= calendar.end_dt AND table.end_dt >= calendar.start_dt
The two columns would be:
GREATEST(table.start_dt, calendar.start_dt)
LEAST(table.end_dt, calendar.end_dt)
How to get a list of missing dates from a BigQuery table. For e.g. a table(test_table) is populated everyday by some job but on few days the jobs fails and data isn't written into the table.
Use Case:
We have a table(test_table) which is populated everyday by some job( a scheduled query or cloud function).Sometimes those job fail and data isn't available for those particular dates in my table.
How to find those dates rather than scrolling through thousands of rows.
The below query will return me a list of dates and ad_ids where data wasn't uploaded (null).
note: I have used MAX(Date) as I knew dates was missing in between my boundary dates. For safe side you can also specify the starting_date and ending_date incase data hasn't been populated in the last few days at all.
WITH Date_Range AS
-- anchor for date range
(
SELECT MIN(DATE) as starting_date,
MAX(DATE) AS ending_date
FROM `project_name.dataset_name.test_table`
),
day_series AS
-- anchor to get all the dates within the range
(
SELECT *
FROM Date_Range
,UNNEST(GENERATE_TIMESTAMP_ARRAY(starting_date, ending_date, INTERVAL 1 DAY)) AS days
-- other options depending on your date type ( mine was timestamp)
-- GENERATE_DATETIME_ARRAY or GENERATE_DATE_ARRAY
)
SELECT
day_series.days,
original_table.ad_id
FROM day_series
-- do a left join on the source table
LEFT JOIN `project_name.dataset_name.test_table` AS original_table ON (original_table.date)= day_series.days
-- I only want the records where data is not available or in other words empty/missing
WHERE original_table.ad_id IS NULL
GROUP BY 1,2
ORDER BY 1
Final output will look like below:
An Alternate solution you can try following query to get desired output:-
with t as (select 1 as id, cast ('2020-12-25' as timestamp) Days union all
select 1 as id, cast ('2020-12-26' as timestamp) Days union all
select 1 as id, cast ('2020-12-27' as timestamp) Days union all
select 1 as id, cast ('2020-12-31' as timestamp) Days union all
select 1 as id, cast ('2021-01-01' as timestamp) Days union all
select 1 as id, cast ('2021-01-04' as timestamp) Days)
SELECT *
FROM (
select TIMESTAMP_ADD(Days, INTERVAL 1 DAY) AS Days, TIMESTAMP_SUB(next_days, INTERVAL 1 DAY) AS next_days from (
select t.Days,
(case when lag(Days) over (partition by id order by Days) = Days
then NULL
when lag(Days) over (partition by id order by Days) is null
then Null
else Lead(Days) over (partition by id order by Days)
end) as next_days
from t) where next_days is not null
and Days <> TIMESTAMP_SUB(next_days, INTERVAL 1 DAY)),
UNNEST(GENERATE_TIMESTAMP_ARRAY(Days, next_days, INTERVAL 1 DAY)) AS days
Output will be as :-
I used the code above but had to restructure it for BigQuery:
-- anchor for date range - this will select dates from the source table (i.e. the table your query runs off of)
WITH day_series AS(
SELECT *
FROM (
SELECT MIN(DATE) as starting_date,
MAX(DATE) AS ending_date
FROM --enter source table here--
---OPTIONAL: filter for a specific date range
WHERE DATE BETWEEN 'YYYY-MM-DD' AND YYYY-MM-DD'
),UNNEST(GENERATE_DATE_ARRAY(starting_date, ending_date, INTERVAL 1 DAY)) as days
-- other options depending on your date type ( mine was timestamp)
-- GENERATE_DATETIME_ARRAY or GENERATE_DATE_ARRAY
)
SELECT
day_series.days,
output_table.date
FROM day_series
-- do a left join on the output table (i.e. the table you are searching the missing dates for)
LEFT JOIN `project_name.dataset_name.test_table` AS output_table
ON (output_table.date)= day_series.days
-- I only want the records where data is not available or in other words empty/missing
WHERE output_table.date IS NULL
GROUP BY 1,2
ORDER BY 1
Athena Tables are partitioned like and same as s3 folder path
parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=17
parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=9
parent=0fc966a0-bba7-4c0b-a648-cff7f0332059/year=2020/month=4/date=16
parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=14
PARTITIONED BY (
`parent` string,
`year` int,
`month` tinyint,
`date` tinyint)
Now how should I form the where condition for a select query to get data for parent = "9ab4fcca-65d8-11ea-bc55-0242ac130003" from 2019-06-01 to 2020-04-31 ?
SELECT *
FROM table
WHERE parent = '9ab4fcca-65d8-11ea-bc55-0242ac130003' AND year >= 2019 AND year <= 2020 AND month >= 04 AND month <= 06 AND date >= 01 AND date <= 31 ;
But this isn't correct. Please help
Partitioning on year, month, and day separately makes it unnecessarily difficult to query tables. If you're starting out I really suggest to avoid this kind of partitioning scheme. If you can't avoid it you can still make things easier by creating the table partitions differently.
Most guides will tell you to create directory structures like year=2020/month=4/date=1/file1, create a table with three corresponding partition columns, and then run MSCK REPAIR TABLE to load partitions. This works, but it's far from the best way to use Athena. MSCK REPAIR TABLE has atrocious performance, and partitioning like that is far from ideal.
I suggest creating directory structures that are just 2020-03-01/file1, but if you can't, you can actually have any structure you want, 2020/03/01/file1, year=2020/month=4/date=1/file1, or any other structure where there is one distinct prefix per date will work more or less equally well.
I also suggest you create tables with only one partition column: date (or dt or day if you want avoid quoting), typed as DATE, not string.
What you do then, instead of running MSCK REPAIR TABLE is that you use ALTER TABLE … ADD PARTITION or the Glue APIs directly, to add partitions. This command lets you specify the location separately from the partition column value:
ALTER TABLE my_table ADD
PARTITION (day = '2020-04-01') LOCATION 's3://some-bucket/path/to/2020-04-01/'
The important thing here is that the partition column value doesn't have to have any relationship at all with the location, this would work equally well:
ALTER TABLE my_table ADD
PARTITION (day = '2020-04-01') LOCATION 's3://some-bucket/path/to/data-for-first-of-april/'
For your specific case you could have:
PARTITIONED BY (`parent` string, `day` date)
and then do:
ALTER TABLE your_table ADD
PARTITION (parent = '9ab4fcca-65d8-11ea-bc55-0242ac130003', day = '2020-04-17') LOCATION 's3://your-bucket/parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=17'
PARTITION (parent = '9ab4fcca-65d8-11ea-bc55-0242ac130003', day = '2020-04-09') LOCATION 's3://your-bucket/parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=9'
PARTITION (parent = '0fc966a0-bba7-4c0b-a648-cff7f0332059', day = '2020-04-16') LOCATION 's3://your-bucket/parent=0fc966a0-bba7-4c0b-a648-cff7f0332059/year=2020/month=4/date=16'
PARTITION (parent = '9ab4fcca-65d8-11ea-bc55-0242ac130003', day = '2020-04-14') LOCATION 's3://your-bucket/parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=14'
Here is how you can use year, month and day values the come from partitions in order to select date range
SELECT col1, col2
FROM my_table
WHERE CAST(date_parse(concat(CAST(year AS VARCHAR(4)),'-',
CAST(month AS VARCHAR(2)),'-',
CAST(day AS VARCHAR(2))
), '%Y-%m-%d') as DATE)
BETWEEN DATE '2019-06-01' AND DATE '2020-04-31'
You can add additional filter statements as needed)
I Have many IOT devices that sends data to my Amazon Athena server, i created a table to store the data and the table contains 2 columns: LocalTime indicate the time that the IOT device capture his status, ServerTime indicate the time the Data arrived to server (sometimes the IOT device doesn't have network connections )
I would like to count the "gaps" in block of hours (let's say 1 hour ) in order to know the deviation of the data arriving, for example:
the result that I would like to get is:
In order to calculate the result i want to calculate how many hours passed between serverTime and LocalTime.
so the first entry (1.1.2019 12:15 - 1.1.2019 10:25 ) = 1-2 hours.
Thanks
If it is MSSQL Server is your database, you can try this below script to get your desired output-
SELECT
CAST(DATEDIFF(HH,localTime,serverTime)-1 AS VARCHAR) +'-'+
CAST(DATEDIFF(HH,localTime,serverTime) AS VARCHAR) [Hours],
COUNT(*) [Count]
FROM your_table
GROUP BY CAST(DATEDIFF(HH,localTime,serverTime)-1 AS VARCHAR) +'-'+
CAST(DATEDIFF(HH,localTime,serverTime) AS VARCHAR)
Oracle
If you using Oracle database as a system, you can use this statement:
select CONCAT(CONCAT (diff_hours,'-') , diff_hours+1) as Hours, count(diff_hours) as Count
from (select 24 * (to_date(LocalTime, 'YYYY-MM-DD hh24:mi') - to_date(ServerTime, 'YYYY-MM-DD hh24:mi')) diff_hours from T_TIMETABLE )
group by diff_hours
order by diff_hours;
Note: This will not display the empty intervals.
I have a table like below.
item_id bill_start_date bill_end_date usage_amount
635212 2019-02-01 00:00:00.000 3/1/2019 00:00:00.000 13.345 user_project
IBM
I am trying to find usage_amount by each month and each project. Amazon Athena query engine is based on Presto 0.172. Due to the limitations in Athena, it's not recognizing query like select sysdate from dual;.
I tried to convert bill_start_date and bill_end_date from timestamp to date but failed. even current_date() didn't work in my case. I am able to do calculate the total cost by hard coding the values but my end goal is to perform the action on columns.
SELECT (FLOOR(SUM(usage_amount)*100)/100) AS total,
user_project
FROM test_table
WHERE bill_start_date
BETWEEN date '2019-02-01'
AND date '2019-03-01'
GROUP BY user_project;
In Presto, current_timestamp is a SQL standard function which does not use parentheses.
To group by month, I'd use date_trunc('month', bill_start_date).
All of these functions are documented here