Is there any limit on ALTER TABLE ADD PARTITION on Athena? - amazon-web-services

I am running a query similar to this:
ALTER TABLE test_table ADD IF NOT EXISTS
PARTITION (date = 'a', hour = '00')
PARTITION (date = 'b', hour = '01')
PARTITION (date = 'c', hour = '02')
PARTITION (date = 'd', hour = '03')
.
.
.
.
.//around 1000 partitions
PARTITION (date = 'aa', hour = '05')
PARTITION (date = 'bb', hour = '06')
PARTITION (date = 'cc', hour = '07')
PARTITION (date = 'dd', hour = '08')
The query is not throwing any error but it is not loading partitions on the Athena table. When I break the query to 500 partitions. It seems to work. Is there any limit on the number of partitions on the ADD PARTITION command? I went with the MSCK REPAIR TABLE instead of this. Just curious about why the query didn't run, I couldn't find any limit in the Athena documentation.

Related

Retrieving the row with the greatest timestamp in questDB

I'm currently running QuestDB 6.1.2 on linux. How do I get the row with maximum value from a table? I have tried the following on a test table with around 5 million rows:
select * from table where cast(timestamp as symbol) in (select cast(max(timestamp) as symbol) from table );
select * from table inner join (select max(timestamp) mm from table ) on timestamp >= mm
select * from table where timestamp = max(timestamp)
select * from table where timestamp = (select max(timestamp) from table )
where 1 is correct but runs in ~5s, 2 is correct and runs in ~500ms but looks unnecessarily verbose for a query, 3 compiles but returns an empty table, and 4 is incorrect syntax although that's how sql usually does it
select * from table limit -1 works. QuestDB returns rows sorted by timestamp as default, and limit -1 takes the last row, which happens to be the row with the greatest timestamp. To be explicit about ordering by timestamp, select * from table order by timestamp limit -1 could be used instead. This query runs in around 300-400ms on the same table.
As a side note, the third query using timestamp=max(timestamp) doesn't work yet since QuestDB does not support subqueries in where yet (questDB 6.1.2).

Athena query performance between similar queries differs significantly

Noticed the other day that there are some significant differences in query performance when running two nearly identical queries.
QUERY 1:
SELECT * FROM "table"
WHERE (badge = 'xyz' or badge = 'abc')
and ((year = '2021' and month = '11' and day = '1')
or (year = '2021' and month = '10' and day = '31'))
ORDER BY timestamp
Runtime: 40.751 sec
Data scanned: 94.06 KB
QUERY 2:
SELECT * FROM "table"
WHERE (badge = 'xyz' or badge = 'abc')
and ((year = '2021' and month = '10' and day = '30')
or (year = '2021' and month = '10' and day = '31'))
ORDER BY timestamp
Runtime: 1.78 sec
Data scanned: 216.86 KB
The only major difference between the two is that one query looks at 11/1 & 10/31 and the other looks at 10/31 & 10/30. So there is an additional month partition being looked at in QUERY 1.
When running both queries
with EXPLAIN I
noticed that
QUERY 2 uses a TableScan while QUERY1 uses a ScanFilter.
Anyone know why this might be the case between these two queries?
Additional Details:
Time in queue for both queries was sub 1 second.
In s3, the data is structured as follows:
badge=%s/year=%s/month=%s/day=%s/hour=%s
badge,year,month,day & hour are all partitions defined via Partition Projection.

i am looking to get the date diff from two or more row in a way that first rows serviceto date - second rows service start date so that i can get diff

my data looks like this
userid
completedat
serviceperiodfrom
serviceperiodto
00002cd9-94eb-4c06-a2c4-75253fd541b9
2020-11-25T14:20:04.293Z
2020-11-25T14:20:04.200Z
2021-02-25T14:20:04.200Z
00002cd9-94eb-4c06-a2c4-75253fd541b9
2021-03-21T10:27:34.842Z
2021-03-21T10:27:34.800Z
2022-03-21T10:27:34.800Z
00002cd9-94eb-4c06-a2c4-75253fd541b9
2020-07-24T11:22:12.410Z
2020-07-24T11:22:12.300Z
2020-10-24T11:22:12.300Z
I need the date diff from serviceperiodto date of first row - serviceperiodfrom date of secondrow and it goes for as many iteration as it has these details for each userid
please help me i tried joining the tables using subqueries tried to create a pivot table but none of them seem working for me please help
You can use lag/lead to access previous/next item:
WITH dataset
AS (SELECT *
FROM
(
VALUES
(1, from_iso8601_timestamp('2020-11-25T14:20:04.200Z'), from_iso8601_timestamp('2021-02-25T14:20:04.200Z')),
(1, from_iso8601_timestamp('2021-03-21T10:27:34.800Z'), from_iso8601_timestamp('2022-03-21T10:27:34.800Z')),
(1, from_iso8601_timestamp('2020-07-24T11:22:12.300Z'), from_iso8601_timestamp('2020-10-24T11:22:12.300Z'))
) AS t (userid, serviceperiodfrom, serviceperiodto)
)
SELECT date_diff(
'hour',
serviceperiodto,
lead(serviceperiodfrom, 1) OVER (PARTITION BY userid ORDER BY serviceperiodfrom))
FROM dataset
Output:
_col0
770
572
 

How to select data from aws athena table which is partitioned like 'year=yyyy/month=MM/date=dd/' for a given date range?

Athena Tables are partitioned like and same as s3 folder path
parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=17
parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=9
parent=0fc966a0-bba7-4c0b-a648-cff7f0332059/year=2020/month=4/date=16
parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=14
PARTITIONED BY (
`parent` string,
`year` int,
`month` tinyint,
`date` tinyint)
Now how should I form the where condition for a select query to get data for parent = "9ab4fcca-65d8-11ea-bc55-0242ac130003" from 2019-06-01 to 2020-04-31 ?
SELECT *
FROM table
WHERE parent = '9ab4fcca-65d8-11ea-bc55-0242ac130003' AND year >= 2019 AND year <= 2020 AND month >= 04 AND month <= 06 AND date >= 01 AND date <= 31 ;
But this isn't correct. Please help
Partitioning on year, month, and day separately makes it unnecessarily difficult to query tables. If you're starting out I really suggest to avoid this kind of partitioning scheme. If you can't avoid it you can still make things easier by creating the table partitions differently.
Most guides will tell you to create directory structures like year=2020/month=4/date=1/file1, create a table with three corresponding partition columns, and then run MSCK REPAIR TABLE to load partitions. This works, but it's far from the best way to use Athena. MSCK REPAIR TABLE has atrocious performance, and partitioning like that is far from ideal.
I suggest creating directory structures that are just 2020-03-01/file1, but if you can't, you can actually have any structure you want, 2020/03/01/file1, year=2020/month=4/date=1/file1, or any other structure where there is one distinct prefix per date will work more or less equally well.
I also suggest you create tables with only one partition column: date (or dt or day if you want avoid quoting), typed as DATE, not string.
What you do then, instead of running MSCK REPAIR TABLE is that you use ALTER TABLE … ADD PARTITION or the Glue APIs directly, to add partitions. This command lets you specify the location separately from the partition column value:
ALTER TABLE my_table ADD
PARTITION (day = '2020-04-01') LOCATION 's3://some-bucket/path/to/2020-04-01/'
The important thing here is that the partition column value doesn't have to have any relationship at all with the location, this would work equally well:
ALTER TABLE my_table ADD
PARTITION (day = '2020-04-01') LOCATION 's3://some-bucket/path/to/data-for-first-of-april/'
For your specific case you could have:
PARTITIONED BY (`parent` string, `day` date)
and then do:
ALTER TABLE your_table ADD
PARTITION (parent = '9ab4fcca-65d8-11ea-bc55-0242ac130003', day = '2020-04-17') LOCATION 's3://your-bucket/parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=17'
PARTITION (parent = '9ab4fcca-65d8-11ea-bc55-0242ac130003', day = '2020-04-09') LOCATION 's3://your-bucket/parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=9'
PARTITION (parent = '0fc966a0-bba7-4c0b-a648-cff7f0332059', day = '2020-04-16') LOCATION 's3://your-bucket/parent=0fc966a0-bba7-4c0b-a648-cff7f0332059/year=2020/month=4/date=16'
PARTITION (parent = '9ab4fcca-65d8-11ea-bc55-0242ac130003', day = '2020-04-14') LOCATION 's3://your-bucket/parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=14'
Here is how you can use year, month and day values the come from partitions in order to select date range
SELECT col1, col2
FROM my_table
WHERE CAST(date_parse(concat(CAST(year AS VARCHAR(4)),'-',
CAST(month AS VARCHAR(2)),'-',
CAST(day AS VARCHAR(2))
), '%Y-%m-%d') as DATE)
BETWEEN DATE '2019-06-01' AND DATE '2020-04-31'
You can add additional filter statements as needed)

Left join not honored when data in 2 different S3 directories

I have 2 tables. table1 and table2
Table1 is pointing to s3://bucket/Dicrectory1/year/month/day/hour/file (25 records) and
Table2 is pointing to s3://bucket/Dicrectory2/year/month/day/hour/file (2 records)
My query looks like below
SELECT table1.column1,
table2.column1
FROM table1
LEFT JOIN table2
ON table1.column1 = table2.column1
WHERE table1.year = '2018'
AND table1.month = '10'
AND table1.day = '31'
AND table1.hour = '00'
and table2.year = '2018'
AND table2.month = '10'
AND table2.day = '31'
AND table2.hour = '00'
Even though I am doing left join I am only getting inner join results (2 Records common in both tables).
Am I not doing the left join correctly for Athena?
If there is no corresponding record in table2, all table2.* will be null and this gets thrown away by the where clause. (answer stolen from Left Outer Join Not Working? )