Athena query performance between similar queries differs significantly

Athena query performance between similar queries differs significantly - amazon-athena

Noticed the other day that there are some significant differences in query performance when running two nearly identical queries.
QUERY 1:
SELECT * FROM "table"
WHERE (badge = 'xyz' or badge = 'abc')
and ((year = '2021' and month = '11' and day = '1')
or (year = '2021' and month = '10' and day = '31'))
ORDER BY timestamp
Runtime: 40.751 sec
Data scanned: 94.06 KB
QUERY 2:
SELECT * FROM "table"
WHERE (badge = 'xyz' or badge = 'abc')
and ((year = '2021' and month = '10' and day = '30')
or (year = '2021' and month = '10' and day = '31'))
ORDER BY timestamp
Runtime: 1.78 sec
Data scanned: 216.86 KB
The only major difference between the two is that one query looks at 11/1 & 10/31 and the other looks at 10/31 & 10/30. So there is an additional month partition being looked at in QUERY 1.
When running both queries
with EXPLAIN I
noticed that
QUERY 2 uses a TableScan while QUERY1 uses a ScanFilter.
Anyone know why this might be the case between these two queries?
Additional Details:
Time in queue for both queries was sub 1 second.
In s3, the data is structured as follows:
badge=%s/year=%s/month=%s/day=%s/hour=%s
badge,year,month,day & hour are all partitions defined via Partition Projection.

Related

PowerBi DAX measure to sum duration of timespans filtered by current slicer

I need a DAX measure that gives me the sum of durations for multiple categories restricted by a date slicer.
In this simplified example there are 2 categories with 3 subcategories each. A DateTime Slicer on the dashboard is set to the timespan of 2nd of January 2021 noon to 6th of January midnight. I need the summed up duration of all categories in this timespan.
Input data:
A table containing multiple rows for each category with a start date and an end date.
The complicated part is that there are pauses between the timestamps.
Desired output:
A table on the dashboard containing the category and a calculated measure for the summed up duration during the sliced timespan.
When changing the slicer the meaure shall change as well.
My current solution for this problem is an M formulato create a list of all days in each timespan and to unpivot all lists. In the dashboard the count of rows gives you the number of days in the selected timespan. This solution though reqires a much larger input table and soes not work if you want to be exact on the second, only on days.
I tried so solve this via a measure but didn't make any progress worth showing here.
all datetime values are in the format dd.mm.yyyy hh:mm:ss (24h system)

I found a way to do it by using 2 measures.
First measure calculates the time during the timespan for each element:
I use one Date Table only consisting of all dates available which is the input for the slicer and the data Table called "Data".
duration_in_timespan_single =
VAR MinTs = MIN ('Date'[Date])
VAR MaxTs = MAX ('Date'[Date])
VAR MinUtcMin = MIN ('Data'[Date_Start])
VAR MaxUtcMax = MAX ('Data'[Date_End])
RETURN
IF(
AND(MinUtcMin >= MinTs, MinUtcMin <= MaxTs),
IF(
MaxUtcMax <= MaxTs,
CONVERT((MaxUtcMax-MinUtcMin),DOUBLE),
CONVERT((MaxTs-MinUtcMin),DOUBLE)),
IF(
MinUtcMin < MinTs,
IF(
MaxUtcMax > MinTs,
IF(
MaxUtcMax <= MaxTs,
CONVERT((MaxUtcMax-MinTs),DOUBLE),
CONVERT((MaxTs-MinTs),DOUBLE)
),
0
),
0
)
)
The second measure just sums up the first for each category:
duration_in_timespan = SUMX('Data',[duration_in_timespan_single])

Create New Column containing value based on conditions on date and hour

I have a table in my power BI with the following fields :
Preview of the data:
The column "platform" has 3 possible values : application, shop, website
"day" is of type Date
"hour" is of type "Date/Time" (same information as "day" + has the hour)
I added a measure to calculate the conversion_rate (orders/visits):
conversion_rate = DIVIDE(SUM(Table[orders]), SUM(Table[visits]))
Then I calculated for every day the conversion_rate from 7 days ago (to be able to compare them):
conversion_rate_7_j = CALCULATE(Table[conversion_rate],
DATEADD(Table[day],-7,DAY)
)
Now my data looks like this:
What I want to do is calculate the conversion rate from 7 days ago but for the same hour.
However I couldn't find a function that substracts field of type Date/Time while taking in consideration the hour.
A solution I thought of is to calculate orders and visits -7 days same hour separately and then divide them to have the conversion rate -7 days same hour:
orders_7_j_hourly =
VAR h = Table[hour] - 7
VAR p = Table[platform]
Return CALCULATE(
MAX(Table[orders]),
Table,
Table[hour] = h,
Table[platform] = p
)
Since my data is grouped by hour (Date/Time) and platform,
And since sometimes for a certain hour I have values for the platform = "application" but not "shop",
My function did not work especially that I am using MAX, this associated the number of orders to the wrong platform.
Can you please help ?
Sample data : https://ufile.io/y1blqgqn

Datetime values are stored in units of days. Thus you can simply shift hour by 7 in your measure.
conversion_rate prev_week =
VAR CurrHour = SELECTEDVALUE ( Table1[hour] )
RETURN
CALCULATE (
[conversion_rate],
ALL ( Table1[day] ),
Table1[hour] = CurrHour - 7
)
Sample results:

Did you try to use HOUR function ??
conversion_rate_7_hour = CALCULATE( [conversion_rate],
FILTER( ALL(Table),
SELECTEDVALUE(Table[day]) - 7 = Table[day]
&& HOUR(SELECTEDVALUE(Table[hour]) - 7) = HOUR( Table[hour])
))
When we put Table[hour] to visualization it should work.
Ps. best pratice => if your refer to measures in your calculations,do not include the table prefix

You can create an additional column called hour in your dataset
Once you have that, you bring the hours in the viz, the following measure can give you what you want
convRate-7 = CALCULATE([convRate],DATEADD('Table'[day],-7,DAY))

How to query the time in unix epoch timestamp in aws athena

I have a simple table contains the node, message, starttime, endtime details where starttime and endtime are in unix timestamp. The query I am running is:
select node, message, (select from_unixtime(starttime)), (select from_unixtime(endtime)) from table1 WHERE try(select from_unixtime(starttime)) > to_iso8601(current_timestamp - interval '24' hour) limit 100
The query is not working and throwing the syntax error.
I am trying to fetch the following information from the table:
query the table using start time and end time for past 'n' hours or 'n' days and get the output of starttime and endtime in human readable format
query the table using a specific date and time in human readable format

You don't need "extra" selects and you don't need to_iso8601 in the where clasue:
WITH dataset AS (
SELECT * FROM (VALUES
(1627409073, 1627409074),
(1627225824, 1627225826)
) AS t (starttime, endtime))
SELECT from_unixtime(starttime), from_unixtime(endtime)
FROM
dataset
WHERE from_unixtime(starttime) > (current_timestamp - interval '24' hour) limit 100
Output:
_col0
_col1
2021-07-27 18:04:33.000
2021-07-27 18:04:34.000

to search last week you can use
WHERE your_date >= to_unixtime(CAST(now() - interval '7' day AS timestamp))

Is there any limit on ALTER TABLE ADD PARTITION on Athena?

I am running a query similar to this:
ALTER TABLE test_table ADD IF NOT EXISTS
PARTITION (date = 'a', hour = '00')
PARTITION (date = 'b', hour = '01')
PARTITION (date = 'c', hour = '02')
PARTITION (date = 'd', hour = '03')
.
.
.
.
.//around 1000 partitions
PARTITION (date = 'aa', hour = '05')
PARTITION (date = 'bb', hour = '06')
PARTITION (date = 'cc', hour = '07')
PARTITION (date = 'dd', hour = '08')
The query is not throwing any error but it is not loading partitions on the Athena table. When I break the query to 500 partitions. It seems to work. Is there any limit on the number of partitions on the ADD PARTITION command? I went with the MSCK REPAIR TABLE instead of this. Just curious about why the query didn't run, I couldn't find any limit in the Athena documentation.

Need to compare current month with average of last 3 months in power bi

I have a data with Dept name and its corresponding Amount for each Dept for each Month like below :
Table1 :
Dept name Amount Period
XXX 20 Jan,2018
XXX 30 Feb,2018
XXX 50 Mar,2018
XXX 70 April,2018
....
YYYY 20 Jan,2018
YYYY 30 Feb,2018
YYYY 50 Mar,2018
YYYY 70 April,2018
....
I need to calculate the Average of Last 3 months (Ex. For Dept XXXX, If I select April Month, It needs to calculate the average Amount of (Jan,Feb,Mar)(20+30+50)/3 =33.33) and Compare the same with current (April) month (70)
I've created a calculated column for Last 3month Average as below (I have also created a Calender Table in Power BI)
AVG3mth =
CALCULATE(SUM('Table1'[Amount]),DATESINPERIOD(Calender[Date],LASTDATE('Table1'[Period]),-3,MONTH))/3
(But it just dividing the current month by 3 and not the Last 3 Mnths.)
and when comparing If the Average of Last 3 months greater than current month I should highlight it as "YES" since the Amount is dropped when comparing to last 3 months. I have added another column as "Dropped?" for the same.
Dropped? = IF(VALUES('Table1'[Amount])<[AVG3mth], "Yes", "No")
And also If I choose the Particular month (Period) in slicer I need to get those Month, Amount, Last 3 months average and Dropped YES/NO alone in my Report.
Attached my current report screenshot (You will get clear idea if you look into this)
Report Screenshot

To do this, you will need 1 Calculated Column and 3 Measures.
First, I created a new column called as MonthDiff (Calculated Column)
MonthDiff = DATEDIFF(MIN(Table1[Period]),Table1[Period],MONTH)
So afterwards, I created the Average for last 3 months Measure
Average Last 3 Months =
Var selectedmonth = SELECTEDVALUE(Table1[MonthDiff])
Var startingMonth = (selectedmonth - 4)
Var selecteddepartment = SELECTEDVALUE(Table1[Dept name])
Return CALCULATE(AVERAGE(Table1[Amount]), FILTER(ALL(Table1), Table1[MonthDiff] > startingMonth && Table1[MonthDiff] < selectedmonth),FILTER(ALL(Table1),Table1[Dept name] = selecteddepartment))
So, then you can create the current selected value Measure
SelectedAmount = SELECTEDVALUE(Table1[Amount])
Then you can create the drop Measure
Drop = var currentvalue = SELECTEDVALUE(Table1[Amount])
Var selectedmonth = SELECTEDVALUE(Table1[MonthDiff])
Var startingMonth = (selectedmonth - 4)
Var selectedDepartment = SELECTEDVALUE(Table1[Dept name])
Var averagevalue = CALCULATE(AVERAGE(Table1[Amount]), FILTER(ALL(Table1), Table1[MonthDiff] > startingMonth && Table1[MonthDiff] < selectedmonth), FILTER(All(Table1),Table1[Dept name] = selectedDepartment))
Return if(averagevalue > currentvalue, "Yes", "No")
This is my final output,
Do let me know, if this helps or not.
My Best Practice
When you are dealing with Measures that involves multiple filters,
it's best to declare them using Var and test it by returning the
output on the card visual as you develop the measure.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Athena query performance between similar queries differs significantly - amazon-athena

Related

PowerBi DAX measure to sum duration of timespans filtered by current slicer

Create New Column containing value based on conditions on date and hour

How to query the time in unix epoch timestamp in aws athena

Is there any limit on ALTER TABLE ADD PARTITION on Athena?

Need to compare current month with average of last 3 months in power bi

Categories

Resources