Redshift - SQL - Cumulative average for grouped results - amazon-web-services

I currently have a table like so:
Date
customer_id
sales
1/1
1
1
1/1
1
1
1/1
1
1
1/1
2
1
1/2
2
3
1/2
2
1
1/2
1
2
1/2
1
1
1/3
1
2
1/3
2
2
1/3
2
3
1/3
2
3
This eventually gets aggregated by the customer_id to get total_sales like so:
customer_id
total_sales
1
8
2
13
I then calculate one metric based off of this table, average_sales, which is defined as:
sum(total_sales) / count(distinct customer_id)
This would result in average_sales of 10.5 based on the information above.
However, I need to find a way to calculate this average but for each day on a cumulative basis like so:
Date 1/1 would be sum(total sales) for 1/1 / count(distinct customer_ids) for 1/1
Date 1/2 would be sum(total sales) for 1/1-1/2 / count(distinct customer_ids) for 1/1-1/2
Date 1/3 would be sum(total sales) for 1/1-1/3 / count(distinct customer_ids) for 1/1-1/3
The final day(1/3) should be equal to the overall average metric of 10.5.
Final table should look like this:
Date
average_sales
1/1
2 (4/2)
1/2
5.5 (11/2)
1/3
10.5 (21/2)
I've tried multiple things thus far with grouping/window functions but can't seem to get the right numbers. Any help would be greatly appreciated!

The main problem is that you can't use COUNT(DISTINCT) with a window.
But, there's a hacky way to calculate it anyway.
Work out the first month each customer id appears
Rank the customers in order of when they appeared
MAX(customer_rank) is then the number of customers seen to date
This gives...
WITH
check_first_date AS
(
SELECT
*,
MIN(date_id) OVER (PARTITION BY cust_id) AS cust_id_first_date
FROM
example
),
rank_customers_by_time AS
(
SELECT
*,
DENSE_RANK() OVER (ORDER BY cust_id_first_date, cust_id) AS cust_rank
FROM
check_first_date
)
SELECT
date_id,
MAX(MAX(cust_rank)) OVER (ORDER BY date_id) AS customers_to_date,
SUM(SUM(sales)) OVER (ORDER BY date_id) AS sales_to_date
FROM
rank_customers_by_time
GROUP BY
date_id
ORDER BY
date_id
Then you can divide one by the other.
There are other ways to do the count-distinct over time, such as using correlated sub-queries. I suspect (I haven't tested) that it's even slower though.
SELECT
date_id,
(
SELECT COUNT(DISTINCT lookup.cust_id)
FROM example AS lookup
WHERE lookup.date_id <= example.date_id
)
AS customers_to_date,
SUM(SUM(sales)) OVER (ORDER BY date_id) AS sales_to_date
FROM
example
GROUP BY
date_id
ORDER BY
date_id
Here is a demo (using postgresql, as the clostest approximation to redshift) with slightly different data to show that it works even when customer id's appear 'out of order'.
https://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=a5a37f3337e42123424c5cf1dbfe0152
EDIT: An even shorter (faster?) version with windows
For each customer_id, identify which is their first row (implicitly requires the rows to have a unique id).
The sum up the number of first rows that have occurred to date...
WITH
check_first_occurrence AS
(
SELECT
*,
MIN(id) OVER (PARTITION BY cust_id) AS cust_id_first_id
FROM
example
)
SELECT
date_id,
SUM(SUM(CASE WHEN id = cust_id_first_id THEN 1 ELSE 0 END)) OVER (ORDER BY date_id) AS customers_to_date,
SUM(SUM(sales )) OVER (ORDER BY date_id) AS sales_to_date
FROM
check_first_occurrence
GROUP BY
date_id
ORDER BY
date_id
https://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=94e5fb624a89170aaf819e2b3ccd01d6
This version should be significantly more friendly to RedShift's horizontal scaling.
Assuming, for example, you distribute by customer and sort by date

Related

Is there a way to filter a table based on criteria from another table in Power BI using DAX?

So, I have two tables, Scores and Accounts.
ID
Score
1
120
2
150
3
100
ID
Account
1
Account 1
2
Account 2
3
Account 3
I also have 4 measures that calculate the quartile percentile for all of the scores in the Scores table. I was wondering if it was possible to have a measure that concatenates the accounts into one line if their score is, for example, greater than the Quartile 1 measure. For example, if quartile 1 is 110, then I want a measure that would give me "Account 1, Account 2". Is this possible?
I managed to get your result by implementing the following measure, assuming you have a relationship between the two tables.
Accounts GT Q1 =
CONCATENATEX(
FILTER(
Scores,
Scores[Scores] > [Quartile 1]
),
RELATED( Accounts[Account] ),
", "
)
Output
There may be a simpler way. Let me know if that worked.

Visualization Issues using Running Total on an Appended Query

Using PowerBI linked to two separate Access Databases.
I have two datasets containing cost estimates. The cost estimates in Dataset 1 run through 2054; the cost estimates in Dataset 2 run through 2074. I used the Append function to join the two tables together and used the Quick Measure for Running Total to create values for cumulative cost by year. I charted this measure and noticed a significant decrease between 2054 and 2055 and was able to determine that the decrease is the cumulative value for Dataset 1. Does anybody know any ways to fix this?
Roughly explained:
Dataset 1 through 2054 totals to 4.5M.
Dataset 2 through 2054 totals to 3M
Dataset 2 through 2055 totals to 3.25M
Appended Dataset through 2054 totals to 7.5M
Appended Dataset through 2055 totals 3.25M instead of the expected 7.75M
I think the issue might be caused by Dataset 1 not having a value for 2055 or after, but I'm not sure how to resolve this issue.
The measure I'm using is:
Cumulative Cost =
CALCULATE(
SUM('AppendedQuery'[Value]),
FILTER(
ALLSELECTED('AppendedQuery'[Year]),
ISONORAFTER('AppendedQuery'[Year], MAX('AppendedQuery'[Year]), DESC)
)
)
ETA: Picture to explain
Here is your Dataset 1-
Here is your Dataset 2-
Here is your final Dataset after appending Dataset 1 & 2
And finally, here is the output when you are adding column Year and Cumulative Cost to a table visual. As standard PBI behavior, this is just grouping data using column Year and and applying SUM to the column Cumulative Cost.
The calculations are simple-
2051 > 1 + 1 = 2
2052 > 2 + 2 = 4
2053 > 3 + 3 = 6
2054 > 4 + 4 = 8
2055 > 5 = 5
2056 > 6 = 6
=========================
Solution for your case:
I already said in the comments that the solution current data will be not a standard one and will consider fixed $1 per year per department. But if you are happy with this static consideration, you can apply these following steps to achieve your required output-
Step-1 Create a Custom Column as below (Adjust the table name as per yours)-
this_year_spent = IF('Dataset 3'[Cumulative Cost] = BLANK(),0,1)
Step-2 Create the following Measure-
cumulative =
VAR current_year = MIN('Dataset 3'[Year])
RETURN
CALCULATE(
SUM('Dataset 3'[this_year_spent]),
FILTER(
ALL('Dataset 3'),
'Dataset 3'[Year] <= current_year
)
)
Here is the final output-

Calculate monthly value between 2 tables without an explicit relationship in Power BI model

I am trying to create a measure that calculates (a/qty)*100 for each month,
where qty commes from Delivery table (generated with an R script)
month qty
<date> <dbl>
1 2019-02-01 1
2 2019-03-01 162
3 2019-04-01 2142
4 2019-05-01 719
And a comes from a table TABLE_A that has been created within Power BI that looks like this :
Client Date a
x 2019-03-07 3
x 2019-04-14 7
y 2019-03-12 2
So far, I managed to calculate that value overall with the following measure formula :
MEASURE = CALCULATE( (Sum(TABLE_A[a])/sum(Delivery[qty]))*100)
The issue I have is that I would need this measure monthly (i.e. join the tables on month) without explicitly defining a link between the tables in the PowerBI model.
For each row in TABLE_A you need to look up the corresponding qty in Delivery, so try something along these lines:
MEASURE =
DIVIDE(
SUM( TABLE_A[a] ),
SUMX(
TABLE_A,
LOOKUPVALUE(
Delivery[qty],
Delivery[month], EOMONTH( TABLE_A[Date], -1 ) + 1
)
)
) * 100
The formula EOMONTH( TABLE_A[Date], -1 ) returns the end of the previous month relative to that date and adding 1 day to that to get the start of the current month for that date.

Power BI - Cumulative SUM filtered by current row values

I have a table with three columns - list_id, id, daily_return. id is basically a number sequence increment, reset for every list_id.
Example:
list_id id daily_return
1 1 0.2
1 2 0.18
1 3 0.35
2 1 0.15
2 2 0.18
2 3 0.23
I need to create a calculated measure on the chart I am creating such that it creates a running total of daily_return for the same list_id, ordered by the id column.
I am creating a measure in the chart, since I want the rows to be filtered by the user and the calculation itself will be more complex.
How do I get the current list_id and id, so I may use it in my formula?
This is what I have so far. I tried using EARLIER/EARLIEST without success.
cumulative_return = CALCULATE(SUM('CMC Daily Return'[daily_return]), 'CMC Daily Return'[list_id]=EARLIER([list_id]), 'CMC Daily Return'[id]<=EARLIER([id]))
I came up with the following formula that works, but if there is a better one, then I am all ears.
cumulative_return = CALCULATE(SUM('CMC Daily Return'[daily_return]), FILTER( ALLSELECTED('CMC Daily Return'), 'CMC Daily Return'[list_id] = max('CMC Daily Return'[list_id]) ), FILTER( ALLSELECTED('CMC Daily Return'), 'CMC Daily Return'[id] <= max('CMC Daily Return'[id]) ) )

DAX measure to count IDs satisfying a threshold condition

SalePersonId Month Qty
1 Jan-18 5
2 Jan-18 7
1 Feb-18 1
2 Feb-18 8
3 Feb-18 12
I need to create a measure which gives me a count of salespersons whose total sales quantity is more than 10 for the year 2018.
The result should be 2 (Sale person 1 & 3)
I can achieve this in T-SQL with the following query:
SELECT COUNT(Distinct EmpId) FROM T1 GROUP BY UserId HAVING SUM(Qty) > 10
How can I do the same in DAX?
Here's one possible approach:
= COUNTROWS(
FILTER(
SUMMARIZECOLUMNS(
T1[SalePersonId],
"Total", SUM(T1[Qty])),
[Total] > 10))
The SUMMARIZECOLUMNS part is essentially
SELECT SalePersonId, SUM(Qty) AS 'Total' FROM T1 GROUP BY SalePersonId
The FILTER part is equivalent to the HAVING clause and then you just count the rows in the resulting table.