Explode a table with a monthly increment in Amazon Redshift - amazon-web-services

I have a sample table:
id
start_dt
end_dt
100
06/07/2021
30/09/2021
I would like to get the following output
id
start_dt
end_dt
100
06/07/2021
31/07/2021
100
01/08/2021
30/08/2021
100
01/09/2021
30/09/2021
I have tried using GENERATE_SERIES() in Amazon Redshift, but that does not give the required result.
The existing table is quite large so I could use temp tables then join back to another table at a later stage.
I have trawled through other posts, but other proposed solutions isn't quite giving the desired results / don't work at all on Amazon Redshift. Any help in solving this would be appreciated.

The traditional method would be:
Create a Calendar table that contains one row per month, with start_date and end_date columns
Join your table to the Calendar table, where table.start_dt <= calendar.end_dt AND table.end_dt >= calendar.start_dt
The two columns would be:
GREATEST(table.start_dt, calendar.start_dt)
LEAST(table.end_dt, calendar.end_dt)

Related

How can I aggregate one column in one table after applying a filter based on a column in another table in Power BI?

I am trying to do something in Power BI which is trivial in SQL. If I have two tables like this.
Table Accounts
Account
Closing
A
01-01-2021
B
01-02-2021
Table Payments
Account
Date
Amount
A
01-01-2020
10
A
01-03-2021
20
A
01-04-2021
30
B
01-01-2005
10
B
01-03-2021
20
B
01-04-2021
30
I would like to create a measure that aggregates the transactions of the accounts if they happened after the closing. In SQL it would be like this
select a.account, sum(b.amount)
from
accounts a inner join
payments b on a.account = b.account
where
b.date > a.closing
group
by a.account
In Power BI, I can easily build the aggregate without the where clause just by creating a measure in account with value SUM(Payments[Amount]). I tried to add the filter command in the sum, but Accounts[Closing] does not show as an available column when I write the filter.
I already spent quite some time looking for solutions and I didn't find anything satisfactory. Of course I can solve this by doing the aggregation at SQL level but there are other things that wouldn't work very well in my original model.
Thanks very much in advance and kind regards
If your both tables are properly related using Account column, you can use this below Measure for your purpose-
total =
CALCULATE(
sum(Payments[Amount]),
FILTER(
Payments,
Payments[Date] > MIN(Accounts[Closing])
)
)
Output will be as below-

AWS Athena date sql query

Below is the data in csv file in s3 bucket which I have used to build Athena database.
John
Wright
cricket
25
Steve
Adams
football
30
I am able to run the query and get the data.
Now I am trying to fetch date of birth based on age column. Is it possible to generate date of birth from age column like current date - age (column) and print only the date of birth?
I tried below query but not sure whether it is correct way
select (current_date - interval age day) from table_name;
Please help me with this.
You can use the date_add function, like this:
SELECT date_add('year', -age, current_date) FROM table_name
I.e. subtract age number of 'year'(s) from the current date.

Power BI - Select Slicer Date Between 2 Columns

Hopefully a quick explanation of what I am hoping to accomplish followed by the approach we've been working on for over a year.
Desired Result
I have a table of SCD values with two columns, SCD_Valid_From and SCD_Valid_To. Is there a way to join a date table in my model (or simply use a slicer without a join) in order to be able to choose a specific date that is in between the two SCD columns and have that row of data returned?
Original Table
ID | SCD_Valid_From | SCR_Valid_To | Cost
1 2020-08-01 2020-08-03 5.00
Slicer date chosen is 2020-08-02. I would like this ID=1 record to be returned.
What We've Attempted So Far
We had a consultant come in and help us get Power BI launched last year. His solution was to create an expansion table that would contain a row for every ID/Date combination.
Expanded Original Table
ID | SCD_Valid_Date | Cost
1 2020-08-01 5.00
1 2020-08-02 5.00
1 2020-08-03 5.00
This was happening originally on the Power BI side, and we would use incremental refresh to control how much of this table was getting pushed each day. Long story short, this was extremely inefficient and made the refresh too slow to be effective - for 5 years' worth of data, we would need over 2000 rows per ID just to be able to select a dimensional record.
Is there a way to use a slicer where Power BI can select the records where that selected date falls between dates in two columns of a table?
Let me explain a workaround and I hope this will help you to solve your issue. Let me guess you have below 2 tables-
"Dates" table with column "Date" from where you are generating the date slicer.
"your_main_table" with with column "scd_valid_from" and "scd_valid_to".
Step-1: If you do not have relation between table "Dates" and "your_main_table", this is fine as other wise you have to create a new table like "Dates2". For this work around, you can not have relation between those tables.
In case you have already relation established between those tables, create a new custom table with this below code-
Dates2 =
SELECTCOLUMNS(
Dates,
"Date", Dates[Date]
)
From here, I will consider "Dates2" as source of your Date slicer. But if you have "Date" table with no relation with table "your_main_table", just consider "Dates" in place of "Dates2" in below measures creation. Now, Create these following 4 measures in your table "your_main_table"
1.
date_from_current_row = max(join_using_date_range[SCD_Valid_From])
2.
date_to_current_row = max(join_using_date_range[SCD_Valid_to])
3.
date_selected_in_slicer = SELECTEDVALUE(Dates2[Date])
4.
show_hide_row =
if(
[date_selected_in_slicer] >= [date_from_current_row]
&& [date_selected_in_slicer] <= [date_to_current_row]
,
1,
0
)
Now you have all instruments ready for play. Create your visual using columns from the table "your_main_table"
Final Step: Now just add a visual level filter with the measure "show_hide_row" and set value will show only when "show_hide_row = 1".
The final output will be something like below image-

How to select data from aws athena table which is partitioned like 'year=yyyy/month=MM/date=dd/' for a given date range?

Athena Tables are partitioned like and same as s3 folder path
parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=17
parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=9
parent=0fc966a0-bba7-4c0b-a648-cff7f0332059/year=2020/month=4/date=16
parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=14
PARTITIONED BY (
`parent` string,
`year` int,
`month` tinyint,
`date` tinyint)
Now how should I form the where condition for a select query to get data for parent = "9ab4fcca-65d8-11ea-bc55-0242ac130003" from 2019-06-01 to 2020-04-31 ?
SELECT *
FROM table
WHERE parent = '9ab4fcca-65d8-11ea-bc55-0242ac130003' AND year >= 2019 AND year <= 2020 AND month >= 04 AND month <= 06 AND date >= 01 AND date <= 31 ;
But this isn't correct. Please help
Partitioning on year, month, and day separately makes it unnecessarily difficult to query tables. If you're starting out I really suggest to avoid this kind of partitioning scheme. If you can't avoid it you can still make things easier by creating the table partitions differently.
Most guides will tell you to create directory structures like year=2020/month=4/date=1/file1, create a table with three corresponding partition columns, and then run MSCK REPAIR TABLE to load partitions. This works, but it's far from the best way to use Athena. MSCK REPAIR TABLE has atrocious performance, and partitioning like that is far from ideal.
I suggest creating directory structures that are just 2020-03-01/file1, but if you can't, you can actually have any structure you want, 2020/03/01/file1, year=2020/month=4/date=1/file1, or any other structure where there is one distinct prefix per date will work more or less equally well.
I also suggest you create tables with only one partition column: date (or dt or day if you want avoid quoting), typed as DATE, not string.
What you do then, instead of running MSCK REPAIR TABLE is that you use ALTER TABLE … ADD PARTITION or the Glue APIs directly, to add partitions. This command lets you specify the location separately from the partition column value:
ALTER TABLE my_table ADD
PARTITION (day = '2020-04-01') LOCATION 's3://some-bucket/path/to/2020-04-01/'
The important thing here is that the partition column value doesn't have to have any relationship at all with the location, this would work equally well:
ALTER TABLE my_table ADD
PARTITION (day = '2020-04-01') LOCATION 's3://some-bucket/path/to/data-for-first-of-april/'
For your specific case you could have:
PARTITIONED BY (`parent` string, `day` date)
and then do:
ALTER TABLE your_table ADD
PARTITION (parent = '9ab4fcca-65d8-11ea-bc55-0242ac130003', day = '2020-04-17') LOCATION 's3://your-bucket/parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=17'
PARTITION (parent = '9ab4fcca-65d8-11ea-bc55-0242ac130003', day = '2020-04-09') LOCATION 's3://your-bucket/parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=9'
PARTITION (parent = '0fc966a0-bba7-4c0b-a648-cff7f0332059', day = '2020-04-16') LOCATION 's3://your-bucket/parent=0fc966a0-bba7-4c0b-a648-cff7f0332059/year=2020/month=4/date=16'
PARTITION (parent = '9ab4fcca-65d8-11ea-bc55-0242ac130003', day = '2020-04-14') LOCATION 's3://your-bucket/parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=14'
Here is how you can use year, month and day values the come from partitions in order to select date range
SELECT col1, col2
FROM my_table
WHERE CAST(date_parse(concat(CAST(year AS VARCHAR(4)),'-',
CAST(month AS VARCHAR(2)),'-',
CAST(day AS VARCHAR(2))
), '%Y-%m-%d') as DATE)
BETWEEN DATE '2019-06-01' AND DATE '2020-04-31'
You can add additional filter statements as needed)

Query to calculate cost by month using AWS Athena querying

I have a table like below.
item_id bill_start_date bill_end_date usage_amount
635212 2019-02-01 00:00:00.000 3/1/2019 00:00:00.000 13.345 user_project
IBM
I am trying to find usage_amount by each month and each project. Amazon Athena query engine is based on Presto 0.172. Due to the limitations in Athena, it's not recognizing query like select sysdate from dual;.
I tried to convert bill_start_date and bill_end_date from timestamp to date but failed. even current_date() didn't work in my case. I am able to do calculate the total cost by hard coding the values but my end goal is to perform the action on columns.
SELECT (FLOOR(SUM(usage_amount)*100)/100) AS total,
user_project
FROM test_table
WHERE bill_start_date
BETWEEN date '2019-02-01'
AND date '2019-03-01'
GROUP BY user_project;
In Presto, current_timestamp is a SQL standard function which does not use parentheses.
To group by month, I'd use date_trunc('month', bill_start_date).
All of these functions are documented here