Filling in missing dates Redshift

Filling in missing dates Redshift - amazon-web-services

I have a table that looks like this:
Account Value Last_Day_in_Month
ABC 7 2018-06-30
ABC 12 2018-06-30
ABC 3 2018-08-31
FGH 57 2019-01-31
FGH 13 2019-03-31
FGH 127 2019-03-31
For each account, I need to fill in the missing dates corresponding to the last day in each month such that the resulting table just fills in the value from the last month (you'll notice two additional rows)
Account Value Last_Day_in_Month
ABC 7 2018-06-30
ABC 12 2018-06-30
ABC 12 2018-07-31
ABC 3 2018-08-31
FGH 57 2019-01-31
FGH 57 2019-02-28
FGH 13 2019-03-31
FGH 127 2019-03-31
I have many accounts each of them with different start and stop times (Last_Day_in_Month) so I only need to fill in the missing months between the min and max months for each account. Because I may have multiple values corresponding to one single month end date per account, my current solution is to use a lead with a case statement that adds a single day and a date table that contains only the last day of each month and perform a cross join. But, I think it's messy and I'm sure there's a better way that I'm not aware of. Here is my current solution...
select
*,
lead(Last_Day_in_Month,1)over (
partition by Account
order by Last_Day_in_Month
) as intermed2,
case
when intermed2 = Last_Day_in_Month
then dateadd('day',1, intermed2)
else intermed2
end as next_last_day
from table
cross join dates
where dates.date_actual >= table.Last_Day_in_Month
and dates.date_actual < table.next_last_day
Any suggestions are appreciated.

What you are doing is fine for reasonable numbers of rows. One thing I'd recommend for clarity is changing from a cross join to a right join with an ON clause. The query planner should see right through what you have and plan an efficient query so just a nit.
There are a number of other ways to do this and you can find examples by searching "gaps and islands" in stack overflow. The biggest feedback I have is about creating additional rows. What you are doing is making new rows for the missing months which is fine for reasonably small tables because they don't get super large when you add rows. For example if you have a table with 100 billion rows and you have an average gap size of 2, you will be creating a result with 300 billion rows. Making this much data will never be fast or efficient. So you say you have "many accounts", how many is many?
If the amount of data can fit in memory or you are doing this operation only once in a while then creating rows will work ok. If this is being done as part of ongoing queries and the data created will be large then I'd rethink why you need to create data to get your queries to execute. In general Redshift stores very large sets of data and multiplying out (cross join) these rows by some other factor (dates) will result in very slow queries. If the the intent is to pared this data down to some smaller result you will want to find a way to create this result without making such a large intermediate dataset.

Related

PowerBi - Counting Switch Statements

I have the following measure:
test = SWITCH(TRUE(),
MAX(test[month])>=9&&MAX(test[month])<=12,"fall",
MAX(test[month])>=1&&MAX(test[month])<=3,"winter",
MAX(test[month])>=4&&MAX(test[month])<=6,"spring",
MAX(test[month])>=7&&MAX(test[month])<=8,"summer")
Currently it looks at the month number (i.e. "3" for March and outputs "winter", what I'd like however is it to output is a count per season to show the distribution of the seasons in the dataset.
For example my desired output would be
Month Number
Count of occurrences of each season
fall
5
winter
7
spring
11
summer
2
I can't have a calculated column here either as I will want to make this measure dynamic later on with the use of a slicer, can someone tell me if this is possible?

The issue here is that you want to define your categories within the measure. Measures are not dynamic without some filter-context.
Take this for example:
Notice that the output of the calculation is identical between seasons.
There is no filter context to help the measure discern between the different seasons because these seasons are not defined in the model. (At least, I don't know how to make this work)

Switch returns the first true result. So, if you have values like in your sample, then start with the smallest, then bigger, and the largest at the end.
test =
SWITCH(
TRUE()
,MAX(test[month])<4,"winter" -- test <4
,MAX(test[month])<7,"spring" -- 3< test < 7
,MAX(test[month])<9,"summer" -- 6< test < 9 -- Is it ok that you have 2 months in
,"fall" -- 8< test -- summer and 4 in fall?
)
If you use MAX(test[month])<4,"winter" instead of MAX(test[month])<=3,"winter" then you avoid one calculation step and the code will be faster.
Then you need to use the result to find months numbers and get dates from the selected months. Then calculate your table filtered by months dates. If this answer is not enough to solve the case, then give more information about you table, it's columns, and what do you mean by 'Count of occurrences of each season', exactly what does 'occurrences' mean, is it a number of certain rows or some unique values.

A table in my model records building valuations over time. Is it a slowly-changing dimension table or a fact fable?

I'm building a data model for a report that allows users to analyze building valuations over time, and details about buildings and their current leases.
I have a fact table that contains leasing information, and a dimension table that contains building information.
I have a third table that measures building valuations recorded every other quarter. I want to know whether it should be considered a slowly-changing dimension table or as another fact table.
The structure of the tables is as follows.
Fact_Leases
Lease ID
Building ID
Floor
Customer
Start Date
End Date
Occupied Area (sq ft)
Yearly Rent
101
1
1
Customer X
1/1/2000
1/1/2020
60
$10
102
1
2
Customer X
1/1/2010
31/12/2030
40
$25
103
2
3
Customer X
6/1/2015
5/8/2032
15
$17
104
2
1
Customer Y
5/6/2016
6/9/2028
5
$12
105
3
1
Customer Z
4/3/2017
12/2/2020
50
$19
Dim_Buildings
Building ID
Building Name
Sq Ft
Units
1
Building 1
100
10
2
Building 2
150
20
3
Building 3
125
50
?_Valuations
Building ID
Quarter
Valuation
1
Q2
$50
1
Q4
$55
2
Q2
$40
2
Q4
$35
3
Q2
$32
3
Q4
$44
At first, I thought the Valuations table was a dimension table because it relates to information about the building dimension. I considered joining the valuation data to the building dimension table but this would result in needlessly repeated rows, so I left it as a separate table.
However, the valuation table will not be used to filter the leases table, and the valuation column would be considered a measurement, which makes me think it is actually another fact table.
Can anybody clear this up for me?

Short answer: Yes. It is a fact table.
You have two fact tables that differ only in terms of granularity. Your Fact_Leases table, for example, is a fact table at the granularity of a lease. I can assume this quite safely because it appears the Lease ID column is a primary key. Each row of that table represents a lease.
On the other hand, your ?_Valuations table is a fact table at the granularity of quarter-time-building. That is, each row not only represents a building but also a quarter time period. And one way you can sort of know that this is a fact table is by understanding that if you had a date-dimension table, you could relate the two on their Quarter columns (although it would be a many-to-many relationship). Therefore, your date-DIMENSION table would be explaining the facts of your valuations. (I'd recommend, however, replacing your Quarter column with actual dates, and allow the date-dimension table to inform the quarters. That's an aside, though.)
Now, the problem of repeating valuation metrics occurs because you are trying to combine two fact tables at different levels of granularity. When you try to apply the valuations to the Fact_Leases table, which is at the granularity of lease, Power BI (or any BI tool, for that matter) can't understand how to apportion the valuation at the BUILDING level down to the LEASE level of granularity. So it just repeats. And it's important to keep this in mind when developing your reporting. No visualizations built at the context level of lease will be able to include a valuation metric because valuations exist only at a higher level of granularity.

Calculate the frequency of duplicates using table calculations in Looker

I have an explore like the following -
Timestamp Rate Count
July 1 $2.00 15
July 2 $2.00 12
July 3 $3.00 20
July 4 $3.00 25
July 5 $2.00 10
I want to get the below results -
Rate Number of days Count
$2.00 3 37
$3.00 2 45
How can I calculate the Number of days column in the the table calculation? I don't want the timestamp to be included in the final table.

First of all— is rate a dimension? If so, and you have LookML access, you could create a "Count Days" measure that's just a simple count, and then return Rate, Count Days, and Count. That would be really simple.
If you can't do that, this hard to do with just a table calculation, since what you're asking for is to change the grouping of the data. Generally, that's something that's only possible in SQL or LookML, where you can actually alter the grouping and aggregation of the data.
With Table Calculations, you can make operations on the data that's been returned by the query, but you can't change the grouping or aggregation of it— So the issue becomes that it's quite difficult to take 3 rows and then use a table calculation to represent those as 1 row.
I'd recommend taking this to the LookML or SQL if you have developer access or can ask someone who does. If you can't do that, then I'd suggest you look at this thread: https://discourse.looker.com/t/creating-a-window-function-inside-a-table-calculation-custom-measure/16973 which explains how to do these kinds of functions in table calculations. It's a bit complex, though.
Once you've done the calculation, you'd want to use the Hide No's from Visualization feature to remove the rows you aren't interested in.

How to display data from different data source tables in a single table in Power BI

I have a couple of different tables in my Report, for demonstration purposes lets say that I have 1 data source that is Actual Invoice amounts and then I have another table that is Forecasted amounts. Each table has several dimensions that are the same between them, let say Country, Region, Product Classification and Product.
What I want is to be able to display a table/matrix that pulls information from both of these data sources like this
Description Invoice Forecast vs Forecast
USA 300 325 92%
East 150 175 86%
Product Grouping 1 125 125 100%
Product 1 50 75 67%
Product 2 75 50 150%
Product Grouping 3 25 50 50%
Product 3 25 50 50%
West 150 150 100%
Product Grouping 1 75 100 75%
Product 1 25 50 50%
Product 2 50 50 100%
Product Grouping 3 75 50 150%
Product 3 75 50 150%
I have not been able to figure out a way to combine the information from the multiple data source into a single matrix table, so any help would be appreciated. The one thing that I did find was somebody hard coded the structure of the rows into a separate data source and then used DAX expressions to pull in the pieces of information into the columns, but I don't like this solution because the structure of the rows is not constant.

What you're asking about is a common part of the star schema: combining facts from different fact tables together into a single visual or report.
What Not To Do (That You Might Be Tempted To)
What you don't want to do is combine the 2 fact tables into a single table in your Power BI data model. That's a lot of work and there's absolutely no need. Especially, since there are likely dimensions that the 2 fact tables do not have in common (e.g. actual amounts might be associated with a customer dimension, but forecast amounts wouldn't be).
What you also don't want to do is relate the 2 fact tables to each other in any way. Again, that's a lot of work. (Especially since there's no natural way to relate them at the row level.)
What To Do
Generally, how you handle 2 fact tables is the same as you handle a single fact table. First, you have your dimensions (country, region, classification, product, date, customer). Then you load your fact tables, and join them to the dimensions. You do not join your fact tables to each other. You then create measures (i.e. DAX expressions).
When you want to combine measures from the two facts together in a single matrix, you only use rows/columns that are meaningful to both fact tables. For example, actual amounts might be associated with a customer, but forecast amounts aren't. So you can't include customer information in the matrix. Another possibility is that actual amounts are recorded each day, whereas forecasts were done for the whole month. In this situation, you could put month in your matrix (since that's meaningful to both), but you wouldn't want to use date because Power BI wouldn't know how to divide up forecasts to individual dates.
As long as you're only using dimensions & attributes that are meaningful to both fact tables, you can easily create a matrix as you envision above. Simply drag on the attributes you want, then add the measures (i.e. DAX expressions).
The Invoice & Forecast columns would both be measures. The two measures from different fact tables can be combined into a 3rd measure for the vs. Forecast measure. Everything will work as long as you're just using dimensions/attributes that mean something to both fact tables.
I don't see anything in your proposed pivot table that strikes me as problematic.
Other Situations
If you have a situation where forecasts are at a month level and actual is at a date level, then you may be wondering how you'd relate them both to the same date dimension. This situation is called having different granularities, and there's a good article here I'd recommend reading that has advice: https://www.daxpatterns.com/handling-different-granularities/. Indeed, there's a whole section on comparing budget with revenue that you might find useful.
Finally, you mention that someone hard-coded the structure of the rows and used DAX expressions to build everything. This does, admittedly, sound like overkill. The goal with Power BI is flexibility. Once you have your facts, measures & dimensions, you can combine them in any way that makes sense. Hard-coding the rows eliminates that flexibility, and is a good clue that something isn't right. (Another good clue that something isn't right is when DAX expressions seem really complicated for something that should be easy)
I hope my answer helps. It's a general answer since your question is general. If you have specific questions about your specific situation, definitely post additional questions. (Sample data, a description of the model, the problem you're seeing, and what you want to see is helpful to get a good answer.)
If you're brand new to Power BI, data models, and the star schema, Alberto Ferrari and Marco Russo have an excellent book that I'd recommend reading to get a crash course: https://www.sqlbi.com/books/analyzing-data-with-microsoft-power-bi-and-power-pivot-for-excel/

Merging data with a dates table gives strange behaviour when some recordfield is added

I'm using Crystel Reports again after not touching it for about 8 years.
I'm having this situation...
I have 1 data table, and 1 table with just day numbers from 1 to 31.
Nothing is really linked between each other.
In my report I let the user select a reference date.
From that date I grab the maximum days of the month.
The report lists a row per day of that month but there are no actual database fields inthere. Just the first 2 letters of the dayname, the day number and another formula based field showing 'yes/no' or '' depending on a main record value.
So far so good.
In the group header I was adding the fields from the main datatable which went all fine until I added fields that in the query on the sql server rely on some cases but CR just read it out as 1 singe record row with everything in it.
For some reason the report generation goes from 1-2 seconds to 30-40 once I add that field that just outputs 'X' or ''. (it represents things assigned to that user)
Other reports where I'm using the same data still generate in 2 seconds.
To get this working right and to eleminate double date records I'm stuck with 3 groups.
I think this ain't optimal and the reason for the slow down although it wasn't there at the start.
So I was wondering:
Should I go for a sub report for the day listing?
Can I feed the subreport with my date parameter?
or is there some kind of scripted way to list a row x-times without all the grouping requirements?

Synchro was right, the problem was in the actual query/view.
For some reason the view takes half a minute longer if you just added an order by to a specific field.
The "where id between 211 and 265 or id=67" has been moved from a joined view to the actual query.
Thanks for the hint, Synchro.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js