Is it possible to replace the SCD2 transformation in SAS DI studio?
I use it to track the changes to dimension data, but with huge tables (about 10 millions rows/observations), this is very time consuming. Just wondering on the best solution to use other loading instead of scd2 to save time.
Is only the SCD2 tracking historical changes?
Thanks,
Related
I'm new to PowerBI, and am working on a large database. I am attempting to prepare the data in the PowerQuery Editor.
I would like to code as many steps as possible, as analysing each column manually is extremely time consuming.
My coding goals (in order of priority):
For each query I would like to get their column quality.
Ideally, I would like to export the header names with the column quality, so that I can determine which are relevant. Furthermore, I can also use the column names to determine which column relationships might be relevant. The database is huge, so simply just importing all the data and trying to work with it from their is not feasible, in fact PowerBI comes up with the error that I don't have enough free memory.
I have VBA and some SQL experience.
I know I have a lot to learn w.r.t. PowerBI, and I am working on it, but need some guidance and direction, also on what is possible/feasible.
Any contructive hints, advice, or feedback would be appreciated - thank you!
Use Table.Profile() on each table and load to the data model.
https://learn.microsoft.com/en-us/powerquery-m/table-profile
Should we use the group by function in Power Query and create a new table, or is it better to create as many measures as we need ? (one measure for each column) ?
Which one is more powerful?
Thank you !
It depends on your purpose. If you have a granular fact table that you want to aggregate first before creating the data model, you can do that through Power Query before feeding the model. Even then, I would recommend doing it on the server-side if you are bringing a SQL table; so that you can perform a native SQL group by rather than having to do it through Power Query syntax solely. Power Query has some performance lagging and each nth step in PQ is evaluated from 1st step internally and it requires a full refresh of the table.
However, if you only want to perform group by to be utilized in an analysis, it is always a good idea to use DAX measures and refrain from using PQ. Also, you can't resort to PQ for different analysis scenarios. DAX is built for those scenarios and it is extremely powerful. DAX measures are the most powerful concept of Power BI. Also, they get evaluated in filter context/slicers; i.e. respond to the selection of values in slicers and / or whatever is present in the Axis (business case)
There are tons of supports for DAX measure optimization, such as SQLBI, Stack, Power BI community. If optimized correctly, DAX measures enhance report performance tremendously without creating any lagging in the report at all.
Few resources to look into
1
2
3
When you are creating a new table in power query, it means results are pre calculated and there will be some performance gain if we consider report usage. But, it will increase your Data Model size. Where as Measure will calculate things on the fly. This will keep your model size same but add some slowness in the presentation part. As a whole, there is no specific answer for your question as per my knowledge as it depends on so many other things like-
Your data size
How many measure you wants to create
How complex your logic inside measure's
How often you need reload your data
and so on...
This is mainly a question about efficiency, as I'm unfamiliar with how SAS processes datasets. A lot of code that I run reads from multiple datasets with consecutive dates (whether this is consecutive months/quarters/years depends on the datasets).
At the moment, the codes require manual updates each time they're run to ensure they're picking up the correct dates, so I would have something such as:
Data Quarters;
Set XYZ_201803
XYZ_201806
...
...
XYZ_202006;
Run;
To help tidy up the code and make it a bit less tedious, I've approached a few different ideas and had a few sent my way and one of the big ideas is to store all of the XYZ_YYYYMM datasets as a single, appended dataset, so they can be read from with a simple filter on the date as below:
Data Quarters;
Set AppendedData;
Where Date > 201812;
Run;
Which of these two options is more efficient as far as computation goes? On datasets which are typically a couple of gb in size, which would you recommend? What other pros and cons come with each idea?
Thanks for any input. :)
Most likely a single dataset and several separate datasets will be similar from a performance standpoint; there is some small overhead opening new datasets, but as long as it's not thousands of them you probably won't notice a difference.
There will be a performance hit with a single dataset in creating that dataset, and in using that dataset, if you use only small sections usually. Typically, separate datasets are common where people usually do analysis of individual quarters, and rarely combine them.
Finally, if the datasets can vary from quarter to quarter in their contents (if the formats could change, if the fields can change), then having separate is easier in some ways than having to manage the change between the different periods.
That said, there's a huge organizational benefit to a single dataset, and all of the above issues can be dealt with. Think of SAS datasets as large SQL tables - they are effectively the same, and the same things that help SQL tables can help SAS. Proper sizing of columns, proper sorting of the stored data, indexing appropriately, are all important solutions. If you have a database team at your place of work, they may be able to help construct an ideal table plan. Files of several GB can definitely benefit from indexing and proper sorting, to allow users to easily get at the bits they need.
If you were to stay with separate datasets, you can use the macro language to make sure you're reading in the right datasets, assuming they're named in a consistent fashion. That might be the ideal solution if there are other reasons to stay separate - then no changes are needed each quarter.
Points of interest:
From a coding standpoint
Dealing with a single, stacked data set, created by appending the quarterly data sets is more efficient.
From a resource standpoint
Have to make sure you have large enough disk to hold the single large table
Have additional off storage to hold the original pieces -- no need to clutter up the primary data disk with all the pieces.
A 2TB SSD is very fast, remarkably cheap, and low power and can contain a table comprised of quite a few "couple GB" pieces.
Spinning disk has lower $/TB and more capacity. I/O will be slower and consume more power.
To further improve query performance you will want to index the variables most commonly used in BY, CLASS, and WHERE statements.
"... simple filter ..." is part of "Keep it Simple S****" (KISS)
the problem I'm solving has many simple solutions but what I need is to find the way to reduce the time and memory needed for the process.
On the one side I have a table with a few hundred ID's and on the other 40 monthly tables and counting.
Each of the tables has between 500 000 to 1 mln records each for unique id. Each table has few thoustand variables but i only need 10-20 of them.
I need to lookup the tables to find the latest table when particular id from base table occur and get variable values that I need.
The newest month table is being calculated every day so many id's from previous months may occur again so I cannot just create indexed dictionary (last.id and variables) once. Also I can't afford creating new dictionary based on all tables every day.
Visual description
I came up with some ideas but I need your help to find the most efficient concept:
Concatenate all monthly tables with variables needed, sort ascending ID and month, select last.id using data step. Use join or merge with base table.
Problem: too much memory needed to set all tables.
Alternatively I used proc append in loop. Unfortunately not very time and memory efficient.
Inner join with all of the tables separately in loop:
Low memory use but very time consuming.
Create dictionary based on all months besides the latest and update it every day.
Problem: Large dictionary table.
Now I'm looking for smart concepts how to solve this kind of problem. Maybe hash objects.. but how?
I would greatly appreciate it if you give me some feedback on this case.
Thank you!
If someone was to write some code to generate some dummy data based on your specs they may be able to provide a more specific answer to your question. But without sample data it's hard to know the best way without trial and error.
Instead I've paraphrased some of my old answers into a more comprehensive list of things you can check.
Below are some ways to boost performance (roughly in order of performance improvement, YMMV):
Index the fields in each table that you will be joining on or using in a where clause. Not all fields are good candidates for indexes so do a little research on how to determine this before indexing.
Reduce the number of rows as early in the process as possible (ie. use a where clause to get rid of anything you don't care about).
If the joins are still time consuming, consider replacing them with hash table lookups.
Compression. When you build the datasets make sure you use the compress=yes option if you're not already. This will shrink the size of the table on disk resulting in less disk I/O (the slowest part of querying).
If the steps are IO intensive, consider using views rather than creating temporary tables.
Make sure you are using proc append to append datasets together to reduce IO (sounds like you are, just adding this for completeness). Append the smaller dataset to the larger dataset. Alternatively use a view to 'append' them without duplicating overhead.
Limit the columns you are processing by using a keep statement (reduces IO).
Check column lengths - make sure you're not using a field length of $255 to store something that only needs a length of $20 etc...
Use the SAS SPDE (Scalable Performance Data Engine). It allows you to partition your SAS datasets into multiple files and optionally spread them across different disks. Once your SAS datasets reach a certain size you can see performance improvements. I generally tend to use SPD libnames any time a dataset grows > 10G. No additional SAS modules are requires - this is enabled as part of Base SAS.
I have a fact table , measure table and and connected to them are dimension tables. It is just a slight modification to star schema. But know as the no of joins are increasing due to introduction of measure table the query processing time is increased. Can any one suggest some approach to improve efficiency of query processing? Like adding some bridges between diemensions etc
Thanks
Store aggregated version of your data by frequently used dimensions, so that you can leave them out of joins that need to run faster.