SAS Data Mart files structure suggestions - sas

I've been working on a SAS ETL project wherein we first extract data for the last month from teradata warehouse in the beginning of every month and then take it further for processing.
This is done via extraction scripts for each table and then the data is stored into a monthly folder structure (yyyymm). After working this way for several months, we've now begun getting requests to product daily, weekly, etc extracts.
The current data storage folder structure is :
Library/Data/YYYYMM folder in one library.
I have to change the structure(with minimal impact to the current structure) to accommodate different timeframe requests like Daily, weekly, Fortnightly, Quarterly, etc.
I thought of two options : In the current structure (monthly folders), add in
Daily, Weekly and Monthly Folders
Library / YYYYMM / Monthly
Library / YYYYMM / Daily
Library / YYYYMM / Weekly
folders.
Option 2:
Under the Data Library Create folders like
Monthly
Daily
Weekly
fortnightly
Quarterly
Under each of these exists individual folders with the current date/month/quarter.
Can anyone suggest of any other more practical design approaches?

Maybe SAS generation data sets would be an option: http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a000934566.htm
Here an extract from SAS:
A generation data set is an archived version of a SAS data set that is stored as part of a generation group. A generation data set is created each time the file is replaced. Each generation data set in a generation group has the same root member name, but each has a different version number. The most recent version of the generation data set is called the base version.

Related

Power BI - Automatically identify new folders to get data and combine them

A new year is coming and the path where we save our data is changing.
The problem is: the data from each asset is saved in folders with the current year as it's name, so data from 2022 is saved in a folder called 2022, and so on.
I want to make a query that will autamatically indentify what years (folders) we have data from and combine them.
The data is saved in this path:
C:\Users\Projects\3. Assets\Type A\Asset Name\Control\YEAR\Data\Dataset\excel.xlsx
This asset for instance has 3 years of data 2020, 2021 and 2022.
By next week we will already have a 2023 folder with new data, usually a manually add a table.combine to the query, but we have a large number of assets and it can be tricky.
Someone knows a efficient way to automatically identify all the folders named with a year and combine the excel data inside them ?
This is the way i usually do:
Table.Combine ({Sharepoint("...2020/Data/Dataset"),Sharepoint("...2021/Data/Dataset"),Sharepoint("...2022/Data/Dataset")})
Sharepoint is a function that returns folder content from sharepoint.
Best Regards
Can you pull all the directory names at a higher level
similar to
https://exceltown.com/en/tutorials/power-bi/powerbi-com-and-power-bi-desktop/power-bi-data-sources/connect-power-query-whole-sharepoint-folder/
then filter them ?

Power BI incremental refresh from Azure Blob

If I have a list of many blobs in a container, can I set my RangeStart and RangeEnd parameters to be based on the modified timestamp of the csv files? My blobs are partitioned based on the created date, but the rows can be updated historically. I need to make sure that Power BI has the latest version of each row (based on updated_at timestamp)
1- filter the blobs I want based on the blob prefix (virtual directory)
2- filter the blobs based on the Date modified attribute and set up a parameter for RangeStart and RangeEnd (this limits the number of blobs which need to be looked at by a great deal)
3- sort the data and drop duplicates as a final step
Would this pattern work and does it seem efficient? My problem with using the 'updated_at' timestamp as the incremental column is that files which were created weeks or months ago might get updated (it is purely based on the customer activity). It seems like PBI would need to scan a lot of blobs in order to possibly know which rows have been updated.
I tested this out and it works on PBI desktop, but I am not seeing the parameters show up on PBI online which has me worried (it has been running for ~4 hours so far).

Union Data Power BI

new to Power BI and just built a dashboard with some finance data with the following columns,
Date|Transaction ID|Transaction Amount|Item Description|Item Key
Every month I receive a new CSV file with data for the previous month. Rather than manually adding the new data to a master file each month, is there a way to simply drop the new CSV file into a folder each month and then refresh the dashboard so it automatically includes the new data (minus the headers)? If possible, I'd also like to add a column which holds the date the new file was loaded, so each new month's file is date stamped each time it's added.
Many thanks
What you can do is use a folder as a source instead of a csv.
That folder should contain all csv files.
When all your files are load you only have to select the following option:
After that you will have all the data from the all csv files in one gigant table.
Unfortunately its not possible to add a date column with the load date.
The only way to do that is that the csv files have the date column.
Hope it helps you.

Most efficient way to filter BigQuery rows by latest date

I am currently working on an ETL pipeline that uses BigQuery to store staging data, and then uses Dataprep to transform the data and store it in new BigQuery tables for production.
We have been experiencing issues finding the most cost effective way to apply these transforms on a small selection of the data, typically only the last X number of days from the current max date in the staging data table. For example, we need to calculate the max available date in the staging data, and then retrieve all rows within the past 3 days from this date. Unfortunately we can't rely on the 'max date' in the staging data always being up to date (this data is brought in from third party APIs of varying quality and reliability).
At first I tried applying these transforms directly in Dataprep by getting the max date, creating a comparison column using DATEDIFF and then discarding rows more than 3 days older than this 'max date'. This proved to be very time consuming and inefficient in terms of cost.
The next thing we tried was to filter down the data in BigQuery views, which would then be used as the initial datasets for the Dataprep flows (the data would be pre-filtered before Dataprep applies any transforms). We first tried doing this dynamically in BigQuery, like so:
WITH latest_partitiontime AS (SELECT _PARTITIONTIME as pt FROM
`{project}.{dataset}.{table}`
GROUP BY _PARTITIONTIME
ORDER BY _PARTITIONTIME DESC
LIMIT 1)
SELECT {columns}
FROM `{project}.{dataset}.{table}`
WHERE _PARTITIONTIME >= (SELECT pt FROM latest_partitiontime)
But upon preview of the GB/estimated cost of the query, it seems very inefficient and expensive.
The next thing we tried was hard coding the date, which for some reason is a lot cheaper/quicker:
SELECT {columns}
FROM `{project}.{dataset}.{table}`
WHERE _PARTITIONTIME >= '2018-08-08'
So our current plan is to maintain a view for each table, and update the hard coded date in the view SQL via the Python SDK each time the staging data successfully completes (https://cloud.google.com/bigquery/docs/managing-views).
It feels like we are potentially missing a much easier/more efficient solution to this problem. So I wanted to ask:
Is it more cost effective carrying out this initial filtering by date in Dataprep or in BigQuery?
What is the most cost effective way of filtering the data in the chosen product?
Are you familiar with the MERGE statement of standard SQL and the clustering feature released? that could actually merge your data and you can further customize it to read only some partitions.
Example from manual:
MERGE dataset.DetailedInventory T
USING dataset.Inventory S
ON T.product = S.product
WHEN NOT MATCHED AND quantity < 20 THEN
INSERT(product, quantity, supply_constrained, comments)
VALUES(product, quantity, true, ARRAY<STRUCT<created DATE, comment STRING>>[(DATE('2016-01-01'), 'comment1')])
WHEN NOT MATCHED THEN
INSERT(product, quantity, supply_constrained)
VALUES(product, quantity, false)
hint: you can partition by null, and leverage only the 'clustering level'

Creating a macro to subset a monthly data into smaller datasets based on date in SAS

I'm working on a large dataset which have data for a whole month with around 42 variables, and I want to create separate datasets for every day of the month.
How can I create a macro which will do it properly? The date variable is trans_date and month is March.
Typically in SAS, you would not create separate datasets for each day. Rather, you would perform your analysis BY the date variable. That has the same effect - i.e., whatever analysis you do will be done separately for each different value in the date variable.