SAS and Snowflake - sas

We are moving to Snowflake and I find running SAS Codes taking a lot more time than usual. This is because the variable lengths are very high (32767 length). How do I limit these variable lengths? Within Greenplum, I was able to limit the lengths using the VARCHAR() option but this doesn't seem to work with Snowflake. Any alternatives?

Related

How to circumvent SPICE limitations (500 M rows) to create a QuickSight dashboard for a big data set?

My goal is to quickly & dynamically visualize a big data set (> 500 M rows) using QuickSight. To achieve quick query times, it's necessary to load all of the data into SPICE. However, AWS currently has a hard limit for the maximum number of rows that can be imported into SPICE for a single data set, which is 500 M rows. I currently don't see any option that could be used to visualize all of the data. Here are things that I already considered:
Splitting the full data set into individual QS datasets: the problem with this approach is that QuickSight requires that each visual has a single dataset as an input, so values from multiple datasets cannot be shown in the same visual. I'm aware that multiple datasets can be used within one dashboard but that would not suit the use-case of having a single plot visualizing the data.
Pivoting the table: the input table has a lot of rows, so changing the format from long to wide table would circumvent the SPICE row limitations. However, QuickSight doesn't seem to support using an array of columns a y-values to be plotted.
Creating a dataset per visualization: Certain visualizations can theoretically be defined using fewer values than in the original data set. For example, to create a box plot over a set of groups, we mainly need the quartile values for each of the groups to be plotted, rather than the full data set, which would allow us to be below the SPICE limitation. However, QuickSight doesn't allow creating custom plots such as creation of a box plot where quartiles are already pre-processed.
Currently, the only viable approach I see is to create a dashboard per user, since most users would only be interested in a subset of rows from the full data set.
Irrespective of the approach taken, unfortunately, this limitation forces us to do some compromises.
Depending on the number of users, creating a dataset per user might become a headache to manage. So, I would suggest that if possible you use datasets that capture groups of users (example by user group, or user's country).
Pivoting the table might make it harder to build some visuals. As you said, if you pivot multiple values from different rows into an array field, then you would not be able to extract these easily in analyses (you could use string functions and to to extract them that way but there are limitations around this approach too).
Also creating a dataset per visualisation has maintenance overhead in that you would need to update and re-ingest the dataset most times when changing visualisations.
Some other approaches you might consider:
Aggregate multiple rows together Example if your dataset has multiple rows for each user within the same minute, you could aggregate all these into 1 row and summing up values within that minute. The aggregation period should be as large as possible but keep in mind that this will affect the time granularity in your analyses/dashboards
Prune old data If you are more interested in recent data, then you could add a filter to only keep say 1 month of activity. You could then have other non-SPICE (Direct Query) datasets that do not have this restriction but reports would be slower on older data.
Cache in an external database You could load your data into some data warehousing database (such as AWS Redshift) and then not use SPICE in QuickSight. Of course, this will probably get more expensive.

How would one go around creating a due by attribute in redshift

I am currently trying to calculate due by dates in a table by adding the sla time to the time the request was created. From what I am able to understand, the way to go around this is to create a table with the work days and hours and query that table to find the due date. However, redshift does not allow one to declare variables. I was wondering how I would go around creating a work hour table in redshift and if that is not possible, how I would calculate the due date by other means. Thanks!
It appears that you would like to provide a timestamp and then calculate the timestamp that is 'n work hours later', most probably taking into account certain rules such as:
Weekdays: 9am-5pm
Weekends: No Hours
Holidays: Occasional weekdays with No Hours
This could be done by Creating a scalar Python UDF - Amazon Redshift that would be passed a 'start' timestamp and a number of hours, and would return the 'end' timestamp.
Please note that Scalar UDFs cannot access tables or 'call outside' of Redshift, so it would need to be self-contained.
There is code on the web that shows How to find the number of hours between two dates excluding weekends and certain holidays in Python? BusinessHours package - Stack Overflow. You would need to modify such code to specify the duration rather than finding the duration.
The alternate method of "creating a work hour table" would work well when trying to find the number of work hours between two timestamps but would be a bit harder when trying to add workhours to a timestamp.

SAS EG - Individual Datasets split by date vs Single appended dataset containing all dates

This is mainly a question about efficiency, as I'm unfamiliar with how SAS processes datasets. A lot of code that I run reads from multiple datasets with consecutive dates (whether this is consecutive months/quarters/years depends on the datasets).
At the moment, the codes require manual updates each time they're run to ensure they're picking up the correct dates, so I would have something such as:
Data Quarters;
Set XYZ_201803
XYZ_201806
...
...
XYZ_202006;
Run;
To help tidy up the code and make it a bit less tedious, I've approached a few different ideas and had a few sent my way and one of the big ideas is to store all of the XYZ_YYYYMM datasets as a single, appended dataset, so they can be read from with a simple filter on the date as below:
Data Quarters;
Set AppendedData;
Where Date > 201812;
Run;
Which of these two options is more efficient as far as computation goes? On datasets which are typically a couple of gb in size, which would you recommend? What other pros and cons come with each idea?
Thanks for any input. :)
Most likely a single dataset and several separate datasets will be similar from a performance standpoint; there is some small overhead opening new datasets, but as long as it's not thousands of them you probably won't notice a difference.
There will be a performance hit with a single dataset in creating that dataset, and in using that dataset, if you use only small sections usually. Typically, separate datasets are common where people usually do analysis of individual quarters, and rarely combine them.
Finally, if the datasets can vary from quarter to quarter in their contents (if the formats could change, if the fields can change), then having separate is easier in some ways than having to manage the change between the different periods.
That said, there's a huge organizational benefit to a single dataset, and all of the above issues can be dealt with. Think of SAS datasets as large SQL tables - they are effectively the same, and the same things that help SQL tables can help SAS. Proper sizing of columns, proper sorting of the stored data, indexing appropriately, are all important solutions. If you have a database team at your place of work, they may be able to help construct an ideal table plan. Files of several GB can definitely benefit from indexing and proper sorting, to allow users to easily get at the bits they need.
If you were to stay with separate datasets, you can use the macro language to make sure you're reading in the right datasets, assuming they're named in a consistent fashion. That might be the ideal solution if there are other reasons to stay separate - then no changes are needed each quarter.
Points of interest:
From a coding standpoint
Dealing with a single, stacked data set, created by appending the quarterly data sets is more efficient.
From a resource standpoint
Have to make sure you have large enough disk to hold the single large table
Have additional off storage to hold the original pieces -- no need to clutter up the primary data disk with all the pieces.
A 2TB SSD is very fast, remarkably cheap, and low power and can contain a table comprised of quite a few "couple GB" pieces.
Spinning disk has lower $/TB and more capacity. I/O will be slower and consume more power.
To further improve query performance you will want to index the variables most commonly used in BY, CLASS, and WHERE statements.
"... simple filter ..." is part of "Keep it Simple S****" (KISS)

Filtering data by time of the day in SAS

I am a beginner in SAS and I have a data set of traffic incidents to analyse. I want to filter out the data by time of the day - all incidents before 18:00:00 . or incidents between 9:00:00 - 18:00:00
I have tried to find a suitable code, but have not had any success. Could anybody help out with this? Im using the standard SAS not enterprise guide.
Is it with a WHERE statement? if so, how do I input the time?
I assume from your description you have a data set with a time variable and want to subset it using a hard-coded time of day. For this, it's easiest to use a time literal with standard WHERE processing. A time literal is a time specified in quotes followed by the T character.
For example, you can create something similar to the following that will subset the times data set but only with observations where time is earlier than 18:00:
data times_before_6pm;
set times;
where time < '18:00't; /* restrict to times of day earlier than 6pm */
run;
This assumes your times are time values and not datetime values. If they are datetime values, you'll need to extract the time portion from it (using the TIMEPART() function, which you can do in the WHERE statement).
Hope this helps.

Lookup primary keys in multiple tables

the problem I'm solving has many simple solutions but what I need is to find the way to reduce the time and memory needed for the process.
On the one side I have a table with a few hundred ID's and on the other 40 monthly tables and counting.
Each of the tables has between 500 000 to 1 mln records each for unique id. Each table has few thoustand variables but i only need 10-20 of them.
I need to lookup the tables to find the latest table when particular id from base table occur and get variable values that I need.
The newest month table is being calculated every day so many id's from previous months may occur again so I cannot just create indexed dictionary (last.id and variables) once. Also I can't afford creating new dictionary based on all tables every day.
Visual description
I came up with some ideas but I need your help to find the most efficient concept:
Concatenate all monthly tables with variables needed, sort ascending ID and month, select last.id using data step. Use join or merge with base table.
Problem: too much memory needed to set all tables.
Alternatively I used proc append in loop. Unfortunately not very time and memory efficient.
Inner join with all of the tables separately in loop:
Low memory use but very time consuming.
Create dictionary based on all months besides the latest and update it every day.
Problem: Large dictionary table.
Now I'm looking for smart concepts how to solve this kind of problem. Maybe hash objects.. but how?
I would greatly appreciate it if you give me some feedback on this case.
Thank you!
If someone was to write some code to generate some dummy data based on your specs they may be able to provide a more specific answer to your question. But without sample data it's hard to know the best way without trial and error.
Instead I've paraphrased some of my old answers into a more comprehensive list of things you can check.
Below are some ways to boost performance (roughly in order of performance improvement, YMMV):
Index the fields in each table that you will be joining on or using in a where clause. Not all fields are good candidates for indexes so do a little research on how to determine this before indexing.
Reduce the number of rows as early in the process as possible (ie. use a where clause to get rid of anything you don't care about).
If the joins are still time consuming, consider replacing them with hash table lookups.
Compression. When you build the datasets make sure you use the compress=yes option if you're not already. This will shrink the size of the table on disk resulting in less disk I/O (the slowest part of querying).
If the steps are IO intensive, consider using views rather than creating temporary tables.
Make sure you are using proc append to append datasets together to reduce IO (sounds like you are, just adding this for completeness). Append the smaller dataset to the larger dataset. Alternatively use a view to 'append' them without duplicating overhead.
Limit the columns you are processing by using a keep statement (reduces IO).
Check column lengths - make sure you're not using a field length of $255 to store something that only needs a length of $20 etc...
Use the SAS SPDE (Scalable Performance Data Engine). It allows you to partition your SAS datasets into multiple files and optionally spread them across different disks. Once your SAS datasets reach a certain size you can see performance improvements. I generally tend to use SPD libnames any time a dataset grows > 10G. No additional SAS modules are requires - this is enabled as part of Base SAS.