Great Expectation profiling on SparkDF takes a long time when there are many columns - great-expectations

I need to profile data coming from snowflake in Databricks. The data is just a sample of 100 rows but containing 3k+ columns, and will eventually have more rows. When I reduce the number of columns, the profiling is done very fast but the more columns there are, the longer it gets. I tried profiling the sample and after more than 10h and I had to cancel the job.
Here is the code I use
df = spark.read.format('snowflake').options(**sfOptions).option('query', f'select * from {db_name}')
df_ge = ge.dataset.SparkDFDataset(df_sf)
BasicDatasetProfiler.profile(df_ge)
You can test this with any data having a lot of columns. Is this normal or am I doing something wrong?

Basically, GE computes metrics for each columns individually, hence, it make an action (probably a collect) for each column and each metric it computes. collects are the most expensive operations you can have on spark so that is almost normal that, the more columns you have, the longer it takes.

Related

Best way to select random rows in redshift without order by

i have to select a set of rows (like 200 unique rows) from 200 million rows at once without order by and it must be efficient.
As you are experiencing sorting 200M rows can take a while and if all you want is 200 rows then this is an expense you shouldn't need to pay. However, you do need to sort on a random value if you want to select 200 rows that are random. Otherwise the sort order of the base tables and the order of reply from the Redshift slices will meaningfully skew you sample.
You can get around this by sampling down (through a random process) so a much more manageable number of rows, then sort by the random value and pick your final 200 rows. While this does need to sort rows it does it upon a significantly smaller number which will speed things up considerably.
select a, b from (
select a, b, random() as ranno
from test_table)
where ranno < .005
order by ranno
limit 200;
You start with 200M rows. Select .5% of them in the WHERE clause. Then order these 10,000 rows before selecting 200. This should speed things up and maintain the randomness of the selection.
Sampling down your data to a reasonable percentage like 10%,5%,1%,.. etc should bring your volume to a manageable size. Then you can order by the sample and choose the count of rows you need.
select * from (select *, random() as sample
from "table")
where sample < .01
order by sample limit 200
The following is an expansion on the question which I found useful for me that others might find helpful as well. In my case, I had a huge table which I could split by a key field value into smaller subsets, but even after splitting it the volume per individual subset would stay very large (10s of millions of rows) and I still needed to sample it anyway. I was initially concerned that the sampling won't work on the subset I created using With statement, but it turned out this is not the case. I compared the distribution of the sample across all different meaningful keys afterwards between the full subset (20 million) and the sample (30K) and I got almost the exact distribution which worked great. Sample code below:
With subset as (select * from "table" Where Key_field='XYZ')
select * from (select *, random() as sample
from subset) s
where s.sample < .01
order by s.sample limit 200

How to circumvent SPICE limitations (500 M rows) to create a QuickSight dashboard for a big data set?

My goal is to quickly & dynamically visualize a big data set (> 500 M rows) using QuickSight. To achieve quick query times, it's necessary to load all of the data into SPICE. However, AWS currently has a hard limit for the maximum number of rows that can be imported into SPICE for a single data set, which is 500 M rows. I currently don't see any option that could be used to visualize all of the data. Here are things that I already considered:
Splitting the full data set into individual QS datasets: the problem with this approach is that QuickSight requires that each visual has a single dataset as an input, so values from multiple datasets cannot be shown in the same visual. I'm aware that multiple datasets can be used within one dashboard but that would not suit the use-case of having a single plot visualizing the data.
Pivoting the table: the input table has a lot of rows, so changing the format from long to wide table would circumvent the SPICE row limitations. However, QuickSight doesn't seem to support using an array of columns a y-values to be plotted.
Creating a dataset per visualization: Certain visualizations can theoretically be defined using fewer values than in the original data set. For example, to create a box plot over a set of groups, we mainly need the quartile values for each of the groups to be plotted, rather than the full data set, which would allow us to be below the SPICE limitation. However, QuickSight doesn't allow creating custom plots such as creation of a box plot where quartiles are already pre-processed.
Currently, the only viable approach I see is to create a dashboard per user, since most users would only be interested in a subset of rows from the full data set.
Irrespective of the approach taken, unfortunately, this limitation forces us to do some compromises.
Depending on the number of users, creating a dataset per user might become a headache to manage. So, I would suggest that if possible you use datasets that capture groups of users (example by user group, or user's country).
Pivoting the table might make it harder to build some visuals. As you said, if you pivot multiple values from different rows into an array field, then you would not be able to extract these easily in analyses (you could use string functions and to to extract them that way but there are limitations around this approach too).
Also creating a dataset per visualisation has maintenance overhead in that you would need to update and re-ingest the dataset most times when changing visualisations.
Some other approaches you might consider:
Aggregate multiple rows together Example if your dataset has multiple rows for each user within the same minute, you could aggregate all these into 1 row and summing up values within that minute. The aggregation period should be as large as possible but keep in mind that this will affect the time granularity in your analyses/dashboards
Prune old data If you are more interested in recent data, then you could add a filter to only keep say 1 month of activity. You could then have other non-SPICE (Direct Query) datasets that do not have this restriction but reports would be slower on older data.
Cache in an external database You could load your data into some data warehousing database (such as AWS Redshift) and then not use SPICE in QuickSight. Of course, this will probably get more expensive.

SAS EG - Individual Datasets split by date vs Single appended dataset containing all dates

This is mainly a question about efficiency, as I'm unfamiliar with how SAS processes datasets. A lot of code that I run reads from multiple datasets with consecutive dates (whether this is consecutive months/quarters/years depends on the datasets).
At the moment, the codes require manual updates each time they're run to ensure they're picking up the correct dates, so I would have something such as:
Data Quarters;
Set XYZ_201803
XYZ_201806
...
...
XYZ_202006;
Run;
To help tidy up the code and make it a bit less tedious, I've approached a few different ideas and had a few sent my way and one of the big ideas is to store all of the XYZ_YYYYMM datasets as a single, appended dataset, so they can be read from with a simple filter on the date as below:
Data Quarters;
Set AppendedData;
Where Date > 201812;
Run;
Which of these two options is more efficient as far as computation goes? On datasets which are typically a couple of gb in size, which would you recommend? What other pros and cons come with each idea?
Thanks for any input. :)
Most likely a single dataset and several separate datasets will be similar from a performance standpoint; there is some small overhead opening new datasets, but as long as it's not thousands of them you probably won't notice a difference.
There will be a performance hit with a single dataset in creating that dataset, and in using that dataset, if you use only small sections usually. Typically, separate datasets are common where people usually do analysis of individual quarters, and rarely combine them.
Finally, if the datasets can vary from quarter to quarter in their contents (if the formats could change, if the fields can change), then having separate is easier in some ways than having to manage the change between the different periods.
That said, there's a huge organizational benefit to a single dataset, and all of the above issues can be dealt with. Think of SAS datasets as large SQL tables - they are effectively the same, and the same things that help SQL tables can help SAS. Proper sizing of columns, proper sorting of the stored data, indexing appropriately, are all important solutions. If you have a database team at your place of work, they may be able to help construct an ideal table plan. Files of several GB can definitely benefit from indexing and proper sorting, to allow users to easily get at the bits they need.
If you were to stay with separate datasets, you can use the macro language to make sure you're reading in the right datasets, assuming they're named in a consistent fashion. That might be the ideal solution if there are other reasons to stay separate - then no changes are needed each quarter.
Points of interest:
From a coding standpoint
Dealing with a single, stacked data set, created by appending the quarterly data sets is more efficient.
From a resource standpoint
Have to make sure you have large enough disk to hold the single large table
Have additional off storage to hold the original pieces -- no need to clutter up the primary data disk with all the pieces.
A 2TB SSD is very fast, remarkably cheap, and low power and can contain a table comprised of quite a few "couple GB" pieces.
Spinning disk has lower $/TB and more capacity. I/O will be slower and consume more power.
To further improve query performance you will want to index the variables most commonly used in BY, CLASS, and WHERE statements.
"... simple filter ..." is part of "Keep it Simple S****" (KISS)

Lookup primary keys in multiple tables

the problem I'm solving has many simple solutions but what I need is to find the way to reduce the time and memory needed for the process.
On the one side I have a table with a few hundred ID's and on the other 40 monthly tables and counting.
Each of the tables has between 500 000 to 1 mln records each for unique id. Each table has few thoustand variables but i only need 10-20 of them.
I need to lookup the tables to find the latest table when particular id from base table occur and get variable values that I need.
The newest month table is being calculated every day so many id's from previous months may occur again so I cannot just create indexed dictionary (last.id and variables) once. Also I can't afford creating new dictionary based on all tables every day.
Visual description
I came up with some ideas but I need your help to find the most efficient concept:
Concatenate all monthly tables with variables needed, sort ascending ID and month, select last.id using data step. Use join or merge with base table.
Problem: too much memory needed to set all tables.
Alternatively I used proc append in loop. Unfortunately not very time and memory efficient.
Inner join with all of the tables separately in loop:
Low memory use but very time consuming.
Create dictionary based on all months besides the latest and update it every day.
Problem: Large dictionary table.
Now I'm looking for smart concepts how to solve this kind of problem. Maybe hash objects.. but how?
I would greatly appreciate it if you give me some feedback on this case.
Thank you!
If someone was to write some code to generate some dummy data based on your specs they may be able to provide a more specific answer to your question. But without sample data it's hard to know the best way without trial and error.
Instead I've paraphrased some of my old answers into a more comprehensive list of things you can check.
Below are some ways to boost performance (roughly in order of performance improvement, YMMV):
Index the fields in each table that you will be joining on or using in a where clause. Not all fields are good candidates for indexes so do a little research on how to determine this before indexing.
Reduce the number of rows as early in the process as possible (ie. use a where clause to get rid of anything you don't care about).
If the joins are still time consuming, consider replacing them with hash table lookups.
Compression. When you build the datasets make sure you use the compress=yes option if you're not already. This will shrink the size of the table on disk resulting in less disk I/O (the slowest part of querying).
If the steps are IO intensive, consider using views rather than creating temporary tables.
Make sure you are using proc append to append datasets together to reduce IO (sounds like you are, just adding this for completeness). Append the smaller dataset to the larger dataset. Alternatively use a view to 'append' them without duplicating overhead.
Limit the columns you are processing by using a keep statement (reduces IO).
Check column lengths - make sure you're not using a field length of $255 to store something that only needs a length of $20 etc...
Use the SAS SPDE (Scalable Performance Data Engine). It allows you to partition your SAS datasets into multiple files and optionally spread them across different disks. Once your SAS datasets reach a certain size you can see performance improvements. I generally tend to use SPD libnames any time a dataset grows > 10G. No additional SAS modules are requires - this is enabled as part of Base SAS.

Best order of joins and append for performance

I'm having huge performance issues with a SAS DI job that I need to get up and running. Therefore I'm looking for clever ways to optimize the job.
One thing in particular that I thought of is that I should perhaps permute the order of some joins and an append. Currently, my job is configured as follows:
there are several similarly structured source tables which I first apply a date filter to (to reduce the number of rows) and sort on two fields, say a and b, then I left join each table to a table with account table on the same fields a and b (I'd like to create indexes for these if possible, but don't know how to do it for temporary work tables in SAS DI). After each of these joins is complete, I append the resulting tables into one dataset.
It occurs to me that I could first append, and then do just one join, but I have no notion of which approach is faster, or if the answer is that it depends I have no notion of what it depends on (though I'd guess the size of the constituent tables).
So, is it better to do many joins then append, or to append then do one join?
EDIT
Here is an update with some relevant information (requested by user Robert Penridge).
The number of source tables here is 7, and the size of these tables ranges from 1500 to 5.2 million. 10 000 is typical. The number of columns is 25. These tables are each being joined with the same table, which has about 5000 rows and 8 columns.
I estimate that the unique key partitions the tables into subsets of roughly equal size; the size reduction here should be between 8% and 30% (the difference is due to the fact that some of the source tables carry much more historical data than others, adding to the percentage of the table grouped into the same number of groups).
I have limited the number of columns to the exact minimum amount required (21).
By default SAS DI creates all temporary datasets as views, and I have not changed that.
The code for the append and joins are auto-generated by SAS DI after constructing them with GUI elements.
The final dataset is not sorted; my reason for sorting the data which feeds the joins is that the section of this link on join performance (page 35) mentions that it should improve performance.
As I mentioned, I'm not sure if one can put indexes on temporary work tables or views in SAS DI.
I cannot say whether the widths of the fields is larger than absolutely necessary, but if so I doubt it is egregious. I hesitate to change this since it would have to be done manually, on several tables, and when new data comes in it might need that extra column width.
Much gratitude
Performance in SAS is mainly about reducing IO (ie. reading/writing to the disk).
Without additional details it's difficult to help but some additional things you can consider are:
limit the columns you are processing by using a keep statement (reduces IO)
if the steps performing the joins are IO intensive, consider using views rather than creating temporary tables
if the joins are still time consuming, consider replacing them with hash table lookups
make sure you are using proc append to append the 2 datasets together to reduce the IO. Append the smaller dataset to the larger dataset.
consider not sorting the final dataset but placing an index on it for consumers of the data.
ensure you are using some type of dataset compression, or ensure your column widths are set appropriately for all columns (ie. you don't have a width of 200 on a field that uses a width of 8)
reduce the number of rows as early in the process as possible (you are already doing this, just listing it here for completeness)
Adjusting the order of left-joins and appends probably won't make as much difference as doing the above.
As per your comments it seems that
1. There are 7 input source tables
2. Join these 7 source tables to 1 table
3. Append the results
In SAS DI studio, use a Lookup to perform the above much faster
1. Connect the 7 Input tables to a Lookup Transform (lets call them SRC 1-7)
2. The table with 5000 records is the tables on which lookup is performed on keys A and B (lets call this LKUP-1)
3. Take the relevant columns from LKUP-1 to propagate into the TARGET tables.
This will be much faster and you don't have to perform JOINs in this case as I suspect you are doing a Many-Many join which is degrading the performance in SAS DIS.