Best way to select random rows in redshift without order by - amazon-web-services

i have to select a set of rows (like 200 unique rows) from 200 million rows at once without order by and it must be efficient.

As you are experiencing sorting 200M rows can take a while and if all you want is 200 rows then this is an expense you shouldn't need to pay. However, you do need to sort on a random value if you want to select 200 rows that are random. Otherwise the sort order of the base tables and the order of reply from the Redshift slices will meaningfully skew you sample.
You can get around this by sampling down (through a random process) so a much more manageable number of rows, then sort by the random value and pick your final 200 rows. While this does need to sort rows it does it upon a significantly smaller number which will speed things up considerably.
select a, b from (
select a, b, random() as ranno
from test_table)
where ranno < .005
order by ranno
limit 200;
You start with 200M rows. Select .5% of them in the WHERE clause. Then order these 10,000 rows before selecting 200. This should speed things up and maintain the randomness of the selection.

Sampling down your data to a reasonable percentage like 10%,5%,1%,.. etc should bring your volume to a manageable size. Then you can order by the sample and choose the count of rows you need.
select * from (select *, random() as sample
from "table")
where sample < .01
order by sample limit 200
The following is an expansion on the question which I found useful for me that others might find helpful as well. In my case, I had a huge table which I could split by a key field value into smaller subsets, but even after splitting it the volume per individual subset would stay very large (10s of millions of rows) and I still needed to sample it anyway. I was initially concerned that the sampling won't work on the subset I created using With statement, but it turned out this is not the case. I compared the distribution of the sample across all different meaningful keys afterwards between the full subset (20 million) and the sample (30K) and I got almost the exact distribution which worked great. Sample code below:
With subset as (select * from "table" Where Key_field='XYZ')
select * from (select *, random() as sample
from subset) s
where s.sample < .01
order by s.sample limit 200

Related

What does this EXPLAIN query plan output mean for my Redshift query?

I ran this:
EXPLAIN select id, birth_date, ROW_NUMBER() OVER (ORDER BY 1) AS load_id from user_profile;
and I see this:
WindowAgg (cost=0.00..133833424.40 rows=30901176 width=36)
-> Seq Scan on user_profile (cost=0.00..133369906.76 rows=30901176 width=28)
What does this query plan mean?
The query plan is the execution plan that the PostgreSQL planner (Amazon Redshift is based on PostgreSQL) has generated for the your SQL statement.
The first node is a window aggregation (WindowAgg) over the data as you're using the OVER window function to calculate a row number.
The second node is a sequential scan (Seq Scan) on the user_profile table, as you're doing a full select of the table without any filtering.
A sequential scan scans the entire table as stored on disk since your query requires a full traversal of the table. Even if there is a multi-column index on id & birth_date, the query engine would pretty much always go for a sequence scan here as you need everything (depending on the random_page_cost & enable_seqscan parameters in PostgreSQL).
The cost number is actually arbitrary, but conventionally means the number of disk page fetches; it's split into 2 values with the delimiter being ...
The first value shows the startup cost - this is the estimated cost to return the first row. The second value shows the total cost - this is the estimated cost to return all rows.
For example, for the Seq Scan, the startup cost is 0 and the total cost is estimated to be 133369906.76.
For sequential scans, the startup cost is usually 0. There's nothing really to do other than return data so it can start returning data right away. Total costs for a node includes the cost of all its child nodes as well - in this case, the final total cost of both operations looks to be 133833424.40 which is the sum of the scan and aggregation cost.
The rows value demonstrates the estimated number of rows that will be returned. In this case, both operations have the same value as the aggregation will apply to all rows & no filtering is being carried out that will reduce the number of final rows.
The width value demonstrates the estimated size in bytes of each returned row i.e. each row will most likely be 28 bytes in length before the aggregation and 36 bytes after the aggregation.
Putting that all together, you could read the query plan as such:
Sequential Scan on table user_profile
will most likely start returning rows immediately
estimated disk page fetch count of 133369906.76
estimated 30,901,176 rows to be returned
estimated total row size of 28 bytes
Window Aggregation on data from above operation
will most likely start returning rows immediately
estimated disk page fetch count of 133833424.40
estimated 30,901,176 rows to be returned
estimated total row size of 36 bytes

Great Expectation profiling on SparkDF takes a long time when there are many columns

I need to profile data coming from snowflake in Databricks. The data is just a sample of 100 rows but containing 3k+ columns, and will eventually have more rows. When I reduce the number of columns, the profiling is done very fast but the more columns there are, the longer it gets. I tried profiling the sample and after more than 10h and I had to cancel the job.
Here is the code I use
df = spark.read.format('snowflake').options(**sfOptions).option('query', f'select * from {db_name}')
df_ge = ge.dataset.SparkDFDataset(df_sf)
BasicDatasetProfiler.profile(df_ge)
You can test this with any data having a lot of columns. Is this normal or am I doing something wrong?
Basically, GE computes metrics for each columns individually, hence, it make an action (probably a collect) for each column and each metric it computes. collects are the most expensive operations you can have on spark so that is almost normal that, the more columns you have, the longer it takes.

Convert indexes to sortkeys Redshift

Do zonemaps exists only in memory? Or its populated in memory from disk where its stored persistently? Is it stored along with the 1MB block, or in a separate place?
We are migrating from oracle to redshift, there are bunch of indexes to cater to reporting needs. The nearest equivalent of index in Redshift is sortkeys. For bunch of tables, the total number of cols of all the indexes are between 15-20 (some are composite indexes, some are single col indexes). Interleaved keys seems to be best fit, but there cannot be more than 8 cols in an interleaved sortkey. But if I use compound sortkey, it wont be effective since the queries might not have prefix colums.
Whats the general advice in such cases - which type of sort key to use? How to convert many indexes from rdbms to sort keys in redshift?
Are high cardinality cols such as identity cols, dates and timestamps not good fit with interleaved keys? Would it be same with compound sortkeys? Any disadvanatges with interleaved sortkeys to keep in consideration?
You are asking the right questions so let's take these down one at a time. First, zonemaps are located on the leader node and stored on disk and the table data is stored on the compute nodes. They are located separate from each other. The zonemaps store the min and max values for every column for every 1MB block in a table. No matter if a column is in your sortkey list or not, there will be zonemap data for the block. When a column shows up in a WHERE clause Redshift will first compare to the zonemap data to decide if the block is needed for the query. If a block is not needed it won't be read from disk resulting in significant performance improvements for very large tables. I call this "block rejection". A few key points - This really only makes a difference on tables will 10s of millions of rows and when there are selective WHERE predicates.
So you have a number of reports each of which looks at the data by different aspects - common. You want all of these to work well, right? Now the first thing to note is that each table can have it's own sortkeys, they aren't linked. What is important is how does the choice of sortkeys affect the min and max values in the zonemaps for the columns you will use as WHERE clauses. With composite sortkeys you have to think about what impact later keys will have on the composition of the block - not much after the 3rd or 4th key. This is greatly impacted by the ordinality of the data but you get the idea. The good news is that sorting on one column will impact the zonemaps of all the columns so you don't always have to have a column in the sortkey list to get the benefit.
The question of compound vs interleaved sortkeys is a complicated one but remember you want to get high levels of block rejection as often as possible (and on the biggest tables). When different queries have different WHERE predicates it can be tricky to get a good mix of sortkeys to make this happen. In general compound sortkeys are easier to understand and have less table maintenance implications. You can inspect the zonemaps and see what impacts your sortkey choices are having and make informed decisions on how to adjust. If there are columns with low ordinality put those first so that the next sortkeys can have impact on the overall row order and therefore make block with different value ranges for these later keys. For these reasons I like compound keys over interleaved but there are cases where things will improve with interleaved keys. When you have high ordinality for all the columns and they are all equally important interleaved may be the right answer. I usually learn about the data trying to optimize compound keys that even if I end up with interleaved keys I can make smart choices about what columns I want in the sortkeys.
Just some metrics to help in you choice. Redshift can store 200,000 row elements in a single block and I've seen columns with over 2M elements per block. Blocks are distributed across the cluster so you need a lot of rows to fill up enough blocks that rejecting a high percentage of them is even possible. If you have a table of 5 million rows and you are sweating the sortkeys you are into the weeds. (Yes sorting can impact other aspects of the query like joining but these are sub-second improvements not make or break performance impacts.) Compression can have a huge impact on the number of row elements per block and therefore how many rows are represented in an entry in the zonemap. This can increase block rejection but will increase the read data needed to scan the entire table - a tradeoff you will want to make sure you are winning (1 query gets faster by 10 get slower is likely not a good tradeoff).
Your question about ordinality is a good one. If I sort my a high ordinality column first in a compound sortkey list this will set the overall order of the rows potentially making all other sortkeys impotent. However if I sort by a low ordinality column first then there is a lot of power left for other sortkeys to change the order of the rows and therefore the zonemap contents. For example if I have Col_A with only 100 unique values and Col_B which is a timestamp with 1microsecond resolution. If I sort by Col_B first all the rows are likely order just by sorting on this column. But if I sort by Col_A first there are lots of rows with the same value and the later sortkey (Col_B) can order these rows. Interleaved works the same way except which column is "first" changes by region of the table. If I interleave sort base on the same Col_A and Col_B above (just 2 sortkeys), then half the table will be sorted by Col_A first and half by Col_B first. For this example Col_A will be useless half of the time - not the best answer. Interleave sorting just modifies which column is use as the first sortkey throughout the table (and second and third if more keys are used). High ordinality in a sort key makes later sortkeys less powerful and this independent of sort style - it's just the interleave changes up which columns are early and which are late by region of the table.
Because ordinality of sortkeys can be such an important factor in gaining block rejection across many WHERE predicates that it is common to add derived columns to tables to hold lower ordinality versions of other columns. In the example above I might add Col_B2 to the table and have if just hold the year and month (month truncated date) of Col_B. I would use Col_B2 in my sortkey list but my queries would still be referencing Col_B. It "roughly" sorts based on Col_B so that Col_A can have some sorting power if it was to come later in the sortkey list. This is a common reason for making data model changes when moving Redshift.
It is also critical that "block rejecting" WHERE clauses on written against the fact table column, not applied to a dimension table column after the join. Zonemap information is read BEFORE the query starts to execute and is done on the leader node - it can't see through joins. Another data model change is to denormalize some key information into the fact tables so these common where predicates can be applied to the fact table and zonemaps will be back in play.
Sorry for the tome but this is a deep topic which I've spent year optimizing. I hope this is of use to you and reach out if anything isn't clear (and I hope you have the DISTKEYS sorted out already :) ).

Working with large offsets in BigQuery

I am trying to emulate pagination in BigQuery by grabbing a certain row number using an offset. It looks like the time to retrieve results steadily degrades as the offset increases until it hits ResourcesExceeded error. Here are a few example queries:
Is there a better way to use the equivalent of an "offset" with BigQuery without seeing performance degradation? I know this might be asking for a magic bullet that doesn't exist, but was wondering if there are workarounds to achieve the above. If not, if someone could suggest an alternative approach to getting the above (such as kinetica or cassandra or whatever other approach), that would be greatly appreciated.
Offset in systems like BigQuery work by reading and discarding all results until the offset.
You'll need to use a column as a lower limit to enable the engine to start directly from that part of the key range, you can't have the engine randomly seek midway through a query efficiently.
For example, let's say you want to view taxi trips by rate code, pickup, and drop off time:
SELECT *
FROM [nyc-tlc:green.trips_2014]
ORDER BY rate_code ASC, pickup_datetime ASC, dropoff_datetime ASC
LIMIT 100
If you did this via OFFSET 100000, it takes 4s and the first row is:
pickup_datetime: 2014-01-06 04:11:34.000 UTC
dropoff_datetime: 2014-01-06 04:15:54.000 UTC
rate_code: 1
If instead of offset, I had used those date and rate values, the query only takes 2.9s:
SELECT *
FROM [nyc-tlc:green.trips_2014]
WHERE rate_code >= 1
AND pickup_datetime >= "2014-01-06 04:11:34.000 UTC"
AND dropoff_datetime >= "2014-01-06 04:15:54.000 UTC"
ORDER BY rate_code ASC, pickup_datetime ASC, dropoff_datetime ASC
limit 100
So what does this mean? Rather than allowing the user to specific result # ranges (e.g, so new rows starting at 100000), have then specified it in a more natural form (e.g, so how rides that started on January 6th, 2015.
If you want to get fancy and REALLY need to allow the user to specific actual row numbers, you can make it a lot more efficient by calculating row ranges in advance, say query everything once and remember what row number is at the start of the hour for each day (8760 values), or even minutes (525600 values). You could then use this to better guess efficient start. Do a look-up for the closest day/minute for a given row range (e.g in Cloud Datastore), then convert that users query into the more efficient version above.
As already mentioned by Dan you need to introduce a row number. Now row_number() over () exceeds resources. This basically means you have to split up the work of counting rows:
decide for few and as evenly distributed partitions as possible
count rows of each partition
cumulative sum of partitions to know later when to start where with counting rows
split up up work of counting rows
save new table with row count column for later use
As partitions I used EXTRACT(month FROM pickup_datetime) as it distributes nicely
WITH
temp AS (
SELECT
*,
-- cumulative sum of partition sizes so we know when to start counting rows here
SUM(COALESCE(lagged,0)) OVER (ORDER BY month RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) cumulative
FROM (
-- lag partition sizes to next partition
SELECT
*,
LAG(qty) OVER (ORDER BY month) lagged
FROM (
-- get partition sizes
SELECT
EXTRACT(month FROM pickup_datetime) month,
COUNT(1) qty
FROM
`nyc-tlc.green.trips_2014`
GROUP BY
1)) )
SELECT
-- cumulative sum = last row of former partition, add to new row count
cumulative + ROW_NUMBER() OVER (PARTITION BY EXTRACT(month FROM pickup_datetime)) row,
*
FROM
`nyc-tlc.green.trips_2014`
-- import cumulative row counts
LEFT JOIN
temp
ON
(month= EXTRACT(month FROM pickup_datetime))
Once you saved it as a new table you can use your new row column to query without losing performance:
SELECT
*
FROM
`project.dataset.your_new_table`
WHERE
row BETWEEN 10000001
AND 10000100
Quite a hassle, but does the trick.
Why not export the resulting table into GCS?
It will automatically split tables into files if you use wildcards, and this export only has to be done one time, instead of querying every single time and paying for all the processing.
Then, instead of serving the result of the call to the BQ API, you simply serve the exported files.

Best order of joins and append for performance

I'm having huge performance issues with a SAS DI job that I need to get up and running. Therefore I'm looking for clever ways to optimize the job.
One thing in particular that I thought of is that I should perhaps permute the order of some joins and an append. Currently, my job is configured as follows:
there are several similarly structured source tables which I first apply a date filter to (to reduce the number of rows) and sort on two fields, say a and b, then I left join each table to a table with account table on the same fields a and b (I'd like to create indexes for these if possible, but don't know how to do it for temporary work tables in SAS DI). After each of these joins is complete, I append the resulting tables into one dataset.
It occurs to me that I could first append, and then do just one join, but I have no notion of which approach is faster, or if the answer is that it depends I have no notion of what it depends on (though I'd guess the size of the constituent tables).
So, is it better to do many joins then append, or to append then do one join?
EDIT
Here is an update with some relevant information (requested by user Robert Penridge).
The number of source tables here is 7, and the size of these tables ranges from 1500 to 5.2 million. 10 000 is typical. The number of columns is 25. These tables are each being joined with the same table, which has about 5000 rows and 8 columns.
I estimate that the unique key partitions the tables into subsets of roughly equal size; the size reduction here should be between 8% and 30% (the difference is due to the fact that some of the source tables carry much more historical data than others, adding to the percentage of the table grouped into the same number of groups).
I have limited the number of columns to the exact minimum amount required (21).
By default SAS DI creates all temporary datasets as views, and I have not changed that.
The code for the append and joins are auto-generated by SAS DI after constructing them with GUI elements.
The final dataset is not sorted; my reason for sorting the data which feeds the joins is that the section of this link on join performance (page 35) mentions that it should improve performance.
As I mentioned, I'm not sure if one can put indexes on temporary work tables or views in SAS DI.
I cannot say whether the widths of the fields is larger than absolutely necessary, but if so I doubt it is egregious. I hesitate to change this since it would have to be done manually, on several tables, and when new data comes in it might need that extra column width.
Much gratitude
Performance in SAS is mainly about reducing IO (ie. reading/writing to the disk).
Without additional details it's difficult to help but some additional things you can consider are:
limit the columns you are processing by using a keep statement (reduces IO)
if the steps performing the joins are IO intensive, consider using views rather than creating temporary tables
if the joins are still time consuming, consider replacing them with hash table lookups
make sure you are using proc append to append the 2 datasets together to reduce the IO. Append the smaller dataset to the larger dataset.
consider not sorting the final dataset but placing an index on it for consumers of the data.
ensure you are using some type of dataset compression, or ensure your column widths are set appropriately for all columns (ie. you don't have a width of 200 on a field that uses a width of 8)
reduce the number of rows as early in the process as possible (you are already doing this, just listing it here for completeness)
Adjusting the order of left-joins and appends probably won't make as much difference as doing the above.
As per your comments it seems that
1. There are 7 input source tables
2. Join these 7 source tables to 1 table
3. Append the results
In SAS DI studio, use a Lookup to perform the above much faster
1. Connect the 7 Input tables to a Lookup Transform (lets call them SRC 1-7)
2. The table with 5000 records is the tables on which lookup is performed on keys A and B (lets call this LKUP-1)
3. Take the relevant columns from LKUP-1 to propagate into the TARGET tables.
This will be much faster and you don't have to perform JOINs in this case as I suspect you are doing a Many-Many join which is degrading the performance in SAS DIS.