I am new to RedShift and just experimenting at this stage to help with table design.
We have a very simple table with about 6 million rows and 2 integer fields.
Both integer fields are in the sort key but the plan has a warning - "very selective query filter".
The STL_Alert_Event_Log entry is:
'Very selective query filter:ratio=rows(61)/rows_pre_user_filter(524170)=0.000116'
The query we are running is:
select count(*)
from LargeNumberofRowswithUniKey r
where r.benchmarkid = 291891 and universeid = 300901
Our Table DDL is:
CREATE TABLE public.LargeNumberofRowswithUniKey
(
benchmarkid INTEGER NOT NULL DISTKEY,
UniverseID INTEGER NOT NULL
)
SORTKEY
(
benchmarkid,UniverseID
);
We have also run the following commands on the table:
Vacuum full public.LargeNumberofRowswithUniKey;
Analyze public.LargeNumberofRowswithUniKey;
The screenshot of the plan is here: [Query Plan Image][1]
My expectation was that the multiple sort key including Benchmark and Universe and the fact that both are part of the filter predicate would ensure that the design was optimal for the sample query. This does not seem to be the case, hence the red warning symbol in the attached image. Can anyone shed light on this?
Thanks
George
Update 2017/09/07
I have some more information that may help:
If I run a much simpler query which just filters on the first column of the sort key.
select r.benchmarkid
from LargeNumberofRowswithUniKey r
where r.benchmarkid = 291891
This results in 524,170 rows being scanned according to the actual query plan from the console. When I look at the blocks using STV_BLOCKLIST. The relevant blocks that might be required to satisfy my query are:
|slice|col|tbl |blocknum|num_values|minvalue|maxvalue|
| 1| 0|346457| 4| 262085| 291881| 383881|
| 3| 0|346457| 4| 262085| 291883| 344174|
| 0| 0|346457| 5| 262085| 291891| 344122|
So shouldn't there be 786,255 rows scanned (3 x 262,085) instead of 524,170 (2 x 262,085) as listed in the plan?
The "very selective filter" warning is returned when the rows selected vs rows scanned ratio is less than 0.05 i.e. a relatively large number of rows are scanned compared to the number of rows actually returned. This can be caused by having a large number of unsorted rows in a table, which can be resolved by running a vacuum. However, as you're already doing that I think this is happening because your query is actually very selective (you're selecting a single combination of benchmarkid and universeid) and so you can probably ignore this warning.
Side-observation: If you are always selecting values by using both benchmarkid and UniverseID, you should probably use DISTKEY EVEN.
The reason for this is that a benchmarkid DISTKEY would distribute the data between slices based on benchmarkid. All the values for a given benchmarkid would be on the same slice. If your query always provides a benchmarkid in the query, then the query only utilizes one slice.
On the other hand, if it used DISTKEY EVEN, then every slice can participate in the query, making it more efficient (for queries with WHERE benchmarkid = xxx).
A general rule of thumb is:
Use DISTKEY for fields commonly used in JOIN or GROUP BY
Use SORTKEY for fields commonly used in WHERE
Related
I have a table with a lot of partitions (something that we're working on reducing)
When I query :
SELECT * FROM mytable LIMIT 10
I get :
"HIVE_EXCEEDED_PARTITION_LIMIT: Query over table 'mytable' can potentially read more than 1000000 partitions"
Why isn't the "LIMIT 10" part of the query sufficient for Athena to return a result without reading more that 1 or 3 partitions ?
ANSWER :
During the query planing phase, Athena attempts to list all partitions potentially needed to answer the query.
Since Athena doesn't know which partitions actually contain data (not empty partitions) it will add all partitions to the list.
Athena plans a query and then executes it. During planning it lists the partitions and all the files in those partitions. However, it does not know anything about the files, how many records they contain, etc.
When you say LIMIT 10 you're telling Athena you want at most 10 records in the result, and since you don't have any grouping or ordering you want 10 arbitrary records.
However, during the planning phase Athena can't know which partitions have files in them, and how many of those files it will need to read to find 10 records. Without listing the partition locations it can't know they're not all empty, and without reading the files it can't know they're not all empty too.
Therefore Athena first has to get the list of partitions, then list each partition's location on S3, even if you say you only want 10 arbitrary records.
In this case there are so many partitions that Athena short-circuits and says that you probably didn't mean to run this kind of query. If the table had fewer partitions Athena would execute the query and each worker would read as little as possible to return 10 records and then stop – but each worker would produce 10 records, because the worker can't assume that other workers would return any records. Finally the coordinator will pick the 10 records out of all the results form all workers to return as the final result.
Limit works on the display operation only, if I am not wrong. So query will still read everything but only display 10 records.
Try to limit data using where condition, that should solve the issue
I think Athena's workers try to read max number of the partitions (relative to the partition size of the table) to get that random chunk of data and stop when query is fulfilled (which in your case, is the specification of the limit).
In your case, it's not even starting to execute the above process because of too many partitions involved. Therefore, if Athena is not planning your random data selection query, you have to explicitly plan it and hand it over to the execution engine.
Something like:
select * from mytable
where (
partition_column in (
select partition_column from mytable limit cast(10 * rand() as integer)
)
)
limit 100
I have 2 Tables to Join, its a Left Join. Below is the two Condition, how my pipeline is working.
The job is running in batch mode and its all User data and we want to process in Google Dataflow.
Day 1:
Table A: 5000000 Records. (Size 3TB)
Table B: 200 Records. (Size 1GB)
Both Tables Joined through SideInput where TableB Data was Taken as SideInput and it was working fine.
Day 2:
Table A: 5000010 Records. (Size 3.001TB)
Table B: 20000 Records. (Size 100GB)
On second day my pipeline is slowing down because SideInput uses cache and my cache size got exhausted, because of size of TableB got Increased.
So I tried Using Co-Group by, but Day 1 data processing was pretty slow with a Log: Having 10000 plus values on Single Key.
So is there any better performant way to perform the Joining when Hotkey get introduced.
It is true that the performance can drop precipitously once table B no longer fits into cache, and there aren't many good solutions. The slowdown in using CoGroupByKey is not solely due to having many values on a single key, but also the fact that you're now shuffling (aka grouping) Table A at all (which was avoided when using a side input).
Depending on the distribution of your keys, one possible mitigation could be to process your hot keys into a path that does the side-input joining as before, and your long-tail keys into a GoGBK. This could be done by producing a truncated TableB' as a side input, and your ParDo would attempt to look up the key emitting to one PCollection if it was found in TableB' and another if it was not [1]. One would then pass this second PCollection to a CoGroupByKey with all of TableB, and flatten the results.
[1] https://beam.apache.org/documentation/programming-guide/#additional-outputs
I want to design a phone book application that each contact can have multiple numbers. There is two db designs:
use contact foreign key in each number.
store numbers in an ArrayField inside each contact.
which solution is more performant in production and why?
Thanks in advance.
If you build a GIN index on the array column and Django will write queries in such a way that they can use that index, then the performance for reading will be quite similar between the two.
It is very unlikely that the performance difference should be the driving factor behind this choice. For example, do you need more info behind the phone number than just the number, such as when it was added, when it was last used, whether it is a mobile phone or something else, etc.
The array column should be faster because it only has to consult one index and table, rather than two of each. Also, it will be more compact, and so more cacheable.
On the other hand, the statistical estimates for your array column will have a problem when estimating rare values, which you are likely to have here, as no phone number is likely to be shared by a large number of people. This misestimate could have devastating results on your query performance. For example in a little test, overestimating the number of rows by many thousand fold caused it to launch parallel worker for a single-row query, leading it to be about 20 fold slower than when parallelization is turned off, and 10 times slower than using the foreign-key representation which doesn't suffer from the estimation problem.
For example:
create table contact as select md5(floor(random()*50000000)::text) as name, array_agg(floor(random()*100000000)::int) phones from generate_series(1,100000000) f(x) group by name;
vacuum analyze contact;
create index on contact using gin (phones );
explain analyze select * from contact where phones #> ARRAY[123456];
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Gather (cost=3023.30..605045.19 rows=216167 width=63) (actual time=0.668..8.071 rows=2 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Bitmap Heap Scan on contact (cost=2023.30..582428.49 rows=90070 width=63) (actual time=0.106..0.110 rows=1 loops=3)
Recheck Cond: (phones #> '{123456}'::integer[])
Heap Blocks: exact=2
-> Bitmap Index Scan on contact_phones_idx (cost=0.00..1969.25 rows=216167 width=0) (actual time=0.252..0.252 rows=2 loops=1)
Index Cond: (phones #> '{123456}'::integer[])
Planning Time: 0.820 ms
Execution Time: 8.137 ms
You can see that it estimates where will be 216167 rows, but in fact there are only 2. (For convenience, I used ints, rather than the text field you would probably use for phone numbers, but this doesn't change anything fundamental).
If this is really vital to you, then you should do the test and see, using your own data and your own architecture. It will depend on what does and does not fit in memory, what kinds of queries you are doing (do you ever look up numbers in bulk? Join them to other tables besides the immediately discussed foreign key?), and maybe how your driver/library handle columns/parameters with array types.
My table is 500gb large with 8+ billion rows, INTERLEAVED SORTED by 4 keys.
One of the keys has a big skew 680+. On running a VACUUM REINDEX, its taking very long, about 5 hours for every billion rows.
When i track the vacuum progress it says the following:
SELECT * FROM svv_vacuum_progress;
table_name | status | time_remaining_estimate
-----------------------------+--------------------------------------------------------------------------------------+-------------------------
my_table_name | Vacuum my_table_name sort (partition: 1761 remaining rows: 7330776383) | 0m 0s
I am wondering how long it will be before it finishes as it is not giving any time estimates as well. Its currently processing partition 1761... is it possible to know how many partitions there are in a certain table? Note these seem to be some storage level lower layer partitions within Redshift.
These days, it is recommended that you should not use Interleaved Sorting.
The sort algorithm places a tremendous load on the VACUUM operation and the benefits of Interleaved Sorts are only applicable for very small use-cases.
I would recommend you change to a compound sort on the fields most commonly used in WHERE clauses.
The most efficient sorts are those involving date fields that are always incrementing. For example, imagine a situation where rows are added to the table with a transaction date. All new rows have a date greater than the previous rows. In this situation, a VACUUM is not actually required because the data is already sorted according to the Date field.
Also, please realise that 500 GB is actually a LOT of data. Doing anything that rearranges that amount of data will take time.
If you vacuum is running slow you probably don’t have enough space on the cluster. I suggest you double the number of nodes temporarily while you do the vacuum.
You might also want to think about changing how your schema is set up. It’s worth going through this list of redshift tips to see if you can change anything:
https://www.dativa.com/optimizing-amazon-redshift-predictive-data-analytics/
The way we recovered back to the previous stage is to drop the table and restore it from the pre vacuum index time from the backup snapshot.
3 options, on a table of events that are inserted by a timestamp.
Which query is faster/better?
Select a,b,c,d,e.. from tab1 order by timestamp desc limit 100
Select top 100 a,b,c,d,e.. from tab1 order by timestamp desc
Select top 100 a,b,c,d,e.. from tab1 order by timestamp desc limit 100
When you ask a question like that, EXPLAIN syntax is helpful. Just add this keyword at the beginning of your query and you will see a query plan. In cases 1 and 2 the plans will be absolutely identical. These are variations of SQL syntax but the internal interpreter of SQL should produce the same query plan according to which requested operations will be performed physically.
More about EXPLAIN command here: EXPLAIN in Redshift
You can get the result by running these queries on a sample dataset. Here are my observations:
Type 1: 5.54s, 2.42s, 1.77s, 1.76s, 1.76s, 1.75s
Type 2: 5s, 1.77s, 1s, 1.75s, 2s, 1.75s
Type 3: Is an invalid SQL statement as you are using two LIMIT clauses
As you can observe, the results are the same for both the queries as both undergo internal optimization by the query engine.
Apparently both TOP and LIMIT do a similar job, so you shouldn't be worrying about which one to use.
More important is the design of your underlying table, especially if you are using WHERE and JOIN clauses. In that case, you should be carefully choosing your SORTKEY and DISTKEY, which will have much more impact on the performance of Amazon Redshift that a simple syntactical difference like TOP/LIMIT.