Amazon redshift vacuum reindex - amazon-web-services

I am running a script to get the tables to run vacuum reindex based on interleaved_skew value from svv_interleaved_columns which represents the skew ratio of interleaved columns(interleaved_skew > 1.4) as mentioned in AWS guide.
The value 1.00 for interleaved_skew specifies that all the rows are in sorted order and no re index is required.
Now that I have run a vacuum reindex on a table of 8gb data, I expect the interleaved_skew value to go down but its behaving ackwardly and is increasing sometimes. And since my script is picking the tables to run vacuum reindex based on interleaved_skew and as the value is not going down to 1.00 the same tables are being picked and re index is being run again and that is killing most of my time.
I expect the tables after going through vacuum reindex and there is no flow of data into the table then that particular table should not go through vacuum reindex again as there won't be any skew.
But, the tables are being picked again.
Thanks in advance,
Any explanation on stv_interleaved_counts table & how and when the values in svv_interleaved_columns changes would help me greatly

Please have a look at our "AnalyzeVacuumUtility" on GitHub. It may provide all the functionality you are looking for.
As far as Interleaved sort keys go, I recommend this style of sort key for only for large tables that are not regularly updated. Compound sort keys will perform better in most circumstances.
Please review our "Advanced Table Design Playbook: Compound and Interleaved Sort Keys" to help with choosing the right style.

Related

Is it safe to truncate ACT_RU_METER_LOG table in Camunda BPM?

The ACT_RU_METER_LOG table contains 10 million rows. I want to upgrade the Camunda from 7.10.0 to 7.17 and as part of the upgrade there are few alter table statements on the mentioned table. As expected these alter tables take huge time, hence wondering if I can truncate the table. I am aware that the metrics can be disabled, but the existing data should be cleaned explicitly.
Thanks in advance.

I would like to know about BigQuery output destination tables

BigQuery scheduled queries and I would like to know one point when spitting out the output of a select statement into a separate table.
I am new to Google Cloud Platform.
The last table that outputs using order by will not have the Is the result of the output spit out as it is?
I would like to write to the table of output results of the schedule query after sorting by order by id. Is that possible?
Sorry for the rudimentary question, but thank you in advance.
BigQuery tables don't guarantee a certain order of its records. Unless you're using clustering - this will physically sort the data in the background. But this will also not guarantee a sorted output of a query. You have to use ORDER BY in order to have a sorted output guaranteed. That means you need a key by which you can sort.

Redshift DISTSTYLE KEY. Deciding whats the best column to define as KEY

Well I recently got into this area of Redshift, trying to optimize disk usage and performance of my database, and having read lots of information on AWS about the topic, I still have some doubts.
First of all, to my database structure. Per schema, I have 3 master tables, with 3 different IDs, these are now DISTSTLYE ALL tables, being small in size.
Each master table has different amounts of IDs,
the date table --> largest one (#1 most joined)
the store table --> medium one (#3 most joined)
the item table --> smallest one (#2 most joined)
Then I have my core table, which has needed combinations of these IDs to display additional information about them. Anyway, this table should be a DISTSTYLE KEY type, based on my knowledge. Well, which of the 3 IDs should I select to be my DIST KEY?
Whats the criteria for this decision? I understand that for joins I need to look at the Sort Key, well that has been understood and defined to the ID_date, because its the most joined table. So now, what about the distribution per node of this table?
I'm sorry if I'm rambling, I dont want to leave any information out. If I have, feel free to ask! Thanks for taking the time to read!
You'll find the best advice on Amazon Redshift best practices for designing tables. It goes into quite a bit of detail.
However, my rule of thumb is:
The DISTKEY should be the column most used in JOINs between tables
The SORTKEY should be the column most used in WHERE statements
Use DISTSTYLE ALL for small lookup tables

Are Redshift system tables immutable and well ordered?

Redshift system tables only story a few days of logging data - periodically backing up rows from these tables is a common practice to collect and maintain proper history. To find new rows added in to system logs I need to check against my backup tables either on query (number) or execution time.
According to an answer on How do I keep more than 5 day's worth of query logs? we can simply select all rows with query > (select max(query) from log). The answer is unreferenced and assumes that query is inserted sequentially.
My question in two parts - hoping for references or code-as-proof - is
are query (identifiers) expected to be inserted sequentially, and
are system tables, e.g. stl_query, immutable or unchanging?
Assuming that we can't verify or prove both the above, then what's the right strategy to backup the system tables?
I am wary of this because I fully expect long running queries to complete after many other queries have started and completed.
I know query (identifier) is generated at query submit time, because I can monitor in progress queries. Therefore it is completed expected that a long running query=1 may complete after query=2. If the stl_query table is immutable then query=1 will be inserted after query=2, and the max(query) logic is flawed.
Alternatively, if query=1 is inserted into stl_query at run time, then the row must be updated upon completion (with end time, duration, etc). This would required me to do an upsert into the backup table.
I think the stl_query table is indeed immutable, it would seem that it's only written to after a query finishes.
Here is why I think that. First off, I ran this query on a cluster with running queries
select count(*) from stl_query where endtime is null
This returns 0. My hunch is that you'll probably see the same thing on your side.
To be double sure, I also ran this query:
select count(*) from stv_inflight i
inner join stl_query q on q.query = i.query
This also returns zero (while I did have queries inflight), which seems to confirm that queries are only logged in stl_query when they have finished executing and are not updated.
That said, I would rewrite the query to insert into your history table as such:
insert into admin.query_history (
select * from stl_query
where query not in (select query from admin.query_history)
)
That way, you'll always insert any records you don't have in the history table.

Redshift Query taking too much time

In Redshift, the queries are taking too much time to execute. Some queries keep on running or get aborted after some time.
I have very limited knowledge of Redshift and it is getting difficult to understand the Query plan to optimise the query.
Sharing one of the queries that we run, along with the Query Plan.
The query is taking 20 seconds to execute.
Query
SELECT
date_trunc('day',
ti) as date,
count(distinct deviceID) AS COUNT
FROM
live_events
WHERE
brandID = 3927
AND ti >= '2017-08-02T00:00:00+00:00'
AND ti <= '2017-09-02T00:00:00+00:00'
GROUP BY
1
Primary key
brandID
Interleaved Sort Keys
we have set following columns as interleaved sort keys -
brandID, ti, event_name
QUERY PLAN
You have 126 million rows in that table. It's going to take more than a second on a single dc1.large node.
Here's some ways you could improve the performance:
More nodes
Spreading data across more nodes allows more parallelization. Each node adds additional processing and storage. Even if your data volume only justifies one node, if you want more performance, add more nodes.
SORTKEY
For the right type of query, the SORTKEY can be the best way to improve query speed. Sorting data on disk allows Redshift to skip over blocks that it knows does not contain relevant data.
For example, your query has WHERE brandID = 3927, so having brandID as the SORTKEY would make this extremely efficient because very few disk blocks would contain data for one brand.
Interleaved sorting is rarely the best sorting method to use because it is less efficient than a single or compound sort key and takes a long time to VACUUM. If the query you have shown is typical of the type of queries you are running, then use a compound sort key of brandId, ti or ti, brandId. It will be much more efficient.
SORTKEYs are typically a date column, since they are often found in a WHERE clause and the table will be automatically sorted if data is always appended in time order.
The Interleaved Sort would be causing Redshift to read many more disk blocks to find your data, thereby significantly increasing query time.
DISTKEY
The DISTKEY should typically be set to the field that is most used in a JOIN statement on the table. This is because data relating to the same DISTKEY value is stored on the same slice. This won't have such a large impact on a single node cluster, but it is still worth getting right.
Again, you have only shown one type of query, so it is hard to recommend a DISTKEY. Based on this query alone, I would recommend DISTKEY EVEN so that all slices participate in the query. (It is also the default DISTKEY if no specific DISTKEY is selected.) Alternatively, set DISTKEY to a field not shown -- but certainly don't use brandId as the DISTKEY otherwise only one slice will participate in the query shown.
VACUUM
VACUUM your tables regularly so that the data is stored in SORTKEY order and deleted data is removed from storage.
Experiment!
Optimal settings depend upon your data and the queries you typically run. Perform some tests to compare SORTKEY and DISTKEY values and choose the settings that perform the best. Then, test again in 3 months to see if your queries or data has changed enough to make other settings more efficient.
Some time the issue could be due to locks being acquired by other processes. You can refer: https://aws.amazon.com/premiumsupport/knowledge-center/prevent-locks-blocking-queries-redshift/
I'd also like to add that in your query you are performing date transformations. Date operations are expensive in Redshift.
-- This date operation is expensive
date_trunc('day', ti) as date
If you have the luxury you should store the date in the format you need in an additional column.