In Redshift, is it possible to pause/resume a VACUUM FULL job? - amazon-web-services

I'm working with on an exceptionally large table which due to some data issue, I have to re-insert data on a couple of historical dates. After the insertion, I wanted to perform a manually triggered VACUUM FULL operation. However, unfortunately, the VACUUM FULL operation on that table takes more than several days to complete. Since, in Redshift, only one VACUUM operation can happen at a time, that also means that other smaller tables will not be able to perform their daily VACUUM operation until that large table is done with its VACUUM operation.
My question is, is there a way to pause a VACUUM operation on that large table to give some room for VACUUM-ing the smaller tables? Will terminating a VACUUM operation resets the operation or will re-running the VACUUM command able to resume the operation from the last successful state?
Sorry, I'm trying to learn more about how the VACUUM process works in Redshift but I am not able to find too much info on it. Would be really appreciated to have some explanation/docs in your answer as well.
Note: I did tried to perform a deep copy as mentioned in the official docs. However, the table is too large to copy on one go so it's not an option.
Thanks!

As far as I know there is no way to pause a (manual) vacuum. However, vacuum runs in "passes" where some vacuum work is done and partial results are committed. This allows for ongoing work to progress while the vacuum is running. If you terminate the vacuum midway the previously committed blocks will save the partial work. Some work will be lost for the current pass and the restarted vacuum will need to scan the entire table to figure out where to start. Last I knew this will work but you will lose some progress with each terminate.
If you manage the update / consistency issues a deep copy can be a faster way to go. You don't have to do this in a single pass - you can do it in parts. You need the space to store the second version of the table but you won't need the space to sort the whole table in one go. For example if you have a table, let's say with 10 years of data, you can insert the first year into the new table (sorted of course). Then the second and so on. There may be some partial empty blocks at the boundaries but these are easy to fix up with a delete only vacuum (or just wait for auto-vacuum to do it).
If you can do it the deep copy method will be faster as it doesn't need to keep consistency or play nice with other workloads.

Related

Q: AWS Redshift: ANALYZE COMPRESSION 'Table Name' - How to save the result set into a table / Join to other Table

I am looking for a way to save the result set of an ANALYZE Compression to a table / Joining it to another table in order to automate compression scripts.
Is it Possible and how?
You can always run analyze compression from an external program (bash script is my go to), read the results and store them back up to Redshift with inserts. This is usually the easiest and fastest way when I run into these type of "no route from leader to compute node" issues on Redshift. These are often one-off scripts that don't need automation or support.
If I need something programatic I'll usually write a Lambda function (or possibly a python program on an ec2). Fairly easy and execution speed is high but does require an external tool and some users are not happy running things outside of the database.
If it needs to be completely Redshift internal then I make a procedure that keeps the results of the leader only query in cursor and then loops on the cursor inserting the data into a table. Basically the same as reading it out and then inserting back in but the data never leaves Redshift. This isn't too difficult but is slow to execute. Looping on a cursor and inserting 1 row at a time is not efficient. Last one of these I did took 25 sec for 1000 rows. It was fast enough for the application but if you need to do this on 100,000 rows you will be waiting a while. I've never done this with analyze compression before so there could be some issue but definitely worth a shot if this needs to be SQL initiated.

First-run of queries are extremely slow

Our Redshift queries are extremely slow during their first execution. Subsequent executions are much faster (e.g., 45 seconds -> 2 seconds). After investigating this problem, the query compilation appears to be the culprit. This is a known issue and is even referenced on the AWS Query Planning And Execution Workflow and Factors Affecting Query Performance pages. Amazon itself is quite tight lipped about how the query cache works (tl;dr it's a magic black box that you shouldn't worry about).
One of the things that we tried was increasing the number of nodes we had, however we didn't expect it to solve anything seeing as how query compilation is a single-node operation anyway. It did not solve anything but it was a fun diversion for a bit.
As noted, this is a known issue, however, anywhere it is discussed online, the only takeaway is either "this is just something you have to live with using Redshift" or "here's a super kludgy workaround that only works part of the time because we don't know how the query cache works".
Is there anything we can do to speed up the compilation process or otherwise deal with this? So far about the best solution that's been found is "pre-run every query you might expect to run in a given day on a schedule" which is....not great, especially given how little we know about how the query cache works.
there are 3 things to consider
The first run of any query causes the query to be "compiled" by
redshift . this can take 2-20 seconds depending on how big it is.
subsequent executions of the same query use the same compiled code,
even if the where clause parameters change there is no re-compile.
Data is measured as marked as "hot" when a query has been run
against it, and is cached in redshift memory. you cannot (reliably) manually
clear this in any way EXCEPT a restart of the cluster.
Redshift will "results cache", depending on your redshift parameters
(enabled by default) redshift will quickly return the same result
for the exact same query, if the underlying data has not changed. if
your query includes current_timestamp or similar, then this will
stop if from caching. This can be turned off with SET enable_result_cache_for_session TO OFF;.
Considering your issue, you may need to run some example queries to pre compile or redesign your queries ( i guess you have some dynamic query building going on that changes the shape of the query a lot).
In my experience, more nodes will increase the compile time. this process happens on the master node not the data nodes, and is made more complex by having more data nodes to consider.
The query is probably not actually running a second time -- rather, Redshift is just returning the same result for the same query.
This can be tested by turning off the cache. Run this command:
SET enable_result_cache_for_session TO OFF;
Then, run the query twice. It should take the same time for each execution.
The result cache is great for repeated queries. Rather than being disappointed that the first execution is 'slow', be happy that subsequent cached queries are 'fast'!

Amazon redshift large table VACUUM REINDEX issue

My table is 500gb large with 8+ billion rows, INTERLEAVED SORTED by 4 keys.
One of the keys has a big skew 680+. On running a VACUUM REINDEX, its taking very long, about 5 hours for every billion rows.
When i track the vacuum progress it says the following:
SELECT * FROM svv_vacuum_progress;
table_name | status | time_remaining_estimate
-----------------------------+--------------------------------------------------------------------------------------+-------------------------
my_table_name | Vacuum my_table_name sort (partition: 1761 remaining rows: 7330776383) | 0m 0s
I am wondering how long it will be before it finishes as it is not giving any time estimates as well. Its currently processing partition 1761... is it possible to know how many partitions there are in a certain table? Note these seem to be some storage level lower layer partitions within Redshift.
These days, it is recommended that you should not use Interleaved Sorting.
The sort algorithm places a tremendous load on the VACUUM operation and the benefits of Interleaved Sorts are only applicable for very small use-cases.
I would recommend you change to a compound sort on the fields most commonly used in WHERE clauses.
The most efficient sorts are those involving date fields that are always incrementing. For example, imagine a situation where rows are added to the table with a transaction date. All new rows have a date greater than the previous rows. In this situation, a VACUUM is not actually required because the data is already sorted according to the Date field.
Also, please realise that 500 GB is actually a LOT of data. Doing anything that rearranges that amount of data will take time.
If you vacuum is running slow you probably don’t have enough space on the cluster. I suggest you double the number of nodes temporarily while you do the vacuum.
You might also want to think about changing how your schema is set up. It’s worth going through this list of redshift tips to see if you can change anything:
https://www.dativa.com/optimizing-amazon-redshift-predictive-data-analytics/
The way we recovered back to the previous stage is to drop the table and restore it from the pre vacuum index time from the backup snapshot.

Is there any possibility that deleted data can be recovered back in SAS?

I am working on production environment. Last day accidentally I made changes to Master dataset permanently while trying to get the sample out of it in work directory. Unfortunately they don't have any backup for this data.
I wanted to execute this:
Data work.facttable;
Set Master.facttable(obs=10);
run;
instead of this, accidentally I executed the following:
data Master.facttable;
set Master.facttable(obs=10);
run;
You can clearly see what sort of blunder it was!
Facttable has been building up nearly from 2 long years and it is of 250GB and has millions of rows. Now it has 10 rows and is of 128kb :(
I am very much worried how to recover the data back. It is crucial for the business teams. I have no idea how to proceed to get it back.
I know that SAS doesn't support any rollback options or recovery process. We don't use Audit trail method also.
I am just wondering if there is any way that still we can get the data back in spite of all these.
Details: Dataset is assigned on SPDE Engine. I checked the data files(.dpf) but all were disappeared except yesterday's data file which is of 128kb
You appear to have exhausted most of the simple options already:
Restore from external/OS-level backup
Restore from previous generation via the gennum= data set option (only available if the genmax option was set to 1+ when creating the dataset).
Restore from SAS audit trail
I think that leaves you with just 2 options:
Rebuild the dataset from the underlying source(s), if you still have them.
Engage the services of a professional data recovery company, who might be able to recover some or all of the deleted files, depending on the complexity of your storage environment, and how much of the original 250GB has since been overwritten.
Either way, it sounds as though this may prove to have been an expensive mistake.

Why does vacuum reindex on an already sorted table take forever?

I have a stuck 'vacuum reindex' operation and am wondering what may be the cause for it taking such a long time.
I recently changed the schema of one of my Redshift tables, by creating a new table with the revised schema and deep copying the data using 'select into' (see Performing a Deep Copy). My basic understanding was that after deep copying the table, the data should be sorted according to the table's sort-keys. The table has an interleaved 4-column sort-key. Just to make sure, after deep copying I ran the 'interleaved skew' query (see Deciding When to Reindex), and the results were 1.0 for all columns, meaning no skew.
I then ran 'vacuum reindex' on the table, which should be really quick since the data is already sorted. However the vacuum is still running after 30 hours. During the vacuum I examined svv_vacuum_progress periodically to check the vacuum operation status. The 'sort' phase finished after ~6 hours but now the 'merge' phase is stuck in 'increment 23' for >12 hours.
What could be the cause for the long vacuum operation, given that the data is supposed to be already sorted by the deep copy operation? Am I to expect these times for future vacuum operations too? The table contains ~3.5 billion rows and its total size is ~200 GB.