Is it possible to run queries in parallel in Redshift? - amazon-web-services

I wanted to do an insert and update at the same time in Redshift. For this I am inserting the data into a temporary table, removing the updated entries from the original table and inserting all the new and updated entries. Since Redshift uses concurrency, sometimes entries are duplicated, because the delete started before the insert was finished. Using a very large sleep for each operation this does not happen, however the script is very slow. Is it possible to run queries in parallel in Redshift?
Hope someone can help me , thanks in advance!

You should read up on MVCC (multi-version coherency control) and transactions. Redshift can only only run one query at a time (for a session) but that is not the issue. You want to COMMIT both changes at the same time (COMMIT is the action that causes changes to be apparent to others). You do this by wrapping your SQL statement in a transaction (BEGIN ... COMMIT) and executed in the same session (not clear if you are using multiple sessions). All changes made within the transaction will only be visible to the session making the changes UNTIL COMMIT when ALL the changes made by the transaction will be visible to everyone at the same moment.
A few things to watch out for - if your connection is in AUTOCOMMIT mode then you may break out of your transaction early and COMMIT partial results. Also when you are working in transactions your source table information is unchanging (so you see consistent data during your transaction) and this information isn't allowed to change for you. This means that if you have multiple sessions changing table data you need to be careful about the order in which they COMMIT so the right version of data is presented to each other.

begin transaction;
<run the queries in parallel>
end transaction;
In this specific case do this:
create temp table stage (like target);
insert into stage
select * from source
where source.filter = 'filter_expression';
begin transaction;
delete from target
using stage
where target.primarykey = stage.primarykey;
insert into target
select * from stage;
end transaction;
drop table stage;
See:
https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-upsert.html
https://docs.aws.amazon.com/redshift/latest/dg/t_updating-inserting-using-staging-tables-.html

Related

Which one is more performant in redshift - Truncate followed with Insert Into or Drop and Create Table As?

I have been working on AWS Redshift and kind of curious about which of the data loading (full reload) method is more performant.
Approach 1 (Using Truncate):
Truncate the existing table
Load the data using Insert Into Select statement
Approach 2 (Using Drop and Create):
Drop the existing table
Load the data using Create Table As Select statement
We have been using both in our ETL, but I am interested in understanding what's happening behind the scene on AWS side.
In my opinion - Drop and Create Table As statement should be more performant as it reduces the overhead of scanning/handling associated data blocks for table needed in Insert Into statement.
Moreover, truncate in AWS Redshift does not reseed identity columns - Redshift Truncate table and reset Identity?
Please share your thoughts.
Redshift operates on 1MB blocks as the base unit of storage and coherency. When changes are made to a table it is these blocks that are "published" for all to see when the changes are committed. A table is just a list (data structure) of block ids that compose it and since there can be many versions of a table in flight at any time (if it is being changed while others are viewing it).
For the sake of the is question let's assume that the table in question is large (contains a lot of data) which I expect is true. These two statements end up doing a common action - unlinking and freeing all the blocks in the table. The blocks is where all the data exists so you'd think that the speed of these two are the same and on idle systems they are close. Both automatically commit the results so the command doesn't complete until the work is done. In this idle system comparison I've seen DROP run faster but then you need to CREATE the table again so there is time needed to recreate the data structure of the table but this can be in a transaction block so do we need to include the COMMIT? The bottom line is that in the idle system these two approaches are quite close in runtime and when I last measured them out for a client the DROP approach was a bit faster. I would advise you to read on before making your decision.
However, in the real world Redshift clusters are rarely idle and in loaded cases these two statements can be quite different. DROP requires exclusive control over the table since it does not run inside of a transaction block. All other uses of the table must be closed (committed or rolled-back) before DROP can execute. So if you are performing this DROP/recreate procedure on a table others are using the DROP statement will be blocked until all these uses complete. This can take an in-determinant amount of time to happen. For ETL processing on "hidden" or "unpublished" tables the DROP/recreate method can work but you need to be really careful about what other sessions are accessing the table in question.
Truncate does run inside of a transaction but performs a commit upon completion. This means that it won't be blocked by others working with the table. It's just that one version of the table is full (for those who were looking at it before truncate ran) and one version is completely empty. The data structure of the table has versions for each session that has it open and each sees the blocks (or lack of blocks) that corresponds to their version. I suspect that it is managing these data structures and propagating these changes through the commit queue that slows TRUNCATE down slightly - bookkeeping. The upside for this bookkeeping is that TRUNCATE will not be blocked by other sessions reading the table.
The deciding factors on choosing between these approaches is often not performance, it is which one has the locking and coherency features that will work in your solution.

Efficiency using triggers inside attached database with SQLite

Situation
I'm using multiple storage databases as attachments to one central "manager" DB.
The storage tables share one pseudo-AUTOINCREMENT index across all storage databases.
I need to iterate over the shared index frequently.
The final number and names of storage tables are not known on storage DB creation.
On some signal, a then-given range of entries will be deleted.
It is vital that no insertion fails and no entry gets deleted before its signal.
Energy outage is possible, data loss in this case is hardly, if ever, tolerable. Any solutions that may cause this (in-memory databases etc) are not viable.
Database access is currently controlled using strands. This takes care of sequential access.
Due to the high frequency of INSERT transactions, I must trigger WAL checkpoints manually. I've seen journals of up to 2GB in size otherwise.
Current solution
I'm inserting datasets using parameter binding to a precreated statement.
INSERT INTO datatable VALUES (:idx, ...);
Doing that, I remember the start and end index. Next, I bind it to an insert statement into the registry table:
INSERT INTO regtable VALUES (:idx, datatable);
My query determines the datasets to return like this:
SELECT MIN(rowid), MAX(rowid), tablename
FROM (SELECT rowid,tablename FROM entryreg LIMIT 30000)
GROUP BY tablename;
After that, I query
SELECT * FROM datatable WHERE rowid >= :minid AND rowid <= :maxid;
where I use predefined statements for each datatable and bind both variables to the first query's results.
This is too slow. As soon as I create the registry table, my insertions slow down so much I can't meet benchmark speed.
Possible Solutions
There are several other ways I can imagine it can be done:
Create a view of all indices as a UNION or OUTER JOIN of all table indices. This can't be done persistently on attached databases.
Create triggers for INSERT/REMOVE on table creation that fill a registry table. This can't be done persistently on attached databases.
Create a trigger for CREATE TABLE on database creation that will create the triggers described above. Requires user functions.
Questions
Now, before I go and add user functions (something I've never done before), I'd like some advice if this has any chances of solving my performance issues.
Assuming I create the databases using a separate connection before attaching them. Can I create views and/or triggers on the database (as main schema) that will work later when I connect to the database via ATTACH?
From what it looks like, a trigger AFTER INSERT will fire after every single line of insert. If it inserts stuff into another table, does that mean I'm increasing my number of transactions from 2 to 1+N? Or is there a mechanism that speeds up triggered interaction? The first case would slow down things horribly.
Is there any chance that a FULL OUTER JOIN (I know that I need to create it from other JOIN commands) is faster than filling a registry with insertion transactions every time? We're talking roughly ten transactions per second with an average of 1000 elements (insert) vs. one query of 30000 every two seconds (query).
Open the sqlite3 databases in multi-threading mode, handle the insert/update/query/delete functions by separate threads. I prefer to transfer query result to a stl container for processing.

Django, Insertion during schema migration

Sometimes schemamigration takes long time, e.g several fields are added/removed/edited. What happens if you try to make an insertion to a table while running a schema migration to change the structure of this table?
I'm aware the changes are not persistent until the entire migration is done.
That behavior depends on the underlying database and what the actual migration is doing. For example, PostgreSQL DDL operations are transactional; an insert to the table will block until a DDL transaction completes. For example, in one psql window, do something like this:
create table kvpair (id serial, key character varying (50), value character varying(100));
begin;
alter table kvpair add column rank integer;
At this point, do not commit the transaction. In another psql window, try:
insert into kvpair (key, value) values ('fruit', 'oranges');
You'll see it will block until the transaction in the other window is committed.
Admittedly, that's a contrived example - the granularity of what's locked will depend on the operation (DDL changes, indexing, DML updates). In addition, any statements that get submitted for execution may have assumed different constraints. For example, change the alter table statement above to include not null. On commit, the insert fails.
In my experience, it's always a good thing to consider the "compatibility" of schema changes, and minimize changes that will dramatically restructure large tables. Careful attention can help you minimize downtime, by performing schema changes that can occur on a running system.

Database polling, prevent duplicate fetches

I have a system whereby a central MSSQL database keeps in a table a queue of jobs that need to be done.
For the reasons that processing requirements would not be that high, and that there would not be a particularly high frequency of requests (probably once every few seconds at most) we made the decision to have the applications that utilise the queue simply query the database whenever one is needed; there is no message queue service at this time.
A single fetch is performed by having the client application run a stored procedure, which performs the query(ies) involved and returns a job ID. The client application then fetches the job information by querying by ID and sets the job as handled.
Performance is fine; the only snag we have felt is that, because the client application has to query for the details and perform a check before the job is marked as handled, on very rare occasions (once every few thousand jobs), two clients pick up the same job.
As a way of solving this problem, I was suggesting having the initial stored procedure that runs "tag" the record it pulls with the time and date. The stored procedure, when querying for records, will only pull records where this "tag" is a certain amount of time, say 5 seconds, in the past. That way, if the stored procedure runs twice within 5 seconds, the second instance will not pick up the same job.
Can anyone foresee any problems with fixing the problem this way or offer an alternative solution?
Use a UNIQUEIDENTIFIER field as your marker. When the stored procedure runs, lock the row you're reading and update the field with a NEWID(). You can mark your polling statement using something like WITH(READPAST) if you're worried about deadlocking issues.
The reason to use a GUID here is to have a unique identifier that will serve to mark a batch. Your NEWID() call is guaranteed to give you a unique value, which will be used to prevent you from accidentally picking up the same data twice. GETDATE() wouldn't work here because you could end up having two calls that resolve to the same time; BIT wouldn't work because it wouldn't uniquely mark off batches for picking up or reporting.
For example,
declare #ReadID uniqueidentifier
declare #BatchSize int = 20; -- make a parameter to your procedure
set #ReadID = NEWID();
UPDATE tbl WITH (ROWLOCK)
SET HasBeenRead = #ReadID -- your UNIQUEIDENTIFIER field
FROM (
SELECT TOP (#BatchSize) Id
FROM tbl WITH(UPDLOCK ROWLOCK READPAST )
WHERE HasBeenRead IS null ORDER BY [Id])
AS t1
WHERE ( tbl.Id = t1.Id)
SELECT Id, OtherCol, OtherCol2
FROM tbl WITH(UPDLOCK ROWLOCK READPAST )
WHERE HasBeenRead = #ReadID
And then you can use a polling statement like
SELECT COUNT(*) FROM tbl WITH(READPAST) WHERE HasBeenRead IS NULL
Adapted from here: https://msdn.microsoft.com/en-us/library/cc507804%28v=bts.10%29.aspx

Can data be changed in my tables that will break batch updating while it modifies multiple rows?

I have an update query that is based on the result of a select, typically returning more than 1000 rows.
If some of these rows are updated by other queries before this update can touch them could that cause a problem with the records? For example could they get out of sync with the original query?
If so would it be better to select and update individual rows rather than in batch?
If it makes a difference, the query is being run on Microsoft SQL Server 2008 R2
Thanks.
No.
A Table cannot be updated while something else is in the process of updating it.
Databases use concurrency control and have ACID properties to prevent exactly this type of problem.
I would recommend reading up on isolation levels. The default in SQL Server is READ COMMITTED, which means that other transactions cannot read data that has been updated but not committed by a given transaction.
This means that data returned by your select/update statement will be an accurate reflection of the database at a moment in time.
If you were to change your database to READ UNCOMMITTED then you could get into a situation where the data from your select/update is out of synch.
If you're selecting first, then updating, you can use a transaction
BEGIN TRAN
-- your select WITHOUT LOCKING HINT
-- your update based upon select
COMMIT TRAN
However, if you're updating directly from a select, then, no need to worry about it. A single transaction is implied.
UPDATE mytable
SET value = mot.value
FROM myOtherTable mot
BUT... do NOT do the following, otherwise you'll run into a deadlock
UPDATE mytable
SET value = mot.value
FROM myOtherTable mot WITH (NOLOCK)