Can data be changed in my tables that will break batch updating while it modifies multiple rows? - concurrency

I have an update query that is based on the result of a select, typically returning more than 1000 rows.
If some of these rows are updated by other queries before this update can touch them could that cause a problem with the records? For example could they get out of sync with the original query?
If so would it be better to select and update individual rows rather than in batch?
If it makes a difference, the query is being run on Microsoft SQL Server 2008 R2
Thanks.

No.
A Table cannot be updated while something else is in the process of updating it.
Databases use concurrency control and have ACID properties to prevent exactly this type of problem.

I would recommend reading up on isolation levels. The default in SQL Server is READ COMMITTED, which means that other transactions cannot read data that has been updated but not committed by a given transaction.
This means that data returned by your select/update statement will be an accurate reflection of the database at a moment in time.
If you were to change your database to READ UNCOMMITTED then you could get into a situation where the data from your select/update is out of synch.

If you're selecting first, then updating, you can use a transaction
BEGIN TRAN
-- your select WITHOUT LOCKING HINT
-- your update based upon select
COMMIT TRAN
However, if you're updating directly from a select, then, no need to worry about it. A single transaction is implied.
UPDATE mytable
SET value = mot.value
FROM myOtherTable mot
BUT... do NOT do the following, otherwise you'll run into a deadlock
UPDATE mytable
SET value = mot.value
FROM myOtherTable mot WITH (NOLOCK)

Related

Snowflake update statements locks the entire table and queues other update statements

We have a use case where we need to execute multiple update statements(each updating different set of rows) on the same table in snowflake without any latency caused by queuing time of the update queries. Currently, a single update statements takes about 1 min to execute and all the other concurrent update statements are locked (about 40 concurrent update statements) and are queued, so the total time including the wait time is around 1 hour but the expected time is around 2 mins( assuming all update statements execute at the same time and the size of warehouse supports 40 queries at the same time without any queueing).
What is the best solution to avoid this lock time? We've considered the following two options :-
Make changes in the application code to batch all the update statements and execute as one query - Not possible for our use case.
Have a separate table for each client (each update statement, updates rows for different clients in the table) - This way, all the update statements will be executing in separate table and there won't be any locks.
Is the second approach the best way to go or is there any other workaround that would help us reduce the latency of the queueing time?
The scenario is expected to happen since Snowflake locks table during update.
Option 1 is ideal to scale in data model. But since you can't make it happen, you can go by option 2.
You can also put all the updates in one staging table and do upsert in bulk - Delete and Insert instead of update. Check if you can afford the delay.
But if you ask me, snowflake should not be used for atomic update. It has to be an upsert (delete and insert and that too in bulk). Atomic updates will have limitations. Try going for a row based store like Citus or MySQL if your usecase allows this.

Is it possible to run queries in parallel in Redshift?

I wanted to do an insert and update at the same time in Redshift. For this I am inserting the data into a temporary table, removing the updated entries from the original table and inserting all the new and updated entries. Since Redshift uses concurrency, sometimes entries are duplicated, because the delete started before the insert was finished. Using a very large sleep for each operation this does not happen, however the script is very slow. Is it possible to run queries in parallel in Redshift?
Hope someone can help me , thanks in advance!
You should read up on MVCC (multi-version coherency control) and transactions. Redshift can only only run one query at a time (for a session) but that is not the issue. You want to COMMIT both changes at the same time (COMMIT is the action that causes changes to be apparent to others). You do this by wrapping your SQL statement in a transaction (BEGIN ... COMMIT) and executed in the same session (not clear if you are using multiple sessions). All changes made within the transaction will only be visible to the session making the changes UNTIL COMMIT when ALL the changes made by the transaction will be visible to everyone at the same moment.
A few things to watch out for - if your connection is in AUTOCOMMIT mode then you may break out of your transaction early and COMMIT partial results. Also when you are working in transactions your source table information is unchanging (so you see consistent data during your transaction) and this information isn't allowed to change for you. This means that if you have multiple sessions changing table data you need to be careful about the order in which they COMMIT so the right version of data is presented to each other.
begin transaction;
<run the queries in parallel>
end transaction;
In this specific case do this:
create temp table stage (like target);
insert into stage
select * from source
where source.filter = 'filter_expression';
begin transaction;
delete from target
using stage
where target.primarykey = stage.primarykey;
insert into target
select * from stage;
end transaction;
drop table stage;
See:
https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-upsert.html
https://docs.aws.amazon.com/redshift/latest/dg/t_updating-inserting-using-staging-tables-.html

Informatica : taking very long time when doing insert

i have one mapping which just includes one source table and one target table. The source table has 100 columns and around 33xxxx records, i need to use this tool to insert to the target table and the logic is insert only. The version of informatica is 9.6.1 version and Database is SQL Server 2012.
After i run the workflow, it takes 5x/s to insert. the speed is too slow. I think it may be related to the number of columns
Can anyone help me how to increase the speed?
Thanks a lot
I think i know the reason why it happened. It is there are two fields which are ntext field in this table. That's why it takes very long time.
You can try the below options
1) Use bulk option for 'Target Load type' attribute in session if the target table doesn't have any indexes or keys on it
2) If there is any SQL override in the SOURCE QUALIFIER try to tune the query
3) Find for 'BUSY' in the session log and note down the busy percentages of each thread. Based on the thread percentages you will be able to identify the exact thread which is taking more time (Reader, Transformation, Writer)
4) Try to use informatica partitions through which you can achieve parallel processing.
Thanks and Regards,
Raj
Consider following points to increase the performance:
Increase the "commit interval" size in the session level properties.
Use the "bulk load" in session level properties.
You can also use the "partitioning" in session level, to do this you need partitioning license.
If your source is a database and you are doing sql override in source qualifier transformation , then you can also use the "Hints" for increasing the performan

Django, Insertion during schema migration

Sometimes schemamigration takes long time, e.g several fields are added/removed/edited. What happens if you try to make an insertion to a table while running a schema migration to change the structure of this table?
I'm aware the changes are not persistent until the entire migration is done.
That behavior depends on the underlying database and what the actual migration is doing. For example, PostgreSQL DDL operations are transactional; an insert to the table will block until a DDL transaction completes. For example, in one psql window, do something like this:
create table kvpair (id serial, key character varying (50), value character varying(100));
begin;
alter table kvpair add column rank integer;
At this point, do not commit the transaction. In another psql window, try:
insert into kvpair (key, value) values ('fruit', 'oranges');
You'll see it will block until the transaction in the other window is committed.
Admittedly, that's a contrived example - the granularity of what's locked will depend on the operation (DDL changes, indexing, DML updates). In addition, any statements that get submitted for execution may have assumed different constraints. For example, change the alter table statement above to include not null. On commit, the insert fails.
In my experience, it's always a good thing to consider the "compatibility" of schema changes, and minimize changes that will dramatically restructure large tables. Careful attention can help you minimize downtime, by performing schema changes that can occur on a running system.

Database polling, prevent duplicate fetches

I have a system whereby a central MSSQL database keeps in a table a queue of jobs that need to be done.
For the reasons that processing requirements would not be that high, and that there would not be a particularly high frequency of requests (probably once every few seconds at most) we made the decision to have the applications that utilise the queue simply query the database whenever one is needed; there is no message queue service at this time.
A single fetch is performed by having the client application run a stored procedure, which performs the query(ies) involved and returns a job ID. The client application then fetches the job information by querying by ID and sets the job as handled.
Performance is fine; the only snag we have felt is that, because the client application has to query for the details and perform a check before the job is marked as handled, on very rare occasions (once every few thousand jobs), two clients pick up the same job.
As a way of solving this problem, I was suggesting having the initial stored procedure that runs "tag" the record it pulls with the time and date. The stored procedure, when querying for records, will only pull records where this "tag" is a certain amount of time, say 5 seconds, in the past. That way, if the stored procedure runs twice within 5 seconds, the second instance will not pick up the same job.
Can anyone foresee any problems with fixing the problem this way or offer an alternative solution?
Use a UNIQUEIDENTIFIER field as your marker. When the stored procedure runs, lock the row you're reading and update the field with a NEWID(). You can mark your polling statement using something like WITH(READPAST) if you're worried about deadlocking issues.
The reason to use a GUID here is to have a unique identifier that will serve to mark a batch. Your NEWID() call is guaranteed to give you a unique value, which will be used to prevent you from accidentally picking up the same data twice. GETDATE() wouldn't work here because you could end up having two calls that resolve to the same time; BIT wouldn't work because it wouldn't uniquely mark off batches for picking up or reporting.
For example,
declare #ReadID uniqueidentifier
declare #BatchSize int = 20; -- make a parameter to your procedure
set #ReadID = NEWID();
UPDATE tbl WITH (ROWLOCK)
SET HasBeenRead = #ReadID -- your UNIQUEIDENTIFIER field
FROM (
SELECT TOP (#BatchSize) Id
FROM tbl WITH(UPDLOCK ROWLOCK READPAST )
WHERE HasBeenRead IS null ORDER BY [Id])
AS t1
WHERE ( tbl.Id = t1.Id)
SELECT Id, OtherCol, OtherCol2
FROM tbl WITH(UPDLOCK ROWLOCK READPAST )
WHERE HasBeenRead = #ReadID
And then you can use a polling statement like
SELECT COUNT(*) FROM tbl WITH(READPAST) WHERE HasBeenRead IS NULL
Adapted from here: https://msdn.microsoft.com/en-us/library/cc507804%28v=bts.10%29.aspx