What's the right pattern for unique data in columns? - concurrency

I've a table [File] that has the following schema
CREATE TABLE [dbo].[File]
(
[FileID] [int] IDENTITY(1,1) NOT NULL,
[Name] [varchar](256) NOT NULL,
CONSTRAINT [PK_File] PRIMARY KEY CLUSTERED
(
[FileID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
The idea is that the FileID is used as the key for the table and the Name is the fully qualified path that represents a file.
What I've been trying to do is create a Stored Procedure that will check to see if the Name is already in use if so then use that record else create a new record.
But when I stress test the code with many threads executing the stored procedure at once I get different errors.
This version of the code will create a deadlock and throw a deadlock exception on the client.
CREATE PROCEDURE [dbo].[File_Create]
#Name varchar(256)
AS
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
BEGIN TRANSACTION xact_File_Create
SET XACT_ABORT ON
SET NOCOUNT ON
DECLARE #FileID int
SELECT #FileID = [FileID] FROM [dbo].[File] WHERE [Name] = #Name
IF ##ROWCOUNT=0
BEGIN
INSERT INTO [dbo].[File]([Name])
VALUES (#Name)
SELECT #FileID = [FileID] FROM [dbo].[File] WHERE [Name] = #Name
END
SELECT * FROM [dbo].[File]
WHERE [FileID] = #FileID
COMMIT TRANSACTION xact_File_Create
GO
This version of the code I end up getting rows with the same data in the Name column.
CREATE PROCEDURE [dbo].[File_Create]
#Name varchar(256)
AS
BEGIN TRANSACTION xact_File_Create
SET NOCOUNT ON
DECLARE #FileID int
SELECT #FileID = [FileID] FROM [dbo].[File] WHERE [Name] = #Name
IF ##ROWCOUNT=0
BEGIN
INSERT INTO [dbo].[File]([Name])
VALUES (#Name)
SELECT #FileID = [FileID] FROM [dbo].[File] WHERE [Name] = #Name
END
SELECT * FROM [dbo].[File]
WHERE [FileID] = #FileID
COMMIT TRANSACTION xact_File_Create
GO
I'm wondering what the right way to do this type of action is? In general this is a pattern I'd like to use where the column data is unique in either a single column or multiple columns and another column is used as the key.
Thanks

If you are searching heavily on the Name field, you will probably want it indexed (as unique, and maybe even clustered if this is the primary search field). As you don't use the #FileID from the first select, I would just select count(*) from file where Name = #Name and see if it is greater than zero (this will prevent SQL from retaining any locks on the table from the search phase, as no columns are selected).
You are on the right course with the SERIALIZABLE level, as your action will impact subsequent queries success or failure with the Name being present. The reason the version without that set causes duplicates is that two selects ran concurrently and found there was no record, so both went ahead with the inserts (which creates the duplicate).
The deadlock with the prior version is most likely due to the lack of an index making the search process take a long time. When you load the server down in a SERIALIZABLE transaction, everything else will have to wait for the operation to complete. The index should make the operation fast, but only testing will indicate if it is fast enough. Note that you can respond to the failed transaction by resubmitting: in real world situations hopefully the load will be transient.
EDIT: By making your table indexed, but not using SERIALIZABLE, you end up with three cases:
Name is found, ID is captured and used. Common
Name is not found, inserts as expected. Common
Name is not found, insert fails because another exact match was posted within milliseconds of the first. Very Rare
I would expect this last case to be truly exceptional, so using an exception to capture this very rare case would be preferable to engaging SERIALIZABLE, which has serious performance consequences.
If you do really have an expectation that it will be common to have posts within milliseconds of one another of the same new name, then use a SERIALIZABLE transaction in conjunction with the index. It will be slower in the general case, but faster when these posts are found.

First, create a unique index on the Name column. Then from your client code first check if the Name exists by selecting the FileID and putting the Name in the where clause - if it does, use the FileID. If not, insert a new one.

Using the Exists function might clean things up a little.
if (Exists(select * from table_name where column_name = #param)
begin
//use existing file name
end
else
//use new file name

Related

"Operation must use an updateable query" in MS Access when the Updated Table is the same as the source

The challenge is to update a table by scanning that same table for information. In this case, I want to find how many entries I received in an Upload dataset that have the same key (effectively duplicate instructions).
I tried the obvious code:
UPDATE Base AS TAR
INNER JOIN (select cdKey, count(*) as ct FROM Base GROUP BY cdKey) AS CHK
ON TAR.cdKey = CHK.cdKey
SET ctReferences = CHK.ct
This resulted in a non-updateable complaint. Some workarounds talked about adding DISTINCTROW, but that made no difference.
I tried creating a view (query in Ms/Access parlance); same failure.
Then I projected the set (SELECT cdKey, count(*) INTO TEMP FROM Base GROUP BY cdKey), and substituted TEMP for the INNER JOIN which worked.
Conclusion: reflexive updates are also non-updateable.
An initial thought was to embed a sub-select in the update, for example:
UPDATE Base TAR SET TAR.ctReferences = (select count(*) from Base CHK where CHK.cd = TAR.cd)
This also failed.
As this is part of a job I am calling, this SQL (like the other statements) are all strings executed by CurrentDb.Execute statements. I thought maybe I could make this a DLookup, I found that as cd is a string, I had a gaggle of double- and triple-quoted elements that was too messy to read (and maintain).
Best solution was to write a function so I could avoid having to do any sort of string manipulation. Hence, in a module there's a function:
Public Function PassHelperCtOccurs(ByRef cdX As String) As Long
PassHelperCtOccurs = DLookup("count(*)", "Base", "cd='" & cdX & "'")
End Function
And the call is:
CurrentDb().Execute ("UPDATE Base SET ctOccursCd =PassHelperCtOccurs(cd)")

Redshift Pivot Function

I've got a similar table which I'm trying to pivot in Redshift:
UUID
Key
Value
a123
Key1
Val1
b123
Key2
Val2
c123
Key3
Val3
Currently I'm using following code to pivot it and it works fine. However, when I replace the IN part with subquery it throws an error.
select *
from (select UUID ,"Key", value from tbl) PIVOT (max(value) for "key" in (
'Key1',
'Key2',
'Key3
))
Question: What's the best way to replace the IN part with sub query which takes distinct values from Key column?
What I am trying to achieve;
select *
from (select UUID ,"Key", value from tbl) PIVOT (max(value) for "key" in (
select distinct "keys" from tbl
))
From the Redshift documentation - "The PIVOT IN list values cannot be column references or sub-queries. Each value must be type compatible with the FOR column reference." See: https://docs.aws.amazon.com/redshift/latest/dg/r_FROM_clause-pivot-unpivot-examples.html
So I think this will need to be done as a sequence of 2 queries. You likely can do this in a stored procedure if you need it as a single command.
Updated with requested stored procedure with results to a cursor example:
In order to make this supportable by you I'll add some background info and description of how this works. First off a stored procedure cannot produce results strait to your bench. It can either store the results in a (temp) table or to a named cursor. A cursor is just storing the results of a query on the leader node where they wait to be fetched. The lifespan of the cursor is the current transaction so a commit or rollback will delete the cursor.
Here's what you want to happen as individual SQL statements but first lets set up the test data:
create table test (UUID varchar(16), Key varchar(16), Value varchar(16));
insert into test values
('a123', 'Key1', 'Val1'),
('b123', 'Key2', 'Val2'),
('c123', 'Key3', 'Val3');
The actions you want to perform are first to create a string for the PIVOT clause IN list like so:
select '\'' || listagg(distinct "key",'\',\'') || '\'' from test;
Then you want to take this string and insert it into your PIVOT query which should look like this:
select *
from (select UUID, "Key", value from test)
PIVOT (max(value) for "key" in ( 'Key1', 'Key2', 'Key3')
);
But doing this in the bench will mean taking the result of one query and copy/paste-ing into a second query and you want this to happen automatically. Unfortunately Redshift does allow sub-queries in PIVOT statement for the reason given above.
We can take the result of one query and use it to construct and run another query in a stored procedure. Here's such a store procedure:
CREATE OR REPLACE procedure pivot_on_all_keys(curs1 INOUT refcursor)
AS
$$
DECLARE
row record;
BEGIN
select into row '\'' || listagg(distinct "key",'\',\'') || '\'' as keys from test;
OPEN curs1 for EXECUTE 'select *
from (select UUID, "Key", value from test)
PIVOT (max(value) for "key" in ( ' || row.keys || ' )
);';
END;
$$ LANGUAGE plpgsql;
What this procedure does is define and populate a "record" (1 row of data) called "row" with the result of the query that produces the IN list. Next it opens a cursor, whose name is provided by the calling command, with the contents of the PIVOT query which uses the IN list from the record "row". Done.
When executed (by running call) this function will produce a cursor on the leader node that contains the result of the PIVOT query. In this stored procedure the name of the cursor to create is passed to the function as a string.
call pivot_on_all_keys('mycursor');
All that needs to be done at this point is to "fetch" the data from the named cursor. This is done with the FETCH command.
fetch all from mycursor;
I prototyped this on a single node Redshift cluster and "FETCH ALL" is not supported at this configuration so I had to use "FETCH 1000". So if you are also on a single node cluster you will need to use:
fetch 1000 from mycursor;
The last point to note is that the cursor "mycursor" now exists and if you tried to rerun the stored procedure it will fail. You could pass a different name to the procedure (making another cursor) or you could end the transaction (END, COMMIT, or ROLLBACK) or you could close the cursor using CLOSE. Once the cursor is destroyed you can use the same name for a new cursor. If you wanted this to be repeatable you could run this batch of commands:
call pivot_on_all_keys('mycursor'); fetch all from mycursor; close mycursor;
Remember that the cursor has a lifespan of the current transaction so any action that ends the transaction will destroy the cursor. If you have AUTOCOMMIT enable in your bench this will insert COMMITs destroying the cursor (you can run the CALL and FETCH in a batch to prevent this in many benches). Also some commands perform an implicit COMMIT and will also destroy the cursor (like TRUNCATE).
For these reasons, and depending on what else you need to do around the PIVOT query, you may want to have the stored procedure write to a temp table instead of a cursor. Then the temp table can be queried for the results. A temp table has a lifespan of the session so is a little stickier but is a little less efficient as a table needs to be created, the result of the PIVOT query needs to be written to the compute nodes, and then the results have to be sent to the leader node to produce the desired output. Just need to pick the right tool for the job.
===================================
To populate a table within a stored procedure you can just execute the commands. The whole thing will look like:
CREATE OR REPLACE procedure pivot_on_all_keys()
AS
$$
DECLARE
row record;
BEGIN
select into row '\'' || listagg(distinct "key",'\',\'') || '\'' as keys from test;
EXECUTE 'drop table if exists test_stage;';
EXECUTE 'create table test_stage AS select *
from (select UUID, "Key", value from test)
PIVOT (max(value) for "key" in ( ' || row.keys || ' )
);';
END;
$$ LANGUAGE plpgsql;
call pivot_on_all_keys();
select * from test_stage;
If you want this new table to have keys for optimizing downstream queries you will want to create the table in one statement then insert into it but this is quickie path.
A little off-topic, but I wonder why Amazon couldn't introduce a simpler syntax for pivot. IMO, if GROUP BY is replaced by PIVOT BY, it can give enough hint to the interpreter to transform rows into columns. For example:
SELECT partname, avg(price) as avg_price FROM Part GROUP BY partname;
can be written as:
SELECT partname, avg(price) as avg_price FROM Part PIVOT BY partname;
Even multi-level pivoting can also be handled in the same syntax.
SELECT year, partname, avg(price) as avg_price FROM Part PIVOT BY year, partname;

Using cte to swap two columns of a table

I want to swap 2nd and 3rd column of one table using CTE.
I'm working with below query, which keeps throwing an error,
no such column: cte.comm1
Table - [SalComm] column: ID, Sal, Comm
with CTE as
(
SELECT ID as id1, sal as sal1, comm as comm1 from SalComm
) UPDATE SalComm SET sal=cte.comm1, comm=cte.sal1 where ID= cte.id1*
Could you please suggest to me the right query?
This answer assumes you are using SQL Server, or some other database, which supports directly updating common table expressions. I don't see the point at all of the aliases inside your CTE. If you want to swap columns values, just use the direct columns names:
WITH cte AS (
SELECT ID, sal, comm
FROM SalComm
)
UPDATE cte
SET sal = comm, comm = sal;
-- no WHERE clause needed, if you really want to cover the entire table
That being said, you could just as easily do the above update on the original table. Updatable CTEs are more useful when they generate some complex derived results which you intend to use as part of a later update. That does not appear to be the case here.

Update a table by using multiple joins

I am using Java DB (Java DB is Oracle's supported version of Apache Derby and contains the same binaries as Apache Derby. source: http://www.oracle.com/technetwork/java/javadb/overview/faqs-jsp-156714.html#1q2).
I am trying to update a column in one table, however I need to join that table with 2 other tables within the same database to get accurate results (not my design, nor my choice).
Below are my three tables, ADSID is a key linking Vehicles and Customers and ADDRESS and ZIP in Salesresp are used to link it to Customers. (Other fields left out for the sake of brevity.)
Salesresp(address, zip, prevsale)
Customers(adsid, address, zipcode)
Vehicles(adsid, selldate)
The goal is to find customers in the SalesResp table that have previously purchased a vehicle before the given date. They are identified by address and adsid in Customers and Vechiles respectively.
I have seen updates to a column with a single join and in fact asked a question about one of my own update/joins here (UPDATE with INNER JOIN). But now I need to take it that one step further and use both tables to get all the information.
I can get a multi-JOIN SELECT statement to work:
SELECT * FROM salesresp
INNER JOIN customers ON (SALESRESP.ZIP = customers.ZIPCODE) AND
(SALESRESP.ADDRESS = customers.ADDRESS)
INNER JOIN vehicles ON (Vehicles.ADSId =Customers.ADSId )
WHERE (VEHICLES.SELLDATE<'2013-09-24');
However I cannot get a multi-JOIN UPDATE statement to work.
I have attempted to try the update like this:
UPDATE salesresp SET PREVSALE = (SELECT SALESRESP.address FROM SALESRESP
WHERE SALESRESP.address IN (SELECT customers.address FROM customers
WHERE customers.adsid IN (SELECT vehicles.adsid FROM vehicles
WHERE vehicles.SELLDATE < '2013-09-24')));
And I am given this error: "Error code 30000, SQL state 21000: Scalar subquery is only allowed to return a single row".
But if I change that first "=" to a "IN" it gives me a syntax error for having encountered "IN" (Error code 30000, SQL state 42X01).
I also attempted to do more blatant inner joins, but upon attempting to execute this code I got the the same error as above: "Error code 30000, SQL state 42X01" with it complaining about my use of the "FROM" keyword.
update salesresp set prevsale = vehicles.selldate
from salesresp sr
inner join vehicles v
on sr.prevsale = v.selldate
inner join customers c
on v.adsid = c.adsid
where v.selldate < '2013-09-24';
And in a different configuration:
update salesresp
inner join customer on salesresp.address = customer.address
inner join vehicles on customer.adsid = vehicles.ADSID
set salesresp.SELLDATE = vehicles.selldate where vehicles.selldate < '2013-09-24';
Where it finds the "INNER" distasteful: Error code 30000, SQL state 42X01: Syntax error: Encountered "inner" at line 3, column 1.
What do I need to do to get this multi-join update query to work? Or is it simply not possible with this database?
Any advice is appreciated.
If I were you I would:
1) Turn off autocommit (if you haven't already)
2) Craft a select/join which returns a set of columns that identifies the record you want to update E.g. select c1, c2, ... from A join B join C... WHERE ...
3) Issue the update. E.g. update salesrep SET CX = cx where C1 = c1 AND C2 = c2 AND...
(Having an index on C1, C2, ... will boost performance)
4) Commit.
That way you don't have worry about mixing the update and the join, and doing it within a txn ensures that nothing can change the result of the join before your update goes through.

Django: Distinct foreign keys

class Log:
project = ForeignKey(Project)
msg = CharField(...)
date = DateField(...)
I want to select the four most recent Log entries where each Log entry must have a unique project foreign key. I've tries the solutions on google search but none of them works and the django documentation isn't that very good for lookup..
I tried stuff like:
Log.objects.all().distinct('project')[:4]
Log.objects.values('project').distinct()[:4]
Log.objects.values_list('project').distinct('project')[:4]
But this either return nothing or Log entries of the same project..
Any help would be appreciated!
Queries don't work like that - either in Django's ORM or in the underlying SQL. If you want to get unique IDs, you can only query for the ID. So you'll need to do two queries to get the actual Log entries. Something like:
id_list = Log.objects.order_by('-date').values_list('project_id').distinct()[:4]
entries = Log.objects.filter(id__in=id_list)
Actually, you can get the project_ids in SQL. Assuming that you want the unique project ids for the four projects with the latest log entries, the SQL would look like this:
SELECT project_id, max(log.date) as max_date
FROM logs
GROUP BY project_id
ORDER BY max_date DESC LIMIT 4;
Now, you actually want all of the log information. In PostgreSQL 8.4 and later you can use windowing functions, but that doesn't work on other versions/databases, so I'll do it the more complex way:
SELECT logs.*
FROM logs JOIN (
SELECT project_id, max(log.date) as max_date
FROM logs
GROUP BY project_id
ORDER BY max_date DESC LIMIT 4 ) as latest
ON logs.project_id = latest.project_id
AND logs.date = latest.max_date;
Now, if you have access to windowing functions, it's a bit neater (I think anyway), and certainly faster to execute:
SELECT * FROM (
SELECT logs.field1, logs.field2, logs.field3, logs.date
rank() over ( partition by project_id
order by "date" DESC ) as dateorder
FROM logs ) as logsort
WHERE dateorder = 1
ORDER BY logs.date DESC LIMIT 1;
OK, maybe it's not easier to understand, but take my word for it, it runs worlds faster on a large database.
I'm not entirely sure how that translates to object syntax, though, or even if it does. Also, if you wanted to get other project data, you'd need to join against the projects table.
I know this is an old post, but in Django 2.0, I think you could just use:
Log.objects.values('project').distinct().order_by('project')[:4]
You need two querysets. The good thing is it still results in a single trip to the database (though there is a subquery involved).
latest_ids_per_project = Log.objects.values_list(
'project').annotate(latest=Max('date')).order_by(
'-latest').values_list('project')
log_objects = Log.objects.filter(
id__in=latest_ids_per_project[:4]).order_by('-date')
This looks a bit convoluted, but it actually results in a surprisingly compact query:
SELECT "log"."id",
"log"."project_id",
"log"."msg"
"log"."date"
FROM "log"
WHERE "log"."id" IN
(SELECT U0."id"
FROM "log" U0
GROUP BY U0."project_id"
ORDER BY MAX(U0."date") DESC
LIMIT 4)
ORDER BY "log"."date" DESC