RedShift: need helps for optimizations subquery WHERE IN (SELECT *)

RedShift: need helps for optimizations subquery WHERE IN (SELECT *) - amazon-web-services

I have next query to RedShift:
SELECT contributor_user_id,
device_id_source,
device_os,
device_model,
device_design,
device_serial,
device_carrier,
device_os_version,
device_manufacturer,
device_current_app_build,
device_current_app_version
FROM all_values
WHERE all_values.device_id_source :: VARCHAR NOT IN (SELECT device_id_source FROM table WHERE device_id_source IS NOT NULL)
AND all_values.device_os :: VARCHAR NOT IN (SELECT device_os FROM table WHERE device_os IS NOT NULL)
AND all_values.device_model :: VARCHAR NOT IN (SELECT device_os FROM table WHERE device_model IS NOT NULL)
AND all_values.device_design :: VARCHAR NOT IN (SELECT device_os FROM table WHERE device_design IS NOT NULL)
AND all_values.device_serial :: VARCHAR NOT IN (SELECT device_os FROM table WHERE device_serial IS NOT NULL)
AND all_values.device_carrier :: VARCHAR NOT IN (SELECT device_os FROM table WHERE device_carrier IS NOT NULL)
AND all_values.device_os_version :: VARCHAR NOT IN (SELECT device_os FROM table WHERE device_os_version IS NOT NULL)
AND all_values.device_manufacturer :: VARCHAR NOT IN (SELECT device_os FROM table WHERE device_manufacturer IS NOT NULL)
AND all_values.device_current_app_build :: VARCHAR NOT IN (SELECT device_os FROM table WHERE device_current_app_build IS NOT NULL)
AND all_values.device_current_app_version :: VARCHAR NOT IN (SELECT device_os FROM table WHERE device_current_app_version IS NOT NULL)
)
As I know, WHERE IN (SELECT) works slowly than "JOIN" and there are many identical requests in subquery and I think that it's not good. But I'm newbie in SQL and I don't know how I can rewrite the code above with JOIN. Could you help me with knowledge?
Thnx!

The "WHERE NOT IN (SELECT ..." can be very expensive as the list can be very long and take a lot of comparisons to determine if the value is not in the list. A somewhat less expensive way to do this is with "WHERE NOT EXISTS (SELECT ..." which is more of a JOIN structure internally but still may not be fast enough for your case.
Note these are just guesses based on your SQL and past experience. Given how simple the rest of the query looks it is a good bet. You may still want to look at the EXPLAIN plan for the query and see where the cost is increasing the most.
The best answer is to rethink this query and remove the negative logic. If I'm reading this right you want to find all the rows in contributor_user_id where the corresponding column value in "table" for ANY of the listed columns are NULL. To do this you are performing a subtraction algorithm using "WHERE NOT IN". I don't know your data model so I'm not sure if this logic is not correct.
The difficulty here is that I don't know your data and data-model. The query will flag any row that any column being NULL in "table" but only if there are no repeats of device_os in "table". For example one row in "table" with NULL for device_model but is not NULL for device_design in another row and has the same device_os value will not be flagged. It all depends on what the legal patterns are in your data. Are multiple rows with the same device_os legal in your data?
A better way is to make this into an additive algorithm which may greatly reduce the work needed to get the desired answer. Not understanding the data and the desired logic it is impossible for me to propose a solution. Example data and expected results would help in making a different solution proposal.

Related

Query for listing Datasets and Number of tables in Bigquery

So I'd like make a query that shows all the datasets from a project, and the number of tables in each one. My problem is with the number of tables.
Here is what I'm stuck with :
SELECT
smt.catalog_name as `Project`,
smt.schema_name as `DataSet`,
( SELECT
COUNT(*)
FROM ***DataSet***.INFORMATION_SCHEMA.TABLES
) as `nbTable`,
smt.creation_time,
smt.location
FROM
INFORMATION_SCHEMA.SCHEMATA smt
ORDER BY DataSet
The view INFORMATION_SCHEMA.SCHEMATA lists all the datasets from the project the query is executed, and the view INFORMATION_SCHEMA.TABLES lists all the tables from a given dataset.
The thing is that the view INFORMATION_SCHEMA.TABLES needs to have the dataset specified like this give the tables informations : dataset.INFORMATION_SCHEMA.TABLES
So what I need is to replace the *** DataSet*** by the one I got from the query itself (smt.schema_name).
I am not sure if I can do it with a sub query, but I don't really know how to manage to do it.
I hope I'm clear enough, thanks in advance if you can help.

You can do this using some procedural language as follows:
CREATE TEMP TABLE table_counts (dataset_id STRING, table_count INT64);
FOR record IN
(
SELECT
catalog_name as project_id,
schema_name as dataset_id
FROM `elzagales.INFORMATION_SCHEMA.SCHEMATA`
)
DO
EXECUTE IMMEDIATE
CONCAT("INSERT table_counts (dataset_id, table_count) SELECT table_schema as dataset_id, count(table_name) from ", record.dataset_id,".INFORMATION_SCHEMA.TABLES GROUP BY dataset_id");
END FOR;
SELECT * FROM table_counts;
This will return something like:

stored procedure in BigQuery had error "Correlated subqueries that reference other tables are not supported...."

I already created stored procedure in BigQuery for data querying.
The main point is send some parameter to get data in WHERE condition, then use list of data in WHERE of other table. It shown error as "Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN."
THIS IS MY QUERY
CALL mydatabase.stored_procedures.GetDateOutCS();
SELECT *, GetDateOutCS(ContainerHeaderID)
FROM mydatabase.instance.PRD_ContainerHeader
THIS IS MY STORED PROCEDURE
CREATE OR REPLACE PROCEDURE mydatabase.stored_procedures.GetDateOutCS()
BEGIN
CREATE TEMP FUNCTION GetDateOutCS (ContainerHeader_ID INT64)
AS
(
( SELECT GIDateFROM mydatabase.instance.RMM_ReceivedMaterialHU
WHERE HU IN
(SELECT CAST(HU AS INT64) FROM mydatabase.instance.PRD_ContainerLine
WHERE ContainerHeaderID=ContainerHeader_ID) LIMIT 1)
);
END;
I've tried to changed subquery to join table, but it's not works for me.
CREATE OR REPLACE PROCEDURE mydatabase.stored_procedures.GetDateOutCS_2()
BEGIN
CREATE TEMP FUNCTION GetDateOutCS (ContainerHID INT64)
AS
(
array( SELECT CAST(GIDate AS DATETIME)
FROM mydatabase.instance.RMM_ReceivedMaterialHU as h
INNER JOIN mydatabase.instance.PRD_ContainerLine as cl ON h.HU = CAST(cl.HU AS INT64)
WHERE cl.ContainerHeaderID=ContainerHIDAND cl.HU IS NOT NULL LIMIT 1)
);
END;
My expect solutions
How to fix this stored procedure.
This is limitation of BigQuery?

Using cte to swap two columns of a table

I want to swap 2nd and 3rd column of one table using CTE.
I'm working with below query, which keeps throwing an error,
no such column: cte.comm1
Table - [SalComm] column: ID, Sal, Comm
with CTE as
(
SELECT ID as id1, sal as sal1, comm as comm1 from SalComm
) UPDATE SalComm SET sal=cte.comm1, comm=cte.sal1 where ID= cte.id1*
Could you please suggest to me the right query?

This answer assumes you are using SQL Server, or some other database, which supports directly updating common table expressions. I don't see the point at all of the aliases inside your CTE. If you want to swap columns values, just use the direct columns names:
WITH cte AS (
SELECT ID, sal, comm
FROM SalComm
)
UPDATE cte
SET sal = comm, comm = sal;
-- no WHERE clause needed, if you really want to cover the entire table
That being said, you could just as easily do the above update on the original table. Updatable CTEs are more useful when they generate some complex derived results which you intend to use as part of a later update. That does not appear to be the case here.

why use 'NA' = with the possibility of returning a group of values in SAS?

I have a quick question about the following piece of code. Why can we use 'NA' = for the subquery ? I mean, the subquery might return a group of values, not a single one, right? Could anyone tell me the reason? Many thanks for your time and attention.
proc sql;
select lastname, first name
from sasuser.staffmaster
where 'NA' =
(select jobcategory
from sasuser.supervisors
where staffmaster.empid = supervisors.empid);
quit;
Thanks again.

Assuming EMPID is a unique ID for an employee (I hope it is?), and each employee has only one supervisor, that query should resolve to a single row every time. (A single row for each row returned from the outer query, of course, which is important. Think of it like a join - that's basically what that is, a slightly oddly phrased join, which often will be turned into an actual join by the SQL parser.)
In general, however, sure, it could resolve to multiple rows. SAS will let you do the query, and if it returns just one row it works; if it returns 2+ rows, it fails. As Quentin pointed out in comments, this is a correlated subquery.

What's the right pattern for unique data in columns?

I've a table [File] that has the following schema
CREATE TABLE [dbo].[File]
(
[FileID] [int] IDENTITY(1,1) NOT NULL,
[Name] [varchar](256) NOT NULL,
CONSTRAINT [PK_File] PRIMARY KEY CLUSTERED
(
[FileID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
The idea is that the FileID is used as the key for the table and the Name is the fully qualified path that represents a file.
What I've been trying to do is create a Stored Procedure that will check to see if the Name is already in use if so then use that record else create a new record.
But when I stress test the code with many threads executing the stored procedure at once I get different errors.
This version of the code will create a deadlock and throw a deadlock exception on the client.
CREATE PROCEDURE [dbo].[File_Create]
#Name varchar(256)
AS
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
BEGIN TRANSACTION xact_File_Create
SET XACT_ABORT ON
SET NOCOUNT ON
DECLARE #FileID int
SELECT #FileID = [FileID] FROM [dbo].[File] WHERE [Name] = #Name
IF ##ROWCOUNT=0
BEGIN
INSERT INTO [dbo].[File]([Name])
VALUES (#Name)
SELECT #FileID = [FileID] FROM [dbo].[File] WHERE [Name] = #Name
END
SELECT * FROM [dbo].[File]
WHERE [FileID] = #FileID
COMMIT TRANSACTION xact_File_Create
GO
This version of the code I end up getting rows with the same data in the Name column.
CREATE PROCEDURE [dbo].[File_Create]
#Name varchar(256)
AS
BEGIN TRANSACTION xact_File_Create
SET NOCOUNT ON
DECLARE #FileID int
SELECT #FileID = [FileID] FROM [dbo].[File] WHERE [Name] = #Name
IF ##ROWCOUNT=0
BEGIN
INSERT INTO [dbo].[File]([Name])
VALUES (#Name)
SELECT #FileID = [FileID] FROM [dbo].[File] WHERE [Name] = #Name
END
SELECT * FROM [dbo].[File]
WHERE [FileID] = #FileID
COMMIT TRANSACTION xact_File_Create
GO
I'm wondering what the right way to do this type of action is? In general this is a pattern I'd like to use where the column data is unique in either a single column or multiple columns and another column is used as the key.
Thanks

If you are searching heavily on the Name field, you will probably want it indexed (as unique, and maybe even clustered if this is the primary search field). As you don't use the #FileID from the first select, I would just select count(*) from file where Name = #Name and see if it is greater than zero (this will prevent SQL from retaining any locks on the table from the search phase, as no columns are selected).
You are on the right course with the SERIALIZABLE level, as your action will impact subsequent queries success or failure with the Name being present. The reason the version without that set causes duplicates is that two selects ran concurrently and found there was no record, so both went ahead with the inserts (which creates the duplicate).
The deadlock with the prior version is most likely due to the lack of an index making the search process take a long time. When you load the server down in a SERIALIZABLE transaction, everything else will have to wait for the operation to complete. The index should make the operation fast, but only testing will indicate if it is fast enough. Note that you can respond to the failed transaction by resubmitting: in real world situations hopefully the load will be transient.
EDIT: By making your table indexed, but not using SERIALIZABLE, you end up with three cases:
Name is found, ID is captured and used. Common
Name is not found, inserts as expected. Common
Name is not found, insert fails because another exact match was posted within milliseconds of the first. Very Rare
I would expect this last case to be truly exceptional, so using an exception to capture this very rare case would be preferable to engaging SERIALIZABLE, which has serious performance consequences.
If you do really have an expectation that it will be common to have posts within milliseconds of one another of the same new name, then use a SERIALIZABLE transaction in conjunction with the index. It will be slower in the general case, but faster when these posts are found.

First, create a unique index on the Name column. Then from your client code first check if the Name exists by selecting the FileID and putting the Name in the where clause - if it does, use the FileID. If not, insert a new one.

Using the Exists function might clean things up a little.
if (Exists(select * from table_name where column_name = #param)
begin
//use existing file name
end
else
//use new file name

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js