Redshift Pivot Function - amazon-web-services

Redshift Pivot Function - amazon-web-services

I've got a similar table which I'm trying to pivot in Redshift:
UUID
Key
Value
a123
Key1
Val1
b123
Key2
Val2
c123
Key3
Val3
Currently I'm using following code to pivot it and it works fine. However, when I replace the IN part with subquery it throws an error.
select *
from (select UUID ,"Key", value from tbl) PIVOT (max(value) for "key" in (
'Key1',
'Key2',
'Key3
))
Question: What's the best way to replace the IN part with sub query which takes distinct values from Key column?
What I am trying to achieve;
select *
from (select UUID ,"Key", value from tbl) PIVOT (max(value) for "key" in (
select distinct "keys" from tbl
))

From the Redshift documentation - "The PIVOT IN list values cannot be column references or sub-queries. Each value must be type compatible with the FOR column reference." See: https://docs.aws.amazon.com/redshift/latest/dg/r_FROM_clause-pivot-unpivot-examples.html
So I think this will need to be done as a sequence of 2 queries. You likely can do this in a stored procedure if you need it as a single command.
Updated with requested stored procedure with results to a cursor example:
In order to make this supportable by you I'll add some background info and description of how this works. First off a stored procedure cannot produce results strait to your bench. It can either store the results in a (temp) table or to a named cursor. A cursor is just storing the results of a query on the leader node where they wait to be fetched. The lifespan of the cursor is the current transaction so a commit or rollback will delete the cursor.
Here's what you want to happen as individual SQL statements but first lets set up the test data:
create table test (UUID varchar(16), Key varchar(16), Value varchar(16));
insert into test values
('a123', 'Key1', 'Val1'),
('b123', 'Key2', 'Val2'),
('c123', 'Key3', 'Val3');
The actions you want to perform are first to create a string for the PIVOT clause IN list like so:
select '\'' || listagg(distinct "key",'\',\'') || '\'' from test;
Then you want to take this string and insert it into your PIVOT query which should look like this:
select *
from (select UUID, "Key", value from test)
PIVOT (max(value) for "key" in ( 'Key1', 'Key2', 'Key3')
);
But doing this in the bench will mean taking the result of one query and copy/paste-ing into a second query and you want this to happen automatically. Unfortunately Redshift does allow sub-queries in PIVOT statement for the reason given above.
We can take the result of one query and use it to construct and run another query in a stored procedure. Here's such a store procedure:
CREATE OR REPLACE procedure pivot_on_all_keys(curs1 INOUT refcursor)
AS
$$
DECLARE
row record;
BEGIN
select into row '\'' || listagg(distinct "key",'\',\'') || '\'' as keys from test;
OPEN curs1 for EXECUTE 'select *
from (select UUID, "Key", value from test)
PIVOT (max(value) for "key" in ( ' || row.keys || ' )
);';
END;
$$ LANGUAGE plpgsql;
What this procedure does is define and populate a "record" (1 row of data) called "row" with the result of the query that produces the IN list. Next it opens a cursor, whose name is provided by the calling command, with the contents of the PIVOT query which uses the IN list from the record "row". Done.
When executed (by running call) this function will produce a cursor on the leader node that contains the result of the PIVOT query. In this stored procedure the name of the cursor to create is passed to the function as a string.
call pivot_on_all_keys('mycursor');
All that needs to be done at this point is to "fetch" the data from the named cursor. This is done with the FETCH command.
fetch all from mycursor;
I prototyped this on a single node Redshift cluster and "FETCH ALL" is not supported at this configuration so I had to use "FETCH 1000". So if you are also on a single node cluster you will need to use:
fetch 1000 from mycursor;
The last point to note is that the cursor "mycursor" now exists and if you tried to rerun the stored procedure it will fail. You could pass a different name to the procedure (making another cursor) or you could end the transaction (END, COMMIT, or ROLLBACK) or you could close the cursor using CLOSE. Once the cursor is destroyed you can use the same name for a new cursor. If you wanted this to be repeatable you could run this batch of commands:
call pivot_on_all_keys('mycursor'); fetch all from mycursor; close mycursor;
Remember that the cursor has a lifespan of the current transaction so any action that ends the transaction will destroy the cursor. If you have AUTOCOMMIT enable in your bench this will insert COMMITs destroying the cursor (you can run the CALL and FETCH in a batch to prevent this in many benches). Also some commands perform an implicit COMMIT and will also destroy the cursor (like TRUNCATE).
For these reasons, and depending on what else you need to do around the PIVOT query, you may want to have the stored procedure write to a temp table instead of a cursor. Then the temp table can be queried for the results. A temp table has a lifespan of the session so is a little stickier but is a little less efficient as a table needs to be created, the result of the PIVOT query needs to be written to the compute nodes, and then the results have to be sent to the leader node to produce the desired output. Just need to pick the right tool for the job.
===================================
To populate a table within a stored procedure you can just execute the commands. The whole thing will look like:
CREATE OR REPLACE procedure pivot_on_all_keys()
AS
$$
DECLARE
row record;
BEGIN
select into row '\'' || listagg(distinct "key",'\',\'') || '\'' as keys from test;
EXECUTE 'drop table if exists test_stage;';
EXECUTE 'create table test_stage AS select *
from (select UUID, "Key", value from test)
PIVOT (max(value) for "key" in ( ' || row.keys || ' )
);';
END;
$$ LANGUAGE plpgsql;
call pivot_on_all_keys();
select * from test_stage;
If you want this new table to have keys for optimizing downstream queries you will want to create the table in one statement then insert into it but this is quickie path.

A little off-topic, but I wonder why Amazon couldn't introduce a simpler syntax for pivot. IMO, if GROUP BY is replaced by PIVOT BY, it can give enough hint to the interpreter to transform rows into columns. For example:
SELECT partname, avg(price) as avg_price FROM Part GROUP BY partname;
can be written as:
SELECT partname, avg(price) as avg_price FROM Part PIVOT BY partname;
Even multi-level pivoting can also be handled in the same syntax.
SELECT year, partname, avg(price) as avg_price FROM Part PIVOT BY year, partname;

Related

How to use page item value inside a trigger query in oracle?

I want the following trigger to be run correctly but it rise an error which is: bad bind variable 'P23_ID'.
The trigger query is:
Create or replace trigger "newTRG"
Before
Insert on "my_table"
For each row
Begin
If :new."ID" is null then
Insert into my_table (ID) values (:P23_ID);
end if;
End;

Use the v() syntax:
create or replace trigger "newTRG" before
insert on "my_table"
for each row
begin
if :new."ID" is null then
insert into my_table ( id ) values (v('P23_ID'));
end if;
end;
On a side note, if this is a primary key value it is a lot easier to use identity columns (the new way) or a sequence (the old way) to populate your column. Doing this from a page item is error prone.

redshift - Not able to apply listagg function

I am getting error when trying to use listagg function.
Query
select
a.user_name,
listagg(a.group_name::text)
within group (order by a.group_name) as group_name
from (
SELECT
usename as user_name,
groname as group_name
FROM
pg_user
join
pg_group
on
pg_user.usesysid = ANY(pg_group.grolist) AND
pg_group.groname in (SELECT DISTINCT pg_group.groname from pg_group)
)a
group by user_name
Error
[Code: 500310, SQL State: XX000] Amazon Invalid operation: One or more of the used functions must be applied on at least one user created tables. Examples of user table only functions are LISTAGG, MEDIAN, PERCENTILE_CONT, etc;
None of the value is null.

Just like there are some functions that can only be run on the leader node there are some that can only be run on compute nodes - listagg() is one of these. If you need to run listagg() on leader data there are a few approaches you can use: (sorry I'm not on a cluster now so cannot test these directly - I saw your question was aging and thought I'd get you started. Grain of salt as I also cannot directly observe your issue but I think I've know what is going on.)
You can use a cursor to save the data from the leader node and use
this as the source for listagg(). A stored procedure can
streamline this. There are examples of this on stackoverflow.
You can make a temp table out of the leader node data and use this
in listagg() but I expect you will need to exit(unload) and
reenter(copy) the cluster to do this.
There just isn't a direct path from leader-node-only results to the compute nodes without some sort of this kind of push-up. Consequence of the large networked cluster architecture of Redshift.
UPDATE
I got some cluster time and there are several unexpected issues with this one. grolist is an array type that isn't generally support cluster wide and the need to user pg_group as source are key ones. So this is going to require #1 AND #2 from above.
The process goes like this:
Define cursor to hold the result of the pg_user / pg_group join select statement
Move cursor results to temp table
Use temp table as source to outer (list_agg()) select
A stored procedure can be written to do #1 and #2 which streamlines things. So you end up with the following SQL:
CREATE OR REPLACE procedure make_user_group()
AS
$$
DECLARE
row record;
BEGIN
create temp table user_group (user_name varchar(256),group_name varchar(256));
for row in SELECT
usename::text as user_name,
groname::text as group_name
FROM
pg_user
join
pg_group
on
pg_user.usesysid = ANY(pg_group.grolist) AND
pg_group.groname in (SELECT DISTINCT pg_group.groname from pg_group)
LOOP
INSERT INTO user_group(user_name,group_name) VALUES (row.user_name,row.group_name);
END LOOP;
END;
$$ LANGUAGE plpgsql;
call make_user_group();
select
user_name,
listagg(group_name::text, ', ')
within group (order by group_name) as group_name
from user_group
group by user_name;
Clearly the stored procedure only needs to be created once but called every time the temp table needs to be created.

BigQuery MERGE statement billing more bytes than editor shows

I have a very large (3.5B records) table that I want to update/insert (upsert) using the MERGE statement in BigQuery. The source table is a staging table that contains only the new data, and I need to check if the record with a corresponding ID is in the target table, updating the row if so or inserting if not.
The target table is partitioned by an integer field called IdParent, and the matching is done on IdParent and another integer field called IdChild. My merge statement/script looks like this:
declare parentList array<int64>;
set parentList = array(select distinct IdParent from dataset.Staging);
merge into dataset.Target t
using dataset.Staging s
on
-- target is partitioned by IdParent, do this for partition pruning
t.IdParent in unnest(parentList)
and t.IdParent = s.IdParent
and t.IdChild = s.IdChild
when matched and t.IdParent in unnest(parentList) then
update
set t.Column1 = s.Column1,
t.Column2 = s.Column2,
...<more columns>
when not matched and IdParent in unnest(parentList) then
insert (<all the fields>)
values (<all the fields)
;
So I:
Pull the IdParent list from the staging table to know which partitions to prune
limit the partitions of the target table in the join predicate
also limit the partitions of the target table in the match/not matched conditions
The total size of dataset.Target is ~250GB. If I put this script in my BQ editor and remove all the IdParent in unnest(parentList) then it shows ~250GB to bill in the editor (as expected since there's no partition pruning). If I add the IdParent in unnest(parentList) back in so the script is exactly like you see it above i.e. attempting to partition prune, the editor shows ~97MB to bill. However, when I look at the query results, I see that it actually billed ~180GB:
The target table is also clustered on the two fields being matched, and I'm aware that the benefits of clustering are typically not shown in the editor's estimate. However, my understanding is that that should only make the bytes billed smaller... I can't think of any reason why this would happen.
Is this a BQ bug, or am I just missing something? BigQuery doesn't even say "the script is estimated to process XX MB", it says "This will process XX MB" and then it processes way more.

That's very interesting. What you did seems totally correct.
It seems BQ query planner could interpret your SQL correctly and know the partition pruning is provided, but when it executes. it failed to do so.
try removing t.IdParent in unnest(parentList) from both when matched clauses to see if the issue still happens, that is,
declare parentList array<int64>;
set parentList = array(select distinct IdParent from dataset.Staging);
merge into dataset.Target t
using dataset.Staging s
on
-- target is partitioned by IdParent, do this for partition pruning
t.IdParent in unnest(parentList)
and t.IdParent = s.IdParent
and t.IdChild = s.IdChild
when matched then
update
set t.Column1 = s.Column1,
t.Column2 = s.Column2,
...<more columns>
when not matched then
insert (<all the fields>)
values (<all the fields)
;
It would be a good idea to submit a bug to BigQuery if it couldn't be resolved.

How to show other vaues in a selectlist in APEX

I am using Oracle APEX to build a interactive report. There is a field in my database called method which should contain either A or B. In the edit page, I want to show a list containing A and B so that users can choose from those two.
I set the type of the item to SelectList and since I need to add the other value to the list, in the List of Values area, I set the type to PL/SQL Function Body returning SQL Query and the code is as follows:
Begin
select TEST_METHOD into method from table_test
where ROWID = :P2_ROWID;
IF ('Live' = method) THEN
return select 'Screenshots' from dual;
END IF;
return select 'Live' from dual;
End;
However, I got the following error:
ORA-06550: line 5, column 10: PLS-00103: Encountered the symbol "SELECT" when expecting one of the following: ( - + ; case mod new not null continue avg count current exists max min prior sql stddev sum variance execute forall merge time timestamp interval date pipe
I am new to plsql and APEX, I know the code looks wired but I don't know what's wrong. I am also wondering if there is any other way to achieve my goal? Thanks!

What's the right pattern for unique data in columns?

I've a table [File] that has the following schema
CREATE TABLE [dbo].[File]
(
[FileID] [int] IDENTITY(1,1) NOT NULL,
[Name] [varchar](256) NOT NULL,
CONSTRAINT [PK_File] PRIMARY KEY CLUSTERED
(
[FileID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
The idea is that the FileID is used as the key for the table and the Name is the fully qualified path that represents a file.
What I've been trying to do is create a Stored Procedure that will check to see if the Name is already in use if so then use that record else create a new record.
But when I stress test the code with many threads executing the stored procedure at once I get different errors.
This version of the code will create a deadlock and throw a deadlock exception on the client.
CREATE PROCEDURE [dbo].[File_Create]
#Name varchar(256)
AS
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
BEGIN TRANSACTION xact_File_Create
SET XACT_ABORT ON
SET NOCOUNT ON
DECLARE #FileID int
SELECT #FileID = [FileID] FROM [dbo].[File] WHERE [Name] = #Name
IF ##ROWCOUNT=0
BEGIN
INSERT INTO [dbo].[File]([Name])
VALUES (#Name)
SELECT #FileID = [FileID] FROM [dbo].[File] WHERE [Name] = #Name
END
SELECT * FROM [dbo].[File]
WHERE [FileID] = #FileID
COMMIT TRANSACTION xact_File_Create
GO
This version of the code I end up getting rows with the same data in the Name column.
CREATE PROCEDURE [dbo].[File_Create]
#Name varchar(256)
AS
BEGIN TRANSACTION xact_File_Create
SET NOCOUNT ON
DECLARE #FileID int
SELECT #FileID = [FileID] FROM [dbo].[File] WHERE [Name] = #Name
IF ##ROWCOUNT=0
BEGIN
INSERT INTO [dbo].[File]([Name])
VALUES (#Name)
SELECT #FileID = [FileID] FROM [dbo].[File] WHERE [Name] = #Name
END
SELECT * FROM [dbo].[File]
WHERE [FileID] = #FileID
COMMIT TRANSACTION xact_File_Create
GO
I'm wondering what the right way to do this type of action is? In general this is a pattern I'd like to use where the column data is unique in either a single column or multiple columns and another column is used as the key.
Thanks

If you are searching heavily on the Name field, you will probably want it indexed (as unique, and maybe even clustered if this is the primary search field). As you don't use the #FileID from the first select, I would just select count(*) from file where Name = #Name and see if it is greater than zero (this will prevent SQL from retaining any locks on the table from the search phase, as no columns are selected).
You are on the right course with the SERIALIZABLE level, as your action will impact subsequent queries success or failure with the Name being present. The reason the version without that set causes duplicates is that two selects ran concurrently and found there was no record, so both went ahead with the inserts (which creates the duplicate).
The deadlock with the prior version is most likely due to the lack of an index making the search process take a long time. When you load the server down in a SERIALIZABLE transaction, everything else will have to wait for the operation to complete. The index should make the operation fast, but only testing will indicate if it is fast enough. Note that you can respond to the failed transaction by resubmitting: in real world situations hopefully the load will be transient.
EDIT: By making your table indexed, but not using SERIALIZABLE, you end up with three cases:
Name is found, ID is captured and used. Common
Name is not found, inserts as expected. Common
Name is not found, insert fails because another exact match was posted within milliseconds of the first. Very Rare
I would expect this last case to be truly exceptional, so using an exception to capture this very rare case would be preferable to engaging SERIALIZABLE, which has serious performance consequences.
If you do really have an expectation that it will be common to have posts within milliseconds of one another of the same new name, then use a SERIALIZABLE transaction in conjunction with the index. It will be slower in the general case, but faster when these posts are found.

First, create a unique index on the Name column. Then from your client code first check if the Name exists by selecting the FileID and putting the Name in the where clause - if it does, use the FileID. If not, insert a new one.

Using the Exists function might clean things up a little.
if (Exists(select * from table_name where column_name = #param)
begin
//use existing file name
end
else
//use new file name

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js