Get the table name from another table and query on it in Redshift - amazon-web-services

I want to select all data from a table name where table name needs to extract from another table every time.
select * from (select max(table_name) from tableA)
I am trying to write stored procedure for it.
Below is the syntax.
CREATE OR REPLACE PROCEDURE db.stproced() AS $$
declare tablename VARCHAR(200);
BEGIN
select max(table_name) into tableA from easybuy.current_portfolio_table_name;
EXECUTE 'select * from ' || tablename || ' ;'
RETURN;
END;
$$ LANGUAGE plpgsql;
CALL easybuy.stproced();
Above code is executing fine but it is not printing the records which should be come from EXECUTE statement.

This cannot be done in plain SQL like your example. This is because the query is compiled and the table target needs to known a compilation.
This leads to needing 2 SQL statements to perform this action. As you mention a stored procedure can run a number of SQL statements so is a good idea. In many cases this will work except a stored procedure cannot produce data out the JDBC/ODBC connection (AFAIK). A stored procedure can fill a table with the results or fill a cursor but in both cases you will need to select or fetch to see these in your bench. Again you are back to needing 2 statements - executing the stored procedure and grabbing the results (select or fetch).
You could set up a wrapper around Redshift that takes some "special" command and maps it to the 2 SQL statement and otherwise just passes SQL through. This can work and there are available tools that work like this.
Some benches have the ability to configure macros that you could map to perform the 2 statements in question. This could be a route to look into.
If you explain the overarching problem you are trying to solve there may be other routes to achieve this goal.
==============================================================
Adding a stored procedure example that will perform the desired operation.
First let's set up some dummy tables:
create table test1 (tname varchar(16));
insert into test1 values
('test2'),
('b123'),
('c123');
create table test2 (UUID varchar(16), Key varchar(16), Value varchar(16));
insert into test2 values
('a123', 'Key1', 'Val1'),
('b123', 'Key2', 'Val2'),
('c123', 'Key3', 'Val3');
Next we create the stored procedure:
CREATE OR REPLACE procedure indirect_table(curs1 INOUT refcursor)
AS
$$
DECLARE
row record;
BEGIN
select into row max(tname) as tname from test1;
OPEN curs1 for EXECUTE 'select * from ' || row.tname || ';';
END;
$$ LANGUAGE plpgsql;
A quick explainer - this procedure first takes the name of a cursor as an argument and defines a record for storing the result of a query. Into this record is stored the max table name from table test1. This record is used to construct a query that uses this record's value in the FROM clause. This constructed query is run into a cursor where the results wait for a fetch request.
So the last step is to call the procedure and fetch the results. These are the only steps that will be needed in your script once the procedure is saved (committed).
call indirect_table('mycursor');
fetch all mycursor;
This will produce the desired output into the users bench. (Note that "fetch all" is not supported on a single node cluster and "fetch 1000" will be needed in such a case.)

Related

PROC SQL: Warning variable already exists on multiple dataset join

I have this data check integrity code for an oncology study I'm working on. This is ostensibly to confirm TU,TR and RS are consistent.
proc sql ;
create table tu_tr_rs as
select tu.*,tr.*,rs.*
from trans.tu as tu
left join trans.tr as tr on tu.usubjid=tr.usubjid and tu.TULNKID
=tr.TRLNKID and tu.tudtc=tr.trdtc
left join trans.rs as rs on tr.usubjid=rs.usubjid and tr.trdtc=
rs.rsdtc
;
quit;
However, when I run this code I get the warning
"Variable XXXX already exists on file WORK.TU_TR_RS."
When I add the feedback option to PROC SQL to get a more granular look I get this
So I know if it's one variable that brings this warning up you can use a rename/DROP combination to work around it but for this case is it just the case that I have to explicitly state the variables for each dataset in the select statement or is there something fundamentally wrong with the code?
Yes, if you want to select columns with the same name from 2 (or more) data sets, you simply need to select them explicitly and give them distinct names. Something like this:
create table tu_tr_rs as
select
tu.ColA as tu_ColA
,tu.ColB as tu_ColB
/* etc */
,tr.ColA as tr_ColA
,tr.ColB as tr_ColB
/* etc */
,rs.ColA as rs_ColA
,rs.ColB as rs_ColB
/* etc */
from trans.tu as tu
/* etc */

Return a table from a user defined function in Redshift

I have a complex query which gives multiple rows for some two dates- starting date and end date.
Now I want to create a function so that I can return multiple rows for a different combination of dates.
CREATE FUNCTION submit_cohort(DATE, DATE)
RETURNS TABLE(Month VARCHAR(10), Name1 VARCHAR(20), Name2 VARCHAR(20), x INTEGER)
STABLE
AS $$
SELECT
to_char((date + interval '330 minutes')::date, 'YYYY/MM') "Month",
Name1,
Name2,
count(*) "x"
FROM xyz
WHERE date > $1
AND date < $2
GROUP BY 1,2,3
ORDER BY 1,2,3
END
$$ LANGUAGE sql;
I ran this query. It says:
Amazon Invalid operation: syntax error at or near "TABLE"
In Redshift, you can define only scalar functions, i.e. those which return a single value. Set based functions (those which return tables) are not supported in Redshift unfortunately.
Possible reason is that Redshift is a distributed database and functions are running on the compute nodes in parallel, independently of each other. Set based functions need to be able to read data from the database, but there is a chance that some data sits on the given node while another portion sits on another node. Such function can't run on a specific compute node independently. You would have to run such function on the master node only. Which you didn't want to do as it's against the whole concept of parallelism.
Try to express the same logic in a SQL query. From your code it seems like it can work as a regular query/subquery.

Variable in a Power BI query

I have a SQL query to get the data into Power BI. For example:
select a,b,c,d from table1
where a in ('1111','2222','3333' etc.)
However, the list of variables ('1111','2222','3333' etc.) will change every day so I would like the SQL statement to be updated before refreshing the data. Is this possible?
Ideally, I would like to keep a spreadsheet with a list of a values (in this example) so before refresh, it will feed those parameters into this script.
Another problem I have is the list will have a different nr of parameters so the last variable needs to be without a comma.
Another option I was considering is to run the script without the where a in ('1111','2222','3333' etc.) and then load the spreadsheet with a list of those a's and filter the report down based on that list however this will be a lot of data to import into Power BI.
It's my first post ever, although I was sourcing help from Stackoverflow for years, so hopefully, it's all clear.
I would create a new Query to read the "a values" from your spreadsheet. I would set the Load To / Import Data option to Only Create Connection (to avoid duplicating the data).
Then in your SQL query I would remove the where clause. With that gone you actually don't need to write custom SQL at all - just select the table/view from the Navigation UI.
Then from the the "table1" query I would add a Merge Queries step, connecting to the "a values" Query on the "a" column, using the Join Type: Inner. The resulting rows will be only those with a matching "a" column value (similar to your current SQL where clause).
Power Query wont be able to send this to your SQL Server as a single query, so it will first select all the rows from table1. But it is still fairly quick and efficient.

Google Spanner - How do you copy data to another table?

Since spanner does not have ddl feature like
insert into dest as (select * from source_table)
How do we select subset of a table and copy that rows into another table ?
I am trying to write data to temporary table and then move data to archive table at the end of day. But only solution i could find so far is, select rows from source table and write them to new table. Which is done using java api, and it does not have a ResultSet to Mutation converter, so i need to map every column of table to new table, even they are exactly same.
Another thing is updating just one column data, like there is no way of doing "update table_name set column= column-1 "
Again to do that, i need to read that row and map every field to update Mutation, but this is not useful if have many tables, i need to code for all of them, a ResultSet -> Mutation converted would be nice too.
Is there any generic Mutation cloner and/or any other way to copy data between tables?
As of version 0.15 this open source JDBC Driver supports bulk INSERT-statements that can be used to copy data from one table to another. The INSERT-syntax can also be used to perform bulk UPDATEs on data.
Bulk insert example:
INSERT INTO TABLE
(COL1, COL2, COL3)
SELECT C1, C2, C3
FROM OTHER_TABLE
WHERE C1>1000
Bulk update is done using an INSERT-statement with the addition of ON DUPLICATE KEY UPDATE. You have to include the value of the primary key in your insert statement in order to 'force' a key violation which in turn will ensure that the existing rows will be updated:
INSERT INTO TABLE
(COL1, COL2, COL3)
SELECT COL1, COL2+1, COL3+COL2
FROM TABLE
WHERE COL2<1000
ON DUPLICATE KEY UPDATE
You can use the JDBC driver with for example SQuirreL to test it, or to do ad-hoc data manipulation.
Please note that the underlying limitations of Cloud Spanner still apply, meaning a maximum of 20,000 mutations in one transaction. The JDBC Driver can work around this limit by specifying the value AllowExtendedMode=true in your connection string or in the connection properties. When this mode is allowed, and you issue a bulk INSERT- or UPDATE-statement that will exceed the limits of one transaction, the driver will automatically open an extra connection and perform the bulk operation in batches on the new connection. This means that the bulk operation will NOT be performed atomically, and will be committed automatically after each successful batch, but at least it will be done automatically for you.
Have a look here for some more examples: http://www.googlecloudspanner.com/2018/02/data-manipulation-language-with-google.html
Another approach to perform Bulk update can be using LIMIT & OFFSET
insert into dest(c1,c2,c3)
(select c1,c2,c3 from source_table LIMIT 1000);
insert into dest(c1,c2,c3)
(select c1,c2,c3 from source_table LIMIT 1000 OFFSET 1001);
insert into dest(c1,c2,c3)
(select c1,c2,c3 from source_table LIMIT 1000 OFFSET 2001);
.
.
.
reach till where required.
PS: This is more of a trick. But will definitely save you time.
Spanner supports expression in the SET section of an UPDATE statement which can be used to supply a subquery fetching data from another table like this:
UPDATE target_table
SET target_field = (
-- use subquery as an expression (must return a single row)
SELECT source_table.source_field
FROM source_table
WHERE my_condition IS TRUE
) WHERE my_other_condition IS TRUE;
The generic syntax is:
UPDATE table SET column_name = { expression | DEFAULT } WHERE condition

SAS Data Integration - Create a physical table from metadata structure

i need to use a append object after a series of join that have a conditional run... So the join step may be not execute if the condition is not verified and his work physical dataset will not be created.
The problem is that the append step take an error if one o more input physical dataset are not created.
Is there a smart way to create a physical empty table from a metadata structure of the works table of the joins or to use the append with some non-created datasets?
The create table with the list of all field is not a real solution because i've to replicate it per 8 different joins and then replicate the job 10 times...
Thanks to all
Roberto
Thank you for your comments.
What you should do:
Amend your conditional node so that it would on positive condition to create a global macro variable with value of MAX. On negative condition to create the same variable with value of 0.
Replace offending SQL step with "CREATE TABLE" node
In the options for "CREATE TABLE", specify macro variable for "MAXIMUM OUTPUT ROWS (OUTOBS)". See the picture below for example of those options.
So now when your condition is not met, you will always end up with an empty table. When condition is met, the step executes normally.
I must say my version of DI Studio is a bit old. In my version SQL node doens't allow passing macro variables to SQL options, only integers can be typed in. Check if your version allows it because if it does, then you can amend existing SQL step and avoid replacing it with another node.
One more thing, you will get a warning when OUTOBS options is less then the resulting would be dataset.
Let me know if you have any questions.
See the picture for create table options:
At the end i've created another step that extract 0 row from the source table by the condition 1=0 in the where tab. In this way i have a empty table that i can use with a data/set in the post sql of the conditional run if the work table of the join does not exist.
This is not a solution but a valid work around.