I want to delete all external tables present in a schema in a particular sanbox in redshift - amazon-web-services

I want to delete all external tables present in a schema in a particular sanbox in redshift
ERROR Exception: DROP EXTERNAL TABLE cannot be executed from a function or procedure
CREATE OR REPLACE PROCEDURE "workspace"."qw"()
AS
$$
DECLARE
t_sql VARCHAR(32000);
t_script_name VARCHAR(100) := 'load_sample_dcdr$table_cleanup';
t_table_name VARCHAR(100);
t_start_runtime TIMESTAMP;
t_row_count BIGINT;
t_current_db_YYYYMM VARCHAR;
t_current_db VARCHAR(100);
cur_loop REFCURSOR;
BEGIN
-- IF UPPER(t_current_db) = UPPER(current_database()) THEN
t_sql := 'select distinct(table_name) from svv_all_columns where schema_name=''deleted'' and database_name=''singh_sandbox''';
OPEN cur_loop FOR EXECUTE t_sql;
LOOP
FETCH cur_loop INTO t_table_name;
EXIT WHEN NOT FOUND;
EXECUTE 'DROP table IF EXISTS deleted.'||t_table_name||' cascade';
t_row_count=t_row_count+1;
END LOOP;
CLOSE cur_loop;
-- END IF;
END;
$$ LANGUAGE plpgsql;

This is because DROP EXTERNAL TABLE cannot be run inside a transaction block (BEGIN - END). This makes running this command inside a stored procedure impossible since the procedure is its own transaction block. [There are reasons for this but that isn't important for the question.]
So to do this you will need some code running outside of Redshift that can issue the DROP statements outside of a transaction block. Some benches can perform actions like this but most don't. You could write a Lambda function, or even a bash script can do this if you have the aws and psql clis installed. Or you can issue the statements manually but that could be a pain if the list of tables is large.

Related

Is it possible to run queries in parallel in Redshift?

I wanted to do an insert and update at the same time in Redshift. For this I am inserting the data into a temporary table, removing the updated entries from the original table and inserting all the new and updated entries. Since Redshift uses concurrency, sometimes entries are duplicated, because the delete started before the insert was finished. Using a very large sleep for each operation this does not happen, however the script is very slow. Is it possible to run queries in parallel in Redshift?
Hope someone can help me , thanks in advance!
You should read up on MVCC (multi-version coherency control) and transactions. Redshift can only only run one query at a time (for a session) but that is not the issue. You want to COMMIT both changes at the same time (COMMIT is the action that causes changes to be apparent to others). You do this by wrapping your SQL statement in a transaction (BEGIN ... COMMIT) and executed in the same session (not clear if you are using multiple sessions). All changes made within the transaction will only be visible to the session making the changes UNTIL COMMIT when ALL the changes made by the transaction will be visible to everyone at the same moment.
A few things to watch out for - if your connection is in AUTOCOMMIT mode then you may break out of your transaction early and COMMIT partial results. Also when you are working in transactions your source table information is unchanging (so you see consistent data during your transaction) and this information isn't allowed to change for you. This means that if you have multiple sessions changing table data you need to be careful about the order in which they COMMIT so the right version of data is presented to each other.
begin transaction;
<run the queries in parallel>
end transaction;
In this specific case do this:
create temp table stage (like target);
insert into stage
select * from source
where source.filter = 'filter_expression';
begin transaction;
delete from target
using stage
where target.primarykey = stage.primarykey;
insert into target
select * from stage;
end transaction;
drop table stage;
See:
https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-upsert.html
https://docs.aws.amazon.com/redshift/latest/dg/t_updating-inserting-using-staging-tables-.html

Google Bigquery: Join of two external tables fails if one of them is empty

I have 2 external tables in BiqQuery, created on top of JSON files on Google Cloud Storage. The first one is a fact table, the second is errors data - and it might or might not be empty.
I can query each table separately just fine, even an empty one - here is an
empty table query result example
I'm also able to left join them if both of them are not empty.
However, if errors table is empty, my query fails with the following error:
The query specified one or more federated data sources but not all of them were scanned. It usually indicates incorrect uri specification or a 'limit' clause over a union of federated data sources that was satisfied without having to read all sources.
This situation isn't covered anywhere in the docs, and it's not related to this versioning issue - Reading BigQuery federated table as source in Dataflow throws an error
I'd rather avoid converting either of this tables to native, since they are used in just one step of the ETL process, and this data is dropped afterwards. One of them being empty doesn't look like an exceptional situation, since plain select works just fine.
Is some workaround possible?
UPD: raised an issue with Google, waiting for response - https://issuetracker.google.com/issues/145230326
It feels like a bug. One workaround is to use scripting to avoid querying the empty table:
DECLARE is_external_table_empty BOOL DEFAULT
(SELECT 0 = (SELECT COUNT(*) FROM your_external_table));
-- do things differently when is_external_table_empty is true
IF is_external_table_empty = true
THEN ...
ELSE ...
END IF

moving heroku db to RDS

I'm trying to move a postgres db (version 9.6.15) we have in heroku public space that I want to move to AWS RDS. The db is about 2.2tb and its on heroku premium 7 tier.
Since I see that its pretty much impossible to do a live replication from heroku since they limit the functionality of setting up replication (from my understanding), I'm looking to see how fast I can dump the data and load it to RDS.
I was looking at ways to optimize pg_dumps and what I did so far was run it from an ec2 instance (m5.2xlarge) in AWS to see how fast pg_dumps would run.
When running it on the ec2, I've only been able to get to 10mb/sec which is crazy slow because in that sense, it would take about 84hrs to run.
I ran pg_dumps as such:
sudo pg_dump postgres://user:pass#url:5432/db \
--jobs=24 \
--format=directory \
--file=/monodb/pgdump2 \
--verbose
what else can I do to speed up the exports, or is this just a limitation on the egress bandwidth from heroku?
1. Dump your schema separately from the data
The reason for this will be more clear as your read on but simply put, dumping the schema separately allows us to have multiple concurrent dump and restore operations going at once.
2. Disable foreign keys prior to beginning a restore
A Postgres dump, typically, is a sequence of insert statements. If the table you are inserting into has foreign keys, then the rules of the keys must be evaluated for every insert.
create table if not exists dropped_foreign_keys (
seq bigserial primary key,
sql text
);
do $$ declare t record;
begin
for t in select conrelid::regclass::varchar table_name, conname constraint_name,
pg_catalog.pg_get_constraintdef(r.oid, true) constraint_definition
from pg_catalog.pg_constraint r
where r.contype = ‘f’
— current schema only:
and r.connamespace = (select n.oid from pg_namespace n where n.nspname = current_schema())
loop
insert into dropped_foreign_keys (sql) values (
format(‘alter table %s add constraint %s %s’,
quote_ident(t.table_name), quote_ident(t.constraint_name), t.constraint_definition));
execute format(‘alter table %s drop constraint %s’, quote_ident(t.table_name), quote_ident(t.constraint_name));
end loop;
end $$;
After you are done restoring data, you can re-enable the foreign keys by running the following command:
do $$ declare t record;
begin
— order by seq for easier troubleshooting when data does not satisfy FKs
for t in select * from dropped_foreign_keys order by seq loop
execute t.sql;
delete from dropped_foreign_keys where seq = t.seq;
end loop;
end $$;
3. You can you different flags for pg_dump with your current flags such as:
–data-only – This is the flag to dump only the data from the tables and not the schema information.
–no-synchronized-snapshots – Prior to Postgres 9.2 this a requirement for running jobs in parallel
–schema=public – Instructs pg_dump to only dump the public schemas, for most cases this is all you need.
4. Dump large tables by themselves and group small tables together
Doing this allows you to run multiple dumps and restores at the same time. You can specify the tables you want to group together by using the –table flag in pg_dump.
Reference link for more detail

Transaction and Locking with multiple threads

Hi All following is the problem scenario:
I am using MYSQL (Innodb engine), one of my application(C++/MYSQLCAPI) is doing following operation :
START TRANSACTION
truncate my_table
load Data infile into table my_table.
if both the above command [truncate and load ] are successful then COMMIT
else ROLLBACK
now another application(C++/MYSQLCAPI) which is reading this table in every second by following command.
select * from my_table
ERROR: in this read attempt sometime it gets 0 data , what could be the reason for this ?
CREATE TABLE new LIKE real;
load `new` by whatever means
if something went wrong, don't do the next two steps.
RENAME TABLE real TO old, new TO real;
DROP TABLE old;
This avoids the problem you mentioned, plus lots of other problems. In particular, it needs no special transaction handling; the RENAME is atomic and very fast.
You're seeing an empty table since truncate table has an implicit commit. If you need to change the entire table in a transaction you can use delete then insert, or try the rename solution presented in this answer

Informatica Target Table Keyword

How do you use Informatica to load data into a target table whose name is a SQL reserved keyword?
I have a situation where I am trying to use Informatica to populate a table called Union which is failing with the following error:
SQL Server Message: Incorrect syntax near the keywork 'Union'
Database driver error...
Function Name : Execute Multiple
SQL Stmt : INSERT INTO UNION (UnionCode, UnionName, etc )
I have been told that changing the database properties to use quoted identifier would solve this problem; however, I have tried that and it only appears to work for sources, not targets.
And before anyone states the obvious - I cannot change the name of the target table.
Can you please try overriding the table name in session properties as "Union" with the quotes.
Load your data in a table with valid name ,having same structure as union .
And in Post Sql of that target, you can rename the table with whatever name required .
Ex .
Click on the target (XUnion) ,
go to Post Sql and put statement below --
RENAME XUnion to 'UNION' ;
If any table name or column name contains a database reserved word, such as MONTH or YEAR, the session fails with database errors when the Integration Service executes SQL against the database. You can create and maintain a reserved words file, reswords.txt, in the server/bin directory. When the Integration Service initializes a session, it searches for reswords.txt. If the file exists, the Integration Service places quotes around matching reserved words when it executes SQL against the database.
Use the following rules and guidelines when working with reserved words.
The Integration Service searches the reserved words file when it generates SQL to connect to source, target, and lookup databases.
If you override the SQL for a source, target, or lookup, you must enclose any reserved word in quotes.
You may need to enable some databases, such as Microsoft SQL Server and Sybase, to use SQL-92 standards regarding quoted identifiers. Use connection environment SQL to issue the command. For example, use the following command with Microsoft SQL Server:
SET QUOTED_IDENTIFIER ON
Sample reswords.txt File
To use a reserved words file, create a file named reswords.txt and place it in the server/bin directory. Create a section for each database that you need to store reserved words for. Add reserved words used in any table or column name. You do not need to store all reserved words for a database in this file. Database names and reserved words in reswords.txt are not case sensitive.
Following is a sample reswords.txt file:
[Teradata] MONTH DATE INTERVAL [Oracle] OPTION START [DB2] [SQL Server] CURRENT [Informix] [ODBC] MONTH [Sybase]