I am trying to drop all the partitions on an external table in a redshift cluster. I am unable to find an easy way to do it. I am currently doing this by running a dynamic query to select the dates from the table and concatenating it with the drop logic and taking the result set and running it separately like this
select 'ALTER TABLE procore_iad_ext.active_histories DROP PARTITION (values='''||rtrim(ltrim(values, '["'),'"]') ||''');' from svv_external_partitions
where tablename = 'xyz';
values looks like this ->["2009-03-10"]
Looking for a simpler direct solution. Thanks.
The easiest way to do this would be to drop the table itself. As long as you have the DDL to recreate the table and don't mind dropping all partitions, just DROP TABLE <schemaname>.<tablename>; then recreate the table. The new table will not have any partitions.
Please check out the Glue catalog. It provides a UI to easily delete the tables/partitions etc.
In SQL Server , we can create index like this. How do we create the index after the table already exists? What is the syntax of create clusted index in bigquery?
CREATE INDEX abcd ON `abcd.xxx.xxx`(columnname )
In big query, we can create table like below. But how to create partition and cluster on an existing table?
CREATE TABLE rep_sales.orders_tmp PARTITION BY DATE(created_at) CLUSTER BY created_at AS SELECT * FROM rep_sales.orders
As #Sergey Geron mentioned in the comments, BigQuery doesn’t support indexes. For more information, please refer to this doc.
An existing table cannot be partitioned but you can create a new partitioned table and then load the data into it from the unpartitioned table.
As for clustering of tables, BigQuery supports changing an existing non-clustered table to a clustered table and vice versa. You can also update the set of clustered columns of a clustered table. This method of updating the clustering column set is useful for tables that use continuous streaming inserts because those tables cannot be easily swapped by other methods.
You can change the clustering specification in the following ways:
Call the tables.update or tables.patch API method.
Call the bq command-line tool's bq update command with the --clustering_fields flag.
Note: When a table is converted from non-clustered to clustered or the clustered column set is changed, automatic re-clustering only works from that time onward. For example, a non-clustered 1 PB table that is converted to a clustered table using tables.update still has 1 PB of non-clustered data. Automatic re-clustering only applies to any new data committed to the table after the update.
My partitions look like these
There are 1500 partitions like this how do I drop all the partitions at once?
The easiest and quickest way is to drop the table and recreate it. You can get the DDL with SHOW CREATE TABLE table_name if you don't have it.
If you really need to drop partitions and not the table the most efficient way is to use the Glue Data Catalog APIs to first list all partitions and then delete partitions in batches of 25.
It's not documented anywhere but sometimes in athena you can drop all partitions with
ALTER TABLE table_name DROP PARTITION (not_a_column=NULL)
This appears to be a side effect of being able to only specify one partition if you have a table partitioned on multiple dimensions.
If the above doesn't work then I fallback to using the awswrangler python library https://aws-sdk-pandas.readthedocs.io/en/stable/stubs/awswrangler.catalog.delete_partitions.html
I need to insert data on daily basis to AWS Redshift.
The requirement is to analyze only the daily batch inserted to Redshift. Redshift cluster is used by BI tools for analytics.
What are the best practices to "renew" the data set on a daily basis?
My concern is it is a quite heavy operation and performance will be poor but at the same time it is a quite common situation and I believe it was done before by multiple organization.
If the data is on S3, why not create an EXTERNAL TABLE over it. Then if the query speed over external table it not enough you can load it using CREATE TABLE AS SELECT statement into a temporary table, and once loaded, rename to a name your usual table name.
Sketched SQL:
CREATE EXTERNAL TABLE external_daily_batch_20190422 (
<schema ...>
<if anything to partition on>
ROW FORMAT SERDE <data format>
LOCATION 's3://my-s3-location/2019-04-22';
CREATE TABLE internal_daily_batch_temp
SELECT * from external_daily_batch_20190422;
DROP TABLE IF EXISTS internal_daily_batch__backup CASCADE;
ALTER TABLE internal_daily_batch rename to internal_daily_batch__backup;
ALTER TABLE internal_daily_batch_temp rename to internal_daily_batch;
Incremental load not possible?
By the way, is all of your 10TB of data mutable? Isn't incremental update possible?
I am trying to change a column name in an AWS Athena table.
From old_name to new_name.
Normal DDL commands does not affect the table (They cannot be executed).
Is It possible to change a column name without deleting and re-creating the table from scratch ?
I was mistaken, Athena uses HIVE DDL syntax so the correct command is :
ALTER TABLE %%table-name%% CHANGE %%old-column-name%% %%new-column-name%%<string>;
I based my answer on a hive related question.
You can find more about supported and unsupported DDLs here
I'm trying to duplicate a Redshift table including modifiers.
I've tried using a CTAS statement and for some reason that fails to copy modifiers like not null
create table public.my_table as (select * from public.my_old_table limit 1);
There also doesn't seem to be a way to alter the table to add modifiers after creating the table which leads me to believe that there isn't a way to duplicate a Redshift table schema except by running the original create table statement vs the CTAS statement.
According to the docs you can do
CREATE TABLE my_table(LIKE my_old_table);