Did anyone try adding compression while using CTAS command in redshift.
I did not find anything on this in their documentation.
https://docs.aws.amazon.com/redshift/latest/dg/r_CTAS_usage_notes.html
Thanks
We cannot use it as per CTAS Usage Notes - Amazon Redshift:
CREATE TABLE AS (CTAS) tables don't inherit constraints, identity columns, default column values, or the primary key from the table that they were created from.
You can't specify column compression encodings for CTAS tables.
Related
I have seen here https://aws.amazon.com/about-aws/whats-new/2020/09/amazon-redshift-spectrum-adds-support-for-querying-open-source-apache-hudi-and-delta-lake/ that Redshift Spectrum has support for Hudi and Delta.
We're using Iceberg right now as a file format, and we have the requirement to read some tables externally in redshift spectrum for the BI Team.
I have created an external schema and an external table, but when I try to read the table, Redshift spectrum give me more data then we should.
We are upserting data based in primary key, so what happens in redshift spectrum the way I tried is that it returns me all records for the same id, instead of returning me only the latest version of it (like a partition by id) - wondering if anyone has tried it with success to integrate Iceberg with AWS Redshift Spectrum?
I am newbie to AWS ecosystem. I am creating an application which queries data using AWS Athena. Data is transformed from JSON into parquet using AWS Glue and stored in S3.
Now use case is to update that parquet data using SQL.
can we update underlying parquet data using AWS Athena SQL command?
No, it is not possible to use UPDATE in Amazon Athena.
Amazon Athena is a query engine, not a database. It performs queries on data that is stored in Amazon S3. It reads those files, but it does not modify or update those files. Therefore, it cannot 'update' a table.
The closest capability is using CREATE TABLE AS to create a new table. You can provide a SELECT query that uses data from other tables, so you could effectively modify information and store it in a new table, and tell it to use Parquet for that new table. In fact, this is an excellent way to convert data from other formats into Snappy-compressed Parquet files (with partitioning, if you wish).
Depending on how data is stored in Athena, you can update it using SQL UPDATE statmements. See Updating Iceberg table data and Using governed tables.
We have a glue crawler that read avro files in S3 and create a table in glue catalog accordingly.
The thing is that we have a column named 'foo' that came from the avro schema and we also have something like 'foo=XXXX' in the s3 bucket path, to have Hive partitions.
What we did not know is that the crawler will then create a table which now has two columns with the same name, thus our issue while querying the table:
HIVE_INVALID_METADATA: Hive metadata for table mytable is invalid: Table descriptor contains duplicate columns
Is there a way to tell glue to map the partition 'foo' to another column name like 'bar' ?
That way we would avoid having to reprocess our data by specifying a new partition name in the s3 bucket path..
Or any other suggestions ?
Glue Crawlers are pretty terrible, this is just one of the many ways where it creates unusable tables. I think you're better off just creating the tables and partitions with a simple script. Create the table without the foo column, and then write a script that lists your files on S3 do the Glue API calls (BatchCreatePartition), or execute ALTER TABLE … ADD PARTITION … calls in Athena.
Whenever new data is added on S3, just add the new partitions with the API call or Athena query. There is no need to do all the work that Glue Crawlers do if you know when and how data is added. If you don't, you can use S3 notificatons to run Lambda functions that do the Glue API calls instead. Almost all solutions are better than Glue Crawlers.
The beauty of Athena and Glue Catalog is that it's all just metadata, it's very cheap to throw it all away and recreate it. You can also create as many tables as you want that use the same location, to try out different schemas. In your case there is no need to move any objects on S3, you just need a different table and a different mechanism to add partitions to it.
You can fix this by updating the schema of the glue table and rename the duplicate column:
Open the AWS Glue console.
Choose the table name from the list, and then choose Edit schema.
Choose the column name foo (not the partitioned column foo), enter a new name, and then choose Save.
Reference:
Resolve HIVE_INVALID_METADATA error
I need to store Amazon Athena query results into New Amazon Athena Table.
Updates:
Athena now supports Create Table as Select Queries (CTAS). Examples are available here.
They have implemented several nice feautes, namely the ability to apply compression to outputs (GZIP, SNAPPY) and supply output format.
Recently they have added Create Table as CTAS support to Athena.
I am wondering if there is a table or a query to list all the tables created by a certain user? I looked around redshift system tables but I couldn't find one yet.
PostgreSQL 8 queries will usually work in Redshift. You can do this:
select * from pg_catalog.pg_tables where tableowner='certainuser';