Apache Iceberg on Redshift Spectrum, is it possible? - amazon-web-services

I have seen here https://aws.amazon.com/about-aws/whats-new/2020/09/amazon-redshift-spectrum-adds-support-for-querying-open-source-apache-hudi-and-delta-lake/ that Redshift Spectrum has support for Hudi and Delta.
We're using Iceberg right now as a file format, and we have the requirement to read some tables externally in redshift spectrum for the BI Team.
I have created an external schema and an external table, but when I try to read the table, Redshift spectrum give me more data then we should.
We are upserting data based in primary key, so what happens in redshift spectrum the way I tried is that it returns me all records for the same id, instead of returning me only the latest version of it (like a partition by id) - wondering if anyone has tried it with success to integrate Iceberg with AWS Redshift Spectrum?

Related

AWS Athena - UPDATE table rows using SQL

I am newbie to AWS ecosystem. I am creating an application which queries data using AWS Athena. Data is transformed from JSON into parquet using AWS Glue and stored in S3.
Now use case is to update that parquet data using SQL.
can we update underlying parquet data using AWS Athena SQL command?
No, it is not possible to use UPDATE in Amazon Athena.
Amazon Athena is a query engine, not a database. It performs queries on data that is stored in Amazon S3. It reads those files, but it does not modify or update those files. Therefore, it cannot 'update' a table.
The closest capability is using CREATE TABLE AS to create a new table. You can provide a SELECT query that uses data from other tables, so you could effectively modify information and store it in a new table, and tell it to use Parquet for that new table. In fact, this is an excellent way to convert data from other formats into Snappy-compressed Parquet files (with partitioning, if you wish).
Depending on how data is stored in Athena, you can update it using SQL UPDATE statmements. See Updating Iceberg table data and Using governed tables.

How to control increasing data volume in Redshift?

I have a data warehouse maintained in AWS Redshift. The data volume and velocity both have increased lately. One option is to keep scaling the cluster horizontally at the expanse of a higher cost of course. I was wondering if there are any archiving options available so that I can query the entire data as usual (maybe with a compromise in the querying time) but with a low or no additional cost?
One option would be to use external tables and query data directly from S3 but the tools used for achieving this, like Athena and Glue have their own cost, that too on a per query basis.
Easy options:
Ensure all tables have compression SELECT * FROM svv_table_info;.
Maximize compression by changing large tables to use ENCODE zstd.
Switch small tables < ~50k rows (depends) to DISTSTYLE ALL (yes this saves space!).
Switch from SSD based nodes (dc2) to HDD nodes (ds2) which have more 8x storage space.
Less easy options:
UNLOAD older data from Redshift to S3 and query using Redshift Spectrum.
Convert unloaded data to Parquet or ORC format using AWS Glue or AWS EMR and then query using Redshift Spectrum.
Please experiment with Redshift Spectrum. Query performance is typically very good and gets even better if your data is in a columnar format (Parquet/ORC).
Data stored within Amazon Redshift will provide the highest performance.
However, if you have data that is less-frequently accessed, you could export (UNLOAD) it into Amazon S3, preferably as compressed, partitioned data and storing it as Parquet or ORC would be even better!
You could then use Amazon Redshift Spectrum to Query External Data in Amazon S3. You can even join external data with Redshift data, so you could query historical information and current information in the one query.
Alternatively, you could use Amazon Athena to query the data directly from Amazon S3. This is similar to Redshift Spectrum, but does not require Redshift. Amazon Athena is based on Presto, so it is super-fast, especially if data is stored as compressed, partitioned, Parquet/ORC.
See: Analyzing Data in S3 using Amazon Athena | AWS Big Data Blog
Please note that Redshift Spectrum and Amazon Athena charge based upon the amount of data read from disk. Therefore, compressed, partitioned Parquet/ORC is both cheaper and faster.

Store Amazon Athena Query Results into new Table

I need to store Amazon Athena query results into New Amazon Athena Table.
Updates:
Athena now supports Create Table as Select Queries (CTAS). Examples are available here.
They have implemented several nice feautes, namely the ability to apply compression to outputs (GZIP, SNAPPY) and supply output format.
Recently they have added Create Table as CTAS support to Athena.

Amazon Redshift table to external table in S3 every hour

I would like to export data from an Amazon Redshift table into an external table stored in Amazon S3. Every hour, I want to export rows from the Redshift source into the external table target.
What kind of options exist in AWS to achieve this?
I know that there is the UNLOAD command that allows me to export data to S3, but I think it would not work to store the data into an external table (which is partitioned too). Or is Amazon EMR probably the only method to get this working?
Amazon Redshift Spectrum external tables are read-only. You cannot update them from Redshift (eg via INSERT commands).
Therefore, you would need a method to create the files directly in S3.
UNLOAD can certainly do this, but it cannot save the data in a partition structure.
Amazon EMR would, indeed, be a good option. These days it is charged per-second, so it would only need to run long enough to export the data. You could use your preferred tool (eg Hive or Spark) to export the data from Redshift, then write it into a partitioned external table.
For example, see: Powering Amazon Redshift Analytics with Apache Spark and Amazon Machine Learning | AWS Big Data Blog
Another option might be AWS Glue. I'm not too familiar with it, but it can output into partitions, so this might be an even easier method to accomplish your goal!
See: Managing Partitions for ETL Output in AWS Glue - AWS Glue
Its now possible to Insert into external tsble , since June 2020 i think:
https://aws.amazon.com/about-aws/whats-new/2020/06/amazon-redshift-now-supports-writing-to-external-tables-in-amazon-s3/
And heres documentation:
https://docs.aws.amazon.com/redshift/latest/dg/r_INSERT_external_table.html
Basically theres 2 ways:
INSERT INTO external_schema.table_name { select_statement }
Or
CREATE EXTERNAL TABLE AS { SELECT }
Typically you specify in redshift external schema of yours (ex my_stg) the glu database name, so any external table you create inside redshift external schema already knows glue catalog database name.
Thats good news since op question is from 2018 👍

HIVE_UNKNOWN_ERROR when running AWS Athena query on Glue table (RDS)

I'm getting an error when running an Athena query against a Glue table created from an RDS database:
HIVE_UNKNOWN_ERROR: Unable to create input format
The tables are created using a crawler. The tables show up correctly in the Glue interface:
However, they do not show up in the Athena interface under the database. It says: "The selected database has no tables"
I do not see this behaviour when using a database created using an S3 file. Maybe this is related to the error. Does anybody have an idea?
I had the same problem. This is the answer that I have got from AWS Support:
I understand that you set up a Glue crawler to crawl our RDS postresql database but the tables are not visible in Athena.
Athena service is designed to query tables that point to S3 as data-source. It cannot read data from non-S3 resources as of today.
So, unfortunately not possible at the moment.