Is there not a way to automatically create an internal table in Redshift and then move data into it with COPY? Can I not use the metadata stored on AWS Glue Data Catalog to create it?
Right now as I understand it, one has to manually write SQL to create a table and then run COPY to move data into the table.
Edit: My problem is the table creation. For me to run COPY a table must already exist. Can i create this table using the existing GLUE metadata? Or can i only write SQL by hand?
Automatic table creation can be achieved using the below workaround.
Create table definition by running the Glue Crawler.
Create an external database in redshift pointing to the glue database.
CREATE EXTERNAL SCHEMA my_ext_db FROM DATA CATALOG DATABASE 'my_glue_db' IAM_ROLE 'arn:aws:iam:::role/my-redshift-role';
Now create a redshift table using the below command
CREATE TABLE my_internal_db.my_table AS Select * from my_ext_db.my_table;
you can now run the copy into command on this empty table
my_internal_db.my_table.
If you want to automate the data movement from Glue catalog to Redshift table you can use AWS Glue to achieve it. Please refer to this for more information.
Once you have your Glue jobs ready you can always schedule them to run at some point of time everyday.
Related
I am newbie to AWS ecosystem. I am creating an application which queries data using AWS Athena. Data is transformed from JSON into parquet using AWS Glue and stored in S3.
Now use case is to update that parquet data using SQL.
can we update underlying parquet data using AWS Athena SQL command?
No, it is not possible to use UPDATE in Amazon Athena.
Amazon Athena is a query engine, not a database. It performs queries on data that is stored in Amazon S3. It reads those files, but it does not modify or update those files. Therefore, it cannot 'update' a table.
The closest capability is using CREATE TABLE AS to create a new table. You can provide a SELECT query that uses data from other tables, so you could effectively modify information and store it in a new table, and tell it to use Parquet for that new table. In fact, this is an excellent way to convert data from other formats into Snappy-compressed Parquet files (with partitioning, if you wish).
Depending on how data is stored in Athena, you can update it using SQL UPDATE statmements. See Updating Iceberg table data and Using governed tables.
When I create a table in athena with CTAS syntax (example below), tables are registered to glue in a way that when I read the same table on an EMR cluster with (py)spark, every partition is read twice, but when I read it with athena, it is alright. When I create a table through spark with write.saveAsTable syntax, it's registered to glue properly and this table is read properly with spark and with athena.
I didn't find anything in the spark/athena/glue documentation about this. After some trial and errors I found out that there is a glue table property that is set by spark and not set by athena: spark.sql.sources.provider='parquet'. When I set this manually on tables created via athena, spark will read it properly. But this feels like an ugly workaround and I would like to understand what's happening in the background. And I didn't find anything about this table property.
Athena create table syntax:
CREATE TABLE {database}.{table}
WITH (format = 'Parquet',
parquet_compression = 'SNAPPY',
external_location = '{s3path}')
AS SELECT
I am trying to create external table in Amazon Redshift using statement
mentioned at this link.
In my case I want location To be parameterized instead of static value
I am using dB Weaver for Amazon redshift
If your partitions are hive compatible(<partition_column_name>=<partition_column_value>) and your table is defined via Glue or Athena, then you can run MSCK REPAIR TABLE on the Athena table directly, which would add them. Read this thread for more info: https://forums.aws.amazon.com/thread.jspa?messageID=800945
You can also try using partition projections, if you don't use hive compatible partitions, where you define the structure of the files location in relation to the partitions and parameters.
If those don't work with you, you can use AWS Glue Crawlers which supposedly automatically detect partitions: https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html
If that doesn't work for you, well then your problem is very specific. I suggest pulling up your sleeves and write some code, deploy on Lambda or AWS Glue Python Shell Job. Here's a bunch of examples where other people tried that:
https://medium.com/swlh/add-newly-created-partitions-programmatically-into-aws-athena-schema-d773722a228e
https://medium.com/#alsmola/partitioning-cloudtrail-logs-in-athena-29add93ee070
We have a glue crawler that read avro files in S3 and create a table in glue catalog accordingly.
The thing is that we have a column named 'foo' that came from the avro schema and we also have something like 'foo=XXXX' in the s3 bucket path, to have Hive partitions.
What we did not know is that the crawler will then create a table which now has two columns with the same name, thus our issue while querying the table:
HIVE_INVALID_METADATA: Hive metadata for table mytable is invalid: Table descriptor contains duplicate columns
Is there a way to tell glue to map the partition 'foo' to another column name like 'bar' ?
That way we would avoid having to reprocess our data by specifying a new partition name in the s3 bucket path..
Or any other suggestions ?
Glue Crawlers are pretty terrible, this is just one of the many ways where it creates unusable tables. I think you're better off just creating the tables and partitions with a simple script. Create the table without the foo column, and then write a script that lists your files on S3 do the Glue API calls (BatchCreatePartition), or execute ALTER TABLE … ADD PARTITION … calls in Athena.
Whenever new data is added on S3, just add the new partitions with the API call or Athena query. There is no need to do all the work that Glue Crawlers do if you know when and how data is added. If you don't, you can use S3 notificatons to run Lambda functions that do the Glue API calls instead. Almost all solutions are better than Glue Crawlers.
The beauty of Athena and Glue Catalog is that it's all just metadata, it's very cheap to throw it all away and recreate it. You can also create as many tables as you want that use the same location, to try out different schemas. In your case there is no need to move any objects on S3, you just need a different table and a different mechanism to add partitions to it.
You can fix this by updating the schema of the glue table and rename the duplicate column:
Open the AWS Glue console.
Choose the table name from the list, and then choose Edit schema.
Choose the column name foo (not the partitioned column foo), enter a new name, and then choose Save.
Reference:
Resolve HIVE_INVALID_METADATA error
I'm running a pyspark job which creates a dataframe and stores it to S3 as below:
df.write.saveAsTable(table_name, format="orc", mode="overwrite", path=s3_path)
I can read the orcfile without a problem, just by using spark.read.orc(s3_path), so there's schema information in the orcfile, as expected.
However, I'd really like to view the dataframe contents using Athena. Clearly if I wrote to my hive metastore, I can call hive and do a show create table ${table_name}, but that's a lot of work when all I want is a simple schema.
Is there another way?
One of the approaches would be to set up a Glue crawler for your S3 path, which would create a table in the AWS Glue Data Catalog. Alternatively, you could create the Glue table definition via the Glue API.
The AWS Glue Data Catalog is fully integrated with Athena, so you would see your Glue table in Athena, and be able to query it directly:
http://docs.aws.amazon.com/athena/latest/ug/glue-athena.html