unable to use Upsert while creating Glue Job - amazon-web-services

AWS Glue service says Unable to derive physical location of the redshift table while using the upsert option while creating a Glue Job . However while using insert it works fine .
I have tried Insert and it works fine while for upsert it says Unable to derive physical location of the redshift table..

Related

AWS glue job (Pyspark) to AWS glue data catalog

We know that,
the procedure of writing from pyspark script (aws glue job) to AWS data catalog is to write in s3 bucket (eg.csv) use a crawler and schedule it.
Is there any other way of writing to aws glue data catalog?
I am looking for a direct way to do this.Eg. writing as a s3 file and sync to the aws glue data catalog.
You may manually specify the table. The crawler only discovers the schema. If you set the schema manually, you should be able to read your data when you run the AWS Glue Job.
We have had this same problem for one of our customers who had millions of small files within AWS S3. The crawler practically would stall and not proceed and continue to run infinitely. We came up with the following alternative approach :
A Custom Glue Python Shell job was written which leveraged AWS Wrangler to fire queries towards AWS Athena.
The Python Shell job would List the contents of folder s3:///event_date=<Put the Date Here from #2.1>
The queries fired :
alter table add partition (event_date='<event_date from above>',eventname=’List derived from above S3 List output’)
4. This was triggered to run post the main Ingestion Job via Glue Workflows.
If you are not expecting schema to change, use Glue job directly after creating manually tables using Glue Database and Table.

UINT_64 causing errors in Athena

Here's my issue with unsigned integers in our source database (MySQL RDS):
I use AWS DMS to do an initial load of the source table and the target is S3 (our Zone1 of our data lake) saved as parquet. I can then crawl it with Glue and query the table in Athena. All good here
I then created a Glue job to read that Zone1 data catalog and output to another bucket in S3 (our Zone2). However, the Glue job fails with: Parquet type not supported: INT64 (UINT_64)
Does anyone have a workaround that I can put into the Glue job to 'cast' this data type to something else?

Add location dynamically in Amazon Redshift create table statement

I am trying to create external table in Amazon Redshift using statement
mentioned at this link.
In my case I want location To be parameterized instead of static value
I am using dB Weaver for Amazon redshift
If your partitions are hive compatible(<partition_column_name>=<partition_column_value>) and your table is defined via Glue or Athena, then you can run MSCK REPAIR TABLE on the Athena table directly, which would add them. Read this thread for more info: https://forums.aws.amazon.com/thread.jspa?messageID=800945
You can also try using partition projections, if you don't use hive compatible partitions, where you define the structure of the files location in relation to the partitions and parameters.
If those don't work with you, you can use AWS Glue Crawlers which supposedly automatically detect partitions: https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html
If that doesn't work for you, well then your problem is very specific. I suggest pulling up your sleeves and write some code, deploy on Lambda or AWS Glue Python Shell Job. Here's a bunch of examples where other people tried that:
https://medium.com/swlh/add-newly-created-partitions-programmatically-into-aws-athena-schema-d773722a228e
https://medium.com/#alsmola/partitioning-cloudtrail-logs-in-athena-29add93ee070

HIVE_UNKNOWN_ERROR when running AWS Athena query on Glue table (RDS)

I'm getting an error when running an Athena query against a Glue table created from an RDS database:
HIVE_UNKNOWN_ERROR: Unable to create input format
The tables are created using a crawler. The tables show up correctly in the Glue interface:
However, they do not show up in the Athena interface under the database. It says: "The selected database has no tables"
I do not see this behaviour when using a database created using an S3 file. Maybe this is related to the error. Does anybody have an idea?
I had the same problem. This is the answer that I have got from AWS Support:
I understand that you set up a Glue crawler to crawl our RDS postresql database but the tables are not visible in Athena.
Athena service is designed to query tables that point to S3 as data-source. It cannot read data from non-S3 resources as of today.
So, unfortunately not possible at the moment.

How to delete from a table in Redshift using PySpark?

I'm trying to write a PySpark script to insert some data into a table in Redshift using Databricks's spark-redshift library, and delete some older data in place of it. Is there a way to directly delete data from Redshift using Spark? Like maybe executing a Spark SQL statement that updates the Redshift table directly?
I know Redshift uses Postgres and Spark uses Hive, but I need to run this query everyday and with the new AWS Glue coming, which supports PySpark, I was wondering if there's a way to do it in PySpark.