How to delete from a table in Redshift using PySpark? - amazon-web-services

I'm trying to write a PySpark script to insert some data into a table in Redshift using Databricks's spark-redshift library, and delete some older data in place of it. Is there a way to directly delete data from Redshift using Spark? Like maybe executing a Spark SQL statement that updates the Redshift table directly?
I know Redshift uses Postgres and Spark uses Hive, but I need to run this query everyday and with the new AWS Glue coming, which supports PySpark, I was wondering if there's a way to do it in PySpark.

Related

aws glue tables created by athena are read twice by emr spark

When I create a table in athena with CTAS syntax (example below), tables are registered to glue in a way that when I read the same table on an EMR cluster with (py)spark, every partition is read twice, but when I read it with athena, it is alright. When I create a table through spark with write.saveAsTable syntax, it's registered to glue properly and this table is read properly with spark and with athena.
I didn't find anything in the spark/athena/glue documentation about this. After some trial and errors I found out that there is a glue table property that is set by spark and not set by athena: spark.sql.sources.provider='parquet'. When I set this manually on tables created via athena, spark will read it properly. But this feels like an ugly workaround and I would like to understand what's happening in the background. And I didn't find anything about this table property.
Athena create table syntax:
CREATE TABLE {database}.{table}
WITH (format = 'Parquet',
parquet_compression = 'SNAPPY',
external_location = '{s3path}')
AS SELECT

Add location dynamically in Amazon Redshift create table statement

I am trying to create external table in Amazon Redshift using statement
mentioned at this link.
In my case I want location To be parameterized instead of static value
I am using dB Weaver for Amazon redshift
If your partitions are hive compatible(<partition_column_name>=<partition_column_value>) and your table is defined via Glue or Athena, then you can run MSCK REPAIR TABLE on the Athena table directly, which would add them. Read this thread for more info: https://forums.aws.amazon.com/thread.jspa?messageID=800945
You can also try using partition projections, if you don't use hive compatible partitions, where you define the structure of the files location in relation to the partitions and parameters.
If those don't work with you, you can use AWS Glue Crawlers which supposedly automatically detect partitions: https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html
If that doesn't work for you, well then your problem is very specific. I suggest pulling up your sleeves and write some code, deploy on Lambda or AWS Glue Python Shell Job. Here's a bunch of examples where other people tried that:
https://medium.com/swlh/add-newly-created-partitions-programmatically-into-aws-athena-schema-d773722a228e
https://medium.com/#alsmola/partitioning-cloudtrail-logs-in-athena-29add93ee070

Redshift COPY and create table automatically?

Is there not a way to automatically create an internal table in Redshift and then move data into it with COPY? Can I not use the metadata stored on AWS Glue Data Catalog to create it?
Right now as I understand it, one has to manually write SQL to create a table and then run COPY to move data into the table.
Edit: My problem is the table creation. For me to run COPY a table must already exist. Can i create this table using the existing GLUE metadata? Or can i only write SQL by hand?
Automatic table creation can be achieved using the below workaround.
Create table definition by running the Glue Crawler.
Create an external database in redshift pointing to the glue database.
CREATE EXTERNAL SCHEMA my_ext_db FROM DATA CATALOG DATABASE 'my_glue_db' IAM_ROLE 'arn:aws:iam:::role/my-redshift-role';
Now create a redshift table using the below command
CREATE TABLE my_internal_db.my_table AS Select * from my_ext_db.my_table;
you can now run the copy into command on this empty table
my_internal_db.my_table.
If you want to automate the data movement from Glue catalog to Redshift table you can use AWS Glue to achieve it. Please refer to this for more information.
Once you have your Glue jobs ready you can always schedule them to run at some point of time everyday.

How to import hive tables from Hadoop datalake to AWS RDS?

I need suggestions on importing data from Hadoop datalake (Kerberos authenticated) to AWS. All the tables in the Hive table should land in s3 and then needs to be loaded to AWS RDS.
I have considered the following options:
1) AWS Glue ?
2) Spark connecting to hive metastore ?
3) Connecting to impala from AWS ?
There are around 50 tables to be imported. How can i maintain the schema ? IS it better to import the data and then create a seperate schema in RDS ?
Personally, I would dump a list of all tables needing moved.
From that, run SHOW CREATE TABLE over them all, and save the queries.
Run distcp, or however else you want to move the data into S3 / EBS
Edit each create table command to specify a Hive table location that's in the cloud data store. I believe you'll need to make all these as external tables since you cannot put data directly into the Hive warehouse directory and have the metastore immediately know of it.
Run all the commands on the AWS Hive connection.
I have coworkers who have used CircusTrain
Impala and Spark are meant for processing. You're going to need to deal mostly with the Hive metastore here.

Exporting Spark Dataframe to Athena

I'm running a pyspark job which creates a dataframe and stores it to S3 as below:
df.write.saveAsTable(table_name, format="orc", mode="overwrite", path=s3_path)
I can read the orcfile without a problem, just by using spark.read.orc(s3_path), so there's schema information in the orcfile, as expected.
However, I'd really like to view the dataframe contents using Athena. Clearly if I wrote to my hive metastore, I can call hive and do a show create table ${table_name}, but that's a lot of work when all I want is a simple schema.
Is there another way?
One of the approaches would be to set up a Glue crawler for your S3 path, which would create a table in the AWS Glue Data Catalog. Alternatively, you could create the Glue table definition via the Glue API.
The AWS Glue Data Catalog is fully integrated with Athena, so you would see your Glue table in Athena, and be able to query it directly:
http://docs.aws.amazon.com/athena/latest/ug/glue-athena.html