Batch loading into AWS RDS (postgres) from PySpark - amazon-web-services

I am looking for a batch loader for a glue job to load into RDS using a PySpark script witht he DataFormatWriter.
I have this working for RedShift as follows:
df.write \
.format("com.databricks.spark.redshift") \
.option("url", jdbcconf.get("url") + '/' + DATABASE + '?user=' + jdbcconf.get('user') + '&password=' + jdbcconf.get('password')) \
.option("dbtable", TABLE_NAME) \
.option("tempdir", args["TempDir"]) \
.option("forward_spark_s3_credentials", "true") \
.mode("overwrite") \
.save()
Where df is defined above to read in a file. What is the best approach I could take to do this in RDS instead of in REDSHIFT?

In RDS would you be only APPEND / OVERWRITE, in such case you can create an RDS JDBC connection, and use something like below:
postgres_url="jdbc:postgresql://localhost:portnum/sakila?user=<user>&password=<pwd>"
df.write.jdbc(postgres_url,table="actor1",mode="append") #for append
df.write.jdbc(postgres_url,table="actor1",mode="overwrite") #for overwrite
If it involves UPSERTS, then probably you can use a MYSQL library as an external python library, and perform INSERT INTO ..... ON DUPLICATE KEY.
Please refer this url: How to use JDBC source to write and read data in (Py)Spark?
regards
Yuva

I learned that this can be only done through JDBC. Eg.
df.write.format("jdbc") \
.option("url", jdbcconf.get("url") + '/' + REDSHIFT_DATABASE + '?user=' + jdbcconf.get('user') + '&password=' + jdbcconf.get('password')) \
.option("dbtable", REDSHIFT_TABLE_NAME) \
.option("tempdir", args["TempDir"]) \
.option("forward_spark_s3_credentials", "true") \
.mode("overwrite") \
.save()

Related

How to store JDBC locally to run pyspark in EMR cluster

I'm trying to run some code locally to AWS, but I'm not sure where to store the jdbc drivers. The goal is to have my pyspark application, read a rds database to do an ELT process from a cluster.
I'm getting two sets of errors:
First: Cannot locate jar files
Two: Error: Missing Additional resource
Here is what my code looks like:
import pyspark
from pyspark.sql import SparkSession
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--driver-class-path /path/to/driver/jars --jars /file/path/to/jars'
post_df = spark.read\
.format("djbc") \
.option("url", "jdbc:postgres:url-to-rds-amazonaws.com") \
.option("dbtable", "mytable") \
.option("user", "myuser") \
.option("password", 'passowrd') \
.query("select * from my table").load()
post_df.createOrReplaceTempView("post_fin_v")
transformed_df = spark.sql('''
perform more aggregation here
''')
transformed_df.write.format("jdbc").mode("append").option("url", "jdbc:sql_server:url-to-rds-amazonaws.com") \
.option("dbtable", "mytable") \
.option("user", "myuser") \
.option("password", 'passowrd')

How to query AWS RedShift from AWS Glue PySpark Job

I have a redshift cluster which is not publicly accessible. I want to query the database in cluster from glue job using pyspark. I have tried this snippet but I'm getting timed out error.
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Glue to RedShift") \
.getOrCreate()
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:redshift://redshift-cluster-***************redshift.amazonaws.com:5439/dev") \
.option("user","******") \
.option("password","************") \
.option("query", "Select * from category limit 10") \
.option("tempdir", "s3a://e-commerce-website-templates/ahmad") \
.option("aws_iam_role", "arn:aws:iam::337618512328:role/glue_s3_redshift") \
.load()
df.show()
Any help would be appreciated. Thanks in advance.

How to input shell parametes in AWS CLI

I've got two shell parameters
AID="subnet-00000"
BID="subnet-11111"
And I can't execute below statement.
aws rds create-db-subnet-group \
--db-subnet-group-name dbsubnet-$service_name \
--db-subnet-group-description "dbsubnet-$service_name" \
--subnet-ids '[$AID, $BID]'
The error message is saying that
Expecting value: line 1 column 2 (char 1)
How can I put my parameters into aws cli statement?
Since you've used single-quote, the variables wont be resolved. Also you can skip square brackets:
aws rds create-db-subnet-group \
--db-subnet-group-name dbsubnet-$service_name \
--db-subnet-group-description "dbsubnet-$service_name" \
--subnet-ids $AID $BID

How to get the full results of a query to CSV file using AWS/Athena from CLI?

I need to download a full table content that I have on my AWS/Glue/Catalog using AWS/Athena. At the moment what I do it is running a select * from my_table from the Dashboard and saving the result locally as CSV always from Dashboard. Is there a way to get the same result using AWS/CLI?
From the documentation I can see https://docs.aws.amazon.com/cli/latest/reference/athena/get-query-results.html but it is not quite what I need.
You can run an Athena query with AWS CLI using the aws athena start-query-execution API call. You will then need to poll with aws athena get-query-execution until the query is finished. When that is the case the result of that call will also contain the location of the query result on S3, which you can then download with aws s3 cp.
Here's an example script:
#!/usr/bin/env bash
region=us-east-1 # change this to the region you are using
query='SELECT NOW()' # change this to your query
output_location='s3://example/location' # change this to a writable location
query_execution_id=$(aws athena start-query-execution \
--region "$region" \
--query-string "$query" \
--result-configuration "OutputLocation=$output_location" \
--query QueryExecutionId \
--output text)
while true; do
status=$(aws athena get-query-execution \
--region "$region" \
--query-execution-id "$query_execution_id" \
--query QueryExecution.Status.State \
--output text)
if [[ $status != 'RUNNING' ]]; then
break
else
sleep 5
fi
done
if [[ $status = 'SUCCEEDED' ]]; then
result_location=$(aws athena get-query-execution \
--region "$region" \
--query-execution-id "$query_execution_id" \
--query QueryExecution.ResultConfiguration.OutputLocation \
--output text)
exec aws s3 cp "$result_location" -
else
reason=$(aws athena get-query-execution \
--region "$region" \
--query-execution-id "$query_execution_id" \
--query QueryExecution.Status.StateChangeReason \
--output text)
echo "Query $query_execution_id failed: $reason" 1>&2
exit 1
fi
If your primary work group has an output location, or you want to use a different work group which also has a defined output location you can modify the start-query-execution call accordingly. Otherwise you probably have an S3 bucket called aws-athena-query-results-NNNNNNN-XX-XXXX-N that has been created by Athena at some point and that is used for outputs when you use the UI.
You cannot save results from the AWS CLI, but you can Specify a Query Result Location and Amazon Athena will automatically save a copy of the query results in an Amazon S3 location that you specify.
You could then use the AWS CLI to download that results file.

sqoop export to Teradata gives com.teradata.connector.common.exception.ConnectorException: Malformed \uxxxx encoding

I am trying to export data from HDFS to Teradata using sqoop. I have created a table in Teradata and tried to import a sample text file with some sample data. Here is my sqoop export command
sqoop export --connect jdbc:teradata://xxx.xxx.xxx.xx/Database=XXXXXXX,CHARSET=UTF8 \
--username User_name \
--password pwd \
--export-dir /user/User/test_td_export/ \
--table HDP_TD_EXPORT_TEST \
--input-fields-terminated-by ',' \
--input-escaped-by '\' \
--input-enclosed-by '\"' \
--input-optionally-enclosed-by '\"' \
--mapreduce-job-name td_export_test
I am able to do a sqoop eval to the same table to get the count successfully but while exporting data, I am getting the exception.
19/01/04 20:48:26 ERROR tool.ExportTool: Encountered IOException running export job:
com.teradata.connector.common.exception.ConnectorException: Malformed \uxxxx encoding
This is the first time I have tried to export to teradata. I have exported data to Oracle and didn't see any such issues. Any help is greatly appreciated. Thanks
I have found that usage of --input-escaped-by \ \ is causing the above exception as it is adding the escape characters while exporting. I have removed that parameter and the export job worked as expected.