We have an SQL file that we would like to run on our Redshift cluster, we're already aware that this is possible via psql as described in this Stackoverflow answer and this Stackoverflow answer. However, we were wondering whether this was possible using the Redshift Data API?
We looked through the documentation but were unable to find anything apart from batch-execute-statement which takes a space delimited list of SQL statements. We're happy to resort to this but would prefer a method of running a file directly against the cluster.
Also, we'd like to parameterise the file as well, can this be done?
Our Current Attempt
This is what we've tried so far:
PARAMETERS="[\
{\"name\": \"param1\", \"value\": \"${PARAM1}\"}, \
{\"name\": \"param2\", \"value\": \"${PARAM2}\"}, \
{\"name\": \"param3\", \"value\": \"${PARAM3}\"}, \
{\"name\": \"param4\", \"value\": \"${PARAM4}\"}, \
{\"name\": \"param5\", \"value\": \"${PARAM5}\"}, \
{\"name\": \"param6\", \"value\": \"${PARAM6}\"}\
]"
SCRIPT_SQL=$(tr -d '\n' <./sql/script.sql)
AWS_RESPONSE=$(aws redshift-data execute-statement \
--region $AWS_REGION \
--cluster-identifier $CLUSTER_IDENTIFIER \
--sql "$SCRIPT_SQL" \
--parameters "$PARAMETERS" \
--database public \
--secret $CREDENTIALS_ARN)
Where all undeclared variables are variables set earlier in the script.
I am a bit confused. Redshift Data API is a REST API which expects you to send a request and it executes the query against your cluster (or serverless). Typical usage might be like using a Lambda Function to connect to your Redshift environment, and execute queries from there. You can load your file into Lambda, decompose and send the commands one by one if you like. And of course, you can parametrise anything within that Lambda. But for the Redshift Data API to work, it needs to be request - response type of operation.
And please note it is an async API.
Related
I have created a new column to an existing dynamo db table in aws. Now I want a one-time script to populate values to the newly created column for all existing records. I have tried with the cursor as shown below from the PartiQL editor in aws
DECLARE cursor CURSOR FOR SELECT CRMCustomerGuid FROM "Customer";
OPEN cursor;
WHILE NEXT cursor DO
UPDATE "Customer"
SET "TimeToLive" = 1671860761
WHERE "CustomerGuid" = cursor.CRMCustomerGuid;
END WHILE
CLOSE cursor;
But I am getting the error message saying that ValidationException: Statement wasn't well formed, can't be processed: unexpected keyword
Any help is appreciated
DynamoDB is not a relational database and PartiQL is not a full SQL implementation.
Here’s the docs on the language. Cursor isn’t in there.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ql-reference.html
My own advice would be to use the plain non-SQL interface first - because with it the calls you can make map directly to the things the database can do.
Once you understand that you may, in some contexts, leverage the PartiQL ability.
From #hunterhacker's comment we know that cursors are not possible with PartiQL. Similarly, we are unable to run multiple types of executions in PartiQL's web editor thus we are unable to do a SELECT then UPDATE.
However, this is quite easily achieved using the CLI or SDK. Below is a simple bash script which will update all of the items in your table with a TTL value, execute from any linux/unix based shell:
for pk in `aws dynamodb scan --table-name Customer --projection-expression 'CustomerGuid' --query 'Items[*].pk.S' --output text`; do
aws dynamodb update-item \
--table-name Customer \
--key '{"CustomerGuid": {"S": "'$pk'"}}' \
--update-expression "SET #ttl = :ttl" \
--expression-attribute-names '{"#ttl":"TimeToLive"}' \
--expression-attribute-values '{":ttl":{"N": "1671860762"}}'
done
My cluster is configured to use ROLE_B which gives me access to BUCKET_IN_ACCOUNT_B but not to BUCKET_IN_ACCOUNT_A. So I assume XACCOUNT_ROLE to access BUCKET_IN_ACCOUNT_A. The following code works just fine.
sc._jsc.hadoopConfiguration().set("fs.s3a.credentialsType", "AssumeRole")
sc._jsc.hadoopConfiguration().set("fs.s3a.stsAssumeRole.arn", XACCOUNT_ROLE)
sc._jsc.hadoopConfiguration().set("fs.s3a.acl.default", "BucketOwnerFullControl")
df = spark.read.option("multiline","true").json(BUCKET_IN_ACCOUNT_A)
But, when I try to write this dataframe back to BUCKET_IN_ACCOUNT_B like below, I get
java.nio.file.AccessDeniedException.
df.write \
.format("delta") \
.mode("append") \
.save(BUCKET_IN_ACCOUNT_B)
I assume this is the case because my spark cluster is still configured to use XACCOUNT_ROLE. My question is, how do I switch back to ROLE_B?
sc._jsc.hadoopConfiguration().set("fs.s3a.credentialsType", "AssumeRole")
sc._jsc.hadoopConfiguration().set("fs.s3a.stsAssumeRole.arn", ROLE_B)
sc._jsc.hadoopConfiguration().set("fs.s3a.acl.default", "BucketOwnerFullControl")
did not work.
I never ended up discovering a way to switch the cluster role again. But I did figure out an alternative way of accomplishing reading from one s3 bucket and writing to another. Basically, I mounted the XACCOUNT bucket and was able to write to BUCKET_IN_ACCOUNT_B without having to interfere with the cluster's role.
dbutils.fs.unmount("/mnt/MOUNT_LOCATION")
dbutils.fs.mount(BUCKET_IN_ACCOUNT_A, "/mnt/MOUNT_LOCATION",
extra_configs = {
"fs.s3a.credentialsType": "AssumeRole",
"fs.s3a.stsAssumeRole.arn": XACCOUNT_ROLE,
"fs.s3a.acl.default": "BucketOwnerFullControl"
}
)
df = spark.read.option("multiline","true").json("/mnt/MOUNT_LOCATION")
df.write \
.format("delta") \
.mode("append") \
.save(BUCKET_IN_ACCOUNT_B)
I am trying to connect to Redshift and run simple queries from a Glue DevEndpoint (that is requirement) but can not seems to connect.
Following code just times out:
df = spark.read \
.format('jdbc') \
.option("url", "jdbc:redshift://my-redshift-cluster.c512345.us-east-2.redshift.amazonaws.com:5439/dev?user=myuser&password=mypass") \
.option("query", "select distinct(tablename) from pg_table_def where schemaname = 'public'; ") \
.option("tempdir", "s3n://test") \
.option("aws_iam_role", "arn:aws:iam::147912345678:role/my-glue-redshift-role") \
.load()
What could be the reason?
I checked URL, user, password and also tried different IAM roles but every time just hangs..
Also tried without IAM role (just having URL, user/pass, schema/table that already exists there) and also hangs/timeout:
jdbcDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:redshift://my-redshift-cluster.c512345.us-east-2.redshift.amazonaws.com:5439/dev") \
.option("dbtable", "public.test") \
.option("user", "myuser") \
.option("password", "mypass") \
.load()
Reading data (directly in Glue SSH terminal) from S3 or from Glue tables (catalog) seems fine so I know that Spark and Dataframes are fine, just there is something with connection to RedShift but not sure what?
Select last option while creating glue job. And in next screen, it will ask to select Glue connection
You seem to be on the correct path. I connect and query Redshift from Glue PySpark job the same way except a minor change of using
.format("com.databricks.spark.redshift")
I have also successfully used
.option("forward_spark_s3_credentials", "true")
instead of
.option("iam_role", "my_iam_role")
I am making SES Templates using the AWS CLI and having issues with single quotes converting to special characters when the emails are sent.
This also happens when doing a DynamoDB put item operation using the CLI when a string contains a single quote within it.
I've tried backslashes, wrapping the quote in double quotes then escaping it etc.
aws ses send-bulk-templated-email --cli-input-json file://test.json
aws dynamodb put-item --table-name TABLE --item file://item.json
Item/Test Example (snippets of the json):
test: "SubjectPart":"Happy birthday! Get more involved in managing your healthcare now that you're 18"
item:
"S": "Now that you're 18"
Output:
Happy birthday! Get more involved in managing your healthcare now that you’re 18
and
Now that you’re 18
Expected:
Happy birthday! Get more involved in managing your healthcare now that you're 18
and
Now that you're 18
Assuming that you're using Linux or Mac, with the bash shell ...
Here is an example of how to escape quote characters when using the awscli:
aws dynamodb put-item \
--table mytable \
--item '{"id":{"S":"1"}, "name":{"S":"Fred'\''s Garage"}}'
Here is a second way:
aws dynamodb put-item \
--table mytable \
--item $'{"id":{"S":"1"}, "name":{"S":"Fred\'s Garage"}}'
In the latter example, words of the form $'string' are treated specially and allow you to quote certain characters.
Welp after many trial and errors this is what worked:
you\u2019re
I have no idea why but it did. Posting this answer in case others experience this as well.
Example:
"SubjectPart":"Happy birthday! Get more involved in managing your healthcare now that you\u2019re 18"
This will give you the expected output.
I am using aws cli to create EMR cluster and adding a step. My create cluster command looks like :
aws emr create-cluster --release-label emr-5.0.0 --applications Name=Spark --ec2-attributes KeyName=*****,SubnetId=subnet-**** --use-default-roles --bootstrap-action Path=$S3_BOOTSTRAP_PATH --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=$instanceCount,InstanceType=m4.4xlarge --steps Type=Spark,Name="My Application",ActionOnFailure=TERMINATE_CLUSTER,Args=[--master,yarn,--deploy-mode,client,$JAR,$inputLoc,$outputLoc] --auto-terminate
$JAR - is my spark jar which takes two params input and output
$input is basically a comma separated list of input files like s3://myBucket/input1.txt,s3://myBucket/input2.txt
However, aws cli command treats comma separated values as separate arguments and hence my second parameter is being treated as second parameter and hence the $output here becomes s3://myBucket/input2.txt
Is there any way to escape comma and treat this whole argument as single value in CLI command so that spark can handle reading multiple files as input?
Seems like there is no possible way of escaping comma from input files.
After trying quite a few ways, I finally had to put a hack by passing a delimiter for separating input files and handling the same in code. In my case,I added % as my delimiter and in Driver code, I am doing
if (inputLoc.contains("%")) {
inputLoc = inputLoc.replaceAll("%", ",");
}