How to connect to Redshift from AWS Glue (PySpark)? - amazon-web-services

I am trying to connect to Redshift and run simple queries from a Glue DevEndpoint (that is requirement) but can not seems to connect.
Following code just times out:
df = spark.read \
.format('jdbc') \
.option("url", "jdbc:redshift://my-redshift-cluster.c512345.us-east-2.redshift.amazonaws.com:5439/dev?user=myuser&password=mypass") \
.option("query", "select distinct(tablename) from pg_table_def where schemaname = 'public'; ") \
.option("tempdir", "s3n://test") \
.option("aws_iam_role", "arn:aws:iam::147912345678:role/my-glue-redshift-role") \
.load()
What could be the reason?
I checked URL, user, password and also tried different IAM roles but every time just hangs..
Also tried without IAM role (just having URL, user/pass, schema/table that already exists there) and also hangs/timeout:
jdbcDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:redshift://my-redshift-cluster.c512345.us-east-2.redshift.amazonaws.com:5439/dev") \
.option("dbtable", "public.test") \
.option("user", "myuser") \
.option("password", "mypass") \
.load()
Reading data (directly in Glue SSH terminal) from S3 or from Glue tables (catalog) seems fine so I know that Spark and Dataframes are fine, just there is something with connection to RedShift but not sure what?

Select last option while creating glue job. And in next screen, it will ask to select Glue connection

You seem to be on the correct path. I connect and query Redshift from Glue PySpark job the same way except a minor change of using
.format("com.databricks.spark.redshift")
I have also successfully used
.option("forward_spark_s3_credentials", "true")
instead of
.option("iam_role", "my_iam_role")

Related

How to fetch and update DynamoDB table items by PartiQL in aws?

I have created a new column to an existing dynamo db table in aws. Now I want a one-time script to populate values to the newly created column for all existing records. I have tried with the cursor as shown below from the PartiQL editor in aws
DECLARE cursor CURSOR FOR SELECT CRMCustomerGuid FROM "Customer";
OPEN cursor;
WHILE NEXT cursor DO
UPDATE "Customer"
SET "TimeToLive" = 1671860761
WHERE "CustomerGuid" = cursor.CRMCustomerGuid;
END WHILE
CLOSE cursor;
But I am getting the error message saying that ValidationException: Statement wasn't well formed, can't be processed: unexpected keyword
Any help is appreciated
DynamoDB is not a relational database and PartiQL is not a full SQL implementation.
Here’s the docs on the language. Cursor isn’t in there.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ql-reference.html
My own advice would be to use the plain non-SQL interface first - because with it the calls you can make map directly to the things the database can do.
Once you understand that you may, in some contexts, leverage the PartiQL ability.
From #hunterhacker's comment we know that cursors are not possible with PartiQL. Similarly, we are unable to run multiple types of executions in PartiQL's web editor thus we are unable to do a SELECT then UPDATE.
However, this is quite easily achieved using the CLI or SDK. Below is a simple bash script which will update all of the items in your table with a TTL value, execute from any linux/unix based shell:
for pk in `aws dynamodb scan --table-name Customer --projection-expression 'CustomerGuid' --query 'Items[*].pk.S' --output text`; do
aws dynamodb update-item \
--table-name Customer \
--key '{"CustomerGuid": {"S": "'$pk'"}}' \
--update-expression "SET #ttl = :ttl" \
--expression-attribute-names '{"#ttl":"TimeToLive"}' \
--expression-attribute-values '{":ttl":{"N": "1671860762"}}'
done

Executing an SQL file on Redshift via the CLI

We have an SQL file that we would like to run on our Redshift cluster, we're already aware that this is possible via psql as described in this Stackoverflow answer and this Stackoverflow answer. However, we were wondering whether this was possible using the Redshift Data API?
We looked through the documentation but were unable to find anything apart from batch-execute-statement which takes a space delimited list of SQL statements. We're happy to resort to this but would prefer a method of running a file directly against the cluster.
Also, we'd like to parameterise the file as well, can this be done?
Our Current Attempt
This is what we've tried so far:
PARAMETERS="[\
{\"name\": \"param1\", \"value\": \"${PARAM1}\"}, \
{\"name\": \"param2\", \"value\": \"${PARAM2}\"}, \
{\"name\": \"param3\", \"value\": \"${PARAM3}\"}, \
{\"name\": \"param4\", \"value\": \"${PARAM4}\"}, \
{\"name\": \"param5\", \"value\": \"${PARAM5}\"}, \
{\"name\": \"param6\", \"value\": \"${PARAM6}\"}\
]"
SCRIPT_SQL=$(tr -d '\n' <./sql/script.sql)
AWS_RESPONSE=$(aws redshift-data execute-statement \
--region $AWS_REGION \
--cluster-identifier $CLUSTER_IDENTIFIER \
--sql "$SCRIPT_SQL" \
--parameters "$PARAMETERS" \
--database public \
--secret $CREDENTIALS_ARN)
Where all undeclared variables are variables set earlier in the script.
I am a bit confused. Redshift Data API is a REST API which expects you to send a request and it executes the query against your cluster (or serverless). Typical usage might be like using a Lambda Function to connect to your Redshift environment, and execute queries from there. You can load your file into Lambda, decompose and send the commands one by one if you like. And of course, you can parametrise anything within that Lambda. But for the Redshift Data API to work, it needs to be request - response type of operation.
And please note it is an async API.

kafkaconnect backup to s3 - the region 'us-east-1' is wrong; expecting 'eu-north-1'

I am trying to backup my kafka topic to s3 following this guide.
I have filled in all the blanks for the configuration and specified aws.region eu-north-1.
aws kafkaconnect create-connector \
--capacity "autoScaling={maxWorkerCount=2,mcuCount=1,minWorkerCount=1,scaleInPolicy={cpuUtilizationPercentage=10},scaleOutPolicy={cpuUtilizationPercentage=80}}" \
--connector-configuration \
"connector.class=io.lenses.streamreactor.connect.aws.s3.sink.S3SinkConnector, \
key.converter.schemas.enable=false, \
connect.s3.kcql=INSERT INTO <<S3 Bucket Name>>:my_workload SELECT * FROM source_topic PARTITIONBY _header.year\,_header.month\,_header.day\,_header.hour STOREAS \`JSON\` WITHPARTITIONER=KeysAndValues WITH_FLUSH_COUNT = 5, \
aws.region=us-east-1, \ <----------- changed to eu-north-1
tasks.max=2, \
topics=source_topic, \
schema.enable=false, \
errors.log.enable=true, \
value.converter=org.apache.kafka.connect.storage.StringConverter, \
key.converter=org.apache.kafka.connect.storage.StringConverter " \
--connector-name "backup-msk-to-s3-v1" \
--kafka-cluster '{"apacheKafkaCluster": {"bootstrapServers": "<<MSK broker list>>","vpc": {"securityGroups": [ <<Security Group>> ],"subnets": [ <<Subnet List>> ]}}}' \
--kafka-cluster-client-authentication "authenticationType=NONE" \
--kafka-cluster-encryption-in-transit "encryptionType=PLAINTEXT" \
--kafka-connect-version "2.7.1" \
The bucket is created like this:
aws s3api create-bucket --bucket my-msk-backup --create-bucket-configuration LocationConstraint=eu-north-1
The connector fails due to this error message:
[Worker-026a2ebf1106735e1] [2022-08-30 13:11:15,695] ERROR [test03|task-0] WorkerSinkTask{id=test03-0} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:191)
[Worker-026a2ebf1106735e1] org.jclouds.aws.AWSResponseException: request GET https://my-msk-backup.s3.amazonaws.com/?prefix=my_workload&max-keys=1000 HTTP/1.1 failed with code 400, error: AWSError{requestId='redacted', requestToken='/redacted/redacted+/redacted+s18=', code='AuthorizationHeaderMalformed', message='The authorization header is malformed; the region 'us-east-1' is wrong; expecting 'eu-north-1'', context='{Region=eu-north-1, HostId=/redacted+redacted/redacted+/redacted+s18=}'}
How can I troubleshoot this further?
This issue occurs when you have recently moved your bucket from one region to another (By deleting from one region and recreating in another). As Bucket configurations have an eventual consistency model your bucket might still appear in the old region for some amount of time.
Normally one won't be able to create the bucket(same name as they are globally unique) right after deleting in another region but if they are able to create anyway, still this error might appear for a while (beacuse of record not being updated/propgated to other regions) even if you point to the region where you have recreated the bucket.
I haven't found anything regarding this a part from this and this.

How to read from one s3 account and write to another in Databricks

My cluster is configured to use ROLE_B which gives me access to BUCKET_IN_ACCOUNT_B but not to BUCKET_IN_ACCOUNT_A. So I assume XACCOUNT_ROLE to access BUCKET_IN_ACCOUNT_A. The following code works just fine.
sc._jsc.hadoopConfiguration().set("fs.s3a.credentialsType", "AssumeRole")
sc._jsc.hadoopConfiguration().set("fs.s3a.stsAssumeRole.arn", XACCOUNT_ROLE)
sc._jsc.hadoopConfiguration().set("fs.s3a.acl.default", "BucketOwnerFullControl")
df = spark.read.option("multiline","true").json(BUCKET_IN_ACCOUNT_A)
But, when I try to write this dataframe back to BUCKET_IN_ACCOUNT_B like below, I get
java.nio.file.AccessDeniedException.
df.write \
.format("delta") \
.mode("append") \
.save(BUCKET_IN_ACCOUNT_B)
I assume this is the case because my spark cluster is still configured to use XACCOUNT_ROLE. My question is, how do I switch back to ROLE_B?
sc._jsc.hadoopConfiguration().set("fs.s3a.credentialsType", "AssumeRole")
sc._jsc.hadoopConfiguration().set("fs.s3a.stsAssumeRole.arn", ROLE_B)
sc._jsc.hadoopConfiguration().set("fs.s3a.acl.default", "BucketOwnerFullControl")
did not work.
I never ended up discovering a way to switch the cluster role again. But I did figure out an alternative way of accomplishing reading from one s3 bucket and writing to another. Basically, I mounted the XACCOUNT bucket and was able to write to BUCKET_IN_ACCOUNT_B without having to interfere with the cluster's role.
dbutils.fs.unmount("/mnt/MOUNT_LOCATION")
dbutils.fs.mount(BUCKET_IN_ACCOUNT_A, "/mnt/MOUNT_LOCATION",
extra_configs = {
"fs.s3a.credentialsType": "AssumeRole",
"fs.s3a.stsAssumeRole.arn": XACCOUNT_ROLE,
"fs.s3a.acl.default": "BucketOwnerFullControl"
}
)
df = spark.read.option("multiline","true").json("/mnt/MOUNT_LOCATION")
df.write \
.format("delta") \
.mode("append") \
.save(BUCKET_IN_ACCOUNT_B)

How to schedule BigQuery DataTransfer Service using bq command

I am trying to create a Data Transfer service using BigQuery. I used bq command to create the DTS,
I am able to create DTS successfully
I need to specify custom time for scheduling using the bq command
Is it possible to schedule custom time while creating the Data Transfer service. Refer sample bq command
bq mk --transfer_config \
--project_id='My project' \
--target_dataset='My Dataset' \
--display_name='test_bqdts' \
--params='{"data_path":<data_path>,
"destination_table_name_template":<destination_table_name>,
"file_format":<>,
"ignore_unknown_values":"true",
"access_key_id": "access_key_id",
"secret_access_key": "secret_access_key"
}' \
--data_source=data_source_id
NOTE: When you create an Amazon S3 transfer using the command-line tool, the transfer configuration is set up using the default value for Schedule (every 24 hours).
You can use the flag --schedule as you can see here
Option 2: Use the bq mk command.
Scheduled queries are a kind of transfer. To schedule a query, you can
use the BigQuery Data Transfer Service CLI to make a transfer
configuration.
Queries must be in StandardSQL dialect to be scheduled.
Enter the bq mk command and supply the transfer creation flag
--transfer_config. The following flags are also required:
--data_source
--target_dataset (Optional for DDL/DML queries.)
--display_name
--params
Optional flags:
--project_id is your project ID. If --project_id isn't specified, the default project is used.
--schedule is how often you want the query to run. If --schedule isn't specified, the default is 'every 24 hours' based on creation
time.
For DDL/DML queries, you can also supply the --location flag to specify a particular region for processing. If --location isn't
specified, the global Google Cloud location is used.
--service_account_name is for authenticating your scheduled query with a service account instead of your individual user account. Note:
Using service accounts with scheduled queries is in beta.
bq mk \
--transfer_config \
--project_id=project_id \
--target_dataset=dataset \
--display_name=name \
--params='parameters' \
--data_source=data_source
If you want to set a 24 hours schedule, for example, you should use --schedule='every 24 hours'
You can find the complete reference for the time syntax here
I hope it helps