How to query AWS RedShift from AWS Glue PySpark Job - amazon-web-services

I have a redshift cluster which is not publicly accessible. I want to query the database in cluster from glue job using pyspark. I have tried this snippet but I'm getting timed out error.
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Glue to RedShift") \
.getOrCreate()
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:redshift://redshift-cluster-***************redshift.amazonaws.com:5439/dev") \
.option("user","******") \
.option("password","************") \
.option("query", "Select * from category limit 10") \
.option("tempdir", "s3a://e-commerce-website-templates/ahmad") \
.option("aws_iam_role", "arn:aws:iam::337618512328:role/glue_s3_redshift") \
.load()
df.show()
Any help would be appreciated. Thanks in advance.

Related

What is different nami wallet signing and payment.skey signing?

I built transaction at nodejs by using cardano-cli.
cardano-cli transaction build \
--alonzo-era \
--testnet-magic 1 \
--tx-in dbf7f56f844cc4b85daccb62bedf4eeff0a84cb060f0f79b206c7f087b3f0ba1#0 \
--tx-in dbf7f56f844cc4b85daccb62bedf4eeff0a84cb060f0f79b206c7f087b3f0ba1#1 \
--tx-in 61b88efd41ccbb0e71c48aca2cbe63728078ec7fb20ec9c27acfe33d0647248d#0 \
--tx-in-script-file /cardano/plutus/direct-sale.plutus \
--tx-in-datum-file /cardano/temp/testnet/datums/list.json \
--tx-in-redeemer-file /cardano/temp/testnet/redeemers/buy.json \
--required-signer-hash 2be4a303e36f628e2a06d977e16f77ce2b9046b8c56576bb5286d1be \
--tx-in-collateral dbf7f56f844cc4b85daccb62bedf4eeff0a84cb060f0f79b206c7f087b3f0ba1#1 \
... some txout ...
--change-address addr_test1qq47fgcrudhk9r32qmvh0ct0wl8zhyzxhrzk2a4m22rdr0sqcga2xfzv6crryyt0sfphksfr947jjddy3t4u0qwfmmfq2h0pj8 \
--protocol-params-file /cardano/testnet/protocol-parameters.json \
--mint "2 20edea925974af2102c63adddbb6a6e789f8d3a16500b15bd1e1c32b.4143544956495459" \
--mint-script-file /cardano/plutus/activity-minter.plutus \
--mint-redeemer-file /cardano/redeemers/mint.json \
--invalid-before 19059345 \
--invalid-hereafter 19059495 \
--out-file ./tx.raw
After run this command, I got cborHex and I used this at frontend.
When I sign by using name wallet, I got some error.
"transaction submit error ShelleyTxValidationError ShelleyBasedEraBabbage (ApplyTxError [UtxowFailure (FromAlonzoUtxowFail (WrappedShelleyEraFailure (InvalidWitnessesUTXOW [VKey (VerKeyEd25519DSIGN \"cf949f966b426f25db11b6062edc31312001e3cd0ced4c6c7db3da7b5ac9766b\")])))])"
But when I sign with payment.skey at nodejs, it was worked.
I was discuss Alexd1985 at cardano forum.
https://forum.cardano.org/t/how-to-resolve-shelleytxvalidationerror-shelleybasederababbage-applytxerror-utxowfailure-fromalonzoutxowfail-wrappedshelleyerafailure-invalidwitnessesutxow-error/113555/4
What is solution for this problem?

How to store JDBC locally to run pyspark in EMR cluster

I'm trying to run some code locally to AWS, but I'm not sure where to store the jdbc drivers. The goal is to have my pyspark application, read a rds database to do an ELT process from a cluster.
I'm getting two sets of errors:
First: Cannot locate jar files
Two: Error: Missing Additional resource
Here is what my code looks like:
import pyspark
from pyspark.sql import SparkSession
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--driver-class-path /path/to/driver/jars --jars /file/path/to/jars'
post_df = spark.read\
.format("djbc") \
.option("url", "jdbc:postgres:url-to-rds-amazonaws.com") \
.option("dbtable", "mytable") \
.option("user", "myuser") \
.option("password", 'passowrd') \
.query("select * from my table").load()
post_df.createOrReplaceTempView("post_fin_v")
transformed_df = spark.sql('''
perform more aggregation here
''')
transformed_df.write.format("jdbc").mode("append").option("url", "jdbc:sql_server:url-to-rds-amazonaws.com") \
.option("dbtable", "mytable") \
.option("user", "myuser") \
.option("password", 'passowrd')

unable to create a gcloud alert policy in command line with multiple conditions

I am trying to create a single alert policy for Cloud-Sql instance_state through gcloud with multiple conditions.
If the instance is in "RUNNABLE" OR "FAILED" state for more than 5 minutes, then a alert should be triggerred. I was able to create that in console and below is the screenshot:
Now I try the same using the command line and give this gcloud command:
gcloud alpha monitoring policies create \
--display-name='Test Database State Alert ('$PROJECTID')' \
--condition-display-name='Instance is not running for 5 minutes'\
--notification-channels="x23234dfdfffffff" \
--aggregation='{"alignmentPeriod": "60s","perSeriesAligner": "ALIGN_COUNT_TRUE"}' \
--condition-filter='metric.type="cloudsql.googleapis.com/database/instance_state" AND resource.type="cloudsql_database" AND (metric.labels.state = "RUNNABLE")'
OR 'metric.type="cloudsql.googleapis.com/database/instance_state" AND resource.type="cloudsql_database" AND (metric.labels.state = "FAILED")' \
--duration='300s' \
--if='> 0.0' \
--trigger-count=1 \
--combiner='OR' \
--documentation='The rule "${condition.display_name}" has generated this alert for the "${metric.display_name}".' \
--project="$PROJECTID" \
--enabled
I am getting the error below in the OR part of the condition:
ERROR: (gcloud.alpha.monitoring.policies.create) unrecognized arguments:
OR
metric.type="cloudsql.googleapis.com/database/instance_state" AND resource.type="cloudsql_database" AND (metric.labels.state = "FAILED")
Even if i put ( ) over the condition still it fails, also the || operator also fails.
Can anyone please tell me the correct gcloud command for this? Also i want the structure of the alert policy to be similar to the one created in cloud-console as shown above
Thanks
I was able to use gcloud alpha monitoring policies conditions create to append additional conditions.
gcloud alpha monitoring policies create \
--notification-channels=projects/qwiklabs-gcp-04-d822dd6cd419/notificationChannels/2510735656842641871 \
--aggregation='{"alignmentPeriod": "60s","perSeriesAligner": "ALIGN_MEAN"}' \
--condition-display-name='CPU Utilization >0.95 for 1m'\
--condition-filter='metric.type="compute.googleapis.com/instance/cpu/utilization" resource.type="gce_instance"' \
--duration='1m' \
--if='> 0.95' \
--display-name=' alert on spikes or consistantly high cpu' \
--combiner='OR'
gcloud alpha monitoring policies list --format='value(name,displayName)'
gcloud alpha monitoring policies conditions create \
projects/qwiklabs-gcp-04-d822dd6cd419/alertPolicies/1712202834227136574 \
--aggregation='{"alignmentPeriod": "60s","perSeriesAligner": "ALIGN_MEAN"}' \
--condition-display-name='CPU Utilization >0.80 for 10m'\
--condition-filter='metric.type="compute.googleapis.com/instance/cpu/utilization" resource.type="gce_instance"' \
--duration='10m' \
--if='> 0.80'
Duplicate --condition-filter clauses did not work for me. YMMV.
From the docs gcloud alpha monitoring policies create, it appears that you can specify repeated (!) occurrences of:
[--aggregation=AGGREGATION --condition-display-name=CONDITION_DISPLAY_NAME --condition-filter=CONDITION_FILTER --duration=DURATION --if=IF_VALUE --trigger-count=TRIGGER_COUNT | --trigger-percent=TRIGGER_PERCENT]
So I think you need to duplicate your --condition-filter with the --combiner="OR", i.e.
gcloud alpha monitoring policies create \
--display-name='Test Database State Alert ('$PROJECTID')' \
--notification-channels="x23234dfdfffffff" \
--aggregation='{"alignmentPeriod": "60s","perSeriesAligner": "ALIGN_COUNT_TRUE"}' \
--condition-display-name='RUNNABLE'\
--condition-filter='metric.type="cloudsql.googleapis.com/database/instance_state" AND resource.type="cloudsql_database" AND (metric.labels.state = "RUNNABLE")'
--duration='300s' \
--if='> 0.0' \
--trigger-count=1 \
--aggregation='{"alignmentPeriod": "60s","perSeriesAligner": "ALIGN_COUNT_TRUE"}' \
--condition-display-name='FAILED'\
--condition-filter='metric.type="cloudsql.googleapis.com/database/instance_state" AND resource.type="cloudsql_database" AND (metric.labels.state = "FAILED")' \
--duration='300s' \
--if='> 0.0' \
--trigger-count=1 \
--combiner='OR' \
--documentation='The rule "${condition.display_name}" has generated this alert for the "${metric.display_name}".' \
--project="$PROJECTID" \
--enabled

Batch loading into AWS RDS (postgres) from PySpark

I am looking for a batch loader for a glue job to load into RDS using a PySpark script witht he DataFormatWriter.
I have this working for RedShift as follows:
df.write \
.format("com.databricks.spark.redshift") \
.option("url", jdbcconf.get("url") + '/' + DATABASE + '?user=' + jdbcconf.get('user') + '&password=' + jdbcconf.get('password')) \
.option("dbtable", TABLE_NAME) \
.option("tempdir", args["TempDir"]) \
.option("forward_spark_s3_credentials", "true") \
.mode("overwrite") \
.save()
Where df is defined above to read in a file. What is the best approach I could take to do this in RDS instead of in REDSHIFT?
In RDS would you be only APPEND / OVERWRITE, in such case you can create an RDS JDBC connection, and use something like below:
postgres_url="jdbc:postgresql://localhost:portnum/sakila?user=<user>&password=<pwd>"
df.write.jdbc(postgres_url,table="actor1",mode="append") #for append
df.write.jdbc(postgres_url,table="actor1",mode="overwrite") #for overwrite
If it involves UPSERTS, then probably you can use a MYSQL library as an external python library, and perform INSERT INTO ..... ON DUPLICATE KEY.
Please refer this url: How to use JDBC source to write and read data in (Py)Spark?
regards
Yuva
I learned that this can be only done through JDBC. Eg.
df.write.format("jdbc") \
.option("url", jdbcconf.get("url") + '/' + REDSHIFT_DATABASE + '?user=' + jdbcconf.get('user') + '&password=' + jdbcconf.get('password')) \
.option("dbtable", REDSHIFT_TABLE_NAME) \
.option("tempdir", args["TempDir"]) \
.option("forward_spark_s3_credentials", "true") \
.mode("overwrite") \
.save()

runtime_version versus runtime-version in cloudml-samples/flowers/sample.sh

In Google's sample code found at cloudml-samples/flowers/sample.sh, between lines 52 and 64, is the argument "runtime_version":
# Training on CloudML is quick after preprocessing. If you ran the above
# commands asynchronously, make sure they have completed before calling this one.
gcloud ml-engine jobs submit training "$JOB_ID" \
--stream-logs \
--module-name trainer.task \
--package-path trainer \
--staging-bucket "$BUCKET" \
--region us-central1 \
--runtime_version=1.0 \
-- \
--output_path "${GCS_PATH}/training" \
--eval_data_paths "${GCS_PATH}/preproc/eval*" \
--train_data_paths "${GCS_PATH}/preproc/train*"
Shouldn't "runtime_version" be replaced with "runtime-version" to avoid an error?
Yes. I've submitted a PR (in the future, never hesitate to do so yourself)