How to store JDBC locally to run pyspark in EMR cluster - amazon-web-services

I'm trying to run some code locally to AWS, but I'm not sure where to store the jdbc drivers. The goal is to have my pyspark application, read a rds database to do an ELT process from a cluster.
I'm getting two sets of errors:
First: Cannot locate jar files
Two: Error: Missing Additional resource
Here is what my code looks like:
import pyspark
from pyspark.sql import SparkSession
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--driver-class-path /path/to/driver/jars --jars /file/path/to/jars'
post_df = spark.read\
.format("djbc") \
.option("url", "jdbc:postgres:url-to-rds-amazonaws.com") \
.option("dbtable", "mytable") \
.option("user", "myuser") \
.option("password", 'passowrd') \
.query("select * from my table").load()
post_df.createOrReplaceTempView("post_fin_v")
transformed_df = spark.sql('''
perform more aggregation here
''')
transformed_df.write.format("jdbc").mode("append").option("url", "jdbc:sql_server:url-to-rds-amazonaws.com") \
.option("dbtable", "mytable") \
.option("user", "myuser") \
.option("password", 'passowrd')

Related

Google Cloud Storage File System, Python Package Error: AttributeError: 'GCSFile' object has no attribute 'gcsfs'

I am trying to run a python code which will download and stream chunks of data from source URL to destination cloud storage blob.
It is working fine in standalone pc, local function and so on.
But when i try same with GCP Cloud RUN it is throwing weird error.
AttributeError: 'GCSFile' object has no attribute 'gcsfs'
Complete error:
Traceback (most recent call last):
File "/home/<user>/.local/lib/python3.9/site-packages/fsspec/spec.py", line 1683, in __del__
self.close()
File "/home/<user>/.local/lib/python3.9/site-packages/fsspec/spec.py", line 1661, in close
self.flush(force=True)
File "/home/<user>/.local/lib/python3.9/site-packages/fsspec/spec.py", line 1527, in flush
self._initiate_upload()
File "/home/<user>/.local/lib/python3.9/site-packages/gcsfs/core.py", line 1443, in _initiate_upload
self.gcsfs.loop,
AttributeError: 'GCSFile' object has no attribute 'gcsfs'
It consumed my week, any help or direction is highly appriciated, thanks in advance.
The actual code which has been used:
from flask import Flask, request
import os
import gcsfs
import requests
app = Flask(__name__)
#app.route('/urltogcs')
def urltogcs():
try:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "secret.json"
gcp_file_system = gcsfs.GCSFileSystem(project='<project_id>')
session = requests.Session()
url = request.args.get('source', 'temp')
blob_path = request.args.get('destination', 'temp')
with session.get(url, stream=True) as r:
r.raise_for_status()
with gcp_file_system.open(blob_path, 'wb') as f_obj:
for chunk in r.iter_content(chunk_size=1024 * 1024):
f_obj.write(chunk)
return f'Successfully downloaded from {url} to {blob_path} :)'
except Exception as e:
print("Failure")
print(e)
return f'download failed for {url} :('
if __name__ == "__main__":
app.run(debug=True, host="0.0.0.0", port=int(os.environ.get("PORT", 8080)))
Your code (with the proposed changes) works for me:
main.py:
from flask import Flask, request
import os
import gcsfs
import requests
app = Flask(__name__)
project = os.getenv("PROJECT")
port = os.getenv("PORT", 8080)
#app.route('/urltogcs')
def urltogcs():
try:
gcp_file_system = gcsfs.GCSFileSystem(project=project)
session = requests.Session()
url = request.args.get('source', 'temp')
blob_path = request.args.get('destination', 'temp')
with session.get(url, stream=True) as r:
r.raise_for_status()
with gcp_file_system.open(blob_path, 'wb') as f_obj:
for chunk in r.iter_content(chunk_size=1024 * 1024):
f_obj.write(chunk)
return f'Successfully downloaded from {url} to {blob_path} :)'
except Exception as e:
print("Failure")
print(e)
return f'download failed for {url}
if __name__ == "__main__":
app.run(debug=True, host="0.0.0.0", port=int(port))
Note: The code requires project from the environment which isn't ideal. It would be better if gcsfs.GCSFileSystem didn't require project. Alternatively project could be obtained from Google's metadata service. For convenience (!), I'm setting it using the environment.
requirements.txt:
Flask==2.2.2
gcsfs==2022.7.1
gunicorn==20.1.0
Dockerfile:
FROM python:3.10-slim
ENV PYTHONUNBUFFERED True
ENV APP_HOME /app
WORKDIR $APP_HOME
COPY . ./
RUN pip install --no-cache-dir -r requirements.txt
CMD exec gunicorn --bind :$PORT --workers 1 --threads 8 --timeout 0 main:app
Bash script:
BILLING="[YOUR-BILLING]"
PROJECT="[YOUR-PROJECT]"
REGION="[YOUR-REGION]"
BUCKET="[YOUR-BUCKET]"
# Create Project
gcloud projects create ${PROJECT}
# Associate with Billing Account
gcloud beta billing projects link ${PROJECT} \
--billing-account=${BILLING}
# Enabled services
SERVICES=(
"artifactregistry"
"cloudbuild"
"run"
)
for SERVICE in ${SERVICES[#]}
do
gcloud services enable ${SERVICE}.googleapis.com \
--project=${PROJECT}
done
# Create Bucket
gsutil mb -p ${PROJECT} gs://${BUCKET}
# Service Account
ACCOUNT=tester
EMAIL=${ACCOUNT}#${PROJECT}.iam.gserviceaccount.com
# Create Service Account
gcloud iam service-accounts create ${ACCOUNT} \
--project=${PROJECT}
# Create Service Account key
gcloud iam service-accounts keys create ${PWD}/${ACCOUNT}.json \
--iam-account=${EMAIL} \
--project=${PROJECT}
# Ensure Service Account can write to storage
gcloud projects add-iam-policy-binding ${PROJECT} \
--role=roles/storage.admin \
--member=serviceAccount:${EMAIL}
# Only needed for local testing
export GOOGLE_APPLICATION_CREDENTIALS=${PWD}/${ACCOUNT}.json
# Deploy Cloud Run service
# Run service as Service Account
NAME="urltogcs"
gcloud run deploy ${NAME} \
--source=${PWD} \
--set-env-vars=PROJECT=${PROJECT} \
--no-allow-unauthenticated \
--service-account=${EMAIL} \
--region=${REGION} \
--project=${PROJECT}
# Grab the Cloud Run service's endpoint
ENDPOINT=$(gcloud run services describe ${NAME} \
--region=${REGION} \
--project=${PROJECT} \
--format="value(status.url)")
# Cloud Run service requires auth
TOKEN=$(gcloud auth print-identity-token)
# This page
SRC="https://stackoverflow.com/questions/73393808/"
# Generate a GCS Object name by epoch
DST="gs://${BUCKET}/$(date +%s)"
curl \
--silent \
--get \
--header "Authorization: Bearer ${TOKEN}" \
--data-urlencode "source=${SRC}" \
--data-urlencode "destination=${DST}" \
--write-out '%{response_code}' \
--output /dev/null \
${ENDPOINT}/urltogcs
Yields OK:
200
And:
gsutil ls gs://${BUCKET}
gs://${BUCKET}/1660780270

How to query AWS RedShift from AWS Glue PySpark Job

I have a redshift cluster which is not publicly accessible. I want to query the database in cluster from glue job using pyspark. I have tried this snippet but I'm getting timed out error.
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Glue to RedShift") \
.getOrCreate()
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:redshift://redshift-cluster-***************redshift.amazonaws.com:5439/dev") \
.option("user","******") \
.option("password","************") \
.option("query", "Select * from category limit 10") \
.option("tempdir", "s3a://e-commerce-website-templates/ahmad") \
.option("aws_iam_role", "arn:aws:iam::337618512328:role/glue_s3_redshift") \
.load()
df.show()
Any help would be appreciated. Thanks in advance.

Use `capture_tpu_profile` in AI Platform

we are trying to capture TPU profiling data while running our training task on AI Platform. Following this tutorial. All needed information like TPU name getting from our model output.
config.yaml:
trainingInput:
scaleTier: BASIC_TPU
runtimeVersion: '1.15' # also tried '2.1'
task submitting command:
export DATE=$(date '+%Y%m%d_%H%M%S') && \
gcloud ai-platform jobs submit training "imaterialist_image_classification_model_${DATE}" \
--region=us-central1 \
--staging-bucket='gs://${BUCKET}' \
--module-name='efficientnet.main' \
--config=config.yaml \
--package-path="${PWD}/efficientnet" \
-- \
--data_dir='gs://${BUCKET}/tfrecords/' \
--train_batch_size=8 \
--train_steps=5 \
--model_dir="gs://${BUCKET}/algorithms_training/imaterialist_image_classification_model/${DATE}" \
--model_name='efficientnet-b4' \
--skip_host_call=true \
--gcp_project=${GCP_PROJECT_ID} \
--mode=train
When we tried to run capture_tpu_profile with name that our model got from master:
capture_tpu_profile --gcp_project="${GCP_PROJECT_ID}" --logdir='gs://${BUCKET}/algorithms_training/imaterialist_image_classification_model/20200318_005446' --tpu_zone='us-central1-b' --tpu='<tpu_IP_address>'
we got this error:
File "/home/kovtuh/.local/lib/python3.7/site-packages/tensorflow_core/python/distribute/cluster_resolver/tpu_cluster_resolver.py", line 480, in _fetch_cloud_tpu_metadata
"constructor. Exception: %s" % (self._tpu, e))
ValueError: Could not lookup TPU metadata from name 'b'<tpu_IP_address>''. Please doublecheck the tpu argument in the TPUClusterResolver constructor. Exception: <HttpError 404 when requesting https://tpu.googleapis.com/v1/projects/<GCP_PROJECT_ID>/locations/us-central1-b/nodes/<tpu_IP_address>?alt=json returned "Resource 'projects/<GCP_PROJECT_ID>/locations/us-central1-b/nodes/<tpu_IP_address>' was not found". Details: "[{'#type': 'type.googleapis.com/google.rpc.ResourceInfo', 'resourceName': 'projects/<GCP_PROJECT_ID>/locations/us-central1-b/nodes/<tpu_IP_address>'}]">
Seems like TPU device isn't connected to our project when provided in AI Platform, but what project is connected to and can we get an access to such TPUs to capture it's profile?

sqoop export to Teradata gives com.teradata.connector.common.exception.ConnectorException: Malformed \uxxxx encoding

I am trying to export data from HDFS to Teradata using sqoop. I have created a table in Teradata and tried to import a sample text file with some sample data. Here is my sqoop export command
sqoop export --connect jdbc:teradata://xxx.xxx.xxx.xx/Database=XXXXXXX,CHARSET=UTF8 \
--username User_name \
--password pwd \
--export-dir /user/User/test_td_export/ \
--table HDP_TD_EXPORT_TEST \
--input-fields-terminated-by ',' \
--input-escaped-by '\' \
--input-enclosed-by '\"' \
--input-optionally-enclosed-by '\"' \
--mapreduce-job-name td_export_test
I am able to do a sqoop eval to the same table to get the count successfully but while exporting data, I am getting the exception.
19/01/04 20:48:26 ERROR tool.ExportTool: Encountered IOException running export job:
com.teradata.connector.common.exception.ConnectorException: Malformed \uxxxx encoding
This is the first time I have tried to export to teradata. I have exported data to Oracle and didn't see any such issues. Any help is greatly appreciated. Thanks
I have found that usage of --input-escaped-by \ \ is causing the above exception as it is adding the escape characters while exporting. I have removed that parameter and the export job worked as expected.

Batch loading into AWS RDS (postgres) from PySpark

I am looking for a batch loader for a glue job to load into RDS using a PySpark script witht he DataFormatWriter.
I have this working for RedShift as follows:
df.write \
.format("com.databricks.spark.redshift") \
.option("url", jdbcconf.get("url") + '/' + DATABASE + '?user=' + jdbcconf.get('user') + '&password=' + jdbcconf.get('password')) \
.option("dbtable", TABLE_NAME) \
.option("tempdir", args["TempDir"]) \
.option("forward_spark_s3_credentials", "true") \
.mode("overwrite") \
.save()
Where df is defined above to read in a file. What is the best approach I could take to do this in RDS instead of in REDSHIFT?
In RDS would you be only APPEND / OVERWRITE, in such case you can create an RDS JDBC connection, and use something like below:
postgres_url="jdbc:postgresql://localhost:portnum/sakila?user=<user>&password=<pwd>"
df.write.jdbc(postgres_url,table="actor1",mode="append") #for append
df.write.jdbc(postgres_url,table="actor1",mode="overwrite") #for overwrite
If it involves UPSERTS, then probably you can use a MYSQL library as an external python library, and perform INSERT INTO ..... ON DUPLICATE KEY.
Please refer this url: How to use JDBC source to write and read data in (Py)Spark?
regards
Yuva
I learned that this can be only done through JDBC. Eg.
df.write.format("jdbc") \
.option("url", jdbcconf.get("url") + '/' + REDSHIFT_DATABASE + '?user=' + jdbcconf.get('user') + '&password=' + jdbcconf.get('password')) \
.option("dbtable", REDSHIFT_TABLE_NAME) \
.option("tempdir", args["TempDir"]) \
.option("forward_spark_s3_credentials", "true") \
.mode("overwrite") \
.save()