pyspark.pandas.read_sql_query from postgresql on AWS Glue

pyspark.pandas.read_sql_query from postgresql on AWS Glue - amazon-web-services

jdbcUrl = "jdbc:postgresql://"+domain+":"+str(port)+"/"+database+"?user="+user+"&password="+password+""
import pyspark.pandas as pd
pd.read_sql_query(select * from table, jdbcurl)
Giving error on AWS Glue
Any suggestion??
Official Documentation :
https://spark.apache.org/docs/3.2.0/api/python/reference/pyspark.pandas/api/pyspark.pandas.read_sql_query.html
any suggestion to solve the above mentioned problem

Related

Google cloud functions + google cloud sql + google cloud scheduler timeout

I am trying to deploy a python code with google cloud funtions and scheduler that writes a simple table to a google postgresql cloud database.
I created a postgre sql database
Added a Cloud SQL client role to the App Engine default service account
Created a Cloud Pub/Sup topic
Created a scheduler job.
So far so good. Then I created the following function main.py :
from sqlalchemy import create_engine
import pandas as pd
from google.cloud.sql.connector import Connector
import sqlalchemy
connector = Connector()
def getconn():
conn = connector.connect(
INSTANCE_CONNECTION_NAME,
"pg8000",
user=DB_USER,
password=DB_PASS,
db=DB_NAME
)
return conn
pool = sqlalchemy.create_engine(
"postgresql+pg8000://",
creator=getconn,
)
def testdf(event,context):
df=pd.DataFrame({"a" : [1,2,3,4,5],
"b" : [1,2,3,4,5]})
df.to_sql("test",
con = pool,
if_exists ="replace",
schema = 'myschema')
And the requirements.txt contains:
pandas
sqlalchemy
pg8000
cloud-sql-python-connector[pg8000]
When I test the function it always timeout. No error, just these logs:
I cant figure it out why. I have tried several code snippets from:
https://colab.research.google.com/github/GoogleCloudPlatform/cloud-sql-python-connector/blob/main/samples/notebooks/postgres_python_connector.ipynb#scrollTo=UzHaM-6TXO8h
and from
https://codelabs.developers.google.com/codelabs/connecting-to-cloud-sql-with-cloud-functions#2
I think the adjustments of the permissions and roles cause the timeout. Any ideas?
Thx

AWS Data Wrangler - wr.athena.read_sql_query doesn't work

I started using AWS Data Wrangler lib
( https://aws-data-wrangler.readthedocs.io/en/stable/what.html )
to execute queries on AWS Athena and use the results of them in my AWS Glue python shell job.
I see that exist wr.athena.read_sql_query to obtain what I need.
This is my code:
import sys
import os
import awswrangler as wr
os.environ['AWS_DEFAULT_REGION'] = 'eu-west-1'
databases = wr.catalog.databases()
print(databases)
query='select count(*) from staging_dim_channel'
print(query)
df_res = wr.athena.read_sql_query(sql=query, database="lsk2-target")
print(df_res)
print(f'DataScannedInBytes: {df_res.query_metadata["Statistics"]["DataScannedInBytes"]}')
print(f'TotalExecutionTimeInMillis: {df_res.query_metadata["Statistics"]["TotalExecutionTimeInMillis"]}')
print(f'QueryQueueTimeInMillis: {df_res.query_metadata["Statistics"]["QueryQueueTimeInMillis"]}')
print(f'QueryPlanningTimeInMillis: {df_res.query_metadata["Statistics"]["QueryPlanningTimeInMillis"]}')
print(f'ServiceProcessingTimeInMillis: {df_res.query_metadata["Statistics"]["ServiceProcessingTimeInMillis"]}')
I retrieve without problem the list of database (including the lsk2-target), but the read_sql_query go on error and I receive:
WaiterError: Waiter BucketExists failed: Max attempts exceeded
Please, can you help me to understand where I am wrong?
Thanks!

Fixed a similar issue and the resolution is to ensure that the IAM role used has necessary Athena permission to create tables. As this API defaults to run in ctas_approach=True.
Ref. documentation
Also, once that is resolved ensure that the IAM role also has access to delete files create in S3

Do you have the right IAM permissions to read execute a query? I bet it is an IAM issue.
Also I guess you have setup your credentials:
[default]
aws_access_key_id = your_access_key_id
aws_secret_access_key = your_secret_access_key

Spark. Problem when writing a large file on aws s3a storage

I have an unexplained problem with uploading large files to s3a. I am using EC2 Instance with spark-2.4.4-bin-hadoop2.7 and Spark DataFrame to write to s3a with V4 version. Authenticating S3 using Access Key and Secret Key.
The procedure is as follows:
1) read csv file from s3a as the Spark DataFrame;
2) processing data;
3) upload Data Frame as format parquet to s3a
If I do the procedure with the 400MB csv file there is no problem, everything works fine. But when I do the same with a 12 GB csv file in the process of writing parquet file to s3a an error appears:
Status Code: 403, AWS Service: Amazon S3, AWS Request ID: 2CA5F6E85BC36E8D, AWS Error Code: SignatureDoesNotMatch, AWS Error Message: The request signature we calculated does not match the signature you provided. Check your key and signing method.
I use the following settings:
import pyspark
from pyspark import SparkContext
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages=org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell"
sc = SparkContext()
sc.setSystemProperty("com.amazonaws.services.s3.enableV4", "true")
hadoopConf = sc._jsc.hadoopConfiguration()
accesskey = input()
secretkey = input()
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoopConf.set("fs.s3a.endpoint", "s3-eu-north-1.amazonaws.com")
hadoopConf.set("fs.s3a.fast.upload", "true")
hadoopConf.set("fs.s3a.fast.upload", "s3-eu-north-1.amazonaws.com")
hadoopConf.set("com.amazonaws.services.s3a.enableV4", "true")
hadoopConf.set("fs.s3a.access.key", accesskey)
hadoopConf.set("fs.s3a.secret.key", secretkey)
also tried to add these settings:
hadoopConf.set('spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version', '2')
hadoopConf.set('spark.speculation', "false")
hadoopConf.set('spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4', 'true')
hadoopConf.set('spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4', 'true')
but it didn’t help.
Again, the problem appears only with large file.
I would appreciate any help. Thank you.

Try setting fs.s3a.fast.upload to true,
otherwise, the multipart upload stuff was only ever experimental in 2.7; you may have hit a corner case. Upgrade to the hadoop-2.8 versions or later and it should go away.

Updated hadoop from 2.7.3 to 2.8.5 and now everything works without errors.

Had same issue. Made a Spark Cluster on EMR (5.27.0) and configured it with Spark 2.4.4 on Hadoop 2.8.5. Uploaded my notebook that had my code on it to a notebook I made in EMR JupyterLab, ran it, and it worked perfectly!

AWS Athena ODI JDBC connection

Has anyone tried connecting AWS Athena from Oracle Data Integrator.
I have been trying this since long but am not able to find the appropriate JDBC connection string.
Steps I have followed from
https://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html#jdbc-url-format
Downloaded AthenaJDBC42_2.0.7.jar driver from AWS
Copied the same into the userlib directory of ODI
Created new technology in ODI
Trying to add Data server. Not able to form JDBC url.
JDBC string Sample format (which isn't working):
jdbc:awsathena://AwsRegion=[Region];User=[AccessKey];Password=[SecretKey];S3OutputLocation=[Output];
Please can anyone help? Thanks.

This is sorter version of JDBC I implemented for Athena. This was just POC and we want to go with AWS SDK rather then jdbc though less important here.
package com.poc.aws.athena;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
public class AthenaJDBC {
public static void main(String[] args) throws ClassNotFoundException, SQLException {
Connection connection = null;
Class.forName("com.simba.athena.jdbc.Driver");
connection = DriverManager.getConnection("jdbc:awsathena://AwsRegion=us-east-1;User=EXAMPLEKEY;"
+ "Password=EXAMPLESECRETKYE;S3OutputLocation=s3://example-bucket-name-us-east-1;");
Statement statement = connection.createStatement();
ResultSet queryResults = statement.executeQuery(ExampleConstants.ATHENA_SAMPLE_QUERY);
System.out.println(queryResults.next());
}
}
The only important point here related to url.
jdbc:awsathena://AwsRegion=us-east-1;User=EXAMPLEKEY;"
+ "Password=EXAMPLESECRETKYE;S3OutputLocation=s3://example-bucket-name-us-east-1;.
us-east-1 must be replaced with your actual region like us-west-1 etc
EXAMPLEKEY must be replaced with your AWS Access key that has Athena access.
EXAMPLESECRETKEY must be replaced with your AWS Secret key that has Athena access.
example-bucket-name-us-east-1 must be replaced with your S3 bucket that above keys has write access too.
There other keys simba driver support but less important here.
I hope this helps.

Sorry I missed to post answer on this.
It all worked fine after configuring a Athena JDBC connection in ODI like below and providing the 4 key values while connecting.
JDBC URL: jdbc:awsathena://athena.eu-west-2.amazonaws.com:443;AWSCredentialsProviderArguments=ACCESSKEYID,SECRETACCESSKEY,SESSIONTOKEN

Referencing a Hive view from within an AWS Glue job

I'm trying to figure out how to migrate a use case from EMR to AWS Glue involving Hive views.
In EMR today, I have Hive external tables backed by Parquet in S3, and I have additional views like create view hive_view as select col from external_table where col = x
Then in Spark on EMR, I can issue statements like df = spark.sql("select * from hive_view") to reference my Hive view.
I am aware I can use the Glue catalog as a drop-in replacement for the Hive metastore, but I'm trying to migrate the Spark job itself off of EMR to Glue. So in my end state, there is no longer a Hive endpoint, only Glue.
Questions:
How do I replace the create view ... statement if I no longer have an EMR cluster to issue Hive commands? What's the equivalent AWS Glue SDK call?
How do I reference those views from within a Glue job?
What I've tried so far: using boto3 to call glue.create_table like this
glue = boto3.client('glue')
glue.create_table(DatabaseName='glue_db_name',
TableInput = {'Name': 'hive_view',
'TableType': 'VIRTUAL_VIEW',
'ViewExpandedText': 'select .... from ...'
})
I can see the object created in the Glue catalog but the classification shows as "Unknown" and the references in the job fail with a corresponding error:
py4j.protocol.Py4JJavaError: An error occurred while calling o56.getCatalogSource. :
java.lang.Error: No classification or connection in bill_glue_poc.view_test at ...
I have validated that I can use Hive views with Spark in EMR with the Glue catalog as the metastore -- I see the view in the Glue catalog, and Spark SQL queries succeed, but I cannot reference the view from within a Glue job.

You can create a temporary view in Spark and query it like a Hive table (Scala):
val dataDyf = glueContext.getSourceWithFormat(
connectionType = "s3",
format = "parquet",
options = JsonOptions(Map(
"paths" -> Array("s3://bucket/external/folder")
))).getDynamicFrame()
// Convert DynamicFrame to Spark's DataFrame and apply filtering
val dataViewDf = dataDyf.toDF().where(...)
dataViewDf.createOrReplaceTempView("hive_view")
val df = spark.sql("select * from hive_view")

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js