Executing a Redshift procedure through AWS Glue - amazon-web-services

I have created Stored procedures on Redshift and need to orchestrate it. The SP contains the DML statements for SCD creation and is limited to Redshift.
Is there a way on AWS to run the SP on Redshift through Glue or any other AWS services?
As we do not have triggers on RS I am exploring other options. Help is highly appreciated.

I think you can try making use of preactions/Postactions. Preactions/Postactions allow you to execute sql commands before/after your dynamic frames processes data. You can provide a list of semi-colon delimited commands, for example just normal sql commands, you could try call the procedures using the same approach:
datasink5 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = prod_dynamicframe, catalog_connection = "my_rdshft", connection_options = {"preactions":"delete from dw.product_dim where sku in ('xxxxx,'bbbb');","dbtable": "dw.product_dim", "database": "DWBI","postactions":"truncate table ld_stg.ld_product_tbl;"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink5")
This might also be helpful.

One approach you can try is preactions and postactions as mentioned by #Eman, I haven't tried it .
But I used psycopg2 to trigger the stored procedure on redshift.
Just zip the package and pass to glue.
Establish a jdbc connection
and use callproc() function to call the stored procedure.
Find its usage https://www.psycopg.org/docs/usage.html

Related

AWS Glue enableUpdateCatalog not creating new partitions after successful job run

I am having a problem, where i have set enableUpdateCatalog=True and also updateBehaviour=LOG to update my glue table which has 1 partition key. After the job, runs there are no new partitions added on my glue catalog table, but data in S3 is separated by the partition key i have used, how do i get the job to automatically partition my glue catalog table?
Currently i have to manually run boto3 create_partition to create partitions on my glue catalog table. I want my job to automatically be able to create partitions as it discovers in S3 path separated by partition Keys
Code:
additionalOptions = {
"enableUpdateCatalog": True,
"updateBehavior": "LOG"}
additionalOptions["partitionKeys"] = ["partition_key0", "partition_key1"]
my_df = glueContext.write_dynamic_frame_from_catalog(frame=last_transform, database=<dst_db_name>,
table_name=<dst_tbl_name>, transformation_ctx="DataSink1",
additional_options=additionalOptions)
job.commit()
PS: I am currently using PARQUET format
Am i missing any Rights that has to be added to my job so that it can create partitions from the job itself?
I got it to work by adding useGlueParquetWriter: 'true' to the CATALOG table properties. And also I have added
format_options = {
'useGlueParquetWriter': True
}
in the write_dynamic_frame.from_catalog calls.
These steps got it to start working :)

How to insert the data into snowflake using s3 stage integration

How to insert the data into snowflake using s3 stage integration
create a storage integration between aws and snowfalke.
Create a stage integration between s3 and snowflake.
USE DATABASE Database_name;
USE SCHEMA scema_name;
truncate table table_name_tmp;
COPY INTO table_name_tmp FROM
(
SELECT $1:emp_id::INTEGER,$1:col2::DATE,$1:col3::string
FROM #Stagename.schema_name.{0}/{1}
)
on_error = 'continue'
file_format = (type = parquet, null_if = ('NULL'), trim_space = true);
MERGE INTO table_name dst USING table_name_tmp srs
ON (dst.emp_id=srs.emp_id)
WHEN MATCHED THEN
UPDATE SET
dst.emp_id=srs.emp_id,dst.col2=srs.col2,dst.col3=srs.col3
WHEN NOT MATCHED THEN INSERT (emp_id,col2,col3 )
values
(srs.emp_id,srs.col2,srs.col3
);
truncate table table_name_tmp;
I think the question could be worded better; you see an s3 storage integration is a METHOD for connecting Snowflake to an external stage, you don't extract from an integration; you still use the external stage to COPY from into Snowflake. The alternate method is to use secrets and keys although Snowflake recommends using the Storage integration because this is a one-time activity and it means you don't have to mess about with those set keys.
S3 by the way is AWS' blob store, step-by-step guide is in the docs, https://docs.snowflake.com/en/user-guide/data-load-s3-config-storage-integration.html

Referencing a Hive view from within an AWS Glue job

I'm trying to figure out how to migrate a use case from EMR to AWS Glue involving Hive views.
In EMR today, I have Hive external tables backed by Parquet in S3, and I have additional views like create view hive_view as select col from external_table where col = x
Then in Spark on EMR, I can issue statements like df = spark.sql("select * from hive_view") to reference my Hive view.
I am aware I can use the Glue catalog as a drop-in replacement for the Hive metastore, but I'm trying to migrate the Spark job itself off of EMR to Glue. So in my end state, there is no longer a Hive endpoint, only Glue.
Questions:
How do I replace the create view ... statement if I no longer have an EMR cluster to issue Hive commands? What's the equivalent AWS Glue SDK call?
How do I reference those views from within a Glue job?
What I've tried so far: using boto3 to call glue.create_table like this
glue = boto3.client('glue')
glue.create_table(DatabaseName='glue_db_name',
TableInput = {'Name': 'hive_view',
'TableType': 'VIRTUAL_VIEW',
'ViewExpandedText': 'select .... from ...'
})
I can see the object created in the Glue catalog but the classification shows as "Unknown" and the references in the job fail with a corresponding error:
py4j.protocol.Py4JJavaError: An error occurred while calling o56.getCatalogSource. :
java.lang.Error: No classification or connection in bill_glue_poc.view_test at ...
I have validated that I can use Hive views with Spark in EMR with the Glue catalog as the metastore -- I see the view in the Glue catalog, and Spark SQL queries succeed, but I cannot reference the view from within a Glue job.
You can create a temporary view in Spark and query it like a Hive table (Scala):
val dataDyf = glueContext.getSourceWithFormat(
connectionType = "s3",
format = "parquet",
options = JsonOptions(Map(
"paths" -> Array("s3://bucket/external/folder")
))).getDynamicFrame()
// Convert DynamicFrame to Spark's DataFrame and apply filtering
val dataViewDf = dataDyf.toDF().where(...)
dataViewDf.createOrReplaceTempView("hive_view")
val df = spark.sql("select * from hive_view")

Run Redshift Queries Periodically

I have started researching into Redshift. It is defined as a "Database" service in AWS. From what I have learnt so far, we can create tables and ingest data from S3 or from external sources like Hive into Redhshift database (cluster). Also, we can use JDBC connection to query these tables.
My questions are -
Is there a place within Redshift cluster where we can store our queries run it periodically (like Daily)?
Can we store our query in a S3 location and use that to create output to another S3 location?
Can we load a DB2 table unload file with a mixture of binary and string fields to Redshift directly, or do we need a intermediate process to make the data into something like a CSV?
I have done some Googling about this. If you have link to resources, that will be very helpful. Thank you.
I used cursor method using psycopg2 function in python. The sample code is given below. You have to set all the redshift credentials in env_vars files.
you can set your queries using cursor.execute. here I mension one update query so you can set your query in this place (you can set multiple queries). After that you have to set this python file into crontab or any other autorun application for running your queries periodically.
import psycopg2
import sys
import env_vars
conn_string = "dbname=%s port=%s user=%s password=%s host=%s " %(env_vars.RedshiftVariables.REDSHIFT_DW ,env_vars.RedshiftVariables.REDSHIFT_PORT ,env_vars.RedshiftVariables.REDSHIFT_USERNAME ,env_vars.RedshiftVariables.REDSHIFT_PASSWORD,env_vars.RedshiftVariables.REDSHIFT_HOST)
conn = psycopg2.connect(conn_string);
cursor = conn.cursor();
cursor.execute("""UPDATE database.demo_table SET Device_id = '123' where Device = 'IPHONE' or Device = 'Apple'; """);
conn.commit();
conn.close();

Automate aws Athena partition loading [duplicate]

I have a Spark batch job which is executed hourly. Each run generates and stores new data in S3 with the directory naming pattern DATA/YEAR=?/MONTH=?/DATE=?/datafile.
After uploading the data to S3, I want to investigate it using Athena. Also, I would like to visualize them in QuickSight by connecting to Athena as a data source.
The problem is that after each run of my Spark batch, the newly generated data stored in S3 will not be discovered by Athena, unless I manually run the query MSCK REPAIR TABLE.
Is there a way to make Athena update the data automatically, so that I can create a fully automatic data visualization pipeline?
There are a number of ways to schedule this task. How do you schedule your workflows? Do you use a system like Airflow, Luigi, Azkaban, cron, or using an AWS Data pipeline?
From any of these, you should be able to fire off the following CLI command.
$ aws athena start-query-execution --query-string "MSCK REPAIR TABLE some_database.some_table" --result-configuration "OutputLocation=s3://SOMEPLACE"
Another option would be AWS Lambda. You could have a function that calls MSCK REPAIR TABLE some_database.some_table in response to a new upload to S3.
An example Lambda Function could be written as such:
import boto3
def lambda_handler(event, context):
bucket_name = 'some_bucket'
client = boto3.client('athena')
config = {
'OutputLocation': 's3://' + bucket_name + '/',
'EncryptionConfiguration': {'EncryptionOption': 'SSE_S3'}
}
# Query Execution Parameters
sql = 'MSCK REPAIR TABLE some_database.some_table'
context = {'Database': 'some_database'}
client.start_query_execution(QueryString = sql,
QueryExecutionContext = context,
ResultConfiguration = config)
You would then configure a trigger to execute your Lambda function when new data are added under the DATA/ prefix in your bucket.
Ultimately, explicitly rebuilding the partitions after you run your Spark Job using a job scheduler has the advantage of being self documenting. On the other hand, AWS Lambda is convenient for jobs like this one.
You should be running ADD PARTITION instead:
aws athena start-query-execution --query-string "ALTER TABLE ADD PARTITION..."
Which adds a the newly created partition from your S3 location
Athena leverages Hive for partitioning data.
To create a table with partitions, you must define it during the CREATE TABLE statement. Use PARTITIONED BY to define the keys by which to partition data.
There's multiple ways to solve the issue and get the table updated:
Call MSCK REPAIR TABLE. This will scan ALL data. It's costly as every file is read in full (at least it's fully charged by AWS). Also it's painfully slow. In short: Don't do it!
Create partitions by your own by calling ALTER TABLE ADD PARTITION abc .... This is good in a sense no data is scanned and costs are low. Also the query is fast, so no problems here. It's also a good choice if you have very cluttered file structure without any common pattern (which doesn't seem it's your case as it's a nicely organised S3 key pattern). There's also downsides to this approach: A) It's hard to maintain B) All partitions will to be stored in GLUE catalog. This can become an issue when you have a lot of partitions as they need to be read out and passed to Athena and EMRs Hadoop infrastructure.
Use partition projection. There's two different styles you might want to evaluate. Here's the variant with does create the partitions for Hadoop at query time. This means there's no GLUE catalog entries send over the network and thus large amounts of partitions can be handled quicker. The downside is you might 'hit' some partitions that might not exist. These will of course be ignored, but internally all partitions that COULD match your query will be generated - no matter if they are on S3 or not (so always add partition filters to your query!). If done correctly, this option is a fire and forget approach as there's no updates needed.
CREATE EXTERNAL TABLE `mydb`.`mytable`
(
...
)
PARTITIONED BY (
`YEAR` int,
`MONTH` int,
`DATE` int)
...
LOCATION
's3://DATA/'
TBLPROPERTIES(
"projection.enabled" = "true",
"projection.account.type" = "integer",
"projection.account.range" = "1,50",
"projection.YEAR.type" = "integer",
"projection.YEAR.range" = "2020,2025",
"projection.MONTH.type" = "integer",
"projection.MONTH.range" = "1,12",
"projection.DATE.type" = "integer",
"projection.DATE.range" = "1,31",
"storage.location.template" = "s3://DATA/YEAR=${YEAR}/MONTH=${MONTH}/DATE=${DATE}/"
);
https://docs.aws.amazon.com/athena/latest/ug/partition-projection.html
Just to list all options: You can also use GLUE crawlers. But it doesn't seemed to be a favourable approach as it's not as flexible as advertised.
You get more control on GLUE using Glue Data Catalog API directly, which might be an alternative to approach #2 if you have a lot of automated scripts
that do the preparation work to setup your table.
In short:
If your application is SQL centric, you like the leanest approach with no scripts, use partition projection
If you have many partitions, use partition projection
If you have a few partitions or partitions do not have a generic pattern, use approach #2
If you're script heavy and scripts do most of the work anyway and are easier to handle for you, consider approach #5
If you're confused and have no clue where to start - try partition projection first! It should fit 95% of the use cases.