Manually setting AWS Glue ETL Bookmark - amazon-web-services

My project is undergoing a transition to a new AWS account, and we are trying to find a way to persist our AWS Glue ETL bookmarks. We have a vast amount of processed data that we are replicating to the new account, and would like to avoid reprocessing.
It is my understanding that Glue bookmarks are just timestamps on the backend, and ideally we'd be able to get the old bookmark(s), and then manually set the bookmarks for the matching jobs in the new AWS account.
It looks like I could get my existing bookmarks via the AWS CLI using:
get-job-bookmark --job-name <value>
(Source)
However, I have been unable to find any possible method of possibly setting the bookmarks in the new account.
As far as workarounds, my best bets seem to be:
Add exclude patterns for all of our S3 data sources on our Glue crawler(s), though this would no longer allow us to track any of our existing unprocessed data via the Glue catalog (which we currently use to track record and file counts). This is looking like the best bet so far...
Attempt to run the Glue ETL jobs prior to crawling our old (replicated) data in the new account, setting the bookmark past the created-time of our replicated S3 objects. Then once we crawl the replicated data, the ETL jobs will consider them older than the current bookmark time and not process them on the next run. However, it appears this hack doesn't work as I ended up processing all data when testing this.
Really at a loss here and the AWS Glue forums are a ghost town and have not been helpful in the past.

I was not able to manually set a bookmark or get a bookmark to manually progress and skip data using the methods in the question above.
However, I was able to get the Glue ETL job to skip data and progress its bookmark using the following steps:
Ensure any Glue ETL schedule is disabled
Add the files you'd like to skip to S3
Crawl S3 data
Comment out the processing steps of your Glue ETL job's Spark code. I just commented out all of the dynamic_frame steps after the initial dynamic frame creation, up until job.commit().
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# Create dynamic frame from raw glue table
datasource0 =
glueContext.create_dynamic_frame.from_catalog(database=GLUE_DATABASE_NAME,
table_name=JOB_TABLE, transformation_ctx="datasource0")
# ~~ COMMENT OUT ADDITIONAL STEPS ~~ #
job.commit()
Run glue etl job with bookmark enabled as usual
Revert Glue ETL Spark code back to normal
Now, the Glue ETL job's bookmark has been progressed and any data that would have been processed on that job run in step 5 will have been skipped. Next time a file is added to S3 and crawled, it will be processed normally by the Glue ETL job.
This can be useful if you know you will be getting some data that you don't want processed, or if you are transitioning to a new AWS account and are replicating over all your old data like I did. It would be nice if there was a way to manually set bookmark times in Glue so this was not necessary.

Related

AWS Glue ETL Job: Bookmark or Overwrite - Best practice?

I have a JDBC connection to an RDS instance and a crawler set up to populate the Data Catalog.
What's the best practice when setting up scheduled runs in order to avoid duplicates and still make the run as efficient as possible? The ETL job output source is S3. The data will then be visualized in QuickSight using Athena or possibly a direct S3 connection, not sure which one is favorable. In the ETL job script (pyspark), different tables are joined and new columns are calculated before storing the final data frame/dynamic frame in S3.
First job run: The data looks something like this (in real life, with a lot more columns and rows):
First job run
Second job run: After some time when the job is scheduled to run again, the data has changed too (notice the changes marked with the red boxes): Second job run
Upcoming job run: After some more time has passed the job is scheduled to run again and some more changes could be seen and so on.
What is the recommended setup for an ETL job like this?
Bookmarks: As for my understanding will produce multiple files in S3 which in turn creates duplicates that could be solved using another script.
Overwrite: Using the 'overwrite' option for the data frame
df.repartition(1).write.mode('overwrite').parquet("s3a://target/name")
Today I've been using an overwrite method but it gave me some issues: At some point when I needed to change the ETL job script and the update changed the data stored in S3 too much, my QuickSight Dashboards crashed and could not be replaced with the new data set (build on the new data frame stored in S3) which meant I had to rebuild the Dashboard all over again.
Please give me your best tips and tricks for smoothly performing ETL jobs on randomly updating tables in AWS Glue!

Automate loading data from S3 to Redshift

I want too load data from S3 to Redshift. The data coming to S3 in around 5MB{approximate size} per sec.
I need to automate the loading of data from S3 to Redshift.
The data to S3 is dumping from the kafka-stream consumer application.
The folder S3 data is in folder structure.
Example folder :
bucketName/abc-event/2020/9/15/10
files in this folder :
abc-event-2020-9-15-10-00-01-abxwdhf. 5MB
abc-event-2020-9-15-10-00-02-aasdljc. 5MB
abc-event-2020-9-15-10-00-03-thntsfv. 5MB
the files in S3 have json objects separated with next line.
This data need to be loaded to abc-event table in redshift.
I know few options like AWS Data pipeline, AWS Glue, AWS Lambda Redshift loader (https://aws.amazon.com/blogs/big-data/a-zero-administration-amazon-redshift-database-loader/).
What would be the best way to do it.
Really appreciate if someone will guide me.
Thanks you
=============================================
Thanks Prabhakar for the answer. Need some help in continuation on this.
Created a table in Data Catalog by crawler and
then running a ETLL job in glue does the job of loading the data from S3 to redshift.
I am using approach 1. Predicate pushdown
New files get loaded in S3 in different partition say (new hour started.)
I am adding new partition using a AWS Glue python script job.
Adding new partition in the table using Athena API. (using ALTER TABLE ADD PARTITION).
I have checked in the console that the new partition gets added by the python script job. I checked new partion gets added in Data catalog table.
When I run the same job with pushdown predicate giving same partition added by the python script glue job.
The job did not load the new files from S3 in this new partition to Redshift.
I cant figure out what I am doing wrong ???
In your use case you can leverage AWS Glue to load the data periodically into redshift.You can schedule your Glue job using trigger to run every 60 minutes which will calculate to be around 1.8 GB in your case.
This interval can be changed according to your needs and depending on how much data that you want to process each run.
There are couple of approaches you can follow in reading this data :
Predicate pushdown :
This will only load the partitions that mentioned in the job. You can calculate the partition values every run on the fly and pass them to the filter. For this you need to run Glue crawler each run so that the table partitions are updated in the table metadata.
If you don't want to use crawler then you can either use boto3 create_partition or Athena add partition which will be a free operation.
Job bookmark :
This will load only the latest s3 data that is accumulated from the time that your Glue job completed it's previous run.This approach might not be effective if there is no data generated in S3 in some runs.
Once you calculate the data that is to be read you can simply write it to redshift table every run.
In your case you have files present in sub directories for which you need to enable recurse as shown in below statement.
datasource0 = glueContext.create_dynamic_frame.from_catalog(database =<name>, table_name = <name>, push_down_predicate = "(year=='<2019>' and month=='<06>')", transformation_ctx = "datasource0", additional_options = {"recurse": True})

Create tables in Glue Data Catalog for data in S3 and unknown schema

My current use case is, in an ETL based service (NOTE: The ETL service is not using the Glue ETL, it is an independent service), I am getting some data from AWS Redshift clusters into the S3. The data in S3 is then fed into the T and L jobs. I want to populate the metadata into the Glue Catalog. The most basic solution for this is to use the Glue Crawler, but the crawler runs for approximately 1 hour and 20 mins(lot of s3 partitions). The other solution that I came across is to use Glue API's. However, I am facing the issue of data type definition in the same.
Is there any way, I can create/update the Glue Catalog Tables where I have data in S3 and the data types are known only during the extraction process.
But also, when the T and L jobs are being run, the data types should be readily available in the catalog.
In order to create, update the data catalog during your ETL process, you can make use of the following:
Update:
additionalOptions = {"enableUpdateCatalog": True, "updateBehavior": "UPDATE_IN_DATABASE"}
additionalOptions["partitionKeys"] = ["partition_key0", "partition_key1"]
sink = glueContext.write_dynamic_frame_from_catalog(frame=last_transform, database=<dst_db_name>,
table_name=<dst_tbl_name>, transformation_ctx="write_sink",
additional_options=additionalOptions)
job.commit()
The above can be used to update the schema. You also have the option to set the updateBehavior choosing between LOG or UPDATE_IN_DATABASE (default).
Create
To create new tables in the data catalog during your ETL you can follow this example:
sink = glueContext.getSink(connection_type="s3", path="s3://path/to/data",
enableUpdateCatalog=True, updateBehavior="UPDATE_IN_DATABASE",
partitionKeys=["partition_key0", "partition_key1"])
sink.setFormat("<format>")
sink.setCatalogInfo(catalogDatabase=<dst_db_name>, catalogTableName=<dst_tbl_name>)
sink.writeFrame(last_transform)
You can specify the database and new table name using setCatalogInfo.
You also have the option to update the partitions in the data catalog using the enableUpdateCatalog argument then specifying the partitionKeys.
A more detailed explanation on the functionality can be found here.
Found a solution to the problem, I ended up utilising the Glue Catalog API's to make it seamless and fast.
I created an interface which interacts with the Glue Catalog, and override those methods for various data sources. Right after the data has been loaded into the S3, I fire the query to get the schema from the source and then the interface does its work.

Should I run Glue crawler everytime to fetch latest data?

I have a S3 bucket named Employee. Every three hours I will be getting a file in the bucket with a timestamp attached to it. I will be using Glue job to move the file from S3 to Redshift with some transformations. My input file in S3 bucket will have a fixed structure. My Glue Job will use the table created in Data Catalog via crawler as the input.
First run:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "test", table_name = "employee_623215", transformation_ctx = "datasource0")
After three hours if I am getting one more file for employee should I crawl it again?
Is there a way to have a single table in Data Catalog like employee and update the table with the latest S3 file which can be used by Glue Job for processing. Or should I run crawler every time to get the latest data? The issue with that is more number of tables will be created in my Data Catalog.
Please let me know if this is possible.
You only need to run the AWS Glue Crawler again if the schema changes. As long as the schema remains unchanged, you can just add files to Amazon S3 without having to re-run the Crawler.
Update: #Eman's comment below is correct
If you are reading from catalog this suggestion will not work. Partitions will not be updated to the catalog table if you do not recrawl. Running the crawler maps those new partitions to the table and allow you to process the next day's partitions.
An alternative approach can be, instead of reading from catalog read directly from s3 and process data in Glue job.
This way you need not to run crawler again.
Use
from_options(connection_type, connection_options={}, format=None, format_options={}, transformation_ctx="")
Documented here

Snowflake to/from S3 Pipeline Recommendations for ETL architecture

I am trying to build a pipeline which is sending data from Snowflake to S3 and then from S3 back into Snowflake (after running it through a production ML model on Sagemaker). I am new to Data Engineering, so I would love to hear from the community what the recommended path is. The pipeline requirements are the following:
I am looking to schedule a monthly job. Do I specify such in AWS or on the Snowflake side?
For the initial pull, I want to query 12 months' worth of data from Snowflake. However, for any subsequent pull, I only need the last month since this should be a monthly pipeline.
All monthly data pulls should be stored in own S3 subfolder like this query_01012020,query_01022020,query_01032020 etc.
The data load from S3 back to a specified Snowflake table should be triggered after the ML model has successfully scored the data in Sagemaker.
I want to monitor the performance of the ML model in production overtime to catch if the model is decreasing its accuracy (some calibration-like graph perhaps).
I want to get any error notifications in real-time when issues in the pipeline occur.
I hope you are able to guide me on relevant documentation/tutorials for this effort. I would truly appreciate the guidance.
Thank you very much.
Snowflake does not have any orchestration tools like Airflow or Oozie. So you need to use or think of using some of the Snowflake Partner Ecosystem tools like Mattilion etc. Alternatively, you can build your own end to end flow using Spark or python or any other programming language which can connect snowflake using JDBC/ODBC/Python connectors.
To feed the data realtime to snowflake from s3, you can use AWS SNS service and invoke a SnowPipe to feed the data to Snowflake Stage environment and take it fwd via the ETL process for consumption.
Answer to each one of your question
I am looking to schedule a monthly job. Do I specify such in AWS or on the Snowflake side?
It is not possible in snowflake, you have to do it via AWS or some other tool.
For the initial pull, I want to query 12 months' worth of data from Snowflake. However, for any subsequent pull, I only need the last month since this should be a monthly pipeline.
Ans: You can pull any size of data and you can also have some scripting to support that via SF, but invocation need to be programmed.
All monthly data pulls should be stored in own S3 subfolder like this query_01012020,query_01022020,query_01032020 etc.
Ans: Feeding data to Snowflake is possible via AWS SNS (or REST API) + SnowPipe but visa-versa is not possible.
The data load from S3 back to a specified Snowflake table should be triggered after the ML model has successfully scored the data in Sagemaker.
Ans: This is possible via AWS SNS + SnowPipe.
I want to monitor the performance of the ML model in production overtime to catch if the model is decreasing its accuracy (some calibration-like graph perhaps).
Ans:Not possible via Snowflake.
I would approach the problem like this:
In a temporary table hold 12 months of data ( i am sure you know all the required queries, just since you asked for tutorials i am thinking it will be helpful may be for you as well as others )
-- Initial Pull Hold 12 months of Data ....
Drop table if exists <TABLE_NAME>;
Create Temporary Table <TABLE_NAME> as (
Select *
From Original Table
Where date_field between current_date -365 and Current_date
);
-- Export data to S3 ...
copy into 's3://path/to/export/directory'
from DB_NAME.SCHEMA_NAME.TABLE_NAME
file_format = (type = csv field_delimiter = '|' skip_header = 0)
credentials=(aws_key_id='your_aws_key_id' aws_secret_key='your_aws_secret_key');
once your ML stuff is done, import data back to snowflake like this:
-- Import to S3 ...
copy into DB_NAME.SCHEMA_NAME.TABLE_NAME
from 's3://path/to/your/csv_file_name.csv'
credentials=(aws_key_id='your_aws_key_id' aws_secret_key='your_aws_secret_key')
file_format = (type = csv field_delimiter = '|' skip_header = 1);
I am not sure if snowflake has released ML stuff and how you will do ML at your end etc.
for the scheduling I would suggest either:
Place your code in a shell script or a python script and schedule it to run once a month.
Use Snowflake tasks as follows:
CREATE TASK monthly_task_1
WAREHOUSE =
SCHEDULE = 'USING CRON 0 0 1 * * America/Chicago'
AS
insert your create temporary table query here;
CREATE TASK monthly_task_2
WAREHOUSE =
AFTER monthly_task_1
AS
insert your S3 export query here;
You can read more about snowflake tasks here: https://docs.snowflake.com/en/sql-reference/sql/create-task.html
For importing results back to Snowflake from S3 after ML stuff is done, you can add few lines in your ML code ( presumably in Python ) to execute the copy into code for --Import to S3 which is written above.