I have a JDBC connection to an RDS instance and a crawler set up to populate the Data Catalog.
What's the best practice when setting up scheduled runs in order to avoid duplicates and still make the run as efficient as possible? The ETL job output source is S3. The data will then be visualized in QuickSight using Athena or possibly a direct S3 connection, not sure which one is favorable. In the ETL job script (pyspark), different tables are joined and new columns are calculated before storing the final data frame/dynamic frame in S3.
First job run: The data looks something like this (in real life, with a lot more columns and rows):
First job run
Second job run: After some time when the job is scheduled to run again, the data has changed too (notice the changes marked with the red boxes): Second job run
Upcoming job run: After some more time has passed the job is scheduled to run again and some more changes could be seen and so on.
What is the recommended setup for an ETL job like this?
Bookmarks: As for my understanding will produce multiple files in S3 which in turn creates duplicates that could be solved using another script.
Overwrite: Using the 'overwrite' option for the data frame
df.repartition(1).write.mode('overwrite').parquet("s3a://target/name")
Today I've been using an overwrite method but it gave me some issues: At some point when I needed to change the ETL job script and the update changed the data stored in S3 too much, my QuickSight Dashboards crashed and could not be replaced with the new data set (build on the new data frame stored in S3) which meant I had to rebuild the Dashboard all over again.
Please give me your best tips and tricks for smoothly performing ETL jobs on randomly updating tables in AWS Glue!
Related
I have a trigger in Oracle. Can anyone please help me with how it can be replicated to Redshift? DynamoDB managed stream kind of functionality will also work.
Redshift does not support triggers because it's a data warehousing system which is designed to be able to import large amounts of data in a limited time. So, if every row insert would be able to fire a trigger the performance of batch inserts would suffer. This is probably why Redshift developers didn't bother to support this and I agree with them. The trigger type of behavior should be a part of business application logic that runs in OLTP environment and not the data warehousing logic. If you want to run some code in DW after inserting or updating data you have to do it as another step of your data pipeline.
Would like some suggestions on loading data to Redshift.
Currently we have an EMR cluster where RAW data is ingested regularly. We have a transformation job which runs daily and creates final modeled object. However, we are following truncate and load strategy in EMR . Due to business reasons there is no way to figure out which data has changed.
We are planning to store of this modeled object in Redshift.
Now my question is If we follow the same truncate and load strategy in
RedShift also, will that work?
I was able to find only articles which say use copy if you want to perform bulk copy, and then use insert command for small updates. But nothing on can and should we be using RedShift where the data is getting overwritten daily.
My project is undergoing a transition to a new AWS account, and we are trying to find a way to persist our AWS Glue ETL bookmarks. We have a vast amount of processed data that we are replicating to the new account, and would like to avoid reprocessing.
It is my understanding that Glue bookmarks are just timestamps on the backend, and ideally we'd be able to get the old bookmark(s), and then manually set the bookmarks for the matching jobs in the new AWS account.
It looks like I could get my existing bookmarks via the AWS CLI using:
get-job-bookmark --job-name <value>
(Source)
However, I have been unable to find any possible method of possibly setting the bookmarks in the new account.
As far as workarounds, my best bets seem to be:
Add exclude patterns for all of our S3 data sources on our Glue crawler(s), though this would no longer allow us to track any of our existing unprocessed data via the Glue catalog (which we currently use to track record and file counts). This is looking like the best bet so far...
Attempt to run the Glue ETL jobs prior to crawling our old (replicated) data in the new account, setting the bookmark past the created-time of our replicated S3 objects. Then once we crawl the replicated data, the ETL jobs will consider them older than the current bookmark time and not process them on the next run. However, it appears this hack doesn't work as I ended up processing all data when testing this.
Really at a loss here and the AWS Glue forums are a ghost town and have not been helpful in the past.
I was not able to manually set a bookmark or get a bookmark to manually progress and skip data using the methods in the question above.
However, I was able to get the Glue ETL job to skip data and progress its bookmark using the following steps:
Ensure any Glue ETL schedule is disabled
Add the files you'd like to skip to S3
Crawl S3 data
Comment out the processing steps of your Glue ETL job's Spark code. I just commented out all of the dynamic_frame steps after the initial dynamic frame creation, up until job.commit().
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# Create dynamic frame from raw glue table
datasource0 =
glueContext.create_dynamic_frame.from_catalog(database=GLUE_DATABASE_NAME,
table_name=JOB_TABLE, transformation_ctx="datasource0")
# ~~ COMMENT OUT ADDITIONAL STEPS ~~ #
job.commit()
Run glue etl job with bookmark enabled as usual
Revert Glue ETL Spark code back to normal
Now, the Glue ETL job's bookmark has been progressed and any data that would have been processed on that job run in step 5 will have been skipped. Next time a file is added to S3 and crawled, it will be processed normally by the Glue ETL job.
This can be useful if you know you will be getting some data that you don't want processed, or if you are transitioning to a new AWS account and are replicating over all your old data like I did. It would be nice if there was a way to manually set bookmark times in Glue so this was not necessary.
I want too load data from S3 to Redshift. The data coming to S3 in around 5MB{approximate size} per sec.
I need to automate the loading of data from S3 to Redshift.
The data to S3 is dumping from the kafka-stream consumer application.
The folder S3 data is in folder structure.
Example folder :
bucketName/abc-event/2020/9/15/10
files in this folder :
abc-event-2020-9-15-10-00-01-abxwdhf. 5MB
abc-event-2020-9-15-10-00-02-aasdljc. 5MB
abc-event-2020-9-15-10-00-03-thntsfv. 5MB
the files in S3 have json objects separated with next line.
This data need to be loaded to abc-event table in redshift.
I know few options like AWS Data pipeline, AWS Glue, AWS Lambda Redshift loader (https://aws.amazon.com/blogs/big-data/a-zero-administration-amazon-redshift-database-loader/).
What would be the best way to do it.
Really appreciate if someone will guide me.
Thanks you
=============================================
Thanks Prabhakar for the answer. Need some help in continuation on this.
Created a table in Data Catalog by crawler and
then running a ETLL job in glue does the job of loading the data from S3 to redshift.
I am using approach 1. Predicate pushdown
New files get loaded in S3 in different partition say (new hour started.)
I am adding new partition using a AWS Glue python script job.
Adding new partition in the table using Athena API. (using ALTER TABLE ADD PARTITION).
I have checked in the console that the new partition gets added by the python script job. I checked new partion gets added in Data catalog table.
When I run the same job with pushdown predicate giving same partition added by the python script glue job.
The job did not load the new files from S3 in this new partition to Redshift.
I cant figure out what I am doing wrong ???
In your use case you can leverage AWS Glue to load the data periodically into redshift.You can schedule your Glue job using trigger to run every 60 minutes which will calculate to be around 1.8 GB in your case.
This interval can be changed according to your needs and depending on how much data that you want to process each run.
There are couple of approaches you can follow in reading this data :
Predicate pushdown :
This will only load the partitions that mentioned in the job. You can calculate the partition values every run on the fly and pass them to the filter. For this you need to run Glue crawler each run so that the table partitions are updated in the table metadata.
If you don't want to use crawler then you can either use boto3 create_partition or Athena add partition which will be a free operation.
Job bookmark :
This will load only the latest s3 data that is accumulated from the time that your Glue job completed it's previous run.This approach might not be effective if there is no data generated in S3 in some runs.
Once you calculate the data that is to be read you can simply write it to redshift table every run.
In your case you have files present in sub directories for which you need to enable recurse as shown in below statement.
datasource0 = glueContext.create_dynamic_frame.from_catalog(database =<name>, table_name = <name>, push_down_predicate = "(year=='<2019>' and month=='<06>')", transformation_ctx = "datasource0", additional_options = {"recurse": True})
I want to retrieve data from BigQuery that arrived every hour and do some processing and pull the new calculate variables in a new BigQuery table. The things is that I've never worked with gcp before and I have to for my job now.
I already have my code in python to process the data but it's work only with a "static" dataset
As your source and sink of that are both in BigQuery, I would recommend you to do your transformations inside BigQuery.
If you need a scheduled job that runs in a pre determined time, you can use Scheduled Queries.
With Scheduled Queries you are able to save some query, execute it periodically and save the results to another table.
To create a scheduled query follow the steps:
In BigQuery Console, write your query
After writing the correct query, click in Schedule query and then in Create new scheduled query as you can see in the image below
Pay attention in this two fields:
Schedule options: there are some pre-configured schedules such as daily, monthly, etc.. If you need to execute it every two hours, for example, you can set the Repeat option as Custom and set your Custom schedule as 'every 2 hours'. In the Start date and run time field, select the time and data when your query should start being executed.
Destination for query results: here you can set the dataset and table where your query's results will be saved. Please keep in mind that this option is not available if you use scripting. In other words, you should use only SQL and not scripting in your transformations.
Click on Schedule
After that your query will start being executed according to your schedule and destination table configurations.
According with Google recommendation, when your data are in BigQuery and when you want to transform them to store them in BigQuery, it's always quicker and cheaper to do this in BigQuery if you can express your processing in SQL.
That's why, I don't recommend you dataflow for your use case. If you don't want, or you can't use directly the SQL, you can create User Defined Function (UDF) in BigQuery in Javascript.
EDIT
If you have no information when the data are updated into BigQuery, Dataflow won't help you on this. Dataflow can process realtime data only if these data are present into PubSub. If not, it's not magic!!
Because you haven't the information of when a load is performed, you have to run your process on a schedule. For this, Scheduled Queries is the right solution is you use BigQuery for your processing.