Optimize Glue job and comparision between Visual and Script mode, JDBC Connection parameters

Optimize Glue job and comparision between Visual and Script mode, JDBC Connection parameters - amazon-web-services

I am working on a Glue job to read data from an oracle database and write it into redshift. I have crawled the tables from my oracle source and redshift target. When I use the glue visual, with oracle source and write to redshift component it is completing in around 7 mins with G1x and 5 workers. I tried other combinations and concluded this is the best combination I can use.
Now I wanted to optimize this further and am trying to write a pyspark script from scratch. I used a simple jdbc read and write, but it is taking more than 30 minutes to complete. I have 3M records in source. I have tried with numpartitions 10 and fetch size 30000. My question is:
what are the default configs used by glue visual job, as it is finishing way so fastly?
Does the fetch size is already configured on the source side when we use to read using a jdbc connection? because if glue visual job is using this and its value is more than what I have specified, could that be the reason for faster execution?
Please let me know if you need any further details.

Related

A record is entered into Redshift Table now a databricks notebook should be triggered [duplicate]

I have a trigger in Oracle. Can anyone please help me with how it can be replicated to Redshift? DynamoDB managed stream kind of functionality will also work.

Redshift does not support triggers because it's a data warehousing system which is designed to be able to import large amounts of data in a limited time. So, if every row insert would be able to fire a trigger the performance of batch inserts would suffer. This is probably why Redshift developers didn't bother to support this and I agree with them. The trigger type of behavior should be a part of business application logic that runs in OLTP environment and not the data warehousing logic. If you want to run some code in DW after inserting or updating data you have to do it as another step of your data pipeline.

AWS Glue ETL Job: Bookmark or Overwrite - Best practice?

I have a JDBC connection to an RDS instance and a crawler set up to populate the Data Catalog.
What's the best practice when setting up scheduled runs in order to avoid duplicates and still make the run as efficient as possible? The ETL job output source is S3. The data will then be visualized in QuickSight using Athena or possibly a direct S3 connection, not sure which one is favorable. In the ETL job script (pyspark), different tables are joined and new columns are calculated before storing the final data frame/dynamic frame in S3.
First job run: The data looks something like this (in real life, with a lot more columns and rows):
First job run
Second job run: After some time when the job is scheduled to run again, the data has changed too (notice the changes marked with the red boxes): Second job run
Upcoming job run: After some more time has passed the job is scheduled to run again and some more changes could be seen and so on.
What is the recommended setup for an ETL job like this?
Bookmarks: As for my understanding will produce multiple files in S3 which in turn creates duplicates that could be solved using another script.
Overwrite: Using the 'overwrite' option for the data frame
df.repartition(1).write.mode('overwrite').parquet("s3a://target/name")
Today I've been using an overwrite method but it gave me some issues: At some point when I needed to change the ETL job script and the update changed the data stored in S3 too much, my QuickSight Dashboards crashed and could not be replaced with the new data set (build on the new data frame stored in S3) which meant I had to rebuild the Dashboard all over again.
Please give me your best tips and tricks for smoothly performing ETL jobs on randomly updating tables in AWS Glue!

Migrate and Transform MySQL rows from one table to another

I need to migrate all records 3 billions from one MySQL Aurora table to 5 different tables in same cluster .
There are transformation of 2 columns is also has to happen .
So when we migrate we need to convert xml to json and then json will be stored in one of the destination table .
We are looking for best way to migrate this data from one MySQL table to another and we are on AWS so we have flexibility to use any services which can help us achieve this .
So far this is what we have planned
MySQL TABLE ----DMS------>S3 ------LAMBDA to convert XML to JSON and create 5 types of files ---->Lambda on file create and Load data local to 5 Different MySQL table .
But one thing we would like to know how can we handle if Load data local fails in between ?So Lambda will submit the query for load data local from s3 to MySQL but how can we track in Lambda that Load data local success or failure ?
We can not use any direct way because we need to transform data in between .
Is there any better way we can use here ?
Can we use Data pipeline in place of Lambda function for load data local?
Or can we use DMS which will upload file from S3 to MySQL ?
Please suggest what can be the best way which will capability to handle failure scenario

What you are basically doing is an ETL process. I would advise you to look into either AWS EMR or AWS Glue. Since you don't seem to have that much experience, I would use Glue.
With Glue you could basically read from MySQL, do the transformation and write back directly to MySQL. Also, since Glue is running Spark in the background, you can leverage it's distributed computing, which will speed up your process instead of using a single thread lambda function.

AWS DMS Binary Reader + Oracle REDO logs vs Binary Reader + Archived Logs

I am planning a migration from an on-premises Oracle 18c (1.5TB of data, 400TPS) to AWS-hosted databases using AWS Database Migration Service.
According to the official DMS documentation, DMS Binary Reader seems to be the only choice because our database is a PDB instance, and it can handle the REDO logs or the archived logs as the source for Change Data Capture.
I am assuming the archived logs would be a better choice in terms of CDC performance because they are smaller in size than the online REDO logs, but not really sure of other benefits of choosing the archived logs as the CDC source over the REDO logs. Does anyone know?

Oracle mining will mine the online redo logs until it gets behind then it will mine the archive logs. You have two options for CDC either Oracle LogMiner or Oracle Binary Reader.
In general, use Oracle LogMiner for migrating your Oracle database unless you have one of the following situations:
You need to run several migration tasks on the source Oracle database.
The volume of changes or the redo log volume on the source Oracle database is high. When using Oracle LogMiner as a source database, the 32 KB buffer limit within LogMiner impacts the performance of change data capture on databases with a high volume of change. For example, the 10GB per hour change rate of a LogMiner source database can exceed DMS change data capture capabilities.
Your workload includes UPDATE statements that update only LOB columns. In this case, use Binary Reader. These UPDATE statements aren't supported by Oracle LogMiner.
Your source is Oracle version 11 and you perform UPDATE statements on XMLTYPE and LOB columns. In this case, you must use Binary Reader. These statements aren't supported by Oracle LogMiner.
You are migrating LOB columns from Oracle 12c. For Oracle 12c, LogMiner doesn't support LOB columns, so in this case use Binary Reader.

Streaming Insert/Update using Google Cloud Functions

This is regarding streaming data Insert/Update using google cloud function. I am using Salesforce as Source database and wanted to do a streaming insert/update to google BigQuery tables. My insert part is working fine but how can i able to do a update since streaming data is getting inserted into a streaming buffer first which wont allow to do DML operation for a period of 30 min or so. Any help on this will be really appreciated

Got a reply from Google Support like below
"It is true that modifying recent data for the last 30 minutes (with an active streaming buffer) is not possible as one of the limitations of BigQuery DML operations"
One of the workaround which we can try is to copy the data from streaming table to a new table and perform any operation on that. This helped me.
Thanks
vak

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Optimize Glue job and comparision between Visual and Script mode, JDBC Connection parameters - amazon-web-services

Related

A record is entered into Redshift Table now a databricks notebook should be triggered [duplicate]

AWS Glue ETL Job: Bookmark or Overwrite - Best practice?

Migrate and Transform MySQL rows from one table to another

AWS DMS Binary Reader + Oracle REDO logs vs Binary Reader + Archived Logs

Streaming Insert/Update using Google Cloud Functions

Categories

Resources