AWS DMS Binary Reader + Oracle REDO logs vs Binary Reader + Archived Logs - database-migration

I am planning a migration from an on-premises Oracle 18c (1.5TB of data, 400TPS) to AWS-hosted databases using AWS Database Migration Service.
According to the official DMS documentation, DMS Binary Reader seems to be the only choice because our database is a PDB instance, and it can handle the REDO logs or the archived logs as the source for Change Data Capture.
I am assuming the archived logs would be a better choice in terms of CDC performance because they are smaller in size than the online REDO logs, but not really sure of other benefits of choosing the archived logs as the CDC source over the REDO logs. Does anyone know?

Oracle mining will mine the online redo logs until it gets behind then it will mine the archive logs. You have two options for CDC either Oracle LogMiner or Oracle Binary Reader.
In general, use Oracle LogMiner for migrating your Oracle database unless you have one of the following situations:
You need to run several migration tasks on the source Oracle database.
The volume of changes or the redo log volume on the source Oracle database is high. When using Oracle LogMiner as a source database, the 32 KB buffer limit within LogMiner impacts the performance of change data capture on databases with a high volume of change. For example, the 10GB per hour change rate of a LogMiner source database can exceed DMS change data capture capabilities.
Your workload includes UPDATE statements that update only LOB columns. In this case, use Binary Reader. These UPDATE statements aren't supported by Oracle LogMiner.
Your source is Oracle version 11 and you perform UPDATE statements on XMLTYPE and LOB columns. In this case, you must use Binary Reader. These statements aren't supported by Oracle LogMiner.
You are migrating LOB columns from Oracle 12c. For Oracle 12c, LogMiner doesn't support LOB columns, so in this case use Binary Reader.

Related

Optimize Glue job and comparision between Visual and Script mode, JDBC Connection parameters

I am working on a Glue job to read data from an oracle database and write it into redshift. I have crawled the tables from my oracle source and redshift target. When I use the glue visual, with oracle source and write to redshift component it is completing in around 7 mins with G1x and 5 workers. I tried other combinations and concluded this is the best combination I can use.
Now I wanted to optimize this further and am trying to write a pyspark script from scratch. I used a simple jdbc read and write, but it is taking more than 30 minutes to complete. I have 3M records in source. I have tried with numpartitions 10 and fetch size 30000. My question is:
what are the default configs used by glue visual job, as it is finishing way so fastly?
Does the fetch size is already configured on the source side when we use to read using a jdbc connection? because if glue visual job is using this and its value is more than what I have specified, could that be the reason for faster execution?
Please let me know if you need any further details.

Row level changes captured via AWS DMS

I am trying to migrate the database using AWS DMS. Source is Azure SQL server and destination is Redshift. Is there any way to know the rows updated or inserted? We dont have any audit columns in source database.
Redshift doesn’t track changes and you would need to have audit columns to do this at the user level. You may be able to deduce this from Redshift query history and save data input files but this will be solution dependent. Query history can be achieved in a couple of ways but both require some action. The first is to review the query logs but these are only saved for a few days. If you need to look back further than this you need a process to save these tables so the information isn’t lost. The other is to turn on Redshift logging to S3 but this would need to be turned on before you run queries on Redshift. There may be some logging from DMS that could be helpful but I think the bottom line answer is that row level change tracking is not something that is on in Redshift by default.

What does "interactive query" or "interactive analytics" mean in Presto SQL?

The Presto website (and other docs) talk about "interactive queries" on Presto. What is an "interactive query"? From the Presto Website: "Facebook uses Presto for interactive queries against several internal data stores, including their 300PB data warehouse."
An interactive query system is basically a user interface that translates the input from the user into SQL queries. These are then sent to Presto, which processes the queries and gets the data and sends it back to the user interface.
The UI then renders the output, which is typically NOT just a simple table of numbers and text, but rather a complex chart, a diagram or some other powerful visualization.
The users expects to be able to e.g. update one criteria and get the updated chart or visualization in near real time, just like you expect on any application typically. Even if the creation of this analysis involves LOTS of data to be processed.
Presto can do that since it can query massive distributed object storage systems like HDFS and many other cloud storage systems, as well as RDBMSs and so on. And it can be set up to have a huge cluster of workers that query the source in parallel and therefore process massive amounts of data for analysis, and still be fast enough for the user expectations.
A typical application to use for the visualization is Apache Superset. You just hook up Presto to it via the JDBC driver. Presto has to be configured to point at the underlying data sources and you are ready to go.

How AWS DMS works internally

In AWS DMS how does the migration happening internally? Is it like exporting entire data from source table and importing to destination table? Or is it like migrating table records one by one to destination table? I am new to aws dms and don't have much idea on how things work there.
AWS publish how DMS works in their documentation and blog posts. This is the list I wish I had when I started with DMS:
For a high level understanding see: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Introduction.html
A task can consist of three major phases:
The full load of existing data
The application of cached changes
Ongoing replication
During a full load migration, where existing data from the source is moved to the target, AWS DMS loads data from tables on the source data store to tables on the target data store. While the full load is in progress, any changes made to the tables being loaded are cached on the replication server; these are the cached changes.
...
When the full load for a given table is complete, AWS DMS immediately begins to apply the cached changes for that table. When all tables have been loaded, AWS DMS begins to collect changes as transactions for the ongoing replication phase. After AWS DMS applies all cached changes, tables are transactionally consistent. At this point, AWS DMS moves to the ongoing replication phase, applying changes as transactions.
From: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Introduction.Components.html
Look at the headings:
Replication Tasks
Ongoing replication, or change data capture (CDC)
To gain a detailed understanding of how DMS works internally, read through the following blogs from AWS:
Debugging Your AWS DMS Migrations: What to Do When Things Go Wrong (Part 1)
Debugging Your AWS DMS Migrations: What to Do When Things Go Wrong (Part 2)
Debugging Your AWS DMS Migrations: What to Do When Things Go Wrong? (Part 3)
Finally, work through the blogs particular to your source and target databases at https://aws.amazon.com/blogs/database/category/migration/aws-database-migration-service-migration/
When I first used DMS I had same question. So simply I enabled Cloudwatch logs and created one migration task from Oracle to Aurora Postgresql.
First DMS task runs on Replication Instance and it connects to Source and Target databases.
RI then connect to Source database and based on selection rule it identifies tables and column details since it has lot of special access on Source and Target DB.
After that it start reading source table(s) in parallel and create Select col1, col2, col3.. from kind of query to fetch data from Source.
Then it write files in a temp location on RI based on tables, 1 file per table and approx 10000 rows in one commit.
While all this is happening another process is creating connection to Target DB and checking if Tables already exist if yes then it check which option we selected Do Nothing or Truncate Table etc.. Based on that it takes action.
Till now we have data from Source table in files on RI and connection and tables created on Target DB. Now RI just reads file records from RI temp location and create insert query.
Once last commit is successful it deletes the temp file from RI.
Once Source table and target table count is matched it closes connections in case of One time load.
In case of On going changes it keeps connection alive and read redo logs or other logs in Source db. Then follow same process mentioned above for CDC.
Here's a doc that provides some more information on how DMS Ongoing Replication works internally: https://aws.amazon.com/blogs/database/introducing-ongoing-replication-from-amazon-rds-for-sql-server-using-aws-database-migration-service/
The short of it is:
(following some initial steps) AWS DMS does not use any replication artifacts. When all the required information is available in the transaction log or transaction log backup, AWS DMS uses the fn_dblog() and fn_dump_dblog() functions to read changes directly from the transaction logs or transaction log backups using the log sequence number (LSN).
In addition to above answers, DMS uses Attunity underneath. There are public documents on how the later works in detail.

Bulk Update a million rows

Suppose I have a million rows in a table. I want to flip a flag in a column from true to false. How do I do that in spanner with a single statement?
That is, I want to achieve the following DML statement.
Update mytable set myflag=true where 1=1;
Cloud Spanner doesn't currently support DML, but we are working on a Dataflow connector (Apache Beam) that would allow you to do bulk mutations.
You can use this open source JDBC driver in combination with a standard JDBC tool like for example SQuirreL or SQL Workbench. Have a look here for a short tutorial on how to use the driver with these tools: http://www.googlecloudspanner.com/2017/10/using-standard-database-tools-with.html
The JDBC driver supports both DML- and DDL-statements, so this statement should work out-of-the-box:
Update mytable set myflag=true
DML-statements operating on a large number of rows are supported, but the underlying transaction quotas of Cloud Spanner continue to apply (max 20,000 mutations in one transaction). You can bypass this by setting the AllowExtendedMode=true connection property (see the Wiki-pages of the driver). This breaks a large update into several smaller updates and executes each of these in its own transaction. You can also do this batching yourself by dividing your update statement into several different parts.