How AWS DMS works internally - amazon-web-services

In AWS DMS how does the migration happening internally? Is it like exporting entire data from source table and importing to destination table? Or is it like migrating table records one by one to destination table? I am new to aws dms and don't have much idea on how things work there.

AWS publish how DMS works in their documentation and blog posts. This is the list I wish I had when I started with DMS:
For a high level understanding see: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Introduction.html
A task can consist of three major phases:
The full load of existing data
The application of cached changes
Ongoing replication
During a full load migration, where existing data from the source is moved to the target, AWS DMS loads data from tables on the source data store to tables on the target data store. While the full load is in progress, any changes made to the tables being loaded are cached on the replication server; these are the cached changes.
...
When the full load for a given table is complete, AWS DMS immediately begins to apply the cached changes for that table. When all tables have been loaded, AWS DMS begins to collect changes as transactions for the ongoing replication phase. After AWS DMS applies all cached changes, tables are transactionally consistent. At this point, AWS DMS moves to the ongoing replication phase, applying changes as transactions.
From: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Introduction.Components.html
Look at the headings:
Replication Tasks
Ongoing replication, or change data capture (CDC)
To gain a detailed understanding of how DMS works internally, read through the following blogs from AWS:
Debugging Your AWS DMS Migrations: What to Do When Things Go Wrong (Part 1)
Debugging Your AWS DMS Migrations: What to Do When Things Go Wrong (Part 2)
Debugging Your AWS DMS Migrations: What to Do When Things Go Wrong? (Part 3)
Finally, work through the blogs particular to your source and target databases at https://aws.amazon.com/blogs/database/category/migration/aws-database-migration-service-migration/

When I first used DMS I had same question. So simply I enabled Cloudwatch logs and created one migration task from Oracle to Aurora Postgresql.
First DMS task runs on Replication Instance and it connects to Source and Target databases.
RI then connect to Source database and based on selection rule it identifies tables and column details since it has lot of special access on Source and Target DB.
After that it start reading source table(s) in parallel and create Select col1, col2, col3.. from kind of query to fetch data from Source.
Then it write files in a temp location on RI based on tables, 1 file per table and approx 10000 rows in one commit.
While all this is happening another process is creating connection to Target DB and checking if Tables already exist if yes then it check which option we selected Do Nothing or Truncate Table etc.. Based on that it takes action.
Till now we have data from Source table in files on RI and connection and tables created on Target DB. Now RI just reads file records from RI temp location and create insert query.
Once last commit is successful it deletes the temp file from RI.
Once Source table and target table count is matched it closes connections in case of One time load.
In case of On going changes it keeps connection alive and read redo logs or other logs in Source db. Then follow same process mentioned above for CDC.

Here's a doc that provides some more information on how DMS Ongoing Replication works internally: https://aws.amazon.com/blogs/database/introducing-ongoing-replication-from-amazon-rds-for-sql-server-using-aws-database-migration-service/
The short of it is:
(following some initial steps) AWS DMS does not use any replication artifacts. When all the required information is available in the transaction log or transaction log backup, AWS DMS uses the fn_dblog() and fn_dump_dblog() functions to read changes directly from the transaction logs or transaction log backups using the log sequence number (LSN).

In addition to above answers, DMS uses Attunity underneath. There are public documents on how the later works in detail.

Related

Getting incremental data from Amazon Aurora to Redshift via DMS using CDC

my company wants to build a Data warehouse in Redshift. We have an OLTP database running in Amazon Aurora and we are thinking of using the DMS (data migration service). I am trying to get my head around the capabilities of CDC (change data capture). The thing is that CDC (over DMS) replicates and stores changes (in our case in Redshift) and I was wondering if it is possible to select specific columns which I want to store (this should be possible to do with table mapping - include) and based on which I want to store? As far as I understand it, if any columns of a row are updated, then the replication is triggered, which could mean a replication that is useless (e.g. if somebody updates a column that I do not want to follow)
E.g. I have a table with leads, which has some 30 columns. Now I am interested in the DW purposes only in 5 columns and I want to get a new line to the redshift table only if any of those 5 columns changes (is updated)... like if the stage of lead is changed, I will get a new line. On the other hand, I am not interested in the column 'Salesmans_comment' so if the salesman updates a comment, I do not want to have a new line, because I am not interested in it...Cheers!
I have run through most of the available yt tutorials and read through the documentation, but I haven't found a clear answer...
Thanks

Row level changes captured via AWS DMS

I am trying to migrate the database using AWS DMS. Source is Azure SQL server and destination is Redshift. Is there any way to know the rows updated or inserted? We dont have any audit columns in source database.
Redshift doesn’t track changes and you would need to have audit columns to do this at the user level. You may be able to deduce this from Redshift query history and save data input files but this will be solution dependent. Query history can be achieved in a couple of ways but both require some action. The first is to review the query logs but these are only saved for a few days. If you need to look back further than this you need a process to save these tables so the information isn’t lost. The other is to turn on Redshift logging to S3 but this would need to be turned on before you run queries on Redshift. There may be some logging from DMS that could be helpful but I think the bottom line answer is that row level change tracking is not something that is on in Redshift by default.

Data Fusion replication pipeline is not syncing data in Google Bigquery

Hi we want to replicate the data from Mysql(source) to GoogleBigquery(destination) we adopted the method described by google Docs with Data fusion replication pipeline as mentioned in Link
https://cloud.google.com/data-fusion/docs/tutorials/replicating-data/mysql-to-bigquery
Berief of what we are doing:
Enabling bin log in MY SQL for CDC(Change data Capture)
creating a replication pipeline in data fusion
starting the pipeline and syncing the data
we are successfully able to create MySql data in comupute engine and enabling bin-log for CDC and provided all necessary permission to user for the data replication pipeline in my SQL
we are successful in creating a data Fusion instance and able to create a replication pipeline
replication pipeline is able to fetch our SQL database details and target Big query is also set
On starting the pipeline it is tracking the Changes successfully (Insert,update and delete ) and table Schema is also created in Bigquery Successfully automatically.
But we are getting PROBLEM that no data is getting transsferred to Bigquery table. In log what i have seen is loading batch of 1 event in to statging Bucket
sharing the screenshot also
able to fetch every change from MYSQL but data is not transferring to bigquery
table schema was created but data is not transferred
loading batch of 1 event in to statging Bucket we are using developer mode and waited for more than 90 mins
The issue might be happening because there may be a schema/data type mismatch with the BigQuery table and the source MYSQL database table on the columns.
For example: if you have a column in source table, in BigQuery this column is of INT64 datatype with a length of 19, while in the source database table, it is Integer type with a length of 10, so you need to update the length of columns as per your datasize.

How can i see metadata, lineage of data stored in AWS redshift?

I am using solutions like cloudera navigator, atlas and Wherehows
to get Hadoop, HDFS, HIVE, SQOOP, MAPREDUCE metadata and lineage.
Now we have a data warehouse in AWS redshift as well. Is there a way to extract metadata or lineage or both information out of redshift.
So far i have not found anything on this.
Is there a way to integrate the same to wherehows as a crawled solution?
I found only one post which gives some information about how to get some information from redshift assuming it will be similar to postgresql. I am sure someone would have written some open source solution to this problem.
Or is it just matter of writing a simple single script to extract this information?
I am looking for a enterprise level solution. I hope someone will point me in right direction.
AWS Glue Data catalog is a fully managed metadata management service.It has AWS Glue crawler which automatically crawls through your source(for you its redshift) and creates a centralized metadata repository which can be accessed by other AWS services.
Refer:
https://docs.aws.amazon.com/glue/latest/dg/components-overview.html
https://aws.amazon.com/glue/
You can access metadata by querying the system tables in Redshift:
https://docs.aws.amazon.com/redshift/latest/dg/cm_chap_system-tables.html
The system tables are on the leader node in each cluster (see this guide on the Redshift Architecture that I wrote)
Redshift deletes the content of the system tables on a rolling basis, so you need to store that data in your cluster, or another separate cluster, to get a history. With the data in the system tables, you have a baseline of information about your queries and what tables they are touching.
You can put a dashboard like Kibana or Periscope Data on top of that data to visualize it. Plaid has done a write-up of how they've built an in-house monitoring solution that has some information about data lineage:
https://blog.plaid.com/managing-your-amazon-redshift-performance-how-plaid-uses-periscope-data/
But go get true data lineage, you need to understand how queries relate to your workflows, i.e. for an Airflow DAG. To get that information, you need to "tag" your queries so you can trace them in the context of transformations / workflows, vs. looking at the individual query.
This is something we've built into our product - heads up that it's a commercial solution:
https://www.intermix.io/blog/announcing-query-insights/
Unlike the raw logs from the system tables, we give you the context of what apps / workflows are triggering queries, which users are running them, and what tables they are touching.
Lars

Loading data (incrementally) into Amazon Redshift, S3 vs DynamoDB vs Insert

I have a web app that needs to send reports on its usage, I want to use Amazon RedShift as a data warehouse for that purpose,
How should i collect the data ?
Every time, the user interact with my app, i want to report that.. so when should i write the files to S3 ? and how many ?
What i mean is:
- If do not send the info immediately, then I might lose it as a result of a connection lost, or from some bug in my system while its been collected and get ready to be sent to S3...
- If i do write files to S3 on each user interaction, i will end up with hundreds of files (on each file has minimal data), that need to be managed, sorted, deleted after been copied to RedShift.. that dose not seems like a good solution .
What am i missing? Should i use DynamoDB instead, Should i use simple insert into Redshift instead !?
If i do need to write the data to DynamoDB, should i delete the hold table after been copied .. what are the best practices ?
On any case what are the best practices to avoid data duplication in RedShift ?
Appreciate the help!
It is preferred to aggregate event logs before ingesting them into Amazon Redshift.
The benefits are:
You will use the parallel nature of Redshift better; COPY on a set of larger files in S3 (or from a large DynamoDB table) will be much faster than individual INSERT or COPY of a small file.
You can pre-sort your data (especially if the sorting is based on event time) before loading it into Redshift. This is also improve your load performance and reduce the need for VACUUM of your tables.
You can accumulate your events in several places before aggregating and loading them into Redshift:
Local file to S3 - the most common way is to aggregate your logs on the client/server and every x MB or y minutes upload them to S3. There are many log appenders that are supporting this functionality, and you don't need to make any modifications in the code (for example, FluentD or Log4J). This can be done with container configuration only. The down side is that you risk losing some logs and these local log files can be deleted before the upload.
DynamoDB - as #Swami described, DynamoDB is a very good way to accumulate the events.
Amazon Kinesis - the recently released service is also a good way to stream your events from the various clients and servers to a central location in a fast and reliable way. The events are in order of insertion, which makes it easy to load it later pre-sorted to Redshift. The events are stored in Kinesis for 24 hours, and you can schedule the reading from kinesis and loading to Redshift every hour, for example, for better performance.
Please note that all these services (S3, SQS, DynamoDB and Kinesis) allow you to push the events directly from the end users/devices, without the need to go through a middle web server. This can significantly improve the high availability of your service (how to handle increased load or server failure) and the cost of the system (you only pay for what you use and you don't need to have underutilized servers just for logs).
See for example how you can get temporary security tokens for mobile devices here: http://aws.amazon.com/articles/4611615499399490
Another important set of tools to allow direct interaction with these services are the various SDKs. For example for Java, .NET, JavaScript, iOS and Android.
Regarding the de-duplication requirement; in most of the options above you can do that in the aggregation phase, for example, when you are reading from a Kinesis stream, you can check that you don't have duplications in your events, but analysing a large buffer of events before putting into the data store.
However, you can do this check in Redshift as well. A good practice is to COPY the data into a staging tables and then SELECT INTO a well organized and sorted table.
Another best practice you can implement is to have a daily (or weekly) table partition. Even if you would like to have one big long events table, but the majority of your queries are running on a single day (the last day, for example), you can create a set of tables with similar structure (events_01012014, events_01022014, events_01032014...). Then you can SELECT INTO ... WHERE date = ... to each of this tables. When you want to query the data from multiple days, you can use UNION_ALL.
One option to consider is to create time series tables in DynamoDB where you create a table every day or week in DynamoDB to write every user interaction. At the end of the time period (day, hour or week), you can copy the logs on to Redshift.
For more details, on DynamoDB time series table see this pattern: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.TimeSeriesDataAccessPatterns
and this blog:
http://aws.typepad.com/aws/2012/09/optimizing-provisioned-throughput-in-amazon-dynamodb.html
For Redshift DynamoDB copy: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/RedshiftforDynamoDB.html
Hope this helps.
Though there is already an accepted answer here, AWS launched a new service called Kinesis Firehose which handles the aggregation according to user defined intervals, a temporary upload to s3 and the upload (SAVE) to redshift, retries and error handling, throughput management,etc...
This is probably the easiest and most reliable way to do so.
You can write data to CSV file on local disk and then run Python/boto/psycopg2 script to load data to Amazon Redshift.
In my CSV_Loader_For_Redshift I do just that:
Compress and load data to S3 using boto Python module and multipart upload.
conn = boto.connect_s3(AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY)
bucket = conn.get_bucket(bucket_name)
k = Key(bucket)
k.key = s3_key_name
k.set_contents_from_file(file_handle, cb=progress, num_cb=20,
reduced_redundancy=use_rr )
Use psycopg2 COPY command to append data to Redshift table.
sql="""
copy %s from '%s'
CREDENTIALS 'aws_access_key_id=%s;aws_secret_access_key=%s'
DELIMITER '%s'
FORMAT CSV %s
%s
%s
%s;""" % (opt.to_table, fn, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY,opt.delim,quote,gzip, timeformat, ignoreheader)
Just being a little selfish here and describing exactly what Snowplow ,an event analytics platform does. They use this awesome unique way of collecting event logs from the client and aggregating it on S3.
They use Cloudfront for this. What you can do is, host a pixel in one of the S3 buckets and put that bucket behind a CloudFront distribution as an origin. Enable logs to an S3 bucket for the same CloudFront.
You can send logs as url parameters whenever you call that pixel on your client (similar to google analytics). These logs can then be enriched and added to Redshift database using Copy.
This solves the purpose of aggregation of logs. This setup will handle all of that for you.
You can also look into Piwik which is an open source analytics service and see if you can modify it specific to your needs.