AWS DMS Ongoing Replication Falling Behind? - amazon-web-services

We are using AWS DMS for on-going replication of specific tables from one Oracle RDS database instance to another Oracle RDS database (both 11g).
Intermittently, the replication seems to fall behind or get out of sync. There are no errors in the log and everything is reported as successful, but data is missing.
We can kick off a full refresh and the data will show up, but this isn't a viable option on a regular basis. This is a production system and a full refresh takes upwards of 14 hours
We would like to monitor whether the destination database is [at least mostly] up to date. Meaning, no more than 2-3 hours behind.
I've found that you can get the current SCN from the source database using "SELECT current_scn FROM V$DATABASE" and from the target in the "awsdms_txn_state" table.
However, that table doesn't exist and I don't see any option to enable TaskRecoveryTableEnabled when creating or modifying a task.
Is there an existing feature that will automatically monitor these values? Can it be done through Lambda?
If DMS is reporting success, then we have no way of knowing that our data is hours or days behind until someone calls us complaining.
I do see an option in the DMS task to "Enable validation", but intuition tells me that's going to create a significant amount of unwanted overhead.
Thanks in advance.

There are a few questions here:
Task Monitoring of CDC Latency
How to set TaskRecoveryTableEnabled
For the first, task Monitoring provides a number of CloudWatch metrics (see all CDC* metrics).
It is possible to see on these metrics when the target is out of sync with the source, and where in the replication instance's process these changes are. The detailed blog from AWS explaining these Task Monitoring metrics is worth reading.
One option is to put a CloudWatch Alarm on the CDCLatencySource.
Alternatively you can create your own Lambda on a CloudWatch schedule to run your SCN queries on source and target and output a custom CloudWatch Metric using PutMetricData. You can create a CloudWatch Alarm on this metric if they are out of sync.
For the second question, to set the TaskRecoveryTableEnabled via the console tick the option "Create recovery table on target DB"
After ticking this you can confirm that the TaskRecoveryTableEnabled is set to Yes by looking at the Overview tab of the task. At the bottom there is the Task Settings json which will have something like:
"TargetMetadata": {
"TargetSchema": "",
"SupportLobs": true,
"FullLobMode": false,
"LobChunkSize": 0,
"LimitedSizeLobMode": true,
"LobMaxSize": 32,
"InlineLobMaxSize": 0,
"LoadMaxFileSize": 0,
"ParallelLoadThreads": 0,
"ParallelLoadBufferSize": 0,
"BatchApplyEnabled": false,
"TaskRecoveryTableEnabled": true
}

Related

AWS ASG replacement strategy

I have successfully created an ASG with rolling update which seem to work. I have however, a rather unique use case. I would like to have a update strategy where I run both in parallell (EC2_old and EC2_new). Meaning, I want to make sure the new one is up and running during a test session of 15-30 min. During these 15-30 min I also want the deployment process to continue and not get stuck in a waiting mode for this transition to become complete. In a way I'm looking for a blue/green deployment strategy and I don't know if it is even possible.
I did some reading and came across WillReplace update policy. This could do the trick but the cfn options seem rather limited. Has anyone implemented an update strategy of this complexity?
Current policy looks like this:
updatePolicy = {
autoScalingRollingUpdate: {
maxBatchSize: 1,
minInstancesInService: 1,
pauseTime: "PT1H",
waitOnResourceSignals: true,
suspendProcesses: [
"HealthCheck",
"ReplaceUnhealthy",
"AZRebalance",
"ScheduledActions",
"AlarmNotification"
]
}
};
willReplace won't be a blue/green strategy. It does create a new ASG, but it will swap the target group to the new ASG instances as soon as they are all healthy. If you google aws blue green deployment you should find a quick start that goes over how to set up what you are looking for.

created one user defined metrics with which an alert is generated every time a firewall is created/modified/deleted but alert automatically recovers

I have created one user defined metrics with which an alert is generated every time a firewall is created/modified/deleted.
and configured the alert
Alerts get triggered and incident is generated but after some time alert automatically cleared with email "Alert recovered". I dont want the alert to be cleared automatically but should be there for ops team to investigate and acknowledge.
Please suggest what is missing in my configuration?
You can aggregate events over period of time, e.g. a day; But then it will only trigger on first occurrence.
The best solution, I think, is to edit the alerting policy and uncheck "Notify on incident resolution".
Incidents will be resolved, but ops team can still check them via link in the e-mail.
If you think that function should be available, you can file a Feature Request at Google Public Issue Tracker.

Google Cloud alerting condition on missing infrequent event

I'm trying to create an alert condition where if an infrequent event (e.g. cron job running once a week) does not occur, it will trigger.
The metric is log-based. I've had success with smaller windows by using the alignment period, but there is a limitation where the alignment period can not be longer than 1 day.
Alignment periods longer than 86400 seconds are not supported.
(Not working) sample of what I'm trying to do:
- conditionThreshold:
aggregations:
- alignmentPeriod: 604800s # 1 week NOT possible
perSeriesAligner: ALIGN_SUM
comparison: COMPARISON_LT
thresholdValue: 1.0
duration: 0s
filter: metric.type="logging.googleapis.com/user/my_infrequent_event_count"
trigger:
count: 1
displayName: Infrequent event did not occur
Any idea on how this is possible?
Currently this is not possible to accomplish as the duration can’t exceed from 24h.
As a workaround, you might find useful Cloud Monitoring metric export for long-term metrics analysis. Please also refer to this doc.
I found this public thread, which might be helpful too.

Google Dataflow and Pubsub - can not achieve exactly-once delivery

I'm trying to achieve exactly-once delivery using Google Dataflow and PubSub using Apache Beam SDK 2.6.0.
Use case is quite simple:
'Generator' dataflow job sends 1M messages to PubSub topic.
GenerateSequence
.from(0)
.to(1000000)
.withRate(100000, Duration.standardSeconds(1L));
'Archive' dataflow job reads messages from PubSub subscription and saves to Google Cloud Storage.
pipeline
.apply("Read events",
PubsubIO.readMessagesWithAttributes()
// this is to achieve exactly-once delivery
.withIdAttribute(ATTRIBUTE_ID)
.fromSubscription('subscription')
.withTimestampAttribute(TIMESTAMP_ATTRIBUTE))
.apply("Window events",
Window.<Dto>into(FixedWindows.of(Duration.millis(options.getWindowDuration())))
.triggering(Repeatedly.forever(AfterWatermark.pastEndOfWindow()))
.withAllowedLateness(Duration.standardMinutes(15))
.discardingFiredPanes())
.apply("Events count metric", ParDo.of(new CountMessagesMetric()))
.apply("Write files to archive",
FileIO.<String, Dto>writeDynamic()
.by(Dto::getDataSource).withDestinationCoder(StringUtf8Coder.of())
.via(Contextful.of((msg, ctx) -> msg.getData(), Requirements.empty()), TextIO.sink())
.to(archiveDir)
.withTempDirectory(archiveDir)
.withNumShards(options.getNumShards())
.withNaming(dataSource ->
new SyslogWindowedDataSourceFilenaming(dataSource, archiveDir, filenamePrefix, filenameSuffix)
));
I added 'withIdAttribute' to both Pubsub.IO.Write ('Generator' job) and PubsubIO.Read ('Archive' job) and expect that it will guarantee exactly-once semantics.
I would like to test the 'negative' scenario:
'Generator' dataflow job sends 1M messages to PubSub topic.
'Archive' dataflow job starts to work, but I stop it in the middle of processing clicking 'Stop job' -> 'Drain'. Some portion of messages has been processed and saved to Cloud Storage, let's say 400K messages.
I start 'Archive' job again and do expect that it will take unprocessed messages (600K) and eventually I will see exactly 1M messages saved to Storage.
What I got in fact - all messages are delivered (at-least-once is achieved), but on top of that there are a lot of duplicates - something in the neighborhood of 30-50K per 1M messages.
Is there any solution to achieve exactly-once delivery?
Dataflow does not enable you to persist state across runs. If you use Java you can update a running pipeline in a way that does not cause it to lose the existing state, allowing you to deduplicate across pipeline releases.
If this doesn't work for you, you may want to archive messages in a way where they are keyed by ATTRIBUTE_ID, e.g,. Spanner or GCS using this as the file name.
So, I've never done it myself, but reasoning about your problem this is how I would approach it...
My solution is a bit convoluted, but I failed to identify others way to achieve this without involving other external services. So, here goes nothing.
You could have your pipeline reading both from pubsub and GCS and then combine them to de-duplicate the data. The tricky part here is that one would be a bounded pCollection (GCS) and the other an unbounded one (pubsub). You can add timestamps to the bounded collection and then window the data. During this stage you could potentially drop GCS data older than ~15 minutes (the duration of the window in your precedent implementation). These two steps (i.e. adding timestamps properly and dropping data that is probably old enough to not create duplicates) are by far the trickiest parts.
Once this has been solved append the two pCollections and then use a GroupByKey on an Id that is common for both sets of data. This will yield a PCollection<KV<Long, Iterable<YOUR_DATUM_TYPE>>. Then you can use an additional DoFn that drops all but the first element in the resulting Iterable and also removes the KV<> boxing. From there on you can simply continue processing the data as your normally would.
Finally, this additional work should be necessary only for the first pubsub window when restarting the pipeline. After that you should re-assign the GCS pCollection to an empty pCollection so the group by key doesn't do too much additional work.
Let me know what you think and if this could work. Also, if you decide to pursue this strategy, please post your mileage :).
In the meanwhile Pub/Sub has support for Exactly once delivery.
It is currently in the pre-GA launch state, so unfortunately not ready for production use yet.

Why playing with AWS DynamoDb "Hello world" produces read/write alarms?

I'v started to play with DynamoDb and I'v created "dynamo-test" table with hash PK on userid and couple more columns (age, name). Read and write capacity is set to 5. I use Lambda and API Gateway with Node.js. Then I manually performed several API calls through API gateway using similar payload:
{
"userId" : "222",
"name" : "Test",
"age" : 34
}
I'v tried to insert the same item couple times (which didn't produce error but silently succeeded.) Also, I used DynamoDb console and browsed for inserted items several times (currently there are 2 only). I haven't tracked how many times exactly I did those actions, but that was done completely manually. And then after an hour, I'v noticed 2 alarms in CloudWatch:
INSUFFICIENT_DATA
dynamo-test-ReadCapacityUnitsLimit-BasicAlarm
ConsumedReadCapacityUnits >= 240 for 12 minutes
No notifications
And the similar alarm with "...WriteCapacityLimit...". Write capacity become OK after 2 mins, but then went back again after 10 mins. Anyway, I'm still reading and learning how to plan and monitor these capacities, but this hello world example scared me a bit if I'v exceeded my table's capacity :) Please, point me to the right direction if I'm missing some fundamental part!
It's just an "INSUFFICIENT_DATA" message. It means that your table hasn't had any reads or writes in a while, so there is insufficient data available for the CloudWatch metric. This happens with the CloudWatch alarms for any DynamoDB table that isn't used very often. Nothing to worry about.
EDIT: You can now change a setting in CloudWatch alarms to ignore missing data, which will leave the alarm at its previous state instead of changing it to the "INSUFFICIENT_DATA" state.