Note: self-answered question, because Google didn't shed any light on the problem.
I have configured a Managed Streaming for Kafka target for AWS Data Migration Service, but the migration job fails. Looking at the logs, I see this:
2021-11-17T18:45:21 kafka_send_record (kafka_records.c:88)
2021-11-17T18:50:21 Message delivery failed with Error:[Local: Message timed out] [1026800] (kafka_records.c:16)
I have verified the following:
Both DMS replication instance and MSK cluster use the same security group, with a "self ingress" rule that allows all traffic, and an egress rule that allows all traffic.
The endpoint connection test succeeds.
I can send a message to the MSK topic using the Kafka console producer from an EC2 instance in the same VPC (and receive this message with the console consumer).
The DMS job succeeds if I change the endpoint to use a self-managed Kafka cluster, running on an EC2 instance in the same VPC.
It turned out that the problem was that I pre-created the topic, with a replication factor of 1, but the default MSK configuration specifies min.insync.replicas of 2, which is applied to all created topics.
When DMS sends a message, it requires acks from all in-sync replicas (I'm inferring this, as it's not open-source). This will never succeed if the minimum number of in-sync replicas exceeds the number of actual replicas.
The Kafka console producer, however, defaults to a single ack. This means that it's not a great verification for MSK cluster usability.
Semi-related: the MSK default default.replication.factor value is 3, which means that you over-replicate for a 2-node MSK cluster.
Related
We have a Kafka cluster in Amazon MSK that has 3 brokers in different availability zones of the same region. I want to set up a Kafka Connect connector that backs up all data from our Kafka brokers to Amazon S3, and I'm trying to do it with MSK Connect.
I set up Confluent's S3 Sink Connector on MSK Connect and it works - everything is uploaded to S3 as expected. The problem is that it costs a fortune in data transfer charges - our AWS bills for MSK nearly double whenever the connector is running, with EU-DataTransfer-Regional-Bytes accounting for the entire increase.
It seems that the connector is pulling messages from all three of our brokers, i.e. from three different AZs, and so we're getting billed for inter-AZ data transfer. This makes sense because by default it will read a partition from that partition's leader, which could be any of the three brokers. But if we were creating a normal consumer, not a connector, it would be possible to restrict the consumer to read all partitions from a specific broker:
"client.rack" : "euw1-az3"
☝️ For a consumer in the euw1-az3 AZ, this setting makes the consumer read all partitions from the local broker, regardless of the partitions' leader - which avoids the need for inter-AZ data transfer and brings the bills down massively.
My question is, is it possible to do something similar for a Kafka Connector? What config setting do I have to pass to the connector, or the worker, to make it only read from one specific broker/AZ? Is this possible with MSK Connect?
Maybe I am missing something about your question. I think you want to have a look at this:
https://docs.confluent.io/platform/current/tutorials/examples/multiregion/docs/multiregion.html
replica.selector.class=org.apache.kafka.common.replica.RackAwareReplicaSelector
I though it was general knowledge, it applies to any on-premises or cloud deployment.
AWS confirmed to me on a call with support that MSK Connect doesn't currently support rack awareness. I was able to solve my problem by deploying the connector in an EC2 instance (not on MSK Connect) with the connect worker config consumer.client.rack set to the same availability zone that the EC2 instance is running in.
At my firm we have a production MSK cluster running Kafka v2.5.1 (created last September), where the available CloudWatch metrics categories are:
by Broker ID, Cluster Name, Topic
by Broker ID, Cluster Name
by Cluster Name
Now, I created a Dev MSK cluster running, again, 2.5.1, and I see this ADDITIONAL metrics category:
by Cluster Name, Consumer Group, Topic
This new category contains the all-important OFFSET LAG metric.
Why is there such a disparity between the two clusters? Both have the "Basic" monitoring option enabled. Is there a way to "upgrade" the older cluster so that it will include that crucial metric?
Thanks for any clues
I have a Kubernetes cluster that relies on AWS EC2 spot requests.
I sometimes have this failure message from the aws auto-scaling group:
Could not launch Spot Instances. InsufficientInstanceCapacity - There is no Spot capacity available that matches your request. Launching EC2 instance failed.
I knew the downfall of using spot requests and that's not why I am here.
I'd like to track this kind of failed activity from my auto-scaling group and I did not find anything inside CloudWatch.
Is there any "legit" way of doing this?
The final aim is to have an alert where AWS does not have capacity for my instance request(s) so I can act appropriately.
I came across this question when I was looking for the same thing, and now I have found an answer!
You can detect this event by creating a Cloud Trail that logs management events for your account, and looking for an event where the EventName = RunInstances, and the ErrorCode field is populated.
I have seen this particular event come through as ErrorCode: Server.InsufficientInstanceCapacity.
There are a variety of ways to consume and alert on the Cloud Trail logs, including CloudWatch.
I have an application running in EC2 that listen to many ports, some external devices connect to those ports to send data to my application. This is fine, but my client has a requirement that i must monitoring those ports and if one of them stop listening, the instance must be terminated and a new one started.
I was reading about couldwatch, but i didn't found an alarm that i can customize like this (doing requests to ports). Is it possible to do this using cloudwatch ? i'm looking for a direction to create this monitoring, using internal aws services or develop a new solution (maybe a sheel script).
thanks!
I'm not aware of any AWS provided EC2 healthcheck monitoring system for custom checks.
You could write an AWS lambda function which sends requests to the ports on the EC2 instance you require. You can then schedule that lambda to run periodically with whatever frequency you want with Cloudwatch Events. The lambda function could publish this as a metric to cloudwatch which would then make it possible for you to use it in an alarm and thus take action when whatever threshold you deem reasonable to spin up a new replacement instance.
One part of AWS that does have basically what you are looking for built-in though is ECS. Instead of an EC2 instance, you'd have a Docker instance (running on an EC2 instance or Fargate) which can have healthchecks defined.
There are many ways to do what you are asking for.
Simplest solution: I will write a boto3/shell script to monitor the port and call TerminateInstance API or use AWS CLI to terminate the current instance. Needless to say, you need to pass AWS credentials or attach instance profile with sufficient privileges to terminate the instance.
Using Cloudwatch: Have a script to check port status and send 1 or 0 (Dimension: Count) to Cloudwatch. Set a threshold in Cloudwatch if there is consecutive 0s or NoData, then terminate the instance. Or do not send any data to Cloudwatch if the port is not available and NoData in Cloudwatch can trigger TerminateInstance. See: Cloudwatch - AddingTerminateActions
Just looking the way to start/stop a AWS EC2 instance in case of CPU utilization increase or decrease on another EC2 instacne. I know there is service available Auto Scaling in AWS but I have a scenario where I can't take advantage of this service.
So just looking if it is possible or anyone can help me on this.
Just detailing the concern like suppose I have 2 EC2 instance on AWS account by name EC21 and EC22. By default, EC22 instance is stopped.
Now I need to setup CloudWatch or any other service to check if load/CPU utilization increase on EC21 instance by 70% then need to start EC22 server and similarly if load decrease on EC21 instance by 30% then stop EC22 server.
Please advice!
When your CloudWatch alarm is triggered, it will notify an SNS topic. You can have that SNS topic then invoke a Lambda function, which can then start your EC2 instance.
Create an AWS Lambda function that starts your EC2 instance.
Configure your SNS topic to invoke your Lambda function when it receives messages. You can read about that here: Invoking Lambda functions using Amazon SNS notifications
Finally, ensure your CloudWatch alert sends messages to the SNS topic.
Yes this is possible for certain types of EC2 instances. Check this detailed guide using which you can set up the triggers in your EC2 instances based on AWS Cloud Watch metrics.
http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/UsingAlarmActions.html
I think your problem might fit the scenario which I'm also trying to solve now - I have some functionality which cannot be solved with Lambdas because of their low lifetime, so I need a relatively short-lived EC2 instance to accomplish the task.
The solution is similar to the one described by Matt, but without SNS, using AWS triggers to launch a lambda function to start the instance. Added benefit is that the lambda function can itself verify whether the EC2 start is really needed.
How do I stop and start Amazon EC2 instances at regular intervals using AWS Lambda?
Issue
I want to reduce my Amazon Elastic Cloud Compute (Amazon EC2) usage by
stopping and starting instances at predefined times or utilization
thresholds. Can I configure AWS Lambda and Amazon CloudWatch to help
me do that automatically?
Short Description
You can use a CloudWatch Event to trigger a Lambda function to start
and stop your EC2 instances at scheduled intervals.
Source: AWS Knowledge Center