AWS MSK (Kafka) metrics issue - amazon-web-services

AWS MSK (Kafka) metrics issue - amazon-web-services

At my firm we have a production MSK cluster running Kafka v2.5.1 (created last September), where the available CloudWatch metrics categories are:
by Broker ID, Cluster Name, Topic
by Broker ID, Cluster Name
by Cluster Name
Now, I created a Dev MSK cluster running, again, 2.5.1, and I see this ADDITIONAL metrics category:
by Cluster Name, Consumer Group, Topic
This new category contains the all-important OFFSET LAG metric.
Why is there such a disparity between the two clusters? Both have the "Basic" monitoring option enabled. Is there a way to "upgrade" the older cluster so that it will include that crucial metric?
Thanks for any clues

Related

AWS Resource Usage Data - CPU, Memory and Disk

I am trying to build an analytics Dashboard using the below Metrics/KPIs for all the EC2 Instance.
Total CPU vs CPUUtilized
Total RAM vs RAMUtilized
Total EBS Volume vs EBSUtilized.
For example, I have lunch an EC2 instance with 4 CPU, 16GiB RAM and 50GB SSD, I would like to know the above KPIs in a time series trend. I am not getting any clue on where to get the data from EC2. Tried the EC2 instance metrics through CloudWatch using boto3 client, however did not get the above Metrics. I would like to know :
Where to find the data with above Metrics ?
Need the above metrics data in s3 on an daily basis.
Similarly is there a way to get similar metrics for AWS RDS and AWS EKS Cluster ?
Thanks!

The Amazon EC2 service collects information about the virtual machine (instance) and sends it to Amazon CloudWatch Logs.
See: List the available CloudWatch metrics for your instances - Amazon Elastic Compute Cloud
Note that it only collects metrics that can be observed from the virtual machine itself -- CPU Utilization, network traffic and Amazon EBS traffic. The EC2 service cannot see what is happening 'inside' the instance, since it is the Operating System that controls memory and manages the contents of the disks.
If you wish to collect metrics from the Operating System, then you would need to Collect metrics and logs from Amazon EC2 instances and on-premises servers with the CloudWatch agent - Amazon CloudWatch. This agent runs in the instance and sends metrics out to CloudWatch.
You can write code that calls the CloudWatch Metrics APIs to retrieve metrics. Note that the metrics returned are calculated over a time period (eg average CPU Utilization over a 5-minute period). It is not possible to retrieve the actual raw datapoints.
See also:
Monitoring Amazon RDS metrics with Amazon CloudWatch - Amazon Relational Database Service
Amazon EKS and Kubernetes Container Insights metrics - Amazon CloudWatch

How to Autoscale a GCP Managed Instance Group using a Rabbitmq VM outside the group

I am using GCP and I have a specific problem to solve where I want to use the metrics from a RabbitMQ instance to control the autoscaling requirements of a Managed Instance Group. Do note that this RabbitMQ instance is outside this group and is used only to maintain the messages in the queue.
I want to scale up the number of instances in the group, when the number of current messages in the queue exceeds the number of available consumers. I had implemented the same in AWS using Amazon MQ integrated with RabbitMQ to autoscale for an ECS Cluster of Instances.
I have installed an OPS agent on the RabbitMQ Instance so that I can monitor the queue-based stats in a dashboard, but I am not sure how these metrics can be used to scale the instance as no specification on the MIG config page seems to point to the accessibility of these metrics.
My question is that, is it possible to scale the instances in an MIG through the metrics of an external instance like in my case? This question arises because the documentation on GCP seems to point out that autoscaling can be used only on metrics of the instances within the group.
If not, I would like to understand other ways I can implement the same by perhaps monitoring a consumer-based metric.

Custom metrics can be used for triggering the auto scaling feature. This document outlines clearly how to configure custom metrics for triggering auto scaling of MIG instances. The configuration involves three simple steps.
Create a custom metric for the Rabbitmq queue in cloud monitoring.
Create a service account and give sufficient permissions for performing
auto scaling actions.
Create a trigger using these custom metrics for scaling the managed
instance group.

Can Kafka Connect be made rack aware so that my connector reads all partitions from one broker?

We have a Kafka cluster in Amazon MSK that has 3 brokers in different availability zones of the same region. I want to set up a Kafka Connect connector that backs up all data from our Kafka brokers to Amazon S3, and I'm trying to do it with MSK Connect.
I set up Confluent's S3 Sink Connector on MSK Connect and it works - everything is uploaded to S3 as expected. The problem is that it costs a fortune in data transfer charges - our AWS bills for MSK nearly double whenever the connector is running, with EU-DataTransfer-Regional-Bytes accounting for the entire increase.
It seems that the connector is pulling messages from all three of our brokers, i.e. from three different AZs, and so we're getting billed for inter-AZ data transfer. This makes sense because by default it will read a partition from that partition's leader, which could be any of the three brokers. But if we were creating a normal consumer, not a connector, it would be possible to restrict the consumer to read all partitions from a specific broker:
"client.rack" : "euw1-az3"
☝️ For a consumer in the euw1-az3 AZ, this setting makes the consumer read all partitions from the local broker, regardless of the partitions' leader - which avoids the need for inter-AZ data transfer and brings the bills down massively.
My question is, is it possible to do something similar for a Kafka Connector? What config setting do I have to pass to the connector, or the worker, to make it only read from one specific broker/AZ? Is this possible with MSK Connect?

Maybe I am missing something about your question. I think you want to have a look at this:
https://docs.confluent.io/platform/current/tutorials/examples/multiregion/docs/multiregion.html
replica.selector.class=org.apache.kafka.common.replica.RackAwareReplicaSelector
I though it was general knowledge, it applies to any on-premises or cloud deployment.

AWS confirmed to me on a call with support that MSK Connect doesn't currently support rack awareness. I was able to solve my problem by deploying the connector in an EC2 instance (not on MSK Connect) with the connect worker config consumer.client.rack set to the same availability zone that the EC2 instance is running in.

AWS DMS unable to write to MSK target

Note: self-answered question, because Google didn't shed any light on the problem.
I have configured a Managed Streaming for Kafka target for AWS Data Migration Service, but the migration job fails. Looking at the logs, I see this:
2021-11-17T18:45:21 kafka_send_record (kafka_records.c:88)
2021-11-17T18:50:21 Message delivery failed with Error:[Local: Message timed out] [1026800] (kafka_records.c:16)
I have verified the following:
Both DMS replication instance and MSK cluster use the same security group, with a "self ingress" rule that allows all traffic, and an egress rule that allows all traffic.
The endpoint connection test succeeds.
I can send a message to the MSK topic using the Kafka console producer from an EC2 instance in the same VPC (and receive this message with the console consumer).
The DMS job succeeds if I change the endpoint to use a self-managed Kafka cluster, running on an EC2 instance in the same VPC.

It turned out that the problem was that I pre-created the topic, with a replication factor of 1, but the default MSK configuration specifies min.insync.replicas of 2, which is applied to all created topics.
When DMS sends a message, it requires acks from all in-sync replicas (I'm inferring this, as it's not open-source). This will never succeed if the minimum number of in-sync replicas exceeds the number of actual replicas.
The Kafka console producer, however, defaults to a single ack. This means that it's not a great verification for MSK cluster usability.
Semi-related: the MSK default default.replication.factor value is 3, which means that you over-replicate for a 2-node MSK cluster.

Mesos - dynamic cluster size

Is it possible in Mesos to have dynamic cluster size - with total cluster CPU and RAM quotas set?
Mesos knows my AWS credentials and spawns new ec2 instances only if there is a new job that cannot fit into existing resources. (AWS or other cloud provider). Similar to that - when the job is finished it could kill the ec2 instance.
It can be Mesos plugin/framework or some external tool - any help appreciated.
Thanks

What we are doing is we are using Mesos monitoring tools and HTTP endpoints # http://mesos.apache.org/documentation/latest/endpoints/ to monitor the cluster.
We have our own framework that gets all the relevant information from the master and slave nodes and our algorithm uses that information to scale the cluster.
For example if the cluster CPU utilization is > 0.90 we bring up a new instance and register that slave to master.

If I understand you correctly you are looking for a solution to autoscale your Mesos cluster?
What some people will do on AWS for example is to create an autoscaling group allowing them to scale up and down the number of agents/slave nodes depending on their needs.
Note that the trigger when to scale up/down are usually application dependent (e.g., could be ok for one app to be at a 100% utilization while for others 80% should already trigger a scale-up action).
For an example of using the AWS auto scaling groups you could have a look at Mesosphere DCOS Community edition (note as mentioned above you will still have to write the trigger code for scaling your scaling group).

AFAIK, the Mesos can not autoscaling itself; it need someone to start Mesos Agent for the cluster. One option is to build a script and be managed by Marathon, this script is to start/stop agents after comparing your pending tasks in the framework and Mesos cluster.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js