I am new to Kafka and my use case is I have provision Kafka 3node cluster and if I produce the message in node1 it's automatically syncing in both node2 and node3 (mean I am consuming the msg in node2 and node3) so now i want that all the messages in another aws ec2 machine. how can i do that?
You can use Apache Kafka's MirrorMaker that facilitates Multi-datacentre replication. You can use it in order to copy data between two Kafka clusters.
Data is read from topics in the origin cluster and written to a topic
with the same name in the destination cluster. You can run many such
mirroring processes to increase throughput and for fault-tolerance (if
one process dies, the others will take overs the additional load).
The origin and destination clusters are completely independent
entities: they can have different numbers of partitions and the
offsets will not be the same. For this reason the mirror cluster is
not really intended as a fault-tolerance mechanism (as the consumer
position will be different). The MirrorMaker process will, however,
retain and use the message key for partitioning so order is preserved
on a per-key basis.
Another option (that requires licensing) is Confluent Replicator that also handles topic configuration.
The Confluent Replicator allows you to easily and reliably replicate
topics from one Kafka cluster to another. In addition to copying the
messages, this connector will create topics as needed preserving the
topic configuration in the source cluster. This includes preserving
the number of partitions, the replication factor, and any
configuration overrides specified for individual topics.
Here's a quickstart tutorial that will help you to get started with Confluent Kafka Replicator.
If I understand correctly, new machine is not a Kafka broker, so mirroring data to it wouldn't work.
it's automatically syncing in both node2 and node3
Only if the replication factor is 3 or more
mean I am consuming the msg in node2 and node3
Only if you have 3 or more partitions would you be consuming from all three nodes, since there's only one leader per partition, and all consume requests come from it
If you just run any consumer process on this new machine, you will get all messages from the existing cluster. If you planned on storing those messages for any particular reason, I would suggest looking into Kafka Connect S3 connector, then you can query an S3 bucket using Athena, for example
Related
We do have a system that is using Redis pub/sub features to communicate between different parts of the system. To keep it simple we used the pub/sub channel to implement different things. On both ends (producer and consumer), we do have Servers containing code that I see no way to convert into Lambda Functions.
We are migrating to AWS and among other changes, we are trying to replace the use of Redis with a managed pub/sub solution. The required solution is fairly simple: a managed broker that allows to publish a message from one node and to subscribe for its reception from 0 or more other nodes.
It seems impossible to achieve this with any of the available solutions:
Kinesis - It is a streaming solution for data ingestion (similar to Apache Pulsar)
SNS - From the documentation, it looks like exactly what we need until we realize that there is no solution to connect a server (not a Lambda) unless with a custom HTTP endpoint.
EventBridge - Same issue as with SNS
SQS - It is a queue, not a pub/sub.
Amazon MQ / Rabbit MQ - It is a queue, not a pub/sub. But also is not a SaaS solution but rather an installation to an owned node.
I see no reason to remove a feature such as subscribing from a server, this is why I was sure it will be present in one or more of the available solutions. But we went through the documentation and attempted to consume fro SNS and EventBridge without success. Are we missing something? How to achieve what we need?
Example
Assume we have an API server layer, deployed on ECS with a load balancer in front. The API has 2 endpoints, a PUT to update a document, and an SSE to listen for updates on documents.
Assuming a simple round-robin load balancer, an update for document1 may occur on node1 where a client may have an ongoing SSE request for the same document on node2. This can be done with a Redis backbone; node1 publishes on document1 topic and node2 is subscribed to the same topic. This solution is fast and efficient (in this case at-most-once delivery is perfectly acceptable).
Being this an example we will not consider WebSocket pub/sub API or other ready-made solutions for this specific use case.
Lambda
Subscriber side can not be a Lambda. Being two distinct contexts involved (the SSE HTTP Request one and the SNS event one) this will cause two distinct lambdas to fire and no way to 'stitch' them together.
SNS + SQS
We hesitate with SQS in conjunction with SNS being a solution that will add a lot of unneeded complexity:
Number of nodes is not known in advance and they scale, requiring an automated system to increase/reduce the number of SQS queues.
Persistence is not required
Additional latency is introduced
Additional infrastructure cost
HTTP Endpoint
This is the closest thing to a programmatic subscription but suffers from similar issues to the SNS-SQS solution:
Number of nodes is unknown requiring endpoint subscriptions to be automatically added.
Eiter we expose one endpoint for each node or have a particular configuration on the Load Balancer to route the message to the appropriate node.
Additional API endpoints must be exposed, maintained, and secured.
I have a Redis cluster with cluster mode enabled and 3 nodes (1 master and 2 replicas). I have noticed that the CPU percentage of one of the replicas is similar to the master node while that of the other replica remains quite low. So, what is the replication logic at play here? Is it like only one replica is used to replicate data proactively and the other one is used only after the first one fails?
PFA screenshot of the CPU percentage usage over a week
PS: The application connects to the cluster using Configuration Endpoint
As it is stated here,
Redis (cluster mode disabled) clusters, use the Primary Endpoint for all write operations. Use the Reader Endpoint to evenly split incoming connections to the endpoint between all read replicas. Use the individual Node Endpoints for read operations (In the API/CLI these are referred to as Read Endpoints).
If you use the reader endpoint it will split the load evenly.
As it is stated here
Each read replica maintains a copy of the data from the cluster's primary node. Asynchronous replication mechanisms are used to keep the read replicas synchronized with the primary.
My optimistic guess is that; instead of reader endpoint, your application directly reads from the single replica. Maybe the endpoint(higher cpu) is hardcoded within the application.
I have to setup jboss over AWS-EC2-Windows server, this will scale-up as well as per the requirements. We are using ELK for infrastructure monitoring for which will be installing beats here which will send the data to on-prem logstash. There we on-board the servers with there hostname and ip.
Now the problem is: in case of autoscaling, how we can achieve this.
Please advise.
Thanks,
Abhishek
If you would create one EC2 instance and create an AMI of it in order to have it autoscale based on that one, this way the config can be part of it.
If you mean by onboard adding it to the allowed list, you could use a direct connect or a VPC with a custom CIDR block defined and add that subnet in the allowed list already.
AFAIK You need to change the logstash config file on disk to add new hosts, and it should notice the updated config automatically and "just work".
I would suggest a local script that can read/write the config file and that polls an SQS queue "listening" for autoscaling events. You can have your ASG send SNS messages when it scales and then subscribe an SQS queue to receive them. Messages will be retained for upto 14 days and theres options to add delays if required. The message you receive from SQS will indicate the region, instance-id and operation (launched or terminated) from which you can lookup the IP address/hostname to add/remove from the config file (and the message should be deleted from the queue when processed successfully). Editing the config file is just simple string operations to locate the right line and insert the new one. This approach only requires outbound HTTPS access for your local script to work and some IAM permissions, but there is (a probably trivial) cost implication.
Another option is a UserData script thats executed on each instance at startup (part of the Launch Template of your AutoScale group). Exactly how it might communicate with your on-prem depends on your architecture/capabilities - anythings possible. You could write a simple webservice to manage the config file and have the instances call it but thats a lot more effort and somewhat risky in my opinion.
FYI - if you use SQS look at Long Polling if your checking the queues frequently/want the message to propagate as quickly as possible (TLDR - faster & cheaper than polling any more than twice a minute). Its good practice to use a dead-letter queue with SQS - messages that get retrieved but not removed from the queue end up here. Setup alarms on the queue and deadletter queue to alert you via email if there are messages failing to be processed or not getting picked up in sensible time (ie your script has crashed etc).
I have a multi-region ECS Fargate, running 2 tasks in 1 cluster per region. Totally I have 4 tasks, 2 in us-east-1 and 2 in us-west-1.
The purpose of the ECS consumer tasks is to process messages as and when messages are available in SQS.
SQS will be configured in just a single region. The SQS arn will be configured in the container running the tasks.
With this setup, when there are messages from SQS, how does the traffic gets distributed across all available ECS tasks across multi-region? Is it random ? Someone please clarify.
I am not configuring load balancers for the ECS task since I do not have external calls. The source is always the messages from SQS.
With this setup, when there are messages from SQS, how does the traffic gets distributed across all available ECS tasks across multi-region? Is it random ? Someone please clarify
It's not random, but it is arbitrary. Here is what the docs say:
Standard queues provide best-effort ordering which ensures that messages are generally delivered in the same order as they're sent.
The reason that it's arbitrary is because SQS queues are distributed across multiple nodes and you have no idea how many nodes there are. So if SQS decides that you need 20 nodes to handle the rate that messages are added to the queue, and you retrieve 10 messages at a time (the limit), clearly you're going to get messages from some subset of those nodes.
Going into the realm of complete speculation, long polling might improve your chances of getting messages in the order that they were sent, because it is documented to "quer[y] all of the servers for messages." Of course, that could only apply when you can't fill your response from a single server. I would expect it to grab all messages that it can from each server and return as soon as it hits the maximum number of messages, even if it hasn't actually queried all servers.
SQS will be configured in just a single region. The SQS arn will be configured in the container running the tasks.
Beware that you need the queue URL, not its ARN, in order to retrieve messages.
Beware also that -- at least with the Python SDK -- you need to configure your SQS client's region to match the region where the queue exists (even though you pass the URL, which contains the region).
my company has a messaging system which sends real-time messages in JSON format, and it's not built on AWS
our team is trying to use AWS SQS to receive these messages, which will then have DynamoDB to storage this messages
im thinking to use EC2 to read this messages then save them
any better solution ?? or how to do it i don't have a good experience
First of All EC2 is infrastructure on Cloud, It is similar to physical machine with OS on local setup. If you want to create any application that will fetch the data from Amazon SQS(Messages in Json Format) and Push it in dynamodb(No Sql database), Your design is correct as both SQS and DynamoDb have thorough Json Support. Once your application is ready then you deploy that application on EC2 machine.
For achieving this, your application must have the asyc Buffered SQS consumer that will consume the messages(limit of sqs messages is 256KB), Hence whichever application is publishing messages size of messages needs to be less thab 256Kb.
Please refer below link for sqs consumer
is putting sqs-consumer to detect receiveMessage event in sqs scalable
Once you had consumed the message from sqs queue you need to save it in dynamodb, that you can easily do it using crud repository. With Repository you can directly save the json in Dynamodb table but please sure to configure the provisioning write capacity based on requests, because more will be the provisioning capacity more will be the cost. Please refer below link for configuring the write capacity of table.
Dynamodb reading and writing units
In general, you'll have a setup something like this:
The EC2 instances (one or more) will read your queue every few seconds to see if there is anything there. If so, they will write this data to DynamoDB.
Based on what you're saying you'll have less than 1,000,000 reads from SQS in a month so you can start out on the free tier for that. You can have a single EC2 instance initially and that can be a very small instance - a T2.micro should be more than sufficient. And you don't need more than a few writes per second on DynamoDB.
The advantage of SQS is that if for some reason your EC2 instance is temporarily unavailable the messages continue to queue up and you won't lose any of them.
From a coding perspective, you don't mention your development environment but there are AWS libraries available for a pretty wide variety of environments. I develop in Java and the code to do this would be maybe 100 lines. I would guess that other languages would be similar. Make sure you look at long polling in the language you're using - it can help to speed up the processing and save you money.