Scaling ActiveMQ on AWS - amazon-web-services

First of all my knowledge of ActiveMQ, AMQPS and AWS auto scaling is fairly limited and I have been handed over this task where I need to create a scalable broker architecture for messaging over AMQPS using ActiveMQ.
In my current architecture if have a single machine ActiveMQ broker where the messaging is happening over AMQP + SSL and as a need of the product there is a publisher subscriber authentication (TLS authentication) to ensure correct guys are talking to each other. That part is working fine.
Now the problem is that I need to scale the whole broker thing over AWS cloud with auto-scaling in my mind. Without auto-scaling, I assume I can create a master slave architecture using EC2 instances, but then adding more slaves will be more like a manual process than automatic.
I want to understand wether below two options can solve the purpose -
ELB + ActiveMQ nodes being auto scaled
Something like a Bitnami powered ActiveMQ AMI running with auto scaling enabled.
In first case where ELB is there, I understand that ELB terminates SSL which will fail my mutual authentication. Also I am not sure wether my Pub/Sub model will still work where different ActiveMQ instances are independently running with no shared DB as such. If yes, if anyone can offer a pointer or a reference material it will be a help as I am not able to find one by myself.
In second case again, my concern is that when multiple instances are running with ActiveMQ how they will coordinate between each other and ensure that everyone has access to data being held up in queue.
The questions may be lame, but if any pointer it will be helpful.
AJ

Related

Load balance Postfix within AWS without SES?

I'm working on a project for a client who does message pre and post-processing at very high volumes. I'm trying to figure out a reliable configuration between two or three API servers that will push outgoing email messages to any of two or more instances of Postfix. This is for outbound only and it's preferred not to have a single point of failure.
I am a bit lost within the AWS ecosystem and all I know is we cannot use SES and the client is set up for high volume smtp with Amazon, so throttling is not an issue.
I've looked into ELB, HAProxy, and a few other things but the whole thing has now gotten muddy and I'm not sure if I'm just overthinking it now.
Any quick thoughts would be appreciated.
Thanks

Difference between worker and service in the context of infrastructure primitives

I am building infrastructure primitives to support workers and http services.
workers are standalone
http services have a web server and a load balancer
The way I understand it, a worker generally pull from an external resource to consume tasks while a service handles inbound requests and talks to upstream services.
Celery is an obvious worker and a web app is an obvious service. The lines can get blurry though and I'm not sure what the best approach is:
Is the worker/service primitive a good idea?
What if there's a service that consumes tasks like a worker but also handles some http requests to add tasks? Is this a worker or a service?
What about services that don't go through nginx, does that mean a third "network" primitive with an NLB is the way to go?
What about instances of a stateful service that a master service connects to? The master has to know the individual agent instances so we cannot
hide them behind a LB. How would you go about representing that?
Is the worker/service primitive a good idea?
IMO, the main difference between service and worker can be that workers should be only designated to one task but service can perform multiple tasks. A service can
utilize a worker or a chain of workers to process the user request.
What if there's a service that consumes tasks like a worker but also handles some http requests to add tasks?
Services can be of different forms like web service, FTP service, SNMP
service or processing service. Writing the processing logic in service may not be a good idea unless it is taking the form of a worker.
What about services that don't go through nginx, does that mean a third "network" primitive with an NLB is the way to go?
I believe you are assuming service to be only HTTP based but as I mentioned in the previous answer, services can be of different types. Yes you may write a TCP service for a particular protocol implementation that can be attached behind an NLB
What about instances of a stateful service that a hub service connects to? The hub has to know the individual instances so we cannot hide them behind a LB.
Not sure what you mean by hub here? But good practice for a scalable architecture is to use stateless servers/services behind. The session state should not be stored in the service memory but should be serialized to a data store like DynamoDB.
One way to see the difference is to look at their names - workers do what their name says - they perform (typically) heavy tasks (work) - something you do not want your service to be bothered with, especially if it is a microservice. - For this particular reason, you will rarely see 32-core machines with hundred GBs of RAM running services, but you will very often see them running workers. Finally, they complement each other well. Services off-load heavy tasks to the workers. This aligns with the well-known UNIX philosophy "do one thing, and do it well".

Launch and shutting down instances suited for AWS ECS or Kubernetes?

I am trying to create a certain kind of networking infrastructure, and have been looking at Amazon ECS and Kubernetes. However I am not quite sure if these systems do what I am actually seeking, or if I am contorting them to something else. If I could describe my task at hand, could someone please verify if Amazon ECS or Kubernetes actually will aid me in this effort, and this is the right way to think about it?
What I am trying to do is on-demand single-task processing on an AWS instance. What I mean by this is, I have a resource heavy application which I want to run in the cloud and have process a chunk of data submitted by a user. I want to submit a this data to be processed on the application, have an EC2 instance spin up, process the data, upload the results to S3, and then shutdown the EC2 instance.
I have already put together a functioning solution for this using Simple Queue Service, EC2 and Lambda. But I am wondering would ECS or Kubernetes make this simpler? I have been going through the ECS documenation and it seems like it is not very concerned with starting up and shutting down instances. It seems like it wants to have an instance that is constantly running, then docker images are fed to it as task to run. Can Amazon ECS be configured so if there are no task running it automatically shuts down all instances?
Also I am not understanding how exactly I would submit a specific chunk of data to be processed. It seems like "Tasks" as defined in Amazon ECS really correspond to a single Docker container, not so much what kind of data that Docker container will process. Is that correct? So would I still need to feed the data-to-be-processed into the instances via simple queue service, or other? Then use Lambda to poll those queues to see if they should submit tasks to ECS?
This is my naive understanding of this right now, if anyone could help me understand the things I've described better, or point me to better ways of thinking about this it would be appreciated.
This is a complex subject and many details for a good answer depend on the exact requirements of your domain / system. So the following information is based on the very high level description you gave.
A lot of the features of ECS, kubernetes etc. are geared towards allowing a distributed application that acts as a single service and is horizontally scalable, upgradeable and maintanable. This means it helps with unifying service interfacing, load balancing, service reliability, zero-downtime-maintenance, scaling the number of worker nodes up/down based on demand (or other metrics), etc.
The following describes a high level idea for a solution for your use case with kubernetes (which is a bit more versatile than AWS ECS).
So for your use case you could set up a kubernetes cluster that runs a distributed event queue, for example an Apache Pulsar cluster, as well as an application cluster that is being sent queue events for processing. Your application cluster size could scale automatically with the number of unprocessed events in the queue (custom pod autoscaler). The cluster infrastructure would be configured to scale automatically based on the number of scheduled pods (pods reserve capacity on the infrastructure).
You would have to make sure your application can run in a stateless form in a container.
The main benefit I see over your current solution would be cloud provider independence as well as some general benefits from running a containerized system: 1. not having to worry about the exact setup of your EC2-Instances in terms of operating system dependencies of your workload. 2. being able to address the processing application as a single service. 3. Potentially increased reliability, for example in case of errors.
Regarding your exact questions:
Can Amazon ECS be configured so if there are no task running it
automatically shuts down all instances?
The keyword here is autoscaling. Note that there are two levels of scaling: 1. Infrastructure scaling (number of EC2 instances) and application service scaling (number of application containers/tasks deployed). ECS infrastructure scaling works based on EC2 autoscaling groups. For more info see this link . For application service scaling and serverless ECS (Fargate) see this link.
Also I am not understanding how exactly I would submit a specific
chunk of data to be processed. It seems like "Tasks" as defined in
Amazon ECS really correspond to a single Docker container, not so much
what kind of data that Docker container will process. Is that correct?
A "Task Definition" in ECS is describing how one or multiple docker containers can be deployed for a purpose and what its environment / limits should be. A task is a single instance that is run in a "Service" which itself can deploy a single or multiple tasks. Similar concepts are Pod and Service/Deployment in kubernetes.
So would I still need to feed the data-to-be-processed into the
instances via simple queue service, or other? Then use Lambda to poll
those queues to see if they should submit tasks to ECS?
A queue is always helpful in decoupling the service requests from processing and to make sure you don't lose requests. It is not required if your application service cluster can offer a service interface and process incoming requests directly in a reliable fashion. But if your application cluster has to scale up/down frequently that may impact its ability to reliably process.

what are best practices for health checks?

We have a REST API. Right now our /health makes an smoke test on every dependency we have (a database and a couple microservices) and then returns 200 if there are no errors.
The problem is that not all dependencies are mandatory for our application to work. So while a problem accessing the database can be critical, problems accessing some microservices will only affect a small portion of our app.
On top of that we have Amazon ELB. It doesn't seem right to tag our app as unhealty only because one dependency is unhealty. ELB should only try to recover the unhealty dependency and with that our app will be healty again.
Which leads to the question: what should we actually check in our health-check? because it looks like we shouldn't be checking for any dependency at all. On the other hand, it's actually realy helpful to know the status of our app accessing all its dependencies (e.g for troubleshooting problems), so is it common to use some other endpoint for that purpose (say /sanity or /diagnostics)?
Do not go overboard trying to check for every service, every dependency, etc. in your health check. Basically think of your health check as a Go / No Go test so that the load balancer knows if the service is running.
Load balancers will not recover failed instances. They will just take your service offline. Auto Scaling Groups can recover failed instances by creating new instances and terminating failed instances. CloudWatch can monitor your instances and report problems and cause events to happen (e.g. rebooting).
You can implement more comprehensive tests that run internal to your server and that chose a reporting / recovery path. Examples might include sending an SNS notification to your email or cell phone account, rebooting the server, etc.
Amazon has a number of services to help monitor, report and manage services. Look into CloudWatch for monitoring, SNS or SES for reporting, ASG for auto scaling, etc.
Think thru what type of fault tolerance, high availability and recovery strategy you need for your service. Then implement an approach that is simple enough so that the monitoring itself does not become a point of failure.

High-availability for Restcomm

I'm planning a High-availability set-up with autoscaling for RestComm and some general doubts about the best way to plan it.
This is what I have now:
Restcomm instance using Amazon ECS (docker), so we can launch more instances very easily.
All of them share the Amazon RDS database.
Workspace is shared and persisted between instances.
To move to the next step, I have some questions:
Amazon Load Balancer isn't an option because it doesn't support UDP so I'm considering Telestax LB, is it correct?. Is it possible to deploy it using docker?
Move Restcomm MS outside of the docker Restcomm image so it can scale independently. Restcomm provides env variables to specify the MS, so I would have a LB and several MS behind it. Correct?.
How much RAM needs a Restcomm instance and how many concurrent sessions supports?. How can we know how many concurrent sessions are in real time and in a programtically way?.
There is a "automatic scaling" mechanism implemented in RestComm? More info would be appreciated. Ubuntu Juju isn't an option for me.
We are considering Graylog2 or logstasch for logs management. Any insight here?. How do you install the agent in the docker images?.
The only documentation I found it was this very good document: https://docs.google.com/document/d/13xlaioF065pDnQUoZgfIpi6Noh0qHfAZ7U6afcPd2Y0/edit
Is there any other doc?.
Thanks in advance!
Very good questions :
Yes. See, https://hub.docker.com/r/restcomm/load-balancer/
You would have one LB (better to have 2 with active passive to avoid single point of failure) with X Restcomm behind it speaking to Z Media Servers behind them.
It depends on the complexity of the application on top. But here is some numbers https://github.com/RestComm/Restcomm-Connect/wiki/Load-Testing-on-Docker
Not yet. you can use Mesos or Kubernetes potentially if juju is not an option. We have a set of open issues for kubernetes right now but Mesos should be working.
You can check https://hub.docker.com/r/restcomm/graylog-restcomm/ it contains a docker image pre loaded with everything needed to poll a restcomm server for gathering metrics.