How to monitor whether a remote process has crashed? - amazon-web-services

I have a large number of instances across multiple cloud providers. Each of them is running a single Java program. I want to check that all of these Java programs are running and haven't crashed, and if/when one of them crashes, I want to be alerted about it.
At the moment I have a hacked-together solution that I run from my local computer, which will loop through an array of all the IP addresses, and send a command through SSH to each of them to check ps -ef and count the number of Java processes are running. If that number is zero then I will popup something on my screen to alert me.
Is there a better solution? Ideally I could use a Zabbix-style tool to handle it for me but I don't know if anything exists that serves this need.

If you have large number of applications running in the cloud then you may like to consider Cloud monitoring tools instead of re-inventing the wheel. I am sure you would like to monitor more than just the process up/down status. There are plenty of cloud monitoring tools, which allow you to monitor both the platform(machines) and the processes. Also, the different types of notifications can be configured depending on the need.
I would suggest that you look into Cloud monitoring solutions such as New Relic/Datadog/Pager Duty/etc. If it is commercially viable then I would highly recommend you to use them.

You can have all of your services write a status metric to CloudWatch Metrics and create an alarm when any do not report a status. This example shows using CloudWatch metrics to report on linux performance counters.

Related

Need help building an uptime dashboard for a distributed system

I have a product for which I would like to create a dashboard to show
its availability/uptime over time and display any outages.
Specifically I am looking for
ability to report historical information on service uptime
provide details on any service outages
The product is running on a fleet of linux servers and connects to a DB running
on a separate instance, also we have some dedicated instances that run nightly
batch jobs. My system also relies on some external services to provide
additional functionality for select customers. There is redis cache also for
caching data for multiple customers.
We replicate all the above setup (application servers, DB, jobs servers, redis
cache etc) into dedicated clusters for large customers. Small customers are put
on one of the shared clusters to keep costs low.
Currently we are running health checks on application servers only and providing
that information in a simple HTML page. This is a go to page for end-users/customers
and support teams.
Since the product is constructed using multiple systems/services our current HTML
page often times says that the system is up and running fine while can be experiencing
issues with some of its components or external services.
Current health check is using a simple HTTP request and looks for a 200
status code, this check runs every minute and we plot this data into a simple
chart to show last 30 days. We also show a list of outages with timestamp and
additional static information that is manually added.
We would like to build a more robust solution that monitors much more than the HTTP port
and where we have more details like what part
of the system is having issues and how those issues are impacting the system and
which customers are impacted.
Appreciate any guidance or help. We prefer to build the solution using
open source tools since we dont have much budget. Goal is to improve things for
my team members who are already overloaded.
I'm not sure if this will be overkill or not for your setup, given that I don't know your product, but have a look at the ELK Stack and see if you can use some components or at least some ideas from there:
What is the ELK Stack?
The Complete Guide to the ELK Stack

How do I handle instant spikes in traffic in my APIs

I am using aws for my cloud infrastructure. I use ecs fargate as my compute machine. I am currently maintaining 10-20 apis which interact with members who have my application downloaded on their phone. Obviously one or two of these apis are my "main" apis and these are the ones which are really personalised to my users and honestly, these are the only two apis which members really access (by navigating to those screens).
My business team wants to send push notifications to members to alert them on certain new events which lands them on a screen where these APIs need to be called. Due to this, my application has mini crashes during this time period.
I've thought of a couple of ideas for the same, but since this is obviously an issue across industries and a solved problem, I wanted the standard solutions.
The ideas I have:
Sending notifications in batches. This seems like the best solution though it requires a bit of effort though I'm not sure how much.
Have a serverless machine run my requests (aws lambda functions) for those APIs which need to scale instantly. I have a lot of other APIs which I keep in fargate because I don't want my lambda function to be too heavy and then take a while to start up.
Scale machines all the time to handle the load I get during push notifications. This seems suboptimal due to cost reasons.
Scale machines up just during those periods where I want to send push notifications and them scale them back down. This seems like a decent solution if I can automate the entire process. I can have a flow which I follow for each push notification which will cause the system to scale and then start sending the notifications.
Is there a better way to do this. This seems like a relatively straightforward problem for people to have, but I don't see too much information on this topic.
I like your second option best because it's by far the easiest to manage (because you don't have to manage it). After that I'd go with your last option. I would use step functions to manage this, where the first step is to scale up the number of instances in Fargate. Once that has reached the desired level you would send the notifications. Add autoscaling to your services in Fargate to have it handle coming down automatically.

Is there a way to tell who started an instance in Google Cloud Platform?

We run only a small handful of instances on Google Cloud Platform and we don't run them all the time. Generally we just fire one up, do what we need to do then shut it down... which is great, except when "we" forget to shut them down.
I've been able to track down the relevant REST APIs and the gcloud sdk but I don't see anything that says who started the instance. Actually it also doesn't have a timestamp on when it was started.
I did find this python app engine script that I might be able to rewrite to stop the instances after X amount of time, but I'd rather find a way to notify the user who started it and let them know the instance is still running.
Has anyone tried to do something similar or seen a way to get the "starter" of the instance in GCP?
You can look into the Audit Logs to determine who did what, where, and when. Further, you can use the Stackdriver Logging API method entries.list to retrieve audit log entries for your use case.
Also you can choose use the Activity Logs to know the details such as the authorized user who made the API request.
With the new API you have to filter on the following:
resource.type="gce_instance"
resource.labels.instance_id="ID"
protoPayload.methodName="v1.compute.instances.start"

Correct architecture for a reliant Windows EC2 background worker on AWS

Basically we save cached data on Redis and we want to dump it into MongoDB every X seconds.
We have a sorted set stored on Redis, saving each user's last activity as score, and we want to periodically dump user's final state after being inactive for a certain period of time, and we wish to make sure that:
We don't strain our API servers (That's why it has to run on a worker instance.
The data dump operation is very critical - We require these worker instances to be scalable and highly resilient to failure (and should handle failure gracefully).
We must ensure that if we have X machines, that the data would be spread across instances, and that every item we pull from Redis will be handled exactly once.
I was wondering what would be the best architectural approach to deploy EC2 Windows instances that periodically handle data.
I was thinking of using Elastic Beanstalk as it's easy to deploy, scale & monitor, but I was wondering if there was a better approach to this.
Thanks in advance!
Other than the application, for which Amazon Elastic Beanstalk is a fine choice, i'd recommend you take a look at Amazon Kinesis: http://aws.amazon.com/kinesis/
That is because you mentioned "scalable", "resilient", "handle failure gracefully" and "exactly once". Those attributes are quite hard to get right in a distributed system, and Kinesis Streams and Client Library can greatly help with that.

How to convert a WAMP stacked app running on a VPS to a scalable AWS app?

I have a web app running on php, mysql, apache on a virtual windows server. I want to redesign it so it is scalable (for fun so I can learn new things) on AWS.
I can see how to setup an EC2 and dump it all in there but I want to make it scalable and take advantage of all the cool features on AWS.
I've tried googling but just can't find a simple guide (note - I have no command line experience of Linux)
Can anyone direct me to detailed resources that can lead me through the steps and teach me? Or alternatively, summarise the steps in an answer so I can research based on what you say.
Thanks
AWS is growing and changing all the time, so there aren't a lot of books to help. Amazon offers training that's excellent. I took their three day class on Architecting with AWS that seems to be just what you're looking for.
Of course, not everyone can afford to spend the travel time and money to attend a class. The AWS re:Invent conference in November 2012 had a lot of sessions related to what you want, and most (maybe all) of the sessions have videos available online for free. Building Web Scale Applications With AWS is probably relevant (slides and video available), as is Dissecting an Internet-Scale Application (slides and video available).
A great way to understand these options better is by fiddling with your existing application on AWS. It will be easy to just move it to an EC2 instance in AWS, then start taking more advantage of what's available. The first thing I'd do is get rid of the MySql server on your own machine and use one offered with RDS. Once that's stable, create one or more read replicas in RDS, and change your application to read from them for most operations, reading from the main (writable) database only when you need completely current results.
Does your application keep any data on the web server, other than in the database? If so, get rid of all local storage by moving that data off the EC2 instance. Some of it might go to the database, some (like big files) might be suitable for S3. DynamoDB is a good place for things like session data.
All of the above reduces the load on the web server to just your application code, which helps with scalability. And now that you keep no state on the web server, you can use ELB and Auto-scaling to automatically run multiple web servers (and even automatically launch more as needed) to handle greater load.
Does the application have any long running, intensive operations that you now perform on demand from a web request? Consider not performing the operation when asked, but instead queueing the request using SQS, and just telling the user you'll get to it. Now have long running processes (or cron jobs or scheduled tasks) check the queue regularly, run the requested operation, and email the result (using SES) back to the user. To really scale up, you can move those jobs off your web server to dedicated machines, and again use auto-scaling if needed.
Do you need bigger machines, or perhaps can live with smaller ones? CloudWatch metrics can show you how much IO, memory, and CPU are used over time. You can use provisioned IOPS with EC2 or RDS instances to improve performance (at a cost) as needed, and use difference size instances for more memory or CPU.
All this AWS setup and configuration can be done with the AWS web console, or command-line tools, or SDKs available in many languages (Python's boto library is great). After learning the basics, look into CloudFormation to automate it better (I've written a couple of posts about that so far).
That's a bit of the 10,000 foot high view of one approach. You'll need to discover the details of each AWS service when you try to use them. AWS has good documentation about all of them.
Depending on how you look at it, this is more of a comment than it is an answer, but it was too long to write as a comment.
What you're asking for really can't be answered on SO--it's a huge, complex question. You're basically asking is "How to I design a highly-scalable, durable application that can be deployed on a cloud-based platform?" The answer depends largely on:
The specifics of your application--what does it do and how does it work?
Your tolerance for downtime balanced against your budget
Your present development and deployment workflow
The resources/skill sets you have on-staff to support the application
What your launch time frame looks like.
I run a software consulting company that specializes in consulting on Amazon Web Services architecture. About 80% of our business is investigating and answering these questions for our clients. It's a multi-week long project each time.
However, to get you pointed in the right direction, I'd recommend that you look at Elastic Beanstalk. It's a PaaS-like service that abstracts away the underlying AWS resources, making AWS easier to use for developers who don't have a lot of sysadmin experience. Think of it as "training wheels" for designing an autoscaling application on AWS.