Collectd on AWS - amazon-web-services

Collectd on AWS - amazon-web-services

We have instances setup in an autoscale group on AWS. We want to collect the metrics in order to determine our scalability needs. Collectd, so far I know that it collects the stats in the same machine and puts it all in RRD files. However, in a scenario of an autoscale cluster, if another instance is spawned and assuming the AMI from which it has been spawned already has collectd, how are we supposed to gather the stats of that second instance in the group? It might just stay up for five to six minutes and go down, but we would need the logs before it goes down. Any way by which we can club these logs for the same cluster or something similar? Or if collectd can make it report somewhere online?

Found the answer. This can be done by using the client-server architecture of collectd. More details can be found here

Related

AWS Autoscaling Group EC2 instances go down during cron jobs

I tried autoscaling groups and alternatively just a bunch of EC2 instances tied by load balancer. Both configs are working fine at first glance.
But, when the EC2 is a part of autoscaling group it goes down sometimes. Actually it happens very often, almost once a day. And they go down in a "hard reset" way. The ec2 monitoring graphs show that CPU usage goes up to 100%, then the instance become not responsive and then it is terminated by autoscaling group.
And it has nothing to do with my processes on these instances.
When the instance is not a part of Autoscaling groups, it can work without the CPU usage spikes for years.
The "hard reset" on autoscaling group instances are braking my cron jobs. As much as I like the autoscaling groups I cannot use it.
It there a standard way to deal with the "hard resets"?
PS.
The cron jobs are running PHP scripts on Ubuntu in my case. I managed to make only one instance running the job.

It sounds like you have a health check that is failing when your cron is running, as as a result the instance is being taken out of service.
If you look at the ASG, there should be a reason listed for why the instance was taken out. This will usually be a health check failure, but there could be other reasons as well.
There are a couple things you can do to fix this.
First, determine why your cron is taking 100% of CPU, and how long it generally takes.
Review your health check settings. Are you using HTTP or TCP? What is the interval, and how many checks have to fail before it is taken out of service?
Between those two items, you should be able to adjust the health checks so that it doesn't take it out of service during the cron running time. It is possible that the instance is failing, typically this would be because it runs out of memory. If that is the case, you may want to consider going to a large instance type and/or enabling swap.

Once I had a similar issue, in that situation was the system auto update running. The system (Windows server) was downloaded a big update and took 100% of the CPU during hours. My suggestion is to try to monitoring which service is running at that moment (even if the SO is Linux), also check for any schedule task (as looks like it is running periodically). Other than that try to keep the task list opened during the event and see what is going on.

What's the meaning of min/max=1 in AWS auto scaling

I am studying AWS, per the illustration in AWS here:
For a min/max=1 case, what does it implicit to? Seems no scaling to me as min = max
Thank you for your kind enlightening.
UPDATE:
so here is an example use case:
http://www.briefmenow.org/amazon/how-can-you-implement-the-order-fulfillment-process-while-making-sure-that-the-emails-are-delivered-reliably/
Your startup wants to implement an order fulfillment process for
selling a personalized gadget that needs an average of 3-4 days to
produce with some orders taking up to 6 months you expect 10 orders
per day on your first day. 1000 orders per day after 6 months and
10,000 orders after 12 months. Orders coming in are checked for
consistency men dispatched to your manufacturing plant for production
quality control packaging shipment and payment processing If the
product does not meet the quality standards at any stage of the
process employees may force the process to repeat a step Customers are
notified via email about order status and any critical issues with
their orders such as payment failure. Your case architecture includes
AWS Elastic Beanstalk for your website with an RDS MySQL instance for
customer data and orders. How can you implement the order fulfillment
process while making sure that the emails are delivered reliably?
Options:
A.
Add a business process management application to your Elastic Beanstalk app servers and re-use the ROS
database for tracking order status use one of the Elastic Beanstalk instances to send emails to customers.
B.
Use SWF with an Auto Scaling group of activity workers and a decider instance in another Auto Scaling group
with min/max=1 Use the decider instance to send emails to customers.
C.
Use SWF with an Auto Scaling group of activity workers and a decider instance in another Auto Scaling group
with min/max=1 use SES to send emails to customers.
D.
Use an SQS queue to manage all process tasks Use an Auto Scaling group of EC2 Instances that poll the tasks
and execute them. Use SES to send emails to customers.
The voted answer is C.
Can anyone kindly share the understanding? Thank you very much.

Correct, there will be no scaling outward or inward when min/max=1. Or when min=max. This situation is generally used for keeping a service available in case of failures.
Consider the alternative; you launch with an EC2 instance that's been bootstrapped with some user data script. If the instance has issues, you'll need to stop it and begin another.
Instead, you launch using an AutoScaling Group with a Launch Configuration that takes care of bootstrapping instances. If your application server begins to fail, you can just de-register it from the AutoScaling Group. AWS will take care of bringing up another instance while you triage the defective one.
Another situation you might consider is when you want the option to deploy a new version of an application with the same AutoScaling Group. In this case, create a new Launch Configuration and register it with the ASG. Increase max and desired by 1 temporarily. AWS will launch the instance for you and if it succeeds, you can then reduce Max and Desired back down to 1. By default, AWS will remove the oldest server but you can guarantee that the new one stays up by using termination protection.

How to detect temporary network partition in Kubernetes?

We have a Kubernetes cluster set up on AWS VPC with 10+ nodes. We encountered an incident where one node was not accessible to others and vice-versa for ~10 minutes. Finding this out took quite a lot of time.
Is there a tool for Kubernetes or AWS to detect these kind of network problems? Maybe something like a Daemon Set where each pod pings the others in the network and logs it when the ping fails.

If you are mostly interested in being alerted when such problem happens, I would set up monitoring system and hook it up with something like alertmanager. For collecting metrics, you can look at open source project such as Prometheus. Once you set this up, it is really easy to integrate it with Grafana (for dashboard) and alertmanager (for alerting based on rules you specify in Prometheus). And they are all open source projects.
https://prometheus.io/

What's the best method for creating a scheduler for running EC2 instances?

I want to create a web app for my organization where users can schedule in advance at what times they'd like their EC2 instances to start and stop (like creating events in a calendar), and those instances will be automatically started or stopped at those times. I've come across four different options:
AWS Datapipeline
Cron running on EC2 instance
Scheduled scaling of Auto Scaling Group
AWS Lambda scheduled events
It seems to me that I'll need a database to store the user's scheduled times for autostarting and autostopping an instance, and that I'll have to pull that data from the database regularly (to make sure that's the latest updated schedule). Which would be the best of the four above options for my use case?
Edit: Auto Scaling only seems to be for launching and terminating instances, so I can rule that out.

Simple!
Ask users to add a tag to their instance(s) indicating when they should start and stop (figure out some format so they can easily specify Mon-Fri or Every Day)
Create an AWS Lambda function that scans instances for their tags and starts/stops them based upon the tag content
Create an Amazon CloudWatch Event rule that triggers the Lambda function every 15 minutes (or whatever resolution you want)
You can probably find some sample code if you search for AWS Stopinator.

Take a look at ParkMyCloud if you're looking for an external SaaS app that can help your users easily schedule (or override that schedule) your EC2, RDS, and ASG instances. It also connects to SSO, provides an API, and shows you all of your resources across regions/accounts/clouds. There's a free trial available if you want to test it out.
Disclosure: I work for ParkMyCloud.

How to concatenate multiple log files from multiple EC2 servers?

So I'm running nginx on three EC2 servers all in different locations (US, EU, Asia). I want to execute a perl script every day on the joined log files (each EC2 holds an nginx log in /var/log/nginx/access.log).
It seems Amazon's CloudWatch has some similar abilities but then again I'm reading about pushing each log to a S3 location. What is the easiest way to accomplish this?

I have been amazed at the logs cost, performance and search capabilities of log aggregator services like PaperTrail for these kinds of problems.
We have 30 instances of all types running Windows with Nxlog configured on each. anytime we spin up an instance, its logs are immediately captured by the paper trail syslogd service. I cannot imagine running cloud services with some log aggregator.
The searching and archiving is great. Papertrail has a free plan 100 MB/month, 48 hour search time, and 7 day archiving.
Disclaimer: Not related to Papertail, just a happy customer.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js