Restart EC2 instance on Website unavailability - amazon-web-services

I have a website hosted on an EC2 server. I want to monitor the website endpoint and restart the EC2 instance if the website in unavailable for a certain time frame (say 60 seconds).
What tools do I use in AWS and how do I accomplish this?

This is not a recommended approach.
Firstly, if a website is unavailable, you would probably want to investigate the cause rather than just restarting the instance. Your goal should be to run a stable system by removing root causes of problems rather than just ignoring the problem by restarting all the time.
The recommended design would be to run in a Highly Available configuration with:
The application running on at least two servers across at least two Availability Zones (in case of failure of an AZ). This is not necessarily more expensive because each server can be smaller than a single, large server.
A load balancer in front of the instances, distributing the traffic to the instances. The load balancer also performs continuous health checks and stops sending requests to servers that fail the health check
An Auto Scaling group that can terminate unhealthy instances and automatically launch replacement servers. This also works well if an Availability Zone should fail.
In this design, an unhealthy instance would be terminated (stopped and destroyed) and a new instance created with a pre-defined disk image and startup script. Alternatively, you might choose to move bad instances out of the Auto Scaling group for investigation of the problem, with a new instance being launched to take its place.
If your application requires a database, the database should be external to the instances so that all instances can connect to the database and replacing application instances does not cause any data loss.
As to the speed of noticing problems on a server, the load balancer can perform checks every few seconds. Amazon CloudWatch, on the other hand, would need at least a minute to detect problems (probably longer since metrics are calculated over a period rather than being "now" metrics).

John's approach is the correct one, but at its simplest:
Write a lambda function that can query your website and see if it is running or not and if not have that lambda function restart the instance.
Setup a cloudwatch event rule that runs on a frequency you determine to call the lambda function
I'll leave to you the work of writing the code that determines if the website is functional and restarting the server - but that is pretty straightforward. You can use python, java, node, go or .net core in your lambda function - I would think python would be the easiest in this case, but that is an opinion.

It is clear that this is not a best practice in AWS but can make some sense - e.g. you are running a small personal web server with low demand where availability is a less issue than costs.
At least that was my reason why I built automation for it.
diagram
lambda code
import json
import os
import boto3
import time
env_vars = [
'ALARM_NAME',
'REGION',
'INSTANCE_ID',
'OUTPUT_SNS_ARN'
]
ENV = {}
for env_var in env_vars:
ENV[env_var] = os.environ.get(env_var, None)
if not ENV[env_var]:
raise Exception(f"Environment variable {env_var} must be set!")
def reboot_instance(instanceID, regionName) -> "instanceID":
"""
InstanceID
instanceID - ID of instance
regionName - name of region
return InstanceID or False in case of exception
"""
ec2 = boto3.resource('ec2', region_name=regionName)
instance = ec2.Instance(instanceID)
try:
instance.stop()
time.sleep(30)
instance.stop(Force=True)
except:
pass
for i in range(180): # wait 3 minutes
instance = ec2.Instance(instanceID)
if instance.state['Code'] == 80:
break
time.sleep(1)
else:
raise Exception('Unable to stop instance')
instance.start()
return instanceID
def notify_about_reboot(instanceID, snsarn) -> True:
"""
Put SNS message about reboot to snsarn
"""
client = boto3.client('sns', region_name='us-east-1')
client.publish(TopicArn=snsarn, Message=f'EC2 instance {instanceID} was rebooted!')
return True
def lambda_handler(event, context) -> "status about reboot":
"""
event: see events/event.json
"""
print('EVENT:')
print(event)
for record in event.get('Records', None):
sns = record.get('Sns', None)
message = json.loads(sns.get('Message', None))
msgalarm = message.get('AlarmName', None)
msgstatus = message.get('NewStateValue', None)
if not all([sns,message,msgalarm,msgstatus]):
continue
if (msgalarm == ENV['ALARM_NAME']) and (msgstatus == 'ALARM'):
notify_about_reboot(reboot_instance(ENV['INSTANCE_ID'], ENV['REGION']), ENV['OUTPUT_SNS_ARN'])
return 'rebooting'
else:
return 'nothing to do'
return 'no sns record found'
I have released whole tested automation with SAM template and installation instructions also on https://github.com/koss822/misc/tree/master/Aws/route53-healthcheck-instance-reboot

Related

Handle AWS Spot Instance Termination

I am using Spot Instances to run some batch jobs.
However lately we have been seeing a lot of spot instance terminations and want to use the 2-minute interruption notice that aws sends before an instance is terminated.
Sources:
https://aws.amazon.com/blogs/aws/new-ec2-spot-instance-termination-notices/
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-instance-termination-notices.html
My approach here was to run a separate thread in my application that polls the instance meta-data url
http://169.254.169.254/latest/meta-data/spot/instance-action to check if a termination notice has been sent out and raise an exception (or re-trigger the current job)
My code
interruption_monitor.py
import requests
import log
from time import sleep
from threading import Thread
class InstanceTerminated(Exception):
"""Instance Terminated Exception class"""
class InterruptionMonitor(Thread):
"""Threaded Interruption monitor"""
def __init__(self, sleep_time=0.1, report_time=5):
super().__init__(daemon=True)
self.sleep_time = min(sleep_time, report_time)
self.count = 0
self.logger = log.get_logger(f"my_app.{__name__}")
def check_interruption_notice(self):
"""Check for interruption Notice"""
self.logger.info("CHECKING FOR INTERRUPTION NOTICE...")
url = 'http://169.254.169.254/latest/meta-data/spot/instance-action'
response = requests.get(url=url, timeout=5)
self.logger.info("RESPONSE:", resp=response)
if response.status_code != 404:
# print(response)
if response.action == 'stop' or response.action == 'terminate':
raise InstanceTerminated # Or retrigger the Job
def run(self):
"""Entry point for thread execution"""
while True:
self.check_interruption_notice()
sleep(self.sleep_time)
self.count += 1
There are 2 questions that i am looking an answer for:
Is the correct way of handling this? Is there any added cost or if this would effect my existing job performance in any way? If yes what? If No, please suggest a better approach to this?
I am not able to test the positive scenario as I have to wait for AWS to interrupt my spot instances to see if it works as I expected. Is there a way to manually cause the spot instance terminations so that I receive the interruption notice and verify that this works.
PS: I am a noob with AWS, so please bear with me
Would there be any added cost or if this would effect my existing Job Performance in any way?
There is no added cost for running another thread in your current ECS processes. Why would there be? Please take the time to understand how ECS bills you if you are concerned about that. ECS doesn't bill per thread.
There could definitely be a performance hit if you poll too often. Your default setting of 0.1 seconds polling is way too fast. I don't understand what you are doing with sleep_time and report_time values, but AWS recommends in the documentation you linked to poll every 5 seconds, not every 0.1 seconds.
Is there a way to manually cause the spot instance terminations so that I receive the interruption notice and verify that this works.
Unfortunately, there is no way to manually trigger that on ECS that I am aware of.

lambda ec2 create instance and assign an elastic IP

Creating an instance in lambda and assigning an Elastic IP. Create instance part works, code section below is meant to wait before assigning the elastic IP. (1) What state does the instance have to be in to assign? (assuming "running") (2) Below logic never proceeds to after the loop, although I can verify the instance goes into "running" state. I verified the instance ID is good and there's only 1 instance in this case.
print ('waiting')
newresp = ec2_client.describe_instance_status(InstanceIds=newins_list,IncludeAllInstances=True)
while (newresp['InstanceStatuses'][0]['InstanceState']['Name'] != 'running'):
newresp = ec2_client.describe_instance_status(InstanceIds=newins_list,IncludeAllInstances=True)
print ('New Instance Running')
The "timeout" setting is important in lambda. This affects loops and timers. If you have for/while loops, sleep, or other functions that take time, lambda should be configured.
When I created the function the default setting was 3s execution time. The statement after the loop was never reached because the function timed out.
You can set the timeout from the function "Configuration" tab under General Configuration.
You have to remember though, pricing is based on the amount of time your code runs. I set it to 30s, and checked status at 5s intervals in the loop. Here's the final code, starting from the loop section, including the associate command:
....
while (newresp['InstanceStatuses'][0]['InstanceState']['Name'] != 'running'):
time.sleep(5)
newresp = ec2_client.describe_instance_status(InstanceIds=newins_list,IncludeAllInstances=True)
print ('Associating Elastic IP')
try:
ec2_client.associate_address(AllocationId='eipalloc-0805514a57680aaf8',InstanceId=newins,AllowReassociation=True)
print ('New Instance Running')
except ClientError as e:
print(e)

Google App Engine (GAE) basic scaling backend instance serves one request and undeploys

I have deployed an application (frontend and backend) in App Engine. First of all, I am using the free tier and I chose the default F1 for the frontend and B2 for the backend. I don't exactly understand the difference between B and F instances but based on their names, I chose them for backend and frontend respectively.
My backend is a Flask application that reads some data from Firestore on #app.before_first_request and "pre-caches" it for all future requests. This takes about 20-30 seconds before the first request is served so I really don't want the backend instance to become undeployed all the time.
Right now, my backend successfully serves one request (that I am making from the browser) and then immediately gets undeployed (basically I see no active instances in App Engine dashboard after the request is served). This means that every request once again has the same long delay upon server start that I don't want. I am not sure why this is happening because I've set idle timeout to 5 minutes. I know it is not a problem with my Flask application because it does not crash after a request on a local machine and I've done its memory profiling which is in bounds of B2 limits. This is my app.yaml for the backend:
runtime: python38
service: api
env_variables:
PORT: 8080
instance_class: B2
basic_scaling:
max_instances: 1
idle_timeout: 5m
Any insight would be appreciated!
Based on the information and behavior that you are exposing, please allow me to explain to you that both Scaling models are behaving as they are designed to do so.
“Automatic Scaling: It creates instances based on request rate, response latencies, and other application metrics. You can specify thresholds for each of these metrics, and a minimum number instances to keep running always.
Basic Scaling: Basic scaling creates instances only when your application receives requests. Each instance will be shut down when the application becomes idle. Basic scaling is ideal for work that is intermittent or driven by user activity.”
Use the following URL’s documentation as reference for those models and more of them How Instances are Managed.
Information added on 10/12/2021:
Hi,
I think the correct term is “shutdown” instead of “undeployed” Disabling your application. Looking at Instance States "an instance of a manual or basic scaled service can be either running or stopped. All instances of the same service and version share the same state." then looking at Scaling types "Basic scaling creates instances when your application receives requests. Each instance will be shut down when the application becomes idle. Basic scaling is ideal for work that is intermittent or driven by user activity." and the table's Startup and shutdown row for basic scaling "Instances are created on demand to handle requests and automatically shut down when idle, based on the idle_timeout configuration parameter. An instance that is manually stopped has 30 seconds to finish handling requests before it is forcibly terminated." and Scaling down "You can specify a minimum number of idle instances. Setting an appropriate number of idle instances for your application based on request volume allows your application to serve every request with little latency".
Could you please verify:
that the instance was not manually halted?
that instance is becoming idle?
that there are no background threads?
if functionality is the same when setting the max_instances to 2
that there are no logs showcasing an instance shutdown
that they are reaching the version with the updated the idle_timeout set

AWS Win Tasks aren't running after server restart via Lambda

I have coded a simple task scedule that turns the AWS Windows server off and a simple Lambda code to turn the stopped instaces on, see code:
import time
import json
import boto3
def lambda_handler(event, context):
# boto3 client
client = boto3.client('ec2')
ssm = boto3.client('ssm')
# getting instance information
describeInstance = client.describe_instances()
#print("eddie", describeInstance["Reservations"][0]["Instances"][0]["State"]["Name"])
InstanceId = []
# fetchin instance id of the running instances
for i in describeInstance['Reservations']:
for instance in i['Instances']:
if instance["State"]["Name"] == "stopped":
InstanceId.append(instance['InstanceId'])
client.start_instances(InstanceIds=InstanceId)
"""looping through instance ids
for instanceid in InstanceId:
# command to be executed on instance
client.start_instances(InstanceIds=InstanceId)
print(output)"""
return {
'statusCode': 200,
'body': json.dumps('Thanks from Srce Cde!')
}
I have several tasks on my task scheduler that run Python scripts but after the activation those tasks arent running even when the server in running.
Note, I have tried to set "run whether logged on or not" on all of them but recived (0x1) errors related to privileges on those tasks
Does anyone know how to solve this? Is there any other way to turn off and on EC2 AWS win server in night time to save on billing

Best practice to run Prefect flow serverless in Google Cloud

I have started using Prefect for various projects and now I need to decide which deployment strategy on GCP would work best. Preferably I would like to work serverless. Comparing Cloud Run, Cloud Functions and App Engine, I am inclined to go for the latter since this doesn't have a timeout limit, while the other two have of 9 resp. 15 minutes.
Am interested to hear how people have deployed Prefect flows serverlessly, such that Flows are scheduled/triggered for batch processing, whilst the agent is automatically scaled down when not used.
Alternatively, a more classic approach would be to deploy Prefect on Compute Engine and schedule this via Cloud Scheduler. But I feel this is somewhat outdated and doesn't do justice to the functionality of Prefect and flexibility for future development.
Am interested to hear how people have deployed Prefect flows serverlessly, such that Flows are scheduled/triggered for batch processing, whilst the agent is automatically scaled down when not used.
Prefect has a blog post on serverless deployment with AWS Lambda which is a good blueprint for doing the same with GCP. The challenge here is the agent scaling - agents work by polling the backend (whether a self deployment of Prefect Server or the hosted Prefect Cloud) on a regular basis (every ~10 secs). One possibility that comes to mind would be to use a Cloud Function to spin up an agent in-process, triggered by whatever batch processing/scheduling event you're thinking of. You can also use the -max-polls CLI argument or kwarg to spin up the agent to look for runs; it'll tear itself down if it doesn't find anything after however many polling attempts you specify. Details on that here or on any of the specific agent pages.
However, this could be inefficient for long-running flows and you might hit resource caps; it might be worthwhile to look at triggering an auto-scaling Dask cluster deployment if the workloads are high enough. Prefect supports that natively with Kubernetes, and has a Kubernetes agent to interact with your cluster. I think this would be the most elegant and scalable solution without having to go the classic Compute Engine route, which I agree is somewhat dated and doesn't provide great auto-scaling or first-class management.
Better support of serverless execution is on the roadmap, specifically a serverless agent is in the works but I don't have an ETA on when that'll be released.
Hopefully that helps! :)
Recently added to Prefect is the Vertex Agent which uses GCP Vertex, the inheritor to AIP. Vertex has a highly configurable serverless execution environment, and no timeouts.
The full explanation is here: https://jerryan.medium.com/hacking-ways-to-run-prefect-flow-serverless-in-google-cloud-function-bc6b249126e4.
Basically, there are two hacking ways to solve the problem.
Use google cloud storage to persistent task states automatically
Publish previous execution results of a cloud function to its subsequent execution.
Caching and Persisting Data
By default, the Prefect Core stored all data, results, and cached states in memory within the Python process running the flow. However, they can be persisted and retrieved from external locations if necessary hooks are configured.
The Prefect has a notion of “checkpointing” that ensures that every time a task is successfully run, its return value write to persistent storage based on the configuration in a result object and target for the task.
#task(result=LocalResult(dir="~/.prefect"), target="task.txt")
def func_task():
return 99
The complete code example is shown below. Here we write to and read from a Google Cloud Bucket by using GCSResult.
import os
os.environ["PREFECT__FLOWS__CHECKPOINTING"] = "true"
from prefect import task, Flow
from prefect.engine.results import LocalResult, GCSResult
#task(target="{date:%Y-%m-%d}/{task_name}.txt")
def task1():
print("Task 1")
return "Task 1"
#task(target="{date:%Y-%m-%d}/{task_name}.txt")
def task2():
print("Task 2")
return "Task 2"
#task(target="{date:%Y-%m-%d}/{task_name}.txt")
def task3():
print("Task 3")
return "Task 3"
#task(target="{date:%Y-%m-%d}/{task_name}.txt")
def task4():
print("Task 4")
#task
def task5():
print("Task 5")
#task
def task6():
print("Task 6")
#task
def task7():
print("Task 7")
#task
def task8():
print("Task 8")
# with Flow("This is My First Flow",result=LocalResult(dir="~/prefect")) as flow:
with Flow("this is my first flow", result=GCSResult(bucket="prefect")) as flow:
t1, t2 = task1(), task2()
t3 = task3(upstream_tasks=[t1,t2])
t4 = task4(upstream_tasks=[t3])
t5 = task5(upstream_tasks=[t4])
t6 = task6(upstream_tasks=[t4])
t7 = task7(upstream_tasks=[t2,t6])
t8 = task8(upstream_tasks=[t2,t3])
# run the whole flow
flow_state = flow.run()
# visualize the flow
flow.visualize(flow_state)
# print the state of the flow
print(flow_state.result)
Publish Execution Results
Another hacking solution is to publish previous execution results of a google cloud function to its subsequent execution. Here, we assume there is no data input and output dependence between tasks.
Some modifications are needed to make it happen.
Change custom state handers for tasks
Manually change task state before publishing
Encode/Decode the task state
First, we know the flow.run function finishes after all the tasks are entered into the finish state, whether it is a success or failure. However, we don’t want all tasks run inside a single call of the google cloud function because the total run time may exceed 540 seconds.
So a custom state hander for the task is used. Every time a task finishes, we emit an ENDRUN signal to the prefect framework. Then it will set the state of the remaining tasks to Cancelled.
from prefect import task, Flow, Task
from prefect.engine.runner import ENDRUN
from prefect.engine.state import State, Cancelled
num_finished = 0
def my_state_handler(obj, old_state, new_state):
global num_finished
if num_finished >= 1:
raise ENDRUN(state=Cancelled("Flow run is cancelled"))
if new_state.is_finished():
num_finished += 1
return new_state
Second, to make tasks with canceled status execute correctly next time, we must manually change their status to pending.
def run(task_state_dict: Dict[Task, State]) -> Dict[Task, State]:
flow_state = flow.run(task_states=task_state_dict)
task_states = flow_state.result
# change task state before next publish
for t in task_states:
if isinstance(task_states[t], Cancelled):
task_states[t] = Pending("Mocked pending")
# TODO: reset global counter
global num_finished
num_finished = 0
# task state for next run
return task_states
Third, there are two essential functions: encoding_data and decode_data. The former serialize the task states to be ready be published, and the latter deserialize the task states into flow object.
# encoding: utf-8
from typing import List, Dict, Any
from prefect.engine.state import State
from prefect import Flow, Task
def decode_data(flow: Flow, data: List[Dict[str, Any]]) -> Dict[Task, State]:
# data as follows:
# [
# {
# "task": {
# "slug": "task1"
# }
# "state": {
# "type": "Success",
# "message": "Task run succeeded(manually set)"
# }
# }
# ]
task_states = {}
for d in data:
tasks_found = flow.get_tasks(d['task']['slug'])
if len(tasks_found) != 1: # 不唯一就不做处理了
continue
state = State.deserialize(
{"message": d['state']['message'],
"type": d['state']['type']
}
)
task_states[tasks_found[0]] = state
return task_states
def encode_data(task_states: Dict[Task, State]) -> List[Dict[str, Any]]:
data = []
for task, state in task_states.items():
data.append({
"task": task.serialize(),
"state": state.serialize()
})
return data
Last but not least, the orchestration connects all the parts above.
def main(data: List[Dict[str, Any]], *args, **kargs) -> List[Dict[str, Any]]:
task_states = decode_data(flow, data)
task_states = run(task_states)
return encode_data(task_states)
if __name__ == "__main__":
evt = []
while True:
data = main(evt)
states = defaultdict(set)
for task in data:
task_type, slug = task['state']['type'], task['task']['slug']
states[task_type].add(slug)
if len(states['Pending']) == 0:
sys.exit(0)
evt = data
# send pubsub message here
# GooglePubsub().publish(evt)
# sys.exit(0)