Handle AWS Spot Instance Termination - amazon-web-services

I am using Spot Instances to run some batch jobs.
However lately we have been seeing a lot of spot instance terminations and want to use the 2-minute interruption notice that aws sends before an instance is terminated.
Sources:
https://aws.amazon.com/blogs/aws/new-ec2-spot-instance-termination-notices/
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-instance-termination-notices.html
My approach here was to run a separate thread in my application that polls the instance meta-data url
http://169.254.169.254/latest/meta-data/spot/instance-action to check if a termination notice has been sent out and raise an exception (or re-trigger the current job)
My code
interruption_monitor.py
import requests
import log
from time import sleep
from threading import Thread
class InstanceTerminated(Exception):
"""Instance Terminated Exception class"""
class InterruptionMonitor(Thread):
"""Threaded Interruption monitor"""
def __init__(self, sleep_time=0.1, report_time=5):
super().__init__(daemon=True)
self.sleep_time = min(sleep_time, report_time)
self.count = 0
self.logger = log.get_logger(f"my_app.{__name__}")
def check_interruption_notice(self):
"""Check for interruption Notice"""
self.logger.info("CHECKING FOR INTERRUPTION NOTICE...")
url = 'http://169.254.169.254/latest/meta-data/spot/instance-action'
response = requests.get(url=url, timeout=5)
self.logger.info("RESPONSE:", resp=response)
if response.status_code != 404:
# print(response)
if response.action == 'stop' or response.action == 'terminate':
raise InstanceTerminated # Or retrigger the Job
def run(self):
"""Entry point for thread execution"""
while True:
self.check_interruption_notice()
sleep(self.sleep_time)
self.count += 1
There are 2 questions that i am looking an answer for:
Is the correct way of handling this? Is there any added cost or if this would effect my existing job performance in any way? If yes what? If No, please suggest a better approach to this?
I am not able to test the positive scenario as I have to wait for AWS to interrupt my spot instances to see if it works as I expected. Is there a way to manually cause the spot instance terminations so that I receive the interruption notice and verify that this works.
PS: I am a noob with AWS, so please bear with me

Would there be any added cost or if this would effect my existing Job Performance in any way?
There is no added cost for running another thread in your current ECS processes. Why would there be? Please take the time to understand how ECS bills you if you are concerned about that. ECS doesn't bill per thread.
There could definitely be a performance hit if you poll too often. Your default setting of 0.1 seconds polling is way too fast. I don't understand what you are doing with sleep_time and report_time values, but AWS recommends in the documentation you linked to poll every 5 seconds, not every 0.1 seconds.
Is there a way to manually cause the spot instance terminations so that I receive the interruption notice and verify that this works.
Unfortunately, there is no way to manually trigger that on ECS that I am aware of.

Related

Keep retrying Celery task and only move on if task succeeds or max retries reached

I have a Celery task that retries on failure with exponential backoff. This task POSTs messages received by a Django application (call it "transport") to a second Django application (call it "base") for processing. These messages have to be processed in order, and so the order in which the tasks are queued must be maintained. However, when a task fails (because transport cannot connect to base for whatever reason), it is relegated to the back of the queue, which is obviously an issue.
Is it possible to "block" a Celery queue, ie. to keep retrying the same task until it either succeeds or reaches the max retries threshold, and only then move to the next task in the queue? In my case I need the task at the head of the queue to keep trying that POST until it can't anymore, and under no circumstances should the order of tasks in the queue be changed, though I'm not sure if this can be done with Celery (and if so, how).
I've come across this previous question which seems to describe a very similar problem, but it's not fully relevant to my use case.
If your tasks are chained together, celery won't process subsequent tasks until the first one finishes. So you could do:
#task(bind=True, acks_late=True)
def retry_task(self, *args, **kwargs):
try:
# do stuff
except PossibleExceptions:
self.retry()
def next_task(*args, **kwargs):
# foo
task_signature = retry_task.si() | next_task.si()
task_signature.apply_async()
Please read the documentation on retrying. It has a lot of new features since I've last looked at it.

Django app with multiple instances - how to ensure daily email is only sent once?

I am building a Django app that uses APScheduler to send out a daily email at a scheduled time each day. Recently the decision was made to bump up the number of instances to two in order to always have something running in case one of the instances crashes. The problem I am now facing is how to prevent the daily email from being sent out by both instances. I've considered having it set some sort of flag on the database (Postgres) so the other instance knows not to send, but I think this method would create race conditions--the first instance wouldn't set the flag in time for the second instance to see or some similar scenario. Has anybody come up against this problem and how did you resolve it?
EDIT:
def start():
scheduler = BackgroundScheduler()
scheduler.add_job(send_daily_emails, 'cron', hour=11)
scheduler.start()
So this is run when my app initializes--this creates a background scheduler that runs the send_daily_emails function at 11am each morning. The send_daily_emails function is exactly that--all it does is send a couple of emails. My problem is that if there are two instances of the app running, two separate background schedulers will be created and thus the emails will be sent twice each day instead of once.
You can use your proposed database solution with select_for_update
If you're using celery, why not use celery-beat + django-celery-beat?
You can use something like the following. Note the max_instances param.
def start():
scheduler = BackgroundScheduler()
scheduler.add_job(send_daily_emails, trigger='cron', hour='23', max_instances=1)
scheduler.start()

Restart EC2 instance on Website unavailability

I have a website hosted on an EC2 server. I want to monitor the website endpoint and restart the EC2 instance if the website in unavailable for a certain time frame (say 60 seconds).
What tools do I use in AWS and how do I accomplish this?
This is not a recommended approach.
Firstly, if a website is unavailable, you would probably want to investigate the cause rather than just restarting the instance. Your goal should be to run a stable system by removing root causes of problems rather than just ignoring the problem by restarting all the time.
The recommended design would be to run in a Highly Available configuration with:
The application running on at least two servers across at least two Availability Zones (in case of failure of an AZ). This is not necessarily more expensive because each server can be smaller than a single, large server.
A load balancer in front of the instances, distributing the traffic to the instances. The load balancer also performs continuous health checks and stops sending requests to servers that fail the health check
An Auto Scaling group that can terminate unhealthy instances and automatically launch replacement servers. This also works well if an Availability Zone should fail.
In this design, an unhealthy instance would be terminated (stopped and destroyed) and a new instance created with a pre-defined disk image and startup script. Alternatively, you might choose to move bad instances out of the Auto Scaling group for investigation of the problem, with a new instance being launched to take its place.
If your application requires a database, the database should be external to the instances so that all instances can connect to the database and replacing application instances does not cause any data loss.
As to the speed of noticing problems on a server, the load balancer can perform checks every few seconds. Amazon CloudWatch, on the other hand, would need at least a minute to detect problems (probably longer since metrics are calculated over a period rather than being "now" metrics).
John's approach is the correct one, but at its simplest:
Write a lambda function that can query your website and see if it is running or not and if not have that lambda function restart the instance.
Setup a cloudwatch event rule that runs on a frequency you determine to call the lambda function
I'll leave to you the work of writing the code that determines if the website is functional and restarting the server - but that is pretty straightforward. You can use python, java, node, go or .net core in your lambda function - I would think python would be the easiest in this case, but that is an opinion.
It is clear that this is not a best practice in AWS but can make some sense - e.g. you are running a small personal web server with low demand where availability is a less issue than costs.
At least that was my reason why I built automation for it.
diagram
lambda code
import json
import os
import boto3
import time
env_vars = [
'ALARM_NAME',
'REGION',
'INSTANCE_ID',
'OUTPUT_SNS_ARN'
]
ENV = {}
for env_var in env_vars:
ENV[env_var] = os.environ.get(env_var, None)
if not ENV[env_var]:
raise Exception(f"Environment variable {env_var} must be set!")
def reboot_instance(instanceID, regionName) -> "instanceID":
"""
InstanceID
instanceID - ID of instance
regionName - name of region
return InstanceID or False in case of exception
"""
ec2 = boto3.resource('ec2', region_name=regionName)
instance = ec2.Instance(instanceID)
try:
instance.stop()
time.sleep(30)
instance.stop(Force=True)
except:
pass
for i in range(180): # wait 3 minutes
instance = ec2.Instance(instanceID)
if instance.state['Code'] == 80:
break
time.sleep(1)
else:
raise Exception('Unable to stop instance')
instance.start()
return instanceID
def notify_about_reboot(instanceID, snsarn) -> True:
"""
Put SNS message about reboot to snsarn
"""
client = boto3.client('sns', region_name='us-east-1')
client.publish(TopicArn=snsarn, Message=f'EC2 instance {instanceID} was rebooted!')
return True
def lambda_handler(event, context) -> "status about reboot":
"""
event: see events/event.json
"""
print('EVENT:')
print(event)
for record in event.get('Records', None):
sns = record.get('Sns', None)
message = json.loads(sns.get('Message', None))
msgalarm = message.get('AlarmName', None)
msgstatus = message.get('NewStateValue', None)
if not all([sns,message,msgalarm,msgstatus]):
continue
if (msgalarm == ENV['ALARM_NAME']) and (msgstatus == 'ALARM'):
notify_about_reboot(reboot_instance(ENV['INSTANCE_ID'], ENV['REGION']), ENV['OUTPUT_SNS_ARN'])
return 'rebooting'
else:
return 'nothing to do'
return 'no sns record found'
I have released whole tested automation with SAM template and installation instructions also on https://github.com/koss822/misc/tree/master/Aws/route53-healthcheck-instance-reboot

In Amazon SWF, can I abuse a Decision task to actually perform the work

I need Amazon SWF to distribute some work, make sure it's done asynchronously, make sure it's store in a reliable way and that it's automatically restarted. However, the workflow logic I need is extremely simple: it's just to get a single task executed.
I implemented it now the way it's supposed to be done:
Request workflow execution
Decider founds out about it and schedules an activity
Workers finds out about the activity request, performs the results and returns the results
Decider notices a result and copies it over in a workflow completion
It seems to me that I can just have the decider do the work – as it were – and complete the workflow execution immediately. That would take care of a lot of code. (The activity might also fail, timeout, etc. All things that I currently need to cater for.)
So back to my question: can I have a decider that performs the work itself and completes the 'workflow' immediately?
Yes. Actually, I think you came up with an interesting use case: using a minimal workflow as a centralized locking mechanism for one-off actions in a distributed system - such as cron jobs executed from a single host in a fleet of many (the hosts have to first undergo election and whichever wins the lock gets to execute an action). The same could be achieved with Amazon SWF and minimum amount of code:
A small Python example, using boto.swf (use 1. from this post to setup the domain):
To code the decider:
#MyDecider.py
import boto.swf.layer2 as swf
class OneShotDecider(swf.Decider):
domain = 'stackoverflow'
task_list = 'default_tasks'
version = '1.0'
def run(self):
history = self.poll()
if 'events' in history:
decisions = swf.Layer1Decisions()
print 'got the decision task, doing the work'
decisions.complete_workflow_execution()
self.complete(decisions=decisions)
return False
return True
To start the decider:
$ ipython -i decider.py
In [1]: while OneShotDecider().run(): print 'polling SWF for decision tasks'
Finally, to start the workflow:
$ ipython
In [1]: wf_type = swf.WorkflowType(domain='stackoverflow', name='MyWorkflow', version='1.0', task_list='default_tasks')
In [2]: wf_type.start()
Out[2]: <WorkflowExecution 'MyWorkflow-1.0' at 0x32e2a10>
Back in the decider window, you you'll see something like:
polling SWF for decision tasks
polling SWF for decision tasks
got the decision task, doing the work
If your workflow is likely to evolve its business logic or grow in the number of activities, it's probably best to stick to the standard way of having Deciders doing the business logic and Workers solving the tasks.
While yes, you can do this (as pointed out by the other answer), there are some things to consider before doing so:
Why are you using SWF to execute this task? Why bother setting it up as a workflow and paying for "StartWorkflow" executions if you can get the same benefit by just invoking your code more directly? If you need to track execution submissions and completions, you can just use an SQS queue for this and get the same results for cheaper.
Your workflows might be extremely simple right now, but they often can and do evolve to be more complex over time. Designing it right from the start can save time in the long run. Do you want future developers working on your code thinking that they should just add more logic to the workflow? Will they know to lookup how to use activities, or just follow the existing pattern you've started with? (Hint - they'll be likely to copy your pattern - developers are lazy :))

Django-celery project, how to handle results from result-backend?

1) I am currently working on a web application that exposes a REST api and uses Django and Celery to handle request and solve them. For a request in order to get solved, there have to be submitted a set of celery tasks to an amqp queue, so that they get executed on workers (situated on other machines). Each task is very CPU intensive and takes very long (hours) to finish.
I have configured Celery to use also amqp as results-backend, and I am using RabbitMQ as Celery's broker.
Each task returns a result that needs to be stored afterwards in a DB, but not by the workers directly. Only the "central node" - the machine running django-celery and publishing tasks in the RabbitMQ queue - has access to this storage DB, so the results from the workers have to return somehow on this machine.
The question is how can I process the results of the tasks execution afterwards? So after a worker finishes, the result from it gets stored in the configured results-backend (amqp), but now I don't know what would be the best way to get the results from there and process them.
All I could find in the documentation is that you can either check on the results's status from time to time with:
result.state
which means that basically I need a dedicated piece of code that runs periodically this command, and therefore keeps busy a whole thread/process only with this, or to block everything with:
result.get()
until a task finishes, which is not what I wish.
The only solution I can think of is to have on the "central node" an extra thread that runs periodically a function that basically checks on the async_results returned by each task at its submission, and to take action if the task has a finished status.
Does anyone have any other suggestion?
Also, since the backend-results' processing takes place on the "central node", what I aim is to minimize the impact of this operation on this machine.
What would be the best way to do that?
2) How do people usually solve the problem of dealing with the results returned from the workers and put in the backend-results? (assuming that a backend-results has been configured)
I'm not sure if I fully understand your question, but take into account each task has a task id. If tasks are being sent by users you can store the ids and then check for the results using json as follows:
#urls.py
from djcelery.views import is_task_successful
urlpatterns += patterns('',
url(r'(?P<task_id>[\w\d\-\.]+)/done/?$', is_task_successful,
name='celery-is_task_successful'),
)
Other related concept is that of signals each finished task emits a signal. A finnished task will emit a task_success signal. More can be found on real time proc.