Creating an instance in lambda and assigning an Elastic IP. Create instance part works, code section below is meant to wait before assigning the elastic IP. (1) What state does the instance have to be in to assign? (assuming "running") (2) Below logic never proceeds to after the loop, although I can verify the instance goes into "running" state. I verified the instance ID is good and there's only 1 instance in this case.
print ('waiting')
newresp = ec2_client.describe_instance_status(InstanceIds=newins_list,IncludeAllInstances=True)
while (newresp['InstanceStatuses'][0]['InstanceState']['Name'] != 'running'):
newresp = ec2_client.describe_instance_status(InstanceIds=newins_list,IncludeAllInstances=True)
print ('New Instance Running')
The "timeout" setting is important in lambda. This affects loops and timers. If you have for/while loops, sleep, or other functions that take time, lambda should be configured.
When I created the function the default setting was 3s execution time. The statement after the loop was never reached because the function timed out.
You can set the timeout from the function "Configuration" tab under General Configuration.
You have to remember though, pricing is based on the amount of time your code runs. I set it to 30s, and checked status at 5s intervals in the loop. Here's the final code, starting from the loop section, including the associate command:
....
while (newresp['InstanceStatuses'][0]['InstanceState']['Name'] != 'running'):
time.sleep(5)
newresp = ec2_client.describe_instance_status(InstanceIds=newins_list,IncludeAllInstances=True)
print ('Associating Elastic IP')
try:
ec2_client.associate_address(AllocationId='eipalloc-0805514a57680aaf8',InstanceId=newins,AllowReassociation=True)
print ('New Instance Running')
except ClientError as e:
print(e)
Related
I am using Spot Instances to run some batch jobs.
However lately we have been seeing a lot of spot instance terminations and want to use the 2-minute interruption notice that aws sends before an instance is terminated.
Sources:
https://aws.amazon.com/blogs/aws/new-ec2-spot-instance-termination-notices/
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-instance-termination-notices.html
My approach here was to run a separate thread in my application that polls the instance meta-data url
http://169.254.169.254/latest/meta-data/spot/instance-action to check if a termination notice has been sent out and raise an exception (or re-trigger the current job)
My code
interruption_monitor.py
import requests
import log
from time import sleep
from threading import Thread
class InstanceTerminated(Exception):
"""Instance Terminated Exception class"""
class InterruptionMonitor(Thread):
"""Threaded Interruption monitor"""
def __init__(self, sleep_time=0.1, report_time=5):
super().__init__(daemon=True)
self.sleep_time = min(sleep_time, report_time)
self.count = 0
self.logger = log.get_logger(f"my_app.{__name__}")
def check_interruption_notice(self):
"""Check for interruption Notice"""
self.logger.info("CHECKING FOR INTERRUPTION NOTICE...")
url = 'http://169.254.169.254/latest/meta-data/spot/instance-action'
response = requests.get(url=url, timeout=5)
self.logger.info("RESPONSE:", resp=response)
if response.status_code != 404:
# print(response)
if response.action == 'stop' or response.action == 'terminate':
raise InstanceTerminated # Or retrigger the Job
def run(self):
"""Entry point for thread execution"""
while True:
self.check_interruption_notice()
sleep(self.sleep_time)
self.count += 1
There are 2 questions that i am looking an answer for:
Is the correct way of handling this? Is there any added cost or if this would effect my existing job performance in any way? If yes what? If No, please suggest a better approach to this?
I am not able to test the positive scenario as I have to wait for AWS to interrupt my spot instances to see if it works as I expected. Is there a way to manually cause the spot instance terminations so that I receive the interruption notice and verify that this works.
PS: I am a noob with AWS, so please bear with me
Would there be any added cost or if this would effect my existing Job Performance in any way?
There is no added cost for running another thread in your current ECS processes. Why would there be? Please take the time to understand how ECS bills you if you are concerned about that. ECS doesn't bill per thread.
There could definitely be a performance hit if you poll too often. Your default setting of 0.1 seconds polling is way too fast. I don't understand what you are doing with sleep_time and report_time values, but AWS recommends in the documentation you linked to poll every 5 seconds, not every 0.1 seconds.
Is there a way to manually cause the spot instance terminations so that I receive the interruption notice and verify that this works.
Unfortunately, there is no way to manually trigger that on ECS that I am aware of.
I have a website hosted on an EC2 server. I want to monitor the website endpoint and restart the EC2 instance if the website in unavailable for a certain time frame (say 60 seconds).
What tools do I use in AWS and how do I accomplish this?
This is not a recommended approach.
Firstly, if a website is unavailable, you would probably want to investigate the cause rather than just restarting the instance. Your goal should be to run a stable system by removing root causes of problems rather than just ignoring the problem by restarting all the time.
The recommended design would be to run in a Highly Available configuration with:
The application running on at least two servers across at least two Availability Zones (in case of failure of an AZ). This is not necessarily more expensive because each server can be smaller than a single, large server.
A load balancer in front of the instances, distributing the traffic to the instances. The load balancer also performs continuous health checks and stops sending requests to servers that fail the health check
An Auto Scaling group that can terminate unhealthy instances and automatically launch replacement servers. This also works well if an Availability Zone should fail.
In this design, an unhealthy instance would be terminated (stopped and destroyed) and a new instance created with a pre-defined disk image and startup script. Alternatively, you might choose to move bad instances out of the Auto Scaling group for investigation of the problem, with a new instance being launched to take its place.
If your application requires a database, the database should be external to the instances so that all instances can connect to the database and replacing application instances does not cause any data loss.
As to the speed of noticing problems on a server, the load balancer can perform checks every few seconds. Amazon CloudWatch, on the other hand, would need at least a minute to detect problems (probably longer since metrics are calculated over a period rather than being "now" metrics).
John's approach is the correct one, but at its simplest:
Write a lambda function that can query your website and see if it is running or not and if not have that lambda function restart the instance.
Setup a cloudwatch event rule that runs on a frequency you determine to call the lambda function
I'll leave to you the work of writing the code that determines if the website is functional and restarting the server - but that is pretty straightforward. You can use python, java, node, go or .net core in your lambda function - I would think python would be the easiest in this case, but that is an opinion.
It is clear that this is not a best practice in AWS but can make some sense - e.g. you are running a small personal web server with low demand where availability is a less issue than costs.
At least that was my reason why I built automation for it.
diagram
lambda code
import json
import os
import boto3
import time
env_vars = [
'ALARM_NAME',
'REGION',
'INSTANCE_ID',
'OUTPUT_SNS_ARN'
]
ENV = {}
for env_var in env_vars:
ENV[env_var] = os.environ.get(env_var, None)
if not ENV[env_var]:
raise Exception(f"Environment variable {env_var} must be set!")
def reboot_instance(instanceID, regionName) -> "instanceID":
"""
InstanceID
instanceID - ID of instance
regionName - name of region
return InstanceID or False in case of exception
"""
ec2 = boto3.resource('ec2', region_name=regionName)
instance = ec2.Instance(instanceID)
try:
instance.stop()
time.sleep(30)
instance.stop(Force=True)
except:
pass
for i in range(180): # wait 3 minutes
instance = ec2.Instance(instanceID)
if instance.state['Code'] == 80:
break
time.sleep(1)
else:
raise Exception('Unable to stop instance')
instance.start()
return instanceID
def notify_about_reboot(instanceID, snsarn) -> True:
"""
Put SNS message about reboot to snsarn
"""
client = boto3.client('sns', region_name='us-east-1')
client.publish(TopicArn=snsarn, Message=f'EC2 instance {instanceID} was rebooted!')
return True
def lambda_handler(event, context) -> "status about reboot":
"""
event: see events/event.json
"""
print('EVENT:')
print(event)
for record in event.get('Records', None):
sns = record.get('Sns', None)
message = json.loads(sns.get('Message', None))
msgalarm = message.get('AlarmName', None)
msgstatus = message.get('NewStateValue', None)
if not all([sns,message,msgalarm,msgstatus]):
continue
if (msgalarm == ENV['ALARM_NAME']) and (msgstatus == 'ALARM'):
notify_about_reboot(reboot_instance(ENV['INSTANCE_ID'], ENV['REGION']), ENV['OUTPUT_SNS_ARN'])
return 'rebooting'
else:
return 'nothing to do'
return 'no sns record found'
I have released whole tested automation with SAM template and installation instructions also on https://github.com/koss822/misc/tree/master/Aws/route53-healthcheck-instance-reboot
I have the following python code to detect whether an EC2 is really started. But it completes when "instance state" shows running.
which API function should I use to block until EC2 "status check" show "2/2 checks passed"
ec2 = boto3.resource('ec2')
instance = ec2.Instance(instanceid)
instance.wait_until_running()
It is rare that you would need to wait for the status check to pass.
When an instance enters the running state, the machine boots, loads the operating system and generally "runs".
The EC2 Status Checks are an independent process that check attributes of the virtual machine. However, your machine is normally running, and you can login to it, well before the status checks show a positive response.
If you do wish to wait for the Status Check, there are two waiters that might do this, but the documentation is unclear:
InstanceStatusOk
SystemStatusOk
I'm looking for the ability to programmatically schedule a lambda function to run a single time with another lambda function. For example, I made a request to myFirstFunction with date and time parameters, and then at that date and time, have mySecondFunction execute. Is that possible only with stateless AWS services? I'm trying to avoid an always-on ec2 instance.
Most of the results I'm finding for scheduling a lambda functions have to do with cloudwatch and regularly scheduled events, not ad-hoc events.
This is a perfect use case for aws step functions.
Use Wait state with SecondsPath or TimestampPath to add the required delay before executing the Next State.
What you're tring to do (schedule Lambda from Lambda) it's not possible with the current AWS services.
So, in order to avoid an always-on ec2 instance, there are other options:
1) Use AWS default or custom metrics. You can use, for example, ApproximateNumberOfMessagesVisible or CPUUtilization (if your app fires a big CPU utilization when process a request). You can also create a custom metric and fire it when your instance is idle (depending on the app that's running in your instance).
The problem with this option is that you'll waste already paid minutes (AWS always charge a full-hour, no matter if you used your instance for 15 minutes).
2) A better option, in my opinion, would be to run a Lambda function once per minute to check if your instances are idle and shut them down only if they are close to the full hour.
import boto3
from datetime import datetime
def lambda_handler(event, context):
print('ManageInstances function executed.')
environments = [['instance-id-1', 'SQS-queue-url-1'], ['instance-id-2', 'SQS-queue-url-2'], ...]
ec2_client = boto3.client('ec2')
for environment in environments:
instance_id = environment[0]
queue_url = environment[1]
print 'Instance:', instance_id
print 'Queue:', queue_url
rsp = ec2_client.describe_instances(InstanceIds=[instance_id])
if rsp:
status = rsp['Reservations'][0]['Instances'][0]
if status['State']['Name'] == 'running':
current_time = datetime.now()
diff = current_time - status['LaunchTime'].replace(tzinfo=None)
total_minutes = divmod(diff.total_seconds(), 60)[0]
minutes_to_complete_hour = 60 - divmod(total_minutes, 60)[1]
print 'Started time:', status['LaunchTime']
print 'Current time:', str(current_time)
print 'Minutes passed:', total_minutes
print 'Minutes to reach a full hour:', minutes_to_complete_hour
if(minutes_to_complete_hour <= 2):
sqs_client = boto3.client('sqs')
response = sqs_client.get_queue_attributes(QueueUrl=queue_url, AttributeNames=['All'])
messages_in_flight = int(response['Attributes']['ApproximateNumberOfMessagesNotVisible'])
messages_available = int(response['Attributes']['ApproximateNumberOfMessages'])
print 'Messages in flight:', messages_in_flight
print 'Messages available:', messages_available
if(messages_in_flight + messages_available == 0):
ec2_resource = boto3.resource('ec2')
instance = ec2_resource.Instance(instance_id)
instance.stop()
print('Stopping instance.')
else:
print('Status was not running. Nothing is done.')
else:
print('Problem while describing instance.')
UPDATE - I wouldn't recommend using this approach. Things changed in when TTL deletions happen and they are not close to TTL time. The only guarantee is that the item will be deleted after the TTL. Thanks #Mentor for highlighting this.
2 months ago AWS announced DynamoDB item TTL, which allows you to insert an item and mark when you wish for it to be deleted. It will be deleted automatically when the time comes.
You can use this feature in conjunction with DynamoDB Streams to achieve your goal - your first function inserts an item to a DynamoDB table. The record TTL should be when you want the second lambda triggered. Setup a stream that triggers your second lambda. In this lambda you will identify deletion events and if that's a delete then run your logic.
Bonus point - you can use the table item as a mechanism for the first lambda to pass parameters to the second lambda.
About DynamoDB TTL:
https://aws.amazon.com/blogs/aws/new-manage-dynamodb-items-using-time-to-live-ttl/
It does depend on your use case, but the idea that you want to trigger something at a later date is a common pattern. The way I do it serverless is I have a react application that triggers an action to store a date in the future. I take the date format like 24-12-2020 and then convert it using date(), having researched that the date format mentioned is correct, so I might try 12-24-2020 and see what I get(!). When I am happy I convert it to a Unix number in javascript React I use this code:
new Date(action.data).getTime() / 1000
where action.data is the date and maybe the time for the action.
I run React in Amplify (serverless), I store that to dynamodb (serverless). I then run a Lambda function (serverless) to check my dynamodb for any dates (I actually use the Unix time for now) and compare the two Unix dates now and then (stored) which are both numbers, so the comparison is easy. This seems to me to be super easy and very reliable.
I just set the crontab on the Lambda to whatever is needed depending on the approximate frequency required, in most cases running a lambda every five minutes is pretty good, although if I was only operating this in a certain time zone for a business weekday application I would control the Lambda a little more. Lambda is free for the first 1m functions per month and running it every few minutes will cost nothing. Obviously things change, so you will need to look that up in your area.
You will never get perfect timing in this scenario. It will, however, for the vast majority of use cases be close enough according to the timing settings of the Lambda function, you could set it up to check every minute or just once per day, it all depends on your application.
Alternatively, If I wanted an instant reaction to an event I might use SMS, SQS, or Kinesis to instantly stream a message, it all depends on your use case.
I'd opt for enqueuing deferred work to SQS using message timers in myFirstFunction.
Currently, you can't use SQS as a Lambda event source, but you can either periodically schedule mySecondFunction to check the queue via scheduled CloudWatch Events (somewhat of a variant of the other options you've found) or use a CloudWatch alarm on the ApproximateNumberOfMessagesVisible to fire an SNS message to a Lambda and avoid constant polling for queues that are frequently inactive for long periods.
I'm running through a loop with a PHP script that's on AWS instance. From my experiences with AWS, as soon as the instance is stopped, all of the code that's in the process of being executed is stopped. What I have is this:
<?php
require("vendor/autoload.php");
use Aws\Ec2\Ec2Client;
$instance_id = 'instance_id';
$creds = array('key' => 'key',
'secret' => 'secret',
'region' => 'us-west-2');
$client = Ec2Client::factory($creds);
$instance = array('InstanceIds' => array($instance_id), 'DryRun' => false);
for($i=0;$i<10;$i++) {
// Execute irrelevant code
// .....
$result = $client->stopInstances($instance);
sleep(300);
$result = $client->startInstances($instance);
}
?>
So, my question is this: Once the instance is stopped, everything that is written after that will not be executed since the instance will be stopped, right? The loop will not continue on to the next iteration, right? If so, then how could I get around that?
When you call the stopinstances api, EC2 will start shutting down your instance (and the OS inside the instance will kill running processes as part of it)
There's no guarantee exactly how long this will take, although in my experience you'll rarely get more than a couple of seconds, so that sleep(300) pretty much guarantees that the call to stopInstances is the last thing that your code will do.
There's nothing you can do about this other than not stopping the instance you are running on. To that end you can query the instance metadata service to find out what the id of the instance running your code is. You can get this data by making a request to http://169.254.169.254/latest/meta-data/instance-id
You cannot start an instance that is Stopped from the same instance. You can keep an additional (external) server either on EC2 or otherwise to control automatic shutdowns/startups.
To follow on from #TJ-'s answers...
You can check to see if the instance is stopped and then continue with your code
$client->waitUntil('InstanceStopped', array('InstanceIds' => $instanceId));
But you have to run this from a different instance than the one being terminated.