EBS volume stuck on 'creating' when using Boto API - amazon-web-services

I'm attempting to create and attach a new EBS volume to an existing instance using Boto. The Boto script is running on the instance itself.
The problem is that the status continuously returns 'creating' much of the time. (Frustratingly, not always!) The code snippet is:
volume = conn.create_volume(args.ebs_volume_size, instance.placement)
status = ''
while status != 'available':
status = conn.get_all_volumes([volume.id])[0].status
print "Volume status: %s" % status
time.sleep(4)
Most of the time, it hangs on 'creating', even though the volume is created and available (it can be seen in the management console as ready to go). Sometimes, it works fine. I must be missing something obvious... but what?

Right after you run you create_volume method, call update on the newly created volume.
volume = conn.create_volume(args.ebs_volume_size, instance.placement)
while volume.status != 'available':
time.sleep(5)
volume.update()
print volume.status

Related

How to automatically stop Sagemaker notebook instances if it is idle?

I have been looking for a script to automatically close Sagemaker Notebook Instances that have been forgotten to be closed or that are idle. A few scripts I found don't work very well (eg: link , it is only checking if ipynb file is live, Im not using .ipynb, or taking the last updated info which never changes until you shut down or open the instance)
Is there a resource or script you can recommend?
You can use the following script to find idle instances. You can modify the script to stop the instance if idle for more than 5 minutes or have a cron job to stop the instance.
import boto3
last_modified_threshold = 5 * 60
sm_client = boto3.client('sagemaker')
response = sm_client.list_notebook_instances()
for item in response['NotebookInstances']:
last_modified_seconds = item['LastModifiedTime'].timestamp()
last_modified_minutes = last_modified_seconds/60
print(last_modified_minutes)
if last_modified_minutes > last_modified_threshold:
print('Notebook {0} has been idle for more than {1} minutes'.format(item['NotebookInstanceName'], last_modified_threshold/60))

AWS Win Tasks aren't running after server restart via Lambda

I have coded a simple task scedule that turns the AWS Windows server off and a simple Lambda code to turn the stopped instaces on, see code:
import time
import json
import boto3
def lambda_handler(event, context):
# boto3 client
client = boto3.client('ec2')
ssm = boto3.client('ssm')
# getting instance information
describeInstance = client.describe_instances()
#print("eddie", describeInstance["Reservations"][0]["Instances"][0]["State"]["Name"])
InstanceId = []
# fetchin instance id of the running instances
for i in describeInstance['Reservations']:
for instance in i['Instances']:
if instance["State"]["Name"] == "stopped":
InstanceId.append(instance['InstanceId'])
client.start_instances(InstanceIds=InstanceId)
"""looping through instance ids
for instanceid in InstanceId:
# command to be executed on instance
client.start_instances(InstanceIds=InstanceId)
print(output)"""
return {
'statusCode': 200,
'body': json.dumps('Thanks from Srce Cde!')
}
I have several tasks on my task scheduler that run Python scripts but after the activation those tasks arent running even when the server in running.
Note, I have tried to set "run whether logged on or not" on all of them but recived (0x1) errors related to privileges on those tasks
Does anyone know how to solve this? Is there any other way to turn off and on EC2 AWS win server in night time to save on billing

Restart EC2 instance on Website unavailability

I have a website hosted on an EC2 server. I want to monitor the website endpoint and restart the EC2 instance if the website in unavailable for a certain time frame (say 60 seconds).
What tools do I use in AWS and how do I accomplish this?
This is not a recommended approach.
Firstly, if a website is unavailable, you would probably want to investigate the cause rather than just restarting the instance. Your goal should be to run a stable system by removing root causes of problems rather than just ignoring the problem by restarting all the time.
The recommended design would be to run in a Highly Available configuration with:
The application running on at least two servers across at least two Availability Zones (in case of failure of an AZ). This is not necessarily more expensive because each server can be smaller than a single, large server.
A load balancer in front of the instances, distributing the traffic to the instances. The load balancer also performs continuous health checks and stops sending requests to servers that fail the health check
An Auto Scaling group that can terminate unhealthy instances and automatically launch replacement servers. This also works well if an Availability Zone should fail.
In this design, an unhealthy instance would be terminated (stopped and destroyed) and a new instance created with a pre-defined disk image and startup script. Alternatively, you might choose to move bad instances out of the Auto Scaling group for investigation of the problem, with a new instance being launched to take its place.
If your application requires a database, the database should be external to the instances so that all instances can connect to the database and replacing application instances does not cause any data loss.
As to the speed of noticing problems on a server, the load balancer can perform checks every few seconds. Amazon CloudWatch, on the other hand, would need at least a minute to detect problems (probably longer since metrics are calculated over a period rather than being "now" metrics).
John's approach is the correct one, but at its simplest:
Write a lambda function that can query your website and see if it is running or not and if not have that lambda function restart the instance.
Setup a cloudwatch event rule that runs on a frequency you determine to call the lambda function
I'll leave to you the work of writing the code that determines if the website is functional and restarting the server - but that is pretty straightforward. You can use python, java, node, go or .net core in your lambda function - I would think python would be the easiest in this case, but that is an opinion.
It is clear that this is not a best practice in AWS but can make some sense - e.g. you are running a small personal web server with low demand where availability is a less issue than costs.
At least that was my reason why I built automation for it.
diagram
lambda code
import json
import os
import boto3
import time
env_vars = [
'ALARM_NAME',
'REGION',
'INSTANCE_ID',
'OUTPUT_SNS_ARN'
]
ENV = {}
for env_var in env_vars:
ENV[env_var] = os.environ.get(env_var, None)
if not ENV[env_var]:
raise Exception(f"Environment variable {env_var} must be set!")
def reboot_instance(instanceID, regionName) -> "instanceID":
"""
InstanceID
instanceID - ID of instance
regionName - name of region
return InstanceID or False in case of exception
"""
ec2 = boto3.resource('ec2', region_name=regionName)
instance = ec2.Instance(instanceID)
try:
instance.stop()
time.sleep(30)
instance.stop(Force=True)
except:
pass
for i in range(180): # wait 3 minutes
instance = ec2.Instance(instanceID)
if instance.state['Code'] == 80:
break
time.sleep(1)
else:
raise Exception('Unable to stop instance')
instance.start()
return instanceID
def notify_about_reboot(instanceID, snsarn) -> True:
"""
Put SNS message about reboot to snsarn
"""
client = boto3.client('sns', region_name='us-east-1')
client.publish(TopicArn=snsarn, Message=f'EC2 instance {instanceID} was rebooted!')
return True
def lambda_handler(event, context) -> "status about reboot":
"""
event: see events/event.json
"""
print('EVENT:')
print(event)
for record in event.get('Records', None):
sns = record.get('Sns', None)
message = json.loads(sns.get('Message', None))
msgalarm = message.get('AlarmName', None)
msgstatus = message.get('NewStateValue', None)
if not all([sns,message,msgalarm,msgstatus]):
continue
if (msgalarm == ENV['ALARM_NAME']) and (msgstatus == 'ALARM'):
notify_about_reboot(reboot_instance(ENV['INSTANCE_ID'], ENV['REGION']), ENV['OUTPUT_SNS_ARN'])
return 'rebooting'
else:
return 'nothing to do'
return 'no sns record found'
I have released whole tested automation with SAM template and installation instructions also on https://github.com/koss822/misc/tree/master/Aws/route53-healthcheck-instance-reboot

How to block until EC2 status check is passed using Python Boto3?

I have the following python code to detect whether an EC2 is really started. But it completes when "instance state" shows running.
which API function should I use to block until EC2 "status check" show "2/2 checks passed"
ec2 = boto3.resource('ec2')
instance = ec2.Instance(instanceid)
instance.wait_until_running()
It is rare that you would need to wait for the status check to pass.
When an instance enters the running state, the machine boots, loads the operating system and generally "runs".
The EC2 Status Checks are an independent process that check attributes of the virtual machine. However, your machine is normally running, and you can login to it, well before the status checks show a positive response.
If you do wish to wait for the Status Check, there are two waiters that might do this, but the documentation is unclear:
InstanceStatusOk
SystemStatusOk

How to verify volume successfully created/attached in boto3?

I'm using boto3 client.create_volume and client.attach_volume APIs, but the return values are dictionaries, and the key State within the dictionary is creating for create_volume, and attaching for attach_volume. Is there any way to check if the volume is successfully created/attached within boto3?
Fortunately, boto3 has a concept called Waiters that can do the waiting for you!
See: EC2.Waiter.VolumeInUse
Polls EC2.Client.describe_volumes() every 15 seconds until a successful state is reached. An error is returned after 40 failed checks.
For those using ec2 client (ec2 = boto3.client('ec2')), you can do
ec2.get_waiter('volume_available').wait(VolumeIds=[new_volume['VolumeId']])
See describe_volumes
Pass your volume_id and describe_volumes returns the information about:
Creation State:
'State': 'creating'|'available'|'in-use'|'deleting'|'deleted'|'error'
Attachment State:
'State': 'attaching'|'attached'|'detaching'|'detached'
and lot more information about your volume.