AWS EC2 spot instance availability - amazon-web-services

I am using the API call request_spot_instances to create spot instance without specifying any availability zone. Normally a random AZ is picked by the API. The spot request sometimes would return a no capacity status whereas I could request for a spot instance successfully through the AWS console in another AZ. What is the proper way to check the availability of the spot instance of a specific instance type before calling the request_spot_instance?

There is no public API to check Spot Instance availability. Having said that, you can still achieve what you want by following the below steps:
Use request_spot_fleet instead, and configure it to launch a single instance.
Be flexible with the instance types you use, pick as many as you can and include them in the request. To help you pick the instances, check Spot Instance advisor for instance interruption and saving rates.
At the Spot Fleet request, configure AllocationStrategy to capacityOptimized this will allow the fleet to allocate capacity form the most available Spot instance from your instances list and reduce the likelihood of Spot interruptions.
Don't set a max price SpotPrice, the default Spot instance price will be used. The pricing model for Spot has changed and it's no longer based on bidding, therefore Spot prices are more stable and don't fluctuate.

This may be a bit overkill for what you are looking for but with parts of the code you can find the spot price history for the last hour (this can be changed). It'll give you the instance type, AZ, and additional information. From there you can loop through the instance type to by AZ. If a spot instance doesn't come up in say 30 seconds try the next AZ.
And to Ahmed's point in his answer, this information can be used in the spot_fleet_request instead of looping through the AZs. If you pass the wrong AZ or subnet in the spot fleet request, it may pass the dryrun api call, but can still fail the real call. Just a heads up on that if you are using the dryrun parameter.
Here's the output of the code that follows:
In [740]: df_spot_instance_options
Out[740]:
AvailabilityZone InstanceType SpotPrice MemSize vCPUs CurrentGeneration Processor
0 us-east-1d t3.nano 0.002 512 2 True [x86_64]
1 us-east-1b t3.nano 0.002 512 2 True [x86_64]
2 us-east-1a t3.nano 0.002 512 2 True [x86_64]
3 us-east-1c t3.nano 0.002 512 2 True [x86_64]
4 us-east-1d t3a.nano 0.002 512 2 True [x86_64]
.. ... ... ... ... ... ... ...
995 us-east-1a p2.16xlarge 4.320 749568 64 True [x86_64]
996 us-east-1b p2.16xlarge 4.320 749568 64 True [x86_64]
997 us-east-1c p2.16xlarge 4.320 749568 64 True [x86_64]
998 us-east-1d p2.16xlarge 14.400 749568 64 True [x86_64]
999 us-east-1c p3dn.24xlarge 9.540 786432 96 True [x86_64]
[1000 rows x 7 columns]
And here's the code:
ec2c = boto3.client('ec2')
ec2r = boto3.resource('ec2')
#### The rest of this code maps the instance details to spot price in case you are looking for certain memory or cpu
paginator = ec2c.get_paginator('describe_instance_types')
response_iterator = paginator.paginate( )
df_hold_list = []
for page in response_iterator:
df_hold_list.append(pd.DataFrame(page['InstanceTypes']))
df_instance_specs = pd.concat(df_hold_list, axis=0).reset_index(drop=True)
df_instance_specs['Spot'] = df_instance_specs['SupportedUsageClasses'].apply(lambda x: 1 if 'spot' in x else 0)
df_instance_spot_specs = df_instance_specs.loc[df_instance_specs['Spot']==1].reset_index(drop=True)
#unapck memory and cpu dictionaries
df_instance_spot_specs['MemSize'] = df_instance_spot_specs['MemoryInfo'].apply(lambda x: x.get('SizeInMiB'))
df_instance_spot_specs['vCPUs'] = df_instance_spot_specs['VCpuInfo'].apply(lambda x: x.get('DefaultVCpus'))
df_instance_spot_specs['Processor'] = df_instance_spot_specs['ProcessorInfo'].apply(lambda x: x.get('SupportedArchitectures'))
#look at instances only between 30MB and 70MB
instance_list = df_instance_spot_specs['InstanceType'].unique().tolist()
#---------------------------------------------------------------------------------------------------------------------
# You can use this section by itself to get the instancce type and availability zone and loop through the instance you want
# just modify instance_list with one instance you want informatin for
#look only in us-east-1
client = boto3.client('ec2', region_name='us-east-1')
prices = client.describe_spot_price_history(
InstanceTypes=instance_list,
ProductDescriptions=['Linux/UNIX', 'Linux/UNIX (Amazon VPC)'],
StartTime=(datetime.now() -
timedelta(hours=1)).isoformat(),
# AvailabilityZone='us-east-1a'
MaxResults=1000)
df_spot_prices = pd.DataFrame(prices['SpotPriceHistory'])
df_spot_prices['SpotPrice'] = df_spot_prices['SpotPrice'].astype('float')
df_spot_prices.sort_values('SpotPrice', inplace=True)
#---------------------------------------------------------------------------------------------------------------------
# merge memory size and cpu information into this dataframe
df_spot_instance_options = df_spot_prices[['AvailabilityZone', 'InstanceType', 'SpotPrice']].merge(df_instance_spot_specs[['InstanceType', 'MemSize', 'vCPUs',
'CurrentGeneration', 'Processor']], left_on='InstanceType', right_on='InstanceType')

Related

How to configure aws auto scaling group for scaling up/down using terraform

I have an ECS cluster (type: ec2) that has an auto-scaling group. let's say that I have for example 10 instances as a maximum and there are 6 instances as a desired count of running instances and each instance has 2 deployed services now I want to configure the auto-scaling group for dynamically scaling up / down based on the service counts and this means:
if the desired number of the deployed service is 6 and I made an update to scaling the replica number of a service. the cluster must scale up with an instance from the 10 instances for the new replica number and so on until the maximum number of 10 instances is full and if I decrease the number of replicas the cluster must terminate the unused instance and so on.
why i need this ?
because I don't wanna have any instance with a Status: Active and I don't use it. I think that any unused instance with an active status will pay for it. so if you even have a good idea or I am wrong in my thoughts tell me.
here is my configuration:
resource "aws_autoscaling_policy" "asg_policy" {
name = "asg-policy"
scaling_adjustment = 1
policy_type = "SimpleScaling"
adjustment_type = "ChangeInCapacity"
cooldown = 100
autoscaling_group_name = aws_autoscaling_group.ecs_asg.name
}
resource "aws_autoscaling_group" "ecs_asg" {
name = "ecs-asg"
vpc_zone_identifier = ["${aws_subnet.public_1.id}", "${aws_subnet.public_2.id}", "${aws_subnet.public_3.id}"]
launch_configuration = aws_launch_configuration.ecs_launch_config.name
desired_capacity = 6
min_size = 0
max_size = 10
health_check_grace_period = 100
health_check_type = "ELB"
force_delete = true
target_group_arns = [aws_lb_target_group.asg_tg.arn]
termination_policies = ["OldestInstance"]
}
i tried to configure the asg_policy but seems not working as expected.
tried to setup the max / min number but not working.
can anyone help me & tnx

Dynamically distribute EC2s in available Subnets via Terraform [duplicate]

This question already has an answer here:
Terraform list element out of bounds?
(1 answer)
Closed 3 years ago.
The requirement is to create EC2s from the dynamically given list instance_names and distributed them evenly in the available subnets of the VPC.
I have tried looping and conditional statements with little luck.
Use Case 01 - (In a VPC with two subnets) If we are creating 2
servers, One EC2 should be in subnet 'a' and other in subnet 'b'
Use Case 02 - (In a VPC with two subnets) If we are creating 3
servers, two EC2s need to be in subnet 'a' and the other EC2 in subnet
'b'
Control Code
module "servers" {
source = "modules/aws-ec2"
instance_type = "t2.micro"
instance_names = ["server01", "server02", "server03"]
subnet_ids = module.prod.private_subnets
}
Module
resource "aws_instance" "instance" {
count = length(var.instance_names)
subnet_id = var.subnet_ids[count.index]
tags = {
Name = var.instance_names[count.index]
}
}
You can use element to loop around the subnet_ids list and get the correct id for each aws_instance.
In the docs you can see that element will give you the desired effect because:
If the given index is greater than the length of the list then the index is "wrapped around" by taking the index modulo the length of the list
Use Case 1
-> server01 - subnet 'a' <-> element(subnet_ids,0)
-> server02 - subnet 'b' <-> element(subnet_ids,1)
Use Case 2
-> server01 | subnet 'a' <-> element(subnet_ids,0)
-> server02 | subnet 'b' <-> element(subnet_ids,1)
# loop around the subnet id list the first id again
-> server03 | subnet 'a' <-> element(subnet_ids,2)
-> server04 | subnet 'b' <-> element(subnet_ids,3)
-> etc.
So the following update to the code should work:
resource "aws_instance" "instance" {
count = length(var.instance_names)
subnet_id = element(var.subnet_ids, count.index)
tags = {
Name = var.instance_names[count.index]
}
}
I found an interesting answer from Tom Lime for a similar question. I derived an answer from it to this scenario. For the Module, we provide the below logic for the subnet_id
subnet_id = "${var.subnet_ids[ count.index % length(var.subnet_ids) ]}"

Aerospike losing documents when node goes down

I've been doing dome tests using aerospike and I noticed a behavior different than what is sold.
I have a cluster of 4 nodes running on AWS in the same AZ, the instances are t2micro (1cpu, 1gb RAM, 25gb SSD) using the aws linux with the AMI aerospike
aerospike.conf:
heartbeat {
mode mesh
port 3002
mesh-seed-address-port XXX.XX.XXX.164 3002
mesh-seed-address-port XXX.XX.XXX.167 3002
mesh-seed-address-port XXX.XX.XXX.165 3002
#internal aws IPs
...
namespace teste2 {
replication-factor 2
memory-size 650M
default-ttl 365d
storage-engine device {
file /opt/aerospike/data/bar.dat
filesize 22G
data-in-memory false
}
}
What I did was a test to see if I would loose documents when a node goes down. For that I wrote a little code on python:
from __future__ import print_function
import aerospike
import pandas as pd
import numpy as np
import time
import sys
config = {
'hosts': [ ('XX.XX.XX.XX', 3000),('XX.XX.XX.XX',3000),
('XX.XX.XX.XX',3000), ('XX.XX.XX.XX',3000)]
} # external aws ips
client = aerospike.client(config).connect()
for i in range(1,10000):
key = ('teste2', 'setTest3', ''.join(('p',str(i))))
try:
client.put(key, {'id11': i})
print(i)
except Exception as e:
print("error: {0}".format(e), file=sys.stderr)
time.sleep(1)
I used this code just for inserting a sequence of integers that I could check after that. I ran that code and after a few seconds I stopped the aerospike service at one node for 10 seconds, using sudo service aerospike stop and sudo service aerospike colstart to restart.
I waited for a few seconds until the nodes did all the migration and executed the following python script:
query = client.query('teste2', 'setTest3')
query.select('id11')
te = []
def save_result((key, metadata, record)):
te.append(record)
query.foreach(save_result)
d = pd.DataFrame(te)
d2 = d.sort(columns='id11')
te2 = np.array(d2.id11)
for i in range(0,len(te2)):
if i > 0:
if (te2[i] != (te2[i-1]+1) ):
print('no %d'% int(te2[i-1]+1))
print(te2)
And got as response:
no 3
no 6
no 8
no 11
no 13
no 17
no 20
no 22
no 24
no 26
no 30
no 34
no 39
no 41
no 48
no 53
[ 1 2 5 7 10 12 16 19 21 23 25 27 28 29 33 35 36 37 38 40 43 44 45 46 47 51 52 54]
Is my cluster configured wrong or this is normal?
ps: I tried to include as many things I could, if you please suggest more information to include I will appreciate.
Actually I found a solution, and it is pretty simple and foolish to be honest.
In the configuration file we have some parameters for network communication between nodes, such as:
interval 150 # Number of milliseconds between heartbeats
timeout 10 # Number of heartbeat intervals to wait
# before timing out a node
This two parameters set the time it takes to the cluster to realize the node is down and out of the cluster. (in this case 1.5 sec).
What we found useful was to tune the write policies at the client to work along this parameters.
Depending on the client you will have some policies like number of tries until the operation fails, timeout for the operation, time between tries.
You just need to adapt the client parameters. For example: set the number of retries to 4 (each is executed after 500 ms) and the timeout to 2 sec. Doing that the client will recognize the node is down and redirect the operation to another node.
This setup can be overwhelming on the cluster, generating a huge overload, but it worked for us.

How can we use clustering results in weka ?

I am using Weka for my internship but I have a little knowledge about data mining. So, maybe someone knows how can I apply the following results on my data-sets to get all data by cluster ? The method that I use now is to compute distances between my attributes and the mean value of each cluster then I classify them by the nearest value. But this method is too rough for me .
=== Run information ===
Scheme:weka.clusterers.EM -I 100 -N -1 -M 1.0E-6 -S 100
Relation: wcet_cluster6 - Copie-weka.filters.unsupervised.attribute.Remove-R1-3,5-weka.filters.unsupervised.attribute.Remove-R5-12
Instances: 467
Attributes: 4
max
alt
stmt
bb
Test mode:evaluate on training data
=== Model and evaluation on training set ===
EM
Number of clusters selected by cross validation: 6
Cluster
Attribute 0 1 2 3 4 5
(0.28) (0.11) (0.25) (0.16) (0.04) (0.17)
==================================================================
max
mean 9.0148 10.9112 11.2826 10.4329 11.2039 10.0546
std. dev. 1.8418 2.7775 3.0263 2.5743 2.2014 2.4614
alt
mean 0.0003 19.6467 0.4867 2.4565 44.191 8.0635
std. dev. 0.0175 5.7685 0.5034 1.3647 10.4761 3.3021
stmt
mean 0.7295 77.0348 3.2439 12.3971 140.9367 33.9686
std. dev. 1.0174 21.5897 2.3642 5.1584 34.8366 11.5868
bb
mean 0.4362 53.9947 1.4895 7.2547 114.7113 22.2687
std. dev. 0.5153 13.1614 0.9276 3.5122 28.0919 7.6968
Time taken to build model (full training data) : 4.24 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 163 ( 35%)
1 50 ( 11%)
2 85 ( 18%)
3 73 ( 16%)
4 18 ( 4%)
5 78 ( 17%)
Log likelihood: -9.09081
Thanks for your help!!
I think no-one can really answer this. Some tips off the top of my head.
You have used the EM clustering algorithm, see animated gif on wikipedia page. From Weka's Documentation Synopsis:
"EM assigns a probability distribution to each instance which
indicates the probability of it belonging to each of the clusters. "
Is this complex output really what you want?
It also selects a number of clusters for you (unless you constrain that number).
In weka 3.7 you can use the unsupervised attribute filter "ClusterMembership" in the Preprocess dialog to replace your dataset with a result of the cluster assignments. You need to select one reference attribute, though. By default it selects the last one. This creates hard-to -interpret output.

How to create a DS to store the accumulated result of another DS with rrdtool

I want to create a rrd file with two data souces incouded. One stores the original value the data, name it 'dc'. The other stores the accumulated result of 'dc', name it 'total'. The expected formula is current(total) = previous(total) + current(dc). For example, If I update the data sequence (2, 3, 5, 4, 9) to the rrd file, I want 'dc' is (2, 3, 5, 4, 9) and 'total' is (2, 5, 15, 19, 28).
I tried to create the rrd file with the command line below. The command fails and says that the PREV are not supported with DS COMPUTE.
rrdtool create test.rrd --start 920804700 --step 300 \
DS:dc:GAUGE:600:0:U \
DS:total:COMPUTE:PREV,dc,ADDNAN \
RRA:AVERAGE:0.5:1:1200 \
RRA:MIN:0.5:12:2400 \
RRA:MAX:0.5:12:2400 \
RRA:AVERAGE:0.5:12:2400
Is there an alternative manner to define the DS 'total' (DS:total:COMPUTE:PREV,dc,ADDNAN) ?
rrdtool does not store 'original' values ... it rather samples to signal you provide via the update command at the rate you defined when you setup the database ... in your case 1/300 Hz
that said, a total does not make much sense ...
what you can do with a single DS though, is build the average value over a time range and multiply the result with the number of seconds in the time range and thus arrive at the 'total'.
Sorry a bit late but may be helpful for someone else.
Better to use RRDtool's ' mrtg-traffic-sum ' package which when I'm using an rrd with GAUGE DS & LAST a the RRA's so it's allowing me to collect monthly traffic volumes & quota limits.
eg: Here is a basic Traffic chart with no traffic quota.
root#server:~# /usr/bin/mrtg-traffic-sum --range=current --units=MB /etc/mrtg/R4.cfg
Subject: Traffic total for '/etc/mrtg/R4.cfg' (1.9) 2022/02
Start: Tue Feb 1 01:00:00 2022
End: Tue Mar 1 00:59:59 2022
Interface In+Out in MB
------------------------------------------------------------------------------
eth0 0
eth1 14026
eth2 5441
eth3 0
eth4 15374
switch0.5 12024
switch0.19 151
switch0.49 1
switch0.51 0
switch0.92 2116
root#server:~#
From this you can then write up a script that will generate a new rrd which stores these values & presto you have a traffic volume / quota graph.
Example fixed traffic volume chart using GAUGE
This thread reminded me to fix this collector that had stopped & just got around to posting ;)