Unable to start Spot Instance VM after Resize - azure-virtual-machine

I am unable to start my spot instance VM after resizing it from HHB120rs_v2 to HB120rs_v3, with the error:
Failed to start virtual machine 'VM2'. Error: Allocation failed. VM(s) with the following
constraints cannot be allocated, because the condition is too restrictive. Please remove some
constraints and try again. Constraints applied are:
Low Priority VMs
Networking Constraints (such as Accelerated Networking or IPv6)
Preemptible VMs (VM might be preempted by another VM with a higher priority)
VM Size
This is all done within the US East region. Interestingly, when resizing, HB120rs_v3 is not shown under the H-Series category, but instead under the Other category. It previously appeared under the H-Series category, when I was able to start it correctly. If I change the size back to HB120rs_v2 I am able to start the VM.
This is not a quota issue as I currently only have 1 VM and the VM was deallocated at the time of resizing. I have also previously successfully started this VM with the HB120rs_v3 size about 4 weeks earlier.
My questions are:
How can I determine the specific cause of the start failure?
What is the significance of the VM size being shown under the Other category?

We cannot change the generation of a VM after we create it
How can I determine the specific cause of the start failure?
The reason for VM start failure is the last operation that was run on the VM failed after the input was accepted
As you are trying to change the gen2 VM to gen3 which is not supported, this might be the reason for the start failure
What is the significance of the VM size being shown under the other
category?
When resizing, HB120rs_v3 is not shown under the H-Series category because the hardware cluster currently used for HB120rs_v2 VM does not support the HB120rs_v3 VM
So HB120rs_v3 is shown under other category but not under H-series

The error message should generally be taken at face value. It means that the platform could not satisfy all the given allocation (placement) constraints together. If you remove one or more constraints (some of which are expressed indirectly as feature/capability specifications in the VM model), the allocation may succeed. For a standalone VM (not in an Availability Set or a VMSS), common highly restrictive allocation constraints are IPv6 and Accelerated Networking, often in relation to the VM size. (But you can be certain the VM size by itself is compatible with Accelerated Networking; otherwise the request would've been rejected earlier, with a more specific error.)
'Spot' priority (referred to as "Preemptible VMs" in the error message) can also be restrictive since there must be ample excess capacity available to allow Spot deployments in a region/zone. A good resource for assessing Spot availability is the eviction history: https://learn.microsoft.com/en-us/azure/virtual-machines/spot-vms#pricing-and-eviction-history.

Related

Experienced problems with our RDS instance

we experienced problems with our RDS instance.
RDS stops running. RDS are in state of "green"(on the AWS console) but we cannot connect to the RDS instance.
Cloud Logs we found following errors:
2018-03-07 8:52:31 47886953160896 [Note] InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
InnoDB: Set innodb_force_recovery to ignore this error.
2018-03-07 8:52:32 47886953160896 [ERROR] Plugin 'InnoDB' init function returned error.
2018-03-07 8:53:46 47508779897024 [Note] InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
InnoDB: Set innodb_force_recovery to ignore this error.
2018-03-07 8:53:46 47508779897024 [ERROR] Plugin 'InnoDB' init function returned error.
When we tried to reboot RDS instance its take almost 2 hours to reboot. After rebooting its working fine again!.
Can someone help us to know the root cause of this incident.
As t2.small provides 2G of RAM. However you might be knowing, most DB engines tend to use up 75% of the memory for caching purposes such as queries, temporary tables, table scans to make things go faster.
For our Maria DB engine, following parameters are by default set to below pre-optimized values :
innodb_buffer_pool_size (DB instance size *3/4= 1.5 Gb)
key_buffer_size (16777216 = 16.7 Mb)
innodb_log_buffer_size (8388608 =8.3Mb)
Apart from that the OS and the RDS Processes will also use some amount of RAM to do their own operations. Hence to summarize, around 1.6 Gb approximately is utilized by DB engine and the actual usable memory which will be getting after taking out these values innodb_buffer_pool_size, key_buffer_size, innodb_log_buffer_size will be around 400 MB.
Overall a decrease in your Freeable Memory as low as ~137MB. As a result, Swap Usage increased drastically in the same time period to 152MB approximately.
FreeableMemory was quite low and there was a high swap utilization. Further, due to the memory pressure ( insufficient memory and high swap usage), RDS internal monitoring system was not able to proceed with host communication which in turn resulted into underlying host replacement.

AWS ElasticSearchService index_create_block_exception

I'm trying to create a new index in AWS ElasticSearch cluster after increasing the cluster size and seeing index_create_block_exception. How can i rectify this? I tried searching but did not get exact answers. Thank you.
curl -XPUT 'http://<aws_es_endpoint>/optimus/'
{"error":{"root_cause":[{"type":"index_create_block_exception","reason":"blocked by: [FORBIDDEN/10/cluster create-index blocked (api)];"}],"type":"index_create_block_exception","reason":"blocked by: [FORBIDDEN/10/cluster create-index blocked (api)];"},"status":403}
According to AWS, the above exception is being thrown due to a low memory in disks.
For t2 instances, when the JVMMemoryPressure metric exceeds 92%,
Amazon ES triggers a protection mechanism by blocking all write
operations to prevent the cluster from getting into red status. When
the protection is on, write operations will fail with a
ClusterBlockException error, new indexes cannot be created, and the
IndexCreateBlockException error will be thrown.
I'm afraid the issue is still on.
You'll also get this error if you run out of disk space.
This should of course not happen after increasing the cluster size, but if you suddenly get this error it'd be worth to check that all your instances has storage left - i.e. don't look at the Total free storage space graph but at the Minimum free storage space one.

AWS Lambda: Service error

What does this error mean?
I have 5 Lambda Functions deployed using Java that worked perfectly but since this afternoon all of them started displaying the same message when I execute each:
Service error.
No output, no logs, only that message in a red box.
In http://status.aws.amazon.com/ they say:
6:05 PM PDT We are investigating increased error rates and elevated
latencies for AWS Lambda requests in the US-EAST-1 Region. Newly
created functions and console editing are also affected.
Why does it happen and is there a way to prevent it?
From time to time, parts of Amazon's AWS service fail. Sometimes the failure is very small and short-lived, and in other cases there are larger distributed failures.
Your system design needs to take into account the possibility that the piece of AWS that you are counting on will not work at the moment, and try to route around the damage. For instance, you can run Lambda in multiple regions. (It already runs in multiple availability zones inside a single region, so you don't have to worry about that). This gives you some isolation against failures in any one region.
Getting distributed systems to work at small scale can be hard because the failures that you need to protect against don't happen very often. At large scale, you get systematic efforts like Netflix's "Chaos Monkey", which deliberately introduces failures so that automated processes can detect and correct those issues.
"A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable." -- Leslie Lamport
"When a Fail-Safe system fails, it fails by failing to fail safe." -- John Gall

How to determine that a jvm app does more GC than normal work?

We recently had a problem that our EC2 instances had 90-100 percent cpu load cause of a bug in a library we include that created to many objects instead of reusing them (which was easy solvable), so we spent too much time in GC.
Unfortunately the AWS health checks and instance status metrics didn't cause the overloaded instances to be stopped and then new ones restarted, so after some time we hit the max autoscaling number and....died. Also our own health checks inside the app which are used for the ELB are so simple that they answered often enough to obviously not cause the instances to be terminated...and restarted, which would mitigate that problem for quite some time.
My idea is now to use our custom health check which is already included in the ELB health checks to report a failure if we spent to much time in GC.
How would I do such a thing inside the app?
There are a number of JVM parameters that allow GC monitoring
-Xloggc:<file> // logs gc activity to a file
-XX:+PrintGCDetails // tells you how different generations are impacted
You can either parse these logs yourself or use specific tool such as GCViewer to analyse gc activity.
Use GarbageCollectorMXBean:
long gcTime = 0;
for (GarbageCollectorMXBean gcBean : ManagementFactory.getGarbageCollectorMXBeans()) {
gcTime += gcBean.getCollectionTime();
}
long jvmUptime = ManagementFactory.getRuntimeMXBean().getUptime();
System.out.println("GC ratio: " + (100 * gcTime / jvmUptime) + "%");
You can use VisualVM to monitor what happens inside the JVM and you can monitor remote instances via JMX. You did not describe which application container that you are using (Apache Tomcat, GlassFish etc.), you can set up a JMX connector like this in the case of Tomcat.
Don't forget to adjust Security Groups in AWS to have the proper permission to access the JMX port.
The JVM flags PrintGCApplicationConcurrentTime and PrintGCApplicationStoppedTime will log how long the application was active or suspended. They're a bit of a misnomers since they actually measure time spent in and out of safepoints, not just GCs.

Why is Google Dataproc HDFS Name Node in Safemode?

I am trying to write to an HDFS directory at hdfs:///home/bryan/test_file/ by submitting a Spark job to a Dataproc cluster.
I get an error that the Name Node is in safe mode. I have a solution to get it out of safe mode, but I am concerned this could be happening for another reason.
Why is the Dataproc cluster in safe mode?
ERROR org.apache.spark.streaming.scheduler.JobScheduler: Error running job streaming job 1443726448000 ms.0
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot create directory /home/bryan/test_file/_temporary/0. Name node is in safe mode.
The reported blocks 125876 needs additional 3093 blocks to reach the threshold 0.9990 of total blocks 129098.
The number of live datanodes 2 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
What safe mode means
The NameNode is in safemode until data nodes report on which blocks are online. This is done to make sure the NameNode does not start replicating blocks even though there is (actually) sufficient (but unreported) replication.
Why this happened
Generally this should not occur with a Dataproc cluster as you describe. In this case I'd suspect a virtual machine in the cluster did not come online properly or ran into an issue (networking, other) and, therefore, the cluster never left safe mode. The bad news is this means the cluster is in a bad state. Since Dataproc clusters are quick to start, I'd recommend you delete the cluster and create a new one. The good news, these errors should be quite uncommon.
The reason is that you probably started the master node (housing namenode) before starting the workers. If you shutdown all the nodes, start the workers first and then start the master node it should work. I suspect the master node starting first, checks the workers are there. If they are offline it goes into a safe mode. In general this should not happen because of the existence of heart beat. However, it is what it is and restart of master node will resolver the matter. In my case it was with spark on Dataproc.
HTH