Suspicously timed EC2 instance restart - amazon-web-services

Yes, I've heard all the stories about EC2 instances being unreliable and how you need to proactively prepare for that. I've also heard stories from others about how they have never had a problem, and their instances just run and run.
Today I had a strange thing happen. I've had an Linux instance running for a couple of months, as I've been preparing to launch an e-commerce site. I've been periodically taking snapshots. I have my images on S3. I have my code in a private github repo. All things considered, I've been doing a fairly good job of protecting myself against failure. Ironically, it was while I was doing even more in this regard today that I experienced something really strange.
Since I have these snapshots, I had assumed that the best thing to do if I needed to quickly spin up a new instance (whether due to a failed instance that wouldn't come back up, or if I just needed additional capacity) would be to take a snapshot and make a volume out of it, then make an image out of that volume, and then launch a new instance using that image.
For whatever reason, every time I've tried that lately, the new instance had a kernel panic during boot, so I decided to try a different approach. I right-clicked on my RUNNING INSTANCE, and chose "Create Image." That seemed like a reasonable shortcut. Then I went to that image and launched an instance.
At almost exactly the same time, my original instance rebooted. I didn't even see it happen. I only know it did from the system log. Is this just a wild coincidence? Or did I commit a silly mistake and accidentally screw up my instance?
Fortunately, I'm just getting this new thing off the ground, so the bit of downtime didn't kill me, and I was able to very quickly get things going again. But either I totally do not understand the "Create Image" feature from the instance list, or I got really unlucky today.

"Create image" takes the following actions:
Stop EC2 instance
Snapshot EBS volume
Start EC2 instance
Register EBS snapshot as an AMI
So, yes, this would look like a reboot because it is like a reboot.
Here's an article I wrote on the difference between stop/start and simple reboot: http://alestic.com/2011/09/ec2-reboot-stop-start

Your problem sounds a lot like my problem. After some searching this page helped me: http://www.raleche.com/node/138
"The problem turned out to be the kernel. Both when creating the AMI and the instance I selected default for the kernel image.
To resolve the problem, I recreated the AMI using the same kernel image as the original instance."

Related

"The selected AMI contains more instance store volumes than the instance allows" no matter which AMI I pick

I am trying to set up a g5.4xlarge instance to run AlphaFold on. When going over the storage section, I get the following warning:
The selected AMI contains more instance store volumes than the instance allows. Only the first 1 instance store volumes from the AMI will be accessible from the instance
This happens regardless of the AMI that I choose, and I only see one EBS AMI Root volume and one store-backed instance volume (which seems to match the volume included with a g5.4xlarge instance). I would like to use a deep learning AMI, such as:
Deep Learning AMI GPU PyTorch 1.13.1 (Ubuntu 20.04) 20230103
ami-0b7e0d9b36f4e8f14 (64-bit (x86))
This leaves a few questions:
Is this going to interfere with using the AMI?
Is the instance storage going to be used at all?
If 1, can I fix it?
If not 2, why is it included, and would there be a discount available if I could remove it?
I tried several different AMI's, expected at least one to remove the error based on what I read at this answer: AWS launch new instance using Ubuntu 22.04: image has more volumes than instances allows
No such luck.
Thanks in advance!
Edit: The two answers I have thus far both have part of the answer (not sure which to mark as the answer as a result). Ben Whaley gave a good direct solution for the problem (part 3), while Adil Hidistan gave a great answer to parts 1 and 2. I think I will mark Ben Whaley as the answer as it has the solution to the underlying problem, but I am also thankful for the information that Adil provided to help my understanding.
The AMI you mentioned, ami-0b7e0d9b36f4e8f14, does in fact have two instance store volumes:
Note the two ephemeral volumes under block devices.
It seems that the Amazon Linux 2 images do not have instance store volumes, however. For example, ami-0dc2e3e2f9cca7c15. I found this by searching for AMIs matching the phrase amazon/Deep Learning AMI GPU PyTorch with a creation date after 2023-01-01.
The problem you have is about the instance type you chose. The error is telling you that. chosen instance type supports 1x instance store volume, yet the AMI has 2, hence the error.
Now, will that affect your usage. Answer is maybe. It depends on what the intention of the AMI owner was, why they added that volume and if they are actively making use of it when instantiating the instance.
If, for example, their intention was to run a script that will look for that instance volume to store some data, in the absence of it, instance may not boot or not have the features it is meant to have. There is no way of telling unless you understand its purpose and usage.
If you happen to choose an instance type that supports 2x instance store volumes, you will be fine, that's a fix. Otherwise, there is a risk that it won't work as intended.
If you have the ability, open a case with Amazon and have them provide information directly. If you do not have a support agreement, you could try their forums.

AWS Sagemaker: Jupyter Notebook kernel keeps dying

I get disconnect every now and then when running a piece of code in Jupyter Notebooks on Sagemaker. I usually just restart my notebook and run all the cells again. However, I want to know if there is a way to reconnect to my instance without having to lose my progress. At the minute, it shows that there is "No Kernel" at the bottom bar, but my file seems active in the kernel sessions tab. Can I recover my notebook's variables and contents? Also, is there a way to prevent future kernel disconnections?
Note that I reverted back to tornado = 5.1.1, which seems to decrease the number of disconnections, but it still happens every now and then.
Often, disconnections will be caused by inactivity because a job is running for a long time with no user input. If it's pre-processing that's taking a long time, you could increase the instance size of the processing job so that it executes faster, or increase the instance count. If you're using EMR, you can now run an EMR Spark query directly on the EMR cluster since December 2021:
https://aws.amazon.com/about-aws/whats-new/2021/12/amazon-sagemaker-studio-data-notebook-integration-emr/
There's a useful blog here https://aws.amazon.com/blogs/machine-learning/build-amazon-sagemaker-notebooks-backed-by-spark-in-amazon-emr/ which is helpful in getting you up and running.
Please let me know if you need more information, or vote for the answer if it's useful. :-)
For me a quick solution was to open a Terminal instead, save the notebook file as a Pytohn file, and run it from the terminal within Sagemaker.

Not able to run cell on a jupyterlab notebook Google cloud ai platform

I am running 2 instances under Google AI Platform, which basically launches 2 VM instances to run jupyter lab. I have been happily making notebooks on both VMs. I shutdown both VMs for the day...
What's strange is that next morning, notebook from one VM will launch but when I run any cell containing simple things like "import pandas", it never return result and hang the whole thing (with a * where the cell # would have generated). I create a whole new notebook and just do a simple print("hello"). it also never returns. I restarted the instance a few times and still doesn't work. What I noticed is the "dot" on the top right corner is filled black. I think it should be white when the kernel is restarted. So there could be a problem with the kernel.
Any ideas what could go wrong? I don't even know where to debug this. The strange thing is the other VM still worked. I don't want to do anything drastic like re-creating a new VM, since I like to be able to fix this for a known cause.
Anyone out there experienced same thing?
In case you didn't attempt this, I would try refreshing the notebook window after restarting the machine.

EC2 AMI and installed third party software - how does this work?

I've been using a Windows 2008R2 EC2 instance for some time. As of today, it still works. I started working with the AWS API, and I was unable to start my instance using the API, the error message being "not authorized for images", specifically : An error occurred (AuthFailure) when calling the RunInstances operation: Not authorized for images: [ami-088dab1e]
That's when I learned about deprecation.
From what I read, what this means is that the AMI being used is no longer publicly available. When using the API call "describe-images", this image cannot be queried. While it apparently can still be used from the console, the API simply doesn't support it and will not start an instance using that image ID. On the console, the AMI description reads : Cannot load details for ami-088dab1e. You may not be permitted to view it.
I understand how to find a new image and I think I understand how to launch my instance using a new image. However, I have lots of custom software installed on this instance. So before I try it, I want to know if I will lose that custom software installation if I launch my existing instance with a new AMI. I'm hoping that my custom software won't change, but I'm skeptical. I don't want to fire up a brand new version of Windows and start from scratch. Mostly, I don't want to lose what I've already got.
I know this is a basic question, but I've looked all over, and I haven't yet found a straightforward answer. I was hoping y'all would know. Thanks.
I think I've found an answer here: AWS EC2 new instance from image AMI
When launching an instance from an Amazon Machine Image (AMI), the disks will contain an exact copy of the disk at the time that the AMI was created.
In other words, if I start a new instance, I'll lose my installed software. WRONG!
Launching != starting. More editing to come once I get this completely figured out.
So, given that updated Windows images are created and deprecated all the time, and the Windows OS is constantly updated by Microsoft, one must wonder how it is a static Windows image can be used with other software? It seems like far more trouble than it's worth, if you've got to constantly reinstall your software to keep your Windows system up to date.
Amazon recently came up with a solution for that, here: Patching Windows
I don't know how to do it yet, but this seems like exactly what I need in order to keep Windows up to date, and keep my installed software intact.

EC2 instance - complete reinstall

I have an EC2 instance up and running (Linux). I've made some installations I'd like to completely undo. What would be the best/simplest way to get back to a clean instance?
Start a new instance, running through your installation and configuration steps.
You can do this without terminating the old instance. This lets you look at configuration on the old instance in case you forgot how you set things up. It also lets you copy data or other files from the old instance to the new instance.
One you're completely happy with the new instance, terminate the old instance.
This approach of starting the new before destroying the old didn't make sense back in the days of physical servers, but EC2 gives us new ways to think about things.
Also: Document and/or script your installation and configuration steps so you can easily reproduce them in the future on new instances. Think about separating your data onto a second EBS volume so it can be easily moved to a new instance.
You should be comfortable with testing your setup scripts/docs repeatedly until they work just right on a brand new instance.
Destroy it and create a new one using the same AMI, kernel, and any user-defined data/script that you passed to the original instance creation. Back any of your own data up to S3.