AWS EC2 instance with custom AMI rarely doesn't launch upon creation - amazon-web-services

We are using AWS EC2 via cloud formation to launch stacks of instances. We create stacks that are combinations of custom and marketplace images. This worked perfectly until Friday. Starting on Friday, 10/7, about 10% of all instances we launch simply stall upon launch. So far we have only seen this for the custom AMIs we created (both Win7 and Win10) but I'm not sure if that is a coincidence or not as the stack is mostly comprised of instance launching from those 2 AMIs.
Note that we did not change the AMIs recently nor have we changed anything else about our process.
The issue eventually manifests as a failure when cloud formation times out.
I detached one of the boot volumes and attached it to a working instance so that I could view the logs. There are simply no new entries from the attempted launch in the following logs (or any accompanied error logs)
Ec2ConfigLog.txt
%WINDIR%\Panther\SetupAct.log
%WINDIR%\Panther\UnattendGC\SetupAct.log
The screenshot of the instance (via the AWS console) shows a static windows icon with no text around it.
Grabbing the system log (via the AWS console) returns nothing (empty console).
Force stopping and then starting the instance does kick off the customization and launch of windows but building in a restart upon failure is a hack I really don't want to do, even if cloud formation allows it (about which I'm not sure).
Does anyone have ideas for how we can troubleshoot further?
Thanks!
Jason

Related

Is it possible to determine the exact EC2 targets of a CodeDeploy deployment via CloudWatch events?

AWS CodeDeploy's model defines an Application, which is a long-lived high-level object and represents software that needs to be deployed somewhere. An application can have many Deployment Groups, which represent targets (e.g. particular EC2 servers that have a particular combination of tags). A deployment is the release of one particular revision of software onto a deployment group defined within an application.
It is possible to get feedback on the progress of CodeDeploy via CloudWatch events. Given that EC2 servers can be up or down at the time of a deployment, and given that the tags on EC2 servers may vary over time, is there a way of determining from a CloudWatch CodeDeploy event the exact set of EC2 servers that were targeted by a particular deployment?
Specifically:
If a server is down at the time a deployment is launched, will it be targeted for release when it comes back up?
If I add a new server with identical tags to the first one after I have done the deployment, or I change the tags on the first server, will the CloudWatch event associated with my CodeDeploy event contain details of exactly which servers were targeted for deployment at the time, even if their current state means that they would not be targeted for deployment if I were to re-release the same deployment?
I tested few scenarios using a simple CodeDeploy setup. Deployment group was identified based on instance tags only (no ASG). My observations are as follows:
Server down at the time a deployment is launched
I simulated this scenario by having a stopped instance. The deployment hanged on the stopped instance. It would probably timeout if I let it hang for long. Once the instance was re-started, the deployment continued.
New instances started with the same tag
CodeDeploy did not detect them automatically. Had to redeploy the last deployment so that the new instances get detected and run the up-to-date application version.
Changing a tag of an instance
The instance with changed tag is not included in a new deployment. Thus you end up with one instance running an old version of your application, while the rest run the new version.
Deployment id and list-deployment-targets AWS CLI
The list-deployment-targets prints out IDs of instances for which the deployment happened at the time of deployment. When you redeploy (deployment id does not change in this case), the list will include instances for redeployment. Original list of instances is lost.
Note
Deployments to ASG will behave differently, since CodeDeploy integrates with ASG through its lifestyle hooks.
Hope this helps.

No changes to app after redeployment to EC2 instance

I've got development and production instances in EC2. I've been updating my app in Visual Studio 2019 and redeploying it to the dev instance, then creating an AMI of that instance and using that image to update the production instance(s).
Suddenly my app no longer updates when I deploy to the dev instance. The logs all show the update was applied, but when I look at the files on the server they have not changed for days. I suspect I may be using AMIs incorrectly, but I'm not sure what I'm doing wrong.
How do I get my updates to show again?
You are facing the issue because creating an AMI from running environment isn't the right approach since EB runs several scripts under the hood to attach instances to that particular environment.
Note: Custom AMIs are ideal only when you're installing a lot of dependencies or software that you want to be baked into your AMI so subsequent deployments go through quick. Here's the documentation that walks you through the steps, and here's the summary of the steps:
The best approach would be to launch a stand alone EC2 using an EB
AMI as base (ideally an AMI with HVM virtualization).
Connect to the instance with SSH or RDP.
Perform any customizations you want.
(Windows platforms) Run the EC2Config service Sysprep. For
information about EC2Config, see Configuring a Windows Instance Using
the EC2Config Service. Ensure that Sysprep is configured to generate
a random password that can be retrieved from the AWS Management
Console.
In the Amazon EC2 console, stop the EC2 instance. Then on the
Instance Actions menu, choose Create Image (EBS AMI).

AWS CodeDeploy is impossible to deploy across multiple Availability Zones at the same time?

Below the screenshot, it seems to be successful only in one Availability Zone.
I checked the codedeploy logs for a failed instance, and I found that there was an error, I think it is recognized as an on-premise instance.
2018-01-10 04:40:22 INFO [codedeploy-agent(2696)]: On Premises config file does not exist or not readable
2018-01-10 04:40:43 ERROR [codedeploy-agent(2696)]: CodeDeploy Instance Agent Service: CodeDeploy Instance Agent Service: error during start or run: InstanceMetadata::InstanceMetadataError - Not an EC2 instance and region not provided in the environment variable AWS_REGION. Please specify your region using environment variable AWS_REGION.......
I've searched for about three days for this issue, but there was no mention in the AWS documentation. In the production env, I plan to use two Availability Zones attached to the auto scaling group. I wonder if I'm overlooking the other thing except CodeDeploy... What should I check? Thank you in advance.
[Updated]
I update with ASG and ASG Config screent shot. There's no special, it's vanila and default process. I waiting 5 days from AWS support center but still pending response.
Auto Scaling Group -----
Auto Scaling Group Launch Config -----
Finally, I found out why codedeploy failed across multiple availability zones on Windows 2016. Also, this problem seems to be an issue with Windows 2016 EC2 itself rather than ASG or codedeploy(I have not tested it on linux). There are 2 solutions I found,
Shut down the server safely by clicking the button "Shutdown with Syspre" in Ec2LaunchSettings. And then you can create AMI as usal.
Run the C:\ProgramData\Amazon\EC2-Windows\Launch\Scripts\InitializeInstance.ps1 -Schedule script manually. The argument "-Schedule" is required. And then you can create AMI as usal.
The first method is an intuitive and convenient way(GUI), and the second method is appropriate for automate a powershell script. I have confirmed that both methods succeed in deploying to multiple AZs. There were no errors in the logs recorded by codedeployagent.
To be more specific, codedeployagent leaves various logs at the time of deployment, and I found that the agent seems to use meta-info from 169.254.169.254. When I failed, the log say "You are On-Premise Instance.". Probably the deployment fails because the instance can not get meta-info. In the following document, I have received a lot of help and all of my solutions are listed.
https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/ec2launch.html
Especially, in the document
.....In Windows PowerShell, run the following command so that the system schedules the script to run as a Windows Scheduled Task. The script runs one time during the next boot and then disables these tasks from running again....
C:\ProgramData\Amazon\EC2-Windows\Launch\Scripts\InitializeInstance.ps1 -Schedule

Codedeploy with AWS ASG

I have configured an aws asg using ansible to provision new instances and then install the codedeploy agent via "user_data" script in a similar fashion as suggested in this question:
Can I use AWS code Deploy for pulling application code while autoscaling?
CodeDeploy works fine and I can install my application onto the asg once it has been created. When new instances are triggered in the ASG via one of my rules (e.g. high cpu usage), the codedeploy agent is installed correctly. The problem is, CodeDeploy does not install the application on these new instances. I suspect it is trying to run before the user_data script has finished. Has anyone else encountered this problem? Or know how to get CodeDeploy to automatically deploy the application to new instances which are spawned as part of the ASG?
AutoScaling tells CodeDeploy to start the deployment before the user data is started. To get around this CodeDeploy gives the instance up to an hour to start polling for commands for the first lifecycle event instead of 5 minutes.
Since you are having problems with automatic deployments but not manual ones and assuming that you didn't make any manual changes to your instances you forgot about, there is most likely a dependency specific to your deployment that's not available yet at the time the instance launches.
Try listing out all the things that your deployment needs to succeed and make sure that each of those is available before you install the host agent. If you can log onto the instance fast enough (before AutoScaling terminates the instance), you can try and grab the host agent logs and your application's logs to find out where the deployment is failing.
If you think the host agent is failing to install entirely, make sure you have Ruby2.0 installed. It should be there by default on AmazonLinux, but Ubuntu and RHEL need to have it installed as part of the user data before you can install the host agent. There is an installer log in /tmp that you can check for problems in the initial install (again you have to be quick to grab the log before the instance terminates).

How to deploy to autoscaling group with only one active node without downtime

There are two questions about AWS autoscaling + deployment which I cannot clearly answer:
I'm currently trying to figure out, whats the best strategy to deploy to an EC2 instance behind an ELB which is the only member of an autoscaling group without downtime.
By now the EC2 setup will be done with puppet including the deployment of the application, triggered after an successful build by jenkins.
The best solution I have found is to check per script how many instances are registered at the ELB. If a single one is registered, spawn a new one, which runs puppet on startup (the new node will be up to date) and kill the old node.
How to deploy (autoscaling EC2 behind an ELB) without delivering two different versions of the application?
Possible solution: Check per script how many EC2 instances are registered to the ELB, spawn the same amount of instances, register all new instances and unregister all old ones.
My experiences with AWS teacher me that AWS has a service for everything. So are there any services out there to accomplish my requirements and my solutions are inconvenient?
You can create an entirely new environment with its own ELB and when it's ready and checked, you switch the DNS record to the new ELB.
Anyway for a brief time (60 seconds or so, depending on the TTL of your DNS record) some users will see your old version while some others will see the new version.
In the end there were two possible solutions. Both of them would temporarily deliver two versions of the app.
Use AWS CodeDeploy to perform an sequential deployment (one after another). This solution offers the possibility to rollback to a previous state and visual shows the state and results of the deployment.
Create a python script to get the registered nodes (using Boto) and run the appropriate puppet script on them (using Fabric). This solution offers more control of the deployment but requires some time to build these script. Also there can be bugs..
For now I choose AWS CodeDeploy because its already available and - hopefully - well tested.