I am deploying an Amazon Linux AMI to EC2, and have the following directive in my user_data:
packages:
- amazon-efs-utils
mounts:
- [ "fs-12345678:/", "/mnt/efs", "efs", "tls", "0", "0" ]
I am expecting this to add the appropriate line to my /etc/fstab and mount the Amazon EFS filesystem. However, this does not work. Instead I see the following in my /var/log/cloud-init.log log file:
May 10 15:16:51 cloud-init[2524]: cc_mounts.py[DEBUG]: Attempting to determine the real name of fs-12345678:/
May 10 15:16:51 cloud-init[2524]: cc_mounts.py[DEBUG]: Ignoring nonexistent named mount fs-12345678:/
May 10 15:16:51 cloud-init[2524]: cc_mounts.py[DEBUG]: changed fs-12345678:/ => None
If I manually add the expected entry to my /etc/fstab, I can indeed mount the filesystem as expected.
I've found a couple of bugs online that talk about similar things, but they're all either not quite the same problem, or they claim to be patched and fixed.
I need this filesystem to be mounted by the time I start executing scripts via the cloud_final_modules stage, so it would be highly desirable to have the mount: directive work rather than having to do nasty hacky things in my later startup scripts.
Can anybody suggest what I am doing wrong, or if this is just not supported?
It is clear that the cloud-init mount module does not support the efs "device" name.
Related
We are using Amazon Elastic Compute Services to spin up a cluster with autoscaling groups. Until very recently, this has been working fine, and generally it is still working fine... Except that we are no longer able to connect to the underlying EC2 instances using SSH with our keypair. We get ssh permission denied errors, which is relatively (weeks) new, and we have changed nothing. By contrast, we can spin up an EC2 instance directly and have no problem using SSH with the same keypair.
What I have done to investigate:
Drained the ECS cluster, detached the instance from it, and stopped it.
Detached the instance's root volume and attached it to a different EC2 instance.
Observed that /home/ec2-user/.ssh does not exist.
Found the following error in the instance's /var/log/cloud-init.log:
Oct 30 23:23:09 cloud-init[3195]: handlers.py[DEBUG]: start: init-network/config-ssh: running config-ssh with frequency once-per-instance
Oct 30 23:23:09 cloud-init[3195]: util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-0e13e9da194d2624a/sem/config_ssh - wb: [644] 20 bytes
Oct 30 23:23:09 cloud-init[3195]: helpers.py[DEBUG]: Running config-ssh using lock (<FileLock using file '/var/lib/cloud/instances/i-0e13e9da194d2624a/sem/config_ssh'>)
Oct 30 23:23:09 cloud-init[3195]: util.py[WARNING]: Applying ssh credentials failed!
Oct 30 23:23:09 cloud-init[3195]: util.py[DEBUG]: Applying ssh credentials failed!
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/cloudinit/config/cc_ssh.py", line 184, in handle
ssh_util.DISABLE_USER_OPTS)
AttributeError: 'module' object has no attribute 'DISABLE_USER_OPTS'
Oct 30 23:23:09 cloud-init[3195]: handlers.py[DEBUG]: finish: init-network/config-ssh: SUCCESS: config-ssh ran successfully
Examined the Python source code for /usr/lib/python2.7/site-packages/cloudinit. It looks OK to me; I see the reference in config/cc_ssh.py to ssh_util.DISABLE_USER_OPTS and it looks like ssh_util.py does indeed contain DISABLE_USER_OPTS as a file-level variable. (But I am not a master Python programmer, so I might be missing something subtle.)
Curiously, the compiled versions of ssh_util.py and cc_ssh.py date from October 16, which raises all sorts of red flags, because we had not seen any problems with ssh until recently. But I loaded uncompyle6 and decompiled those files, and the decompiled versions seem to be OK, too.
Looking at cloud-init, it's pretty clear that if the reference to ssh_util.DISABLE_USER_OPTS throws an exception, the .ssh directory won't be configured for ec2-user, so I understand what's happening.
What I don't understand is why? Has anyone else experienced issues with cloud-init with recently-created EC2 instances under ECS, and found a workaround?
For reference, we are using AMI amzn2-ami-ecs-hvm-2.0.20190815-x86_64-ebs (ami-0b16d80945b1a9c7d)
in us-east-1, and we certainly not seen these issues as far back as August 15. I assume that some cloud-init change that the instance gets via a yum update explains the new behavior and the change to the write dates of the compiled Python modules in cloud-init.
I should also add that the EC2 instance I spun up to mount the root volume of the ECS-created instance has subtly-different cloud-init code. In particular, the cc_ssh.py module doesn't refer to ssh_util.DISABLE_USER_OPTS but rather to a local DISABLE_ROOT_OPTS variable. So this is all suspicious.
I have diagnosed this problem in a specific AWS Deployment on an Amazon Linux2 AMI. The root cause is running yum update, which causes an update of cloud-init, from user_data that is executed in cloud-init during AWS EC2 instance startup.
The user_data associated with an ECS launch_configuration is executed by cloud-init. Our user_data initialization code included a "yum update". Amazon has deployed a new version of cloud-init, 18.5-2amzn2 which is not configured in the AMI images yet (they have 18.2-72-amzn2.07 cloud-init version). Therefore, the yum update will upgrade cloud-init to the 18.5-2amzn2 version. However, analysis of the python code for the 18.5-2amzn2 version indicates that it includes a commit (https://github.com/number5/cloud-init/commit/757247f9ff2df57e792e29d8656ac415364e914d) which adds an attribute to ssh_util not present in the prior version. Ordinarily, yum would produce a consistent cloud-init installation, as verified in a standalone EC2 instance. However, since the update occurs in cloud-init, as it is already running, the results are inconsistent. The ssh_util module is apparently not updated for the running cloud-init so it can't provide the "DISABLE_USER_OPTS" value that was added in the aforementioned commit.
So, the problem was indeed the yum-update command invoked from within cloud-init, which was updating cloud-init itself while in use.
I should point out that we were using Amazon EFS on our nodes, and were following the exact instructions that Amazon specifies on their help page for using EFS with ECS, which include the yum-update call in the user data script.
My workflow:
Packer -> Kick off a kickstart build of my system on vCloud
Packer -> Export VM as OVA
Packer -> Upload OVA to s3 bucket in AWS
Packer -> Ask AWS to convert my OVA to AMI
Manual -> Launch AMI instance
When I launch my AMI instance, I instantiate cloud-init early on, after networking service has started. But cloud-init doesn't configure my interfaces because this file is present:
/etc/cloud/cloud.cfg.d/99-disable-network-config.cfg
This file tells cloud-init to bypass the network configuration and the end result is that my instance is unreachable.
I know cloud-init runs because it sets my hostname in another file that I have defined in my custom distro script. The host gets its hostname from AWS during boot so I know the NIC is functional! Also, I can see that it gets an IP via DHCP in the /var/log/messages)
Config excerpt:
system_info:
# This will affect which distro class gets used
distro: cent1
network:
renderers: ['cent1']
Basically, the distro script is run (cloud-init/cloudinit/distros/cent1.py) but the renderers script isn't (cloud-init/cloudinit/net/cent1.py).
I have searched through the packer code base and the cloud-init code base as well as my own code base and nowhere is the actual creation or moving of such a file present. The only place the file is mentioned is a comment in the cloud-init source. The following comment is found in cloudinit/distros/debian.py (my instance is CentOS but the comment explains what the presence of this file does):
34 # To disable cloud-init's network configuration capabilities, write a file
35 # /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
36 # network: {config: disabled}
When I stop my instance and mount its volume on another system, I can see the 99-disable-network-config.cfg file is present. Even more confusing is that the top of the file says:
[root#host ~]# cat /root/drop1-3/etc/cloud/cloud.cfg.d/99-disable-network-config.cfg
# Automatically generated by the vm import process
network: {config: disabled}
When I do a google search, I see other references to the string # Automatically generated by the vm import process.
For example, this post has such a reference.
Another bit of info. If I manually remove 99-disable-network-config.cfg in cloud-init util.py file just before it checks for the 99-disable-network-config.cfg file, everything works exactly as it should - networking gets configured and I can SSH to my instance.
I'm not putting the 99-disable-network-config.cfg file there. I don't see anything in packer's source suggesting that it's putting that file there. I don't see anything in cloud-init's source suggesting that it's putting that file there.
So the question is, where is that file coming from? I already have a work-around but I'd like to understand the root cause. I have not been able to find the root cause of why that file is present.
(Sorry this is so long-winded but I've been staring at this problems for days and have zero solutions other than the heavy-handed workaround.)
I have an EC2 instance with a 20GB root volume. I attached an additional 200GB volume for partitioning to comply with DISA/NIST 800-53 STIG by creating separate partitions for directories such as /home, /var, and /tmp, as well as others required by company guidelines. Using Red Hat Enterprise Linux 7.5. I rarely ever do this and haven't done it for years so I'm willing to accept stupid solutions.
I've followed multiple tutorials using various methods and get the same result each time. The short (from my understanding), the OS cannot access certain files/directories on these newly mounted partitions.
Example: When I mount say /var/log/audit, the audit daemon will fail.
"Job for auditd.service failed because the control process exited with error code. See..."
systemctl status auditd says "Failed to start Security Auditing Service". I am also unable to login via public key when I mount /home but these types of problems go away when I unmount the new partitions. journalctl -xe reveals "Could not open dir /var/log/audit (Permission denied).
Permission for each dir is:
drwxr-xr-x root root /var
drwxr-xr-x root root /var/log
drws------ root root /var/log/audit
Which is consistent with the base OS image, which DOES work when the partition isn't mounted.
What I did:
-Create EC2 with 2 volumes (EBS)
-Partitioned the new volume /dev/xvdb
-Formatted partition to extf
-Create /etc/fstab entries for the partitions
-Mounted partitions to a temp place in /mnt then copied the contents using rsync -av <source> <dest>
-Unmounted the new partitions and updated fstab to reflect actual mount locations, e.g. /var/log/audit
I've tried:
-Variations such as different disk utilities (e.g. fdisk, parted)
-Using different partition schemes (GPT, DOS [default], Windows basic [default for the root volume, not sure why], etc.)
-Using the root volume from an EC2, detaching, attaching to other instance as a 2nd volume, and only repartitioning
Ownership, permissions, and directory structures are exactly the same, as far as I can tell, between the original directory and the directory created on the new partitions.
I know it sounds stupid but did you try restarting instance after new mounts?
The reason why I suggest this is because linux caches dir/file path to inode mapping. When you change mount, I am not sure if cache is invalidated. And that can possibly the reason for errors.
Also, though it is Ubuntu have a look at: https://help.ubuntu.com/community/Partitioning/Home/Moving
I am attempting to limit the quantity of successful code deploy revisions that are preserved on the EC2 instances by editing the codedeployagent.yml file’s max_revisions value. I have currently set the value to :max_revisions: 2.
I believe that the issue I am having is due to the method that I am setting the file value. I am attempting to set the value by deploying it with the code deploy package. To do this I have created a custom codedeployagent.yml file locally at the following location:
etc/codedeploy-agent/conf/codedeployagent.yml
In my appspec.yml file I am specifying the installation location of this file by the following lines:
- source: etc/codedeploy-agent/conf/codedeployagent.yml
destination: /etc/codedeploy-agent/conf
I have found that this errors out when I attempt to deploy due to the script already being in place. To work around this, I have added a script that hooks on BeforeInstall with my appspec.yml that will remove the script prior to installing the package:
#!/bin/bash
sudo rm /etc/codedeploy-agent/conf/codedeployagent.yml
Okay, so after this I have ssh’d into the server and sure enough, the :max_revisions: 2 value is set as expected. Unfortunately, in practice I am seeing many more revisions than just two being preserved on the ec2 instances.
So, to go back to the beginning of my question here… Clearly this workaround is not the best way to update the codedeployagent.yml file. I should add that I am deploying to an auto scaling group, so this needs to be a solution that can live in the deployment scripts or cloud formation templates rather than just logging in and hardcoding the value. With all this info, what am I missing here? How can I properly limit the revisions? Thanks.
Have you restart the agent after updating the config file? Any new configurations won't work until you restart the agent.
You may try one of below approaches.
Take an AMI of an instance where you already modified max_revisions to 2 and update ASG's Launch configuration with this AMI, so that scale out instance will also have this config.
Add this config in your userdata section while creating launch configuration
Command to add in userdata section
"UserData" : { "Fn::Base64" : { "Fn::Join" : ["", [
"#!/bin/bash -xe\n",
"# Delete last line and add new line \n",
"sed '$ d' /etc/codedeploy-agent/conf/codedeployagent.yml > /etc/codedeploy-agent/conf/temp.yml\n",
"echo ':max_revisions: 2' >> /etc/codedeploy-agent/conf/temp.yml\n",
"rm -f /etc/codedeploy-agent/conf/codedeployagent.yml\n",
"mv /etc/codedeploy-agent/conf/temp.yml /etc/codedeploy-agent/conf/codedeployagent.yml\n",
"service codedeploy-agent restart\n"
]]}}
As per reference, max_revisions applies for applications per deployment group. So it keeps only 2 revisions under /opt/codedeploy-agent/deployment-root/<deployment_group_id>/. If ASG is associated with multiple applications, codedeploy will store 2 revisions of each application in its deployment_group_id directory.
So I'm having some adventures with the vagrant-aws plugin, and I'm now stuck on the issue of syncing folders. This is necessary to provision the machines, which is the ultimate goal. However, running vagrant provision on my machine yields
[root#vagrant-puppet-minimal vagrant]# vagrant provision
[default] Rsyncing folder: /home/vagrant/ => /vagrant
The following SSH command responded with a non-zero exit status.
Vagrant assumes that this means the command failed!
mkdir -p '/vagrant'
I'm almost positive the error is caused because ssh-ing manually and running that command yields 'permission denied' (obviously, a non-root user is trying to make a directory in the root directory). I tried ssh-ing as root but it seems like bad practice. (and amazon doesn't like it) How can I change the folder to be rsynced with vagrant-aws? I can't seem to find the setting for that. Thanks!
Most likely you are running into the known vagrant-aws issue #72: Failing with EC2 Amazon Linux Images.
Edit 3 (Feb 2014): Vagrant 1.4.0 (released Dec 2013) and later versions now support the boolean configuration parameter config.ssh.pty. Set the parameter to true to force Vagrant to use a PTY for provisioning. Vagrant creator Mitchell Hashimoto points out that you must not set config.ssh.pty on the global config, you must set it on the node config directly.
This new setting should fix the problem, and you shouldn't need the workarounds listed below anymore. (But note that I haven't tested it myself yet.) See Vagrant's CHANGELOG for details -- unfortunately the config.ssh.pty option is not yet documented under SSH Settings in the Vagrant docs.
Edit 2: Bad news. It looks as if even a boothook will not be "faster" to run (to update /etc/sudoers.d/ for !requiretty) than Vagrant is trying to rsync. During my testing today I started seeing sporadic "mkdir -p /vagrant" errors again when running vagrant up --no-provision. So we're back to the previous point where the most reliable fix seems to be a custom AMI image that already includes the applied patch to /etc/sudoers.d.
Edit: Looks like I found a more reliable way to fix the problem. Use a boothook to perform the fix. I manually confirmed that a script passed as a boothook is executed before Vagrant's rsync phase starts. So far it has been working reliably for me, and I don't need to create a custom AMI image.
Extra tip: And if you are relying on cloud-config, too, you can create a Mime Multi Part Archive to combine the boothook and the cloud-config. You can get the latest version of the write-mime-multipart helper script from GitHub.
Usage sketch:
$ cd /tmp
$ wget https://raw.github.com/lovelysystems/cloud-init/master/tools/write-mime-multipart
$ chmod +x write-mime-multipart
$ cat boothook.sh
#!/bin/bash
SUDOERS_FILE=/etc/sudoers.d/999-vagrant-cloud-init-requiretty
echo "Defaults:ec2-user !requiretty" > $SUDOERS_FILE
echo "Defaults:root !requiretty" >> $SUDOERS_FILE
chmod 440 $SUDOERS_FILE
$ cat cloud-config
#cloud-config
packages:
- puppet
- git
- python-boto
$ ./write-mime-multipart boothook.sh cloud-config > combined.txt
You can then pass the contents of 'combined.txt' to aws.user_data, for instance via:
aws.user_data = File.read("/tmp/combined.txt")
Sorry for not mentioning this earlier, but I am literally troubleshooting this right now myself. :)
Original answer (see above for a better approach)
TL;DR: The most reliable fix is to "patch" a stock Amazon Linux AMI image, save it and then use the customized AMI image in your Vagrantfile. See below for details.
Background
A potential workaround is described (and linked in the bug report above) at https://github.com/mitchellh/vagrant-aws/pull/70/files. In a nutshell, add the following to your Vagrantfile:
aws.user_data = "#!/bin/bash\necho 'Defaults:ec2-user !requiretty' > /etc/sudoers.d/999-vagrant-cloud-init-requiretty && chmod 440 /etc/sudoers.d/999-vagrant-cloud-init-requiretty\nyum install -y puppet\n"
Most importantly this will configure the OS to not require a tty for user ec2-user, which seems to be the root of the problem. I /think/ that the additional installation of the puppet package is not required for the actual fix (although Vagrant may use Puppet for provisioning the machine later, depending on how you configured Vagrant).
My experience with the described workaround
I have tried this workaround but Vagrant still occasionally fails with the same error. It might be a "race condition" where Vagrant happens to run its rsync phase faster than cloud-init (which is what aws.user_data is passing information to) can prepare the workaround for #72 on the machine for Vagrant. If Vagrant is faster you will see the same error; if cloud-init is faster it works.
What will work (but requires more effort on your side)
What definitely works is to run the command on a stock Amazon Linux AMI image, and then save the modified image (= create an image snapshot) as a custom AMI image of yours.
# Start an EC2 instance with a stock Amazon Linux AMI image and ssh-connect to it
$ sudo su - root
$ echo 'Defaults:ec2-user !requiretty' > /etc/sudoers.d/999-vagrant-cloud-init-requiretty
$ chmod 440 /etc/sudoers.d/999-vagrant-cloud-init-requiretty
# Note: Installing puppet is mentioned in the #72 bug report but I /think/ you do not need it
# to fix the described Vagrant problem.
$ yum install -y puppet
You must then use this custom AMI image in your Vagrantfile instead of the stock Amazon one. The obvious drawback is that you are not using a stock Amazon AMI image anymore -- whether this is a concern for you or not depends on your requirements.
What I tried but didn't work out
For the record: I also tried to pass a cloud-config to aws.user_data that included a bootcmd to set !requiretty in the same way as the embedded shell script above. According to the cloud-init docs bootcmd is run "very early" in the startup cycle for an EC2 instance -- the idea being that bootcmd instructions would be run earlier than Vagrant would try to run its rsync phase. But unfortunately I discovered that the bootcmd feature is not implemented in the outdated cloud-init version of current Amazon's Linux AMIs (e.g. ami-05355a6c has cloud-init 0.5.15-69.amzn1 but bootcmd was only introduced in 0.6.1).