Google Cloud Compute Engine: sudo broke after "dnf upgrade" on Centos 8 - google-cloud-platform

the company i'm working with is developing a web application based on Laravel Framework, using Google Cloud Platform infrastructures. The frontend VM is a Centos8 OS with Apache webserver installed.
Seems that a developer ran a pretty massive "dnf upgrade" which included: kernel, openssl ,kerberos and others packages.
After the upgrade, seems that ldconfig has lost his mind:
[developer#webserver ~]$ sudo su - root
sudo: error in /etc/sudo.conf, line 19 while loading plugin "sudoers_policy"
sudo: unable to load /usr/libexec/sudo/sudoers.so: /lib64/libldap-2.4.so.2: undefined symbol: EVP_md4, version OPENSSL_1_1_0
sudo: fatal error, unable to load plugins
same for other commands like "dnf" or "rpm":
[developer#webserver ~]$ rpm
rpm: symbol lookup error: /lib64/librpmio.so.8: undefined symbol: EVP_md2, version OPENSSL_1_1_0
after a little bit of investigations, i found that the same commands, specifing the LD_LIBRARY_PATH variable, are working:
[developer#webserver ~]$ LD_LIBRARY_PATH=/lib64 rpm
RPM version 4.14.3
Copyright (C) 1998-2002 - Red Hat, Inc.
This program may be freely redistributed under the terms of the GNU GPL
...
...of course, i can't do the same trick with "sudo" command.
Important fact is that the VM is still running and it was never rebooted ( i'll exaplin later why i'm sayin this )
( and finally..at the point )
The major problem is that we can't use root account cause "sudo" is not working and, by default, Google use Public Key Authentication as deafult method (Local users has random passwords genereated by GCP). So actually, i can't even do a "dnf reinstall" to try fix the issues
I was afraid that, once rebooted, every services stops to work because of the incorrect dependecies library path, so instead of doing a reboot, i have created an image based on the VM and then a new VM based on that image.
As i was thinking: Once booted the new VM, every services stopped working. i was able to read the logs over the serial console of GCP web interface.
a snippet:
...
Oct 27 20:20:30 webserver google_oslogin_nss_cache[783]: /usr/bin/google_oslogin_nss_cache: /lib64/libjson-c.so.4: no version information available (required by /usr/bin/google_oslogin_nss_cache)
Oct 27 20:20:30 webserver NetworkManager[778]: /usr/sbin/NetworkManager: symbol lookup error: /lib64/libldap-2.4.so.2: undefined symbol: EVP_md4, version OPENSSL_1_1_0
Oct 27 20:20:30 webserver google_oslogin_nss_cache[783]: /usr/bin/google_oslogin_nss_cache: symbol lookup error: /lib64/libldap-2.4.so.2: undefined symbol: EVP_md4, version OPENSSL_1_1_0
Oct 27 20:20:30 webserver sssd[771]: ldb: unable to dlopen /usr/lib64/ldb/modules/ldb/ldap.so : /lib64/libldap-2.4.so.2: undefined symbol: EVP_md4, version OPENSSL_1_1_0
...
Using Google official documentation, i found the "startup-script" section of the VM properties that can be launched at every boot and that can be used to "change" user's passwords.
I know that, by default, all VMs has root access disabled, so i made this and added to vm's "automation" script:
#! /bin/bash
echo 'developer:PASSWORD' | chpasswd
echo 'root:PASSWORD' | chpasswd
Once rebooted, i've tried to login using the "serial console" option on the web interface, but with no luck. I've also used journalctl ( as normal user ) to find something in the logs... but nothing.
i suppose that is a consequence of that "google_oslogin_nss_cache" error
there's no way to run that script.
Searching on the internet, i've found some posts where someone was able to login directly as "root" using the "gcloud compute ssh" command. So i have tried to login as described using another VM of the same project, using both my google account user and root user...but also in this way ...no luck.
( i forgot to mention that my google account has "project owner" role, so actually i have all necessary permissions )
is there another way to reset "root" password without using "sudo" or i have to reinstall the VM from start?
I'm sorry for the long explanation....hope that everything is clear
Thanks

So... actually this question is divided by 2 different issues:
The only possible way for me to recover "root" account was to stop the VM, detach the boot disk, mount the boot disk on a new VM, mount the filesystem and modify the user. once boot disk is reattached to the original VM..you can use the modified account
second issue was made by upgrading openssl...so in the end the only way to avoid that error messages was to create a new file: /etc/ld.so.conf.d/libc.conf:
/usr/lib64

Related

How to Connect Secure Shell App to a Google Cloud VM Instance

I would like to connect to a Google Cloud VM instance using Secure Shell App (SSA). I assumed this would be easy as these are both Google products and I had no problem before connection SSA to a Digital Ocean Droplet. I have found Google's own documentation to do so here and it looked easy enough to follow. However, the following link in the instructions: Providing public SSH keys to instances leads down a rabbit hole of confusing and seemingly self-contradicting information. I tried to follow it the best I could but kept running into errors. I have searched in vain for better instructions and am still astounded that Google has made it so hard to connect their own products. Is it really this hard to make this work? Are there any better instructions out there? If not, would someone be willing to write up clear and simple instructions?
Please follow this step by step instruction:
create a new VM instance-1
connect to it with gcloud compute ssh instance-1 (as mentioned #John Hanley)
check ~/.ssh folder
$ ls -l ~/.ssh
-rw------- 1 user usergroup 1856 Dec 9 17:12 google_compute_engine
-rw-r--r-- 1 user usergroup 417 Dec 9 17:12 google_compute_engine.pub
copy keys
cp ~/.ssh/google_compute_engine.pub mykey.pub
cp ~/.ssh/google_compute_engine mykey
follow instructions from step 7 - create connection and import identity
(optional) if you don't find your mykey in the Indentity list try to connect anyway (ended with an error as expected), then restart Secure Shell App and check Indentity menu again (they should be there without redoing import again)
After that, I successfully connected to my VM via Secure Shell App.

using oslogin on gcp with osAdminLogin role a user can't sudo on the instance

I have some GCP users with the roles :
* compute.instances.osAdminLogin
* iam.serviceAccountUser
They connect throw ssh with the GCP web interface in compute engine
When they do sudo ls
For some user the password is requested and some not.
in the folder /var/google-sudoers.d/
for the users that can do sudo without the prompt we can read on their file:
user_name ALL=(ALL) NOPASSWD: ALL
for the others the files are empty
os information :
uname -a
Linux xxx 4.15.0-1027-gcp #28~16.04.1-Ubuntu SMP Fri Jan 18 10:10:51 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
For the same users, on another vm, in the same gcp project, they all can do sudo.
I am expecting that for all users having the same roles, they have the same sudo behaviour on instances.
What should I do for my users to be able to sudo ? ( except overriding the empty files in the folder /var/google-sudoers.d/ > that is working but may not be stable)
I had a similar problem on a project that was originally set up with the legacy login system (based on SSH keys stored in instance or project metadata). When I converted the project to use OS Login, I lost the ability to sudo without a password on one VM instance. This was a major problem, since I had never set a password for my user account, and therefore was unable to sudo to troubleshoot the problem.
Things I tried that did NOT work:
Rebooting the instance
Explicitly adding role roles/compute.osAdminLogin to my IAM account (I was already a project owner)
I solved the problem by editing the project compute engine metadata to disable OS Login. After disabling, I confirmed that I was able to log into the problematic instance and sudo without a password. I then edited the project metadata again to re-enable OS Login. This time, passwordless sudo worked on the problematic instance. It appears that the instance was not fully reconfigured the first time I switched from legacy login to OS Login.

Unable to sudo to Deep Learning Image

I installed the latest Google Cloud Deep Learning VM Image today, after VM was launched, I was able to do sudo -i successfully via SSH web.
Once I login, I start my Tensorflow model training running in background (Using &). Few hours later I'm unable to login as root.
I get the following message:
We trust you have received the usual lecture from the local System
Administrator. It usually boils down to these three things:
#1) Respect the privacy of others.
#2) Think before you type.
#3) With great power comes great responsibility.
[sudo] password for my_username:
I tried:
sudo -i
su sudo -i
su root
I was able to replicate the issue. Any suggestions?
This issue was caused due to an internal Google side and removes the user from “Google-sudoers” group. For all affected instances, I suggest following the below workaround until the permanent fix has been rolled out.
Use a different username:
If using browser SSH window, click on the settings icon (top right), and click change Linux name in the drop down.
Using the SDK
$ gcloud compute ssh newusername#instance
Enable OS Login on the instance (set "enable-oslogin=True" in metadata) and per this article
You can track the permanent fix by following the Public Issue tracker.
The original answer:
Maybe the solution will be to add a SSH Key for Google Cloud Console and log in with another SSH client.
Additional answer:
I do not know why, but sometime the user suddenly stopped being a member of the google-sudoers group...
Then it's enough add your user to this group by some other user with administrator privileges to this group:
# usermod -G google-sudoers your_user_name
of course, if there is such a user...

Google cloud compute startup script ignored with no logging

I have a standard Debian 8.9 instance on google cloud compute (GCE) where my startup script is ignored.
In the custom metadata field, for startup-script, I am trying to run an Rscript (which is used for batch execution of R files), followed by a system shutdown, with the following:
#! /bin/bash
sudo /usr/bin/Rscript /home/myuser/launch_script.R
sudo shutdown -h now
Starting the instance is immediately followed by a shutdown and the Rscript is ignored. Removing the last line to shutdown causes the GCE instance to start, but the Rscript to be ignored. Running just "sudo /usr/bin/Rscript /home/myuser/launch_script.R" from the terminal results in the script being run. It has a chmod of 755, so I don't think this is a permissions issue.
In addition to this problem, I have read elsewhere that logging should happen in /var/log/, but there is nothing there. Instead, I have a bunch of log files (that only contain the start-up script and nothing else) in the root of my instance:
I got in touch with Google cloud support, who gave the following response:
script definition is kept under /var/run/google.startup.script
If the script does not run initially, you can force it manually with : $ sudo google_metadata_script_runner --script-type startup # for Debian, or # sudo /usr/share/google/run-startup-scripts # on Ubuntu and older images
I'm posting this information here, because it is not in their documentation (as of August 2017). I'm not sure how helpful it is, since the google.startup.script didn't exist in my case (using the latest Debian image on GCE), but I did run the other commands.
However, I think my main issues were:
I was using autossh to connect to a remote database. The startup-script was running before autossh. Building a 40 second delay into the script and running the script as a user (not sudo-type root) seems to have solved this problem for now. Autossh was being run as the main user, which I think gets loaded before lower-privilege user-defined scripts get loaded.
I was using some gcloud commands from the user account which had its own authentication issues. Running gcloud auth login as the user and ensuring correct permissions on my private key solved this.
Always remember to check the messages and syslog files in /var/log for troubleshooting. This allowed me to see the order of things being loaded at system-boot.

Vagrant Rsync Error before provisioning

So I'm having some adventures with the vagrant-aws plugin, and I'm now stuck on the issue of syncing folders. This is necessary to provision the machines, which is the ultimate goal. However, running vagrant provision on my machine yields
[root#vagrant-puppet-minimal vagrant]# vagrant provision
[default] Rsyncing folder: /home/vagrant/ => /vagrant
The following SSH command responded with a non-zero exit status.
Vagrant assumes that this means the command failed!
mkdir -p '/vagrant'
I'm almost positive the error is caused because ssh-ing manually and running that command yields 'permission denied' (obviously, a non-root user is trying to make a directory in the root directory). I tried ssh-ing as root but it seems like bad practice. (and amazon doesn't like it) How can I change the folder to be rsynced with vagrant-aws? I can't seem to find the setting for that. Thanks!
Most likely you are running into the known vagrant-aws issue #72: Failing with EC2 Amazon Linux Images.
Edit 3 (Feb 2014): Vagrant 1.4.0 (released Dec 2013) and later versions now support the boolean configuration parameter config.ssh.pty. Set the parameter to true to force Vagrant to use a PTY for provisioning. Vagrant creator Mitchell Hashimoto points out that you must not set config.ssh.pty on the global config, you must set it on the node config directly.
This new setting should fix the problem, and you shouldn't need the workarounds listed below anymore. (But note that I haven't tested it myself yet.) See Vagrant's CHANGELOG for details -- unfortunately the config.ssh.pty option is not yet documented under SSH Settings in the Vagrant docs.
Edit 2: Bad news. It looks as if even a boothook will not be "faster" to run (to update /etc/sudoers.d/ for !requiretty) than Vagrant is trying to rsync. During my testing today I started seeing sporadic "mkdir -p /vagrant" errors again when running vagrant up --no-provision. So we're back to the previous point where the most reliable fix seems to be a custom AMI image that already includes the applied patch to /etc/sudoers.d.
Edit: Looks like I found a more reliable way to fix the problem. Use a boothook to perform the fix. I manually confirmed that a script passed as a boothook is executed before Vagrant's rsync phase starts. So far it has been working reliably for me, and I don't need to create a custom AMI image.
Extra tip: And if you are relying on cloud-config, too, you can create a Mime Multi Part Archive to combine the boothook and the cloud-config. You can get the latest version of the write-mime-multipart helper script from GitHub.
Usage sketch:
$ cd /tmp
$ wget https://raw.github.com/lovelysystems/cloud-init/master/tools/write-mime-multipart
$ chmod +x write-mime-multipart
$ cat boothook.sh
#!/bin/bash
SUDOERS_FILE=/etc/sudoers.d/999-vagrant-cloud-init-requiretty
echo "Defaults:ec2-user !requiretty" > $SUDOERS_FILE
echo "Defaults:root !requiretty" >> $SUDOERS_FILE
chmod 440 $SUDOERS_FILE
$ cat cloud-config
#cloud-config
packages:
- puppet
- git
- python-boto
$ ./write-mime-multipart boothook.sh cloud-config > combined.txt
You can then pass the contents of 'combined.txt' to aws.user_data, for instance via:
aws.user_data = File.read("/tmp/combined.txt")
Sorry for not mentioning this earlier, but I am literally troubleshooting this right now myself. :)
Original answer (see above for a better approach)
TL;DR: The most reliable fix is to "patch" a stock Amazon Linux AMI image, save it and then use the customized AMI image in your Vagrantfile. See below for details.
Background
A potential workaround is described (and linked in the bug report above) at https://github.com/mitchellh/vagrant-aws/pull/70/files. In a nutshell, add the following to your Vagrantfile:
aws.user_data = "#!/bin/bash\necho 'Defaults:ec2-user !requiretty' > /etc/sudoers.d/999-vagrant-cloud-init-requiretty && chmod 440 /etc/sudoers.d/999-vagrant-cloud-init-requiretty\nyum install -y puppet\n"
Most importantly this will configure the OS to not require a tty for user ec2-user, which seems to be the root of the problem. I /think/ that the additional installation of the puppet package is not required for the actual fix (although Vagrant may use Puppet for provisioning the machine later, depending on how you configured Vagrant).
My experience with the described workaround
I have tried this workaround but Vagrant still occasionally fails with the same error. It might be a "race condition" where Vagrant happens to run its rsync phase faster than cloud-init (which is what aws.user_data is passing information to) can prepare the workaround for #72 on the machine for Vagrant. If Vagrant is faster you will see the same error; if cloud-init is faster it works.
What will work (but requires more effort on your side)
What definitely works is to run the command on a stock Amazon Linux AMI image, and then save the modified image (= create an image snapshot) as a custom AMI image of yours.
# Start an EC2 instance with a stock Amazon Linux AMI image and ssh-connect to it
$ sudo su - root
$ echo 'Defaults:ec2-user !requiretty' > /etc/sudoers.d/999-vagrant-cloud-init-requiretty
$ chmod 440 /etc/sudoers.d/999-vagrant-cloud-init-requiretty
# Note: Installing puppet is mentioned in the #72 bug report but I /think/ you do not need it
# to fix the described Vagrant problem.
$ yum install -y puppet
You must then use this custom AMI image in your Vagrantfile instead of the stock Amazon one. The obvious drawback is that you are not using a stock Amazon AMI image anymore -- whether this is a concern for you or not depends on your requirements.
What I tried but didn't work out
For the record: I also tried to pass a cloud-config to aws.user_data that included a bootcmd to set !requiretty in the same way as the embedded shell script above. According to the cloud-init docs bootcmd is run "very early" in the startup cycle for an EC2 instance -- the idea being that bootcmd instructions would be run earlier than Vagrant would try to run its rsync phase. But unfortunately I discovered that the bootcmd feature is not implemented in the outdated cloud-init version of current Amazon's Linux AMIs (e.g. ami-05355a6c has cloud-init 0.5.15-69.amzn1 but bootcmd was only introduced in 0.6.1).