How to debug a failed systemctl service (code=exited, status=217/USER)?

How to debug a failed systemctl service (code=exited, status=217/USER)? - amazon-web-services

I'm trying to add my first service on rhel7 (which resides in AWS/EC2), but - the service is not configured correctly - as I get:
[ec2-user#ip-172-30-1-96 ~]$ systemctl status clouddirectd.service -l
● clouddirectd.service - CloudDirect Daemon
Loaded: loaded (/usr/lib/systemd/system/clouddirectd.service; enabled; vendor preset: disabled)
Active: activating (auto-restart) (Result: exit-code) since Tue 2018-01-09 16:09:42 EST; 8s ago
Main PID: 10064 (code=exited, status=217/USER)
Jan 09 16:09:42 ip-172-30-1-96.us-west-1.compute.internal systemd[1]: clouddirectd.service: main process exited, code=exited, status=217/USER
Jan 09 16:09:42 ip-172-30-1-96.us-west-1.compute.internal systemd[1]: Unit clouddirectd.service entered failed state.
Jan 09 16:09:42 ip-172-30-1-96.us-west-1.compute.internal systemd[1]: clouddirectd.service failed.
Also:
[ec2-user#ip-172-30-1-96 ~]$ systemctl is-active clouddirectd
activating
[ec2-user#ip-172-30-1-96 ~]$ sudo systemctl list-units --type service --all | grep clouddirectd
clouddirectd.service loaded activating auto-restart CloudDirect Daemon
And my unit file is:
[ec2-user#ip-172-30-1-96 ~]$ cat /usr/lib/systemd/system/clouddirectd.service
[Unit]
Description=CloudDirect Daemon
After=network.target
[Service]
Environment=AWS_SHARED_CREDENTIALS_FILE=/etc/sonar/.aws/credentials
#ExecStart=/usr/lib/sonar/clouddirect/virtualenv/bin/python /usr/bin/sonar/clouddirectd -c /etc/sonar/clouddirect/clouddirectd.conf
ExecStart=/usr/lib/sonar/clouddirect/virtualenv/bin/python /usr/bin/clouddirect -c /etc/sonar/clouddirect.conf
# #PERM# allow group write permission on newly created files
UMask=0007
#User=clouddirectd
User=clouddirect
Group=sonar
KillSignal=SIGINT
TimeoutStopSec=60min
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
Can you suggest how to debug this systemctl service so it won't keep dying and auto restarting?

The error 217 indicate the user did not exist at the time the service tried to start. In your case the user specified in your service is clouddirect.
Main PID: 10064 (code=exited, status=217/USER)
Jan 09 16:09:42 ip-172-30-1-96.us-west-1.compute.internal systemd[1]: clouddirectd.service: main process exited, code=exited, status=217/USER
This could be caused if that is not the actual user name (for example if it has a typo), it can also be caused if the user is part of some external user store (ex: LDAP or Active Directory) and the service which needs to start that allows the Linux server to access the external user store is not up yet. For example vasd.service starts a product used to allow Linux to authenticate against Active Directory, if vasd.service is not up and you have specified a user that is only available in Active Directory you would want to add that service in your After= line. For example:
After=network.target vasd.service

There's two parts to the question. One is how to diagnose a 217/USER, the other is how to fix it. I'll just focus on the former.
For the 217/USER there's some good pointers here:
https://www.reddit.com/r/linuxquestions/comments/oaya49/systemd_service_not_starting_with_status217/
217 doesnt' "always" mean it's a user problem, it just means it exited with a 217. May or may not...
You could use journalctl to see the logs of which services "seem to come up after it does" initially or what not.
It's possible that "network users" aren't yet available at the time the system is started during boot, you can fix that by adding After=nss-user-lookup.target https://systemd.io/UIDS-GIDS/ though that's not the case here since it still fails after restarting, which is later. systemd expects the user specified to "be available" when the service starts. So for "system users" (which start early running processes) they need to be available on the local box. For later started processes they can be "network users".
You could also try changing your group and username (and environment) to what you "think" systemd is running and run it manually, see what happens. https://serverfault.com/questions/410577/execute-a-command-from-another-group
Kind of wish systemd output more debug so you could tell what it is running more easily...
In certain bizarre cases you may need to specify both User= and Group= https://superuser.com/a/1452367/39364
In our case running "vintela status" had a message "SELinux may not be configured correctly" and sure enough, after disabling SELinux, it started working as expected, no more 217. [redhat 8]

Related

Systemd service should log to journal but not to rsyslog (/var/log/messages)

Why is the log data also displayed in the file "/var/log/messages" if you specify StandardOutput=null and StandardError=journal in the systemd service? I'm using Centos7 as the operating system.
[Service]
Restart=always
TimeoutStartSec=1200
StandardOutput=null
StandardError=journal
I can see both messages in journal and in the file /var/log/messages.
journalctl -u my_service
----
systemd[1]: my_service.service holdoff time over, scheduling restart.
----
cat /var/log/messages | grep my_service
----
systemd[1]: my_service.service holdoff time over, scheduling restart.
----
What additional adjustments do I have to make so that the service only logs its error messages in the journal?
EDIT:
I use the default configuration for journal "/etc/systemd/journald.conf". All lines are commented out.

Unable to restart Hue in AWS EMR cluster

I have an EMR Cluster in AWS, configured in a Cloudformation template. In my template, I have a step that executes a script on the master node. The purpose of this script is to make changes to the hue.ini file.
The final step in the script is to restart Hue, for the changes to take effect. I'm following this documentation for the correct command. This documentation is explicit with Do Not run restart.
Running sudo systemctl stop hue followed by sudo systemctl start hue leaves Hue in the following state (per sudo systemctl status hue):
[root#ip-10-x-xxx-xxx ~]# sudo systemctl status hue
● hue.service - Hue web server
Loaded: loaded (/etc/systemd/system/hue.service; enabled; vendor preset: disabled)
Active: activating (auto-restart) (Result: exit-code) since Wed 2021-05-19 18:44:27 UTC; 2s ago
Process: 22743 ExecStart=/etc/init.d/hue start (code=exited, status=1/FAILURE)
Main PID: 17508 (code=exited, status=1/FAILURE)
Tasks: 0
Memory: 0B
CGroup: /system.slice/hue.service
May 19 18:44:27 ip-10-x-xxx-xxx systemd[1]: Failed to start Hue web server.
May 19 18:44:27 ip-10-x-xxx-xxx systemd[1]: Unit hue.service entered failed state.
May 19 18:44:27 ip-10-x-xxx-xxx systemd[1]: hue.service failed.
Running start again manually on the instance returns this:
Job for hue.service failed because the control process exited with error code. See "systemctl status hue.service" and "journalctl -xe" for details.
Those logs just show the same as above. I have also checked this similar question but the answer does not work for me.
EMR: emr-6.2.0
Hue: 4.8.0

After a little more research, it seems this is not the best approach. The best approach is to include a hue-ini Classification lock in my Cloudformation template. This applies the changes and performs the required restart for you.

google compute startup script starting postgresql

I have a startup script set by an instance template that initializes the server for google compute. After installing postgres, I manually call for it to start using :
/etc/init.d/postgresql start
This completes successfully, but the server is not listening on 5432 when run by the startup script (postgres isn't started, although that service start call completes successfully). After startup completes, and I log in, I can do it successfully. Anyone know why that won't work within the startup script ? I need to load up data during startup so I need to startup postgres during initialization.

I Solved using newer debian image

I had the same problem as you (installing postgresql in GCE startup-script results in the package being installed, but the server is not running), and I think I figured out the root cause.
Normally, the postgresql-11 package is supposed to start the PostgreSQL server after installation. Here is a snippet from its postinst script:
if [ "$1" = configure ]; then
. /usr/share/postgresql-common/maintscripts-functions
configure_version $VERSION "$2"
fi
Taking a look at /usr/share/postgresql-common/maintscripts-functions, we see:
configure_version() {
...
# reload systemd to let the generator pick up the new unit
if [ -d /run/systemd/system ]; then
systemctl daemon-reload
fi
invoke-rc.d postgresql start $VERSION # systemd: argument ignored, starts all versions
}
My debian installation comes with init-system-helpers version "1.56+nmu1", which contains this bit of code in invoke-rc.d:
# avoid deadlocks during bootup and shutdown from units/hooks
# which call "invoke-rc.d service reload" and similar, since
# the synchronous wait plus systemd's normal behaviour of
# transactionally processing all dependencies first easily
# causes dependency loops
if ! systemctl --quiet is-active multi-user.target; then
sctl_args="--job-mode=ignore-dependencies"
fi
case $saction in
start|restart|try-restart)
[ "$_state" != "LoadState=masked" ] || exit 0
systemctl $sctl_args "${saction}" "${UNIT}" && exit 0
;;
The debian postgresql-11 package makes use of templated systemd units. The main one is called postgresql.service but this is a dummy service that doesn't actually do anything. The PostgreSQL server is actually started by a templated unit named postgresql#11-main which is usually started alongside the main service because it has ReloadPropagatedFrom=postgresql.service.
Note that when this issue occurs, the main unit is started but the templated one is not:
$ sudo systemctl status postgresql
● postgresql.service - PostgreSQL RDBMS
Loaded: loaded (/lib/systemd/system/postgresql.service; enabled; vendor preset: enabled)
Active: active (exited) since Fri 2021-04-02 05:40:48 UTC; 32min ago
Main PID: 1663 (code=exited, status=0/SUCCESS)
Tasks: 0 (limit: 4665)
Memory: 0B
CGroup: /system.slice/postgresql.service
Apr 02 05:40:48 hubnext-west-r21r systemd[1]: Starting PostgreSQL RDBMS...
Apr 02 05:40:48 hubnext-west-r21r systemd[1]: Started PostgreSQL RDBMS.
$ sudo systemctl status postgresql#11-main
● postgresql#11-main.service - PostgreSQL Cluster 11-main
Loaded: loaded (/lib/systemd/system/postgresql#.service; enabled-runtime; vendor preset: enabled)
Active: inactive (dead)
That's because when --job-mode=ignore-dependencies is specified, this link is ignored.
The GcE startup script runs as a systemd unit, which starts before multi-user.target is up:
$ find /etc/systemd | grep startup
/etc/systemd/system/multi-user.target.wants/google-startup-scripts.service
Therefore, invoke-rc.d notices that systemctl --quiet is-active multi-user.target is false and adds --job-mode=ignore-dependencies, which results in the PostgreSQL server not starting.
One possible workaround is explicitly running systemd start postgresql#11-main.service from your startup script after installing postgres.
By the way, I noticed that a recent commit (Nov 2020) changed this invoke-rc.d behavior so that it no longer uses --job-mode=ignore-dependencies. That would help avoid this issue.

docker-machine create with digitalocean driver: ssh command error

I´m using docker tools on windows.
create command was working perfectly last week and I managed to create a number of machines on Digital Ocean. Then I tried today with no success. I repeated the same command with different regions and I always get the same result:
λ docker-machine create -d digitalocean --digitalocean-access-token=MYTOKEN --digitalocean-region=ams2 vmname
Running pre-create checks...
Creating machine...
(fernu) Creating SSH key...
(fernu) Creating Digital Ocean droplet...
(fernu) Waiting for IP address to be assigned to the Droplet...
Waiting for machine to be running, this may take a few minutes...
Detecting operating system of created instance...
Waiting for SSH to be available...
Detecting the provisioner...
Provisioning with ubuntu(systemd)...
Installing Docker...
Copying certs to the local machine directory...
Copying certs to the remote machine...
Setting Docker configuration on the remote daemon...
Error creating machine: Error running provisioning: ssh command error:
command : sudo systemctl -f start docker
err : exit status 1
output : Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.
If I execute the suggested command:
root#fernu:~# systemctl status docker.service
● docker.service - Docker Application Container Engine
Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/docker.service.d
└─10-machine.conf
Active: inactive (dead) (Result: exit-code) since Fri 2017-06-30 20:56:13 UTC; 8min ago
Docs: https://docs.docker.com
Process: 4943 ExecStart=/usr/bin/docker daemon -H tcp://0.0.0.0:2376 -H unix:///var/run/docker.sock --storage-driver aufs --tlsverify --tlscacert /etc/docker/ca.pem --tlscert /etc/docker/server.pem --tlskey /etc/docker/server-key.pem --label provider=digitalocean (code=exited, status=1/FAILURE)
Main PID: 4943 (code=exited, status=1/FAILURE)
Jun 30 20:56:13 fernu systemd[1]: docker.service: Main process exited, code=exited, status=1/FAILURE
Jun 30 20:56:13 fernu systemd[1]: Failed to start Docker Application Container Engine.
Jun 30 20:56:13 fernu systemd[1]: docker.service: Unit entered failed state.
Jun 30 20:56:13 fernu systemd[1]: docker.service: Failed with result 'exit-code'.
Jun 30 20:56:13 fernu systemd[1]: docker.service: Service hold-off time over, scheduling restart.
Jun 30 20:56:13 fernu systemd[1]: Stopped Docker Application Container Engine.
Jun 30 20:56:13 fernu systemd[1]: docker.service: Start request repeated too quickly.
Jun 30 20:56:13 fernu systemd[1]: Failed to start Docker Application Container Engine.
Any help would be appreciated
Update
It´s working with ubuntu 14:
--digitalocean-image=ubuntu-14-04-x64 so it seams like a problem with the default image (ubuntu-16-04-x64)

This seems to be hitting a lot of people. TL;DR: There is a bug in docker-machine v0.12.0 and this issue can be resolved by upgrading.
Logging in to the DigitalOcean instance and running journalctl -xe provides more information:
-- Unit docker.service has begun starting up.
Jul 07 20:03:52 docker-sandbox docker[4930]: `docker daemon` is not supported on Linux. Please run `do
Jul 07 20:03:52 docker-sandbox systemd[1]: docker.service: Main process exited, code=exited, status=1/
Jul 07 20:03:52 docker-sandbox systemd[1]: Failed to start Docker Application Container Engine.
-- Subject: Unit docker.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
The key here is docker daemon is not supported on Linux. A bug in docker-machine's version comparison code caused an incorrect systemd unit file to be produced (located at /etc/systemd/system/docker.service.d/10-machine.conf) on certain versions of Ubuntu.
A fix has been committed and a new release (v0.12.1) was made.
You can grab the latest release at: https://github.com/docker/machine/releases/tag/v0.12.1

Glusterfs-server can't stop in good status

Glusterfs-server can't stop in good status.
I have been trying the following steps to install glusterfs and start service.
yum install centos-release-gluster
yum install glusterfs-server
systemctl start glusterd
and stop it.
systemctl stop glusterd
Then displayed following status "Active: failed".
glusterd.service - GlusterFS, a clustered file-system server
Loaded: loaded (/usr/lib/systemd/system/glusterd.service; disabled)
Active: failed (Result: exit-code) since 火 2017-01-24 18:23:55 JST; 4s ago
Process: 2523 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 2524 (code=exited, status=15)
1月 24 18:23:52 ds009 systemd[1]: Started GlusterFS, a clustered file-system server.
1月 24 18:23:55 ds009 systemd[1]: Stopping GlusterFS, a clustered file-system server...
1月 24 18:23:55 ds009 systemd[1]: glusterd.service: main process exited, code=exited, status=15/n/a
1月 24 18:23:55 ds009 systemd[1]: Stopped GlusterFS, a clustered file-system server.
1月 24 18:23:55 ds009 systemd[1]: Unit glusterd.service entered failed state.
Hint: Some lines were ellipsized, use -l to show in full.
Using environment is"CentOS Linux release 7.2.1511 (Core)"
And installed glusterfs-server version is 3.8.8-1.el7.
Does anyone have an idea what wrong is and to fix this.

I have found a workaround.
Edit /usr/lib/systemd/system/glusterd.service file as following.
[Service]
Type=forking
PIDFile=/var/run/glusterd.pid
LimitNOFILE=65536
Environment="LOG_LEVEL=INFO"
EnvironmentFile=-/etc/sysconfig/glusterd
ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS
ExecStopPost=/usr/bin/systemctl reset-failed glusterd # Add this
KillMode=process
The failed status is clear when the GlusterFS service stop.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js