Command failed: wait ${SUBPROC} on line 326 - google-cloud-platform

I am trying to create hadoop-cluster in GCP, by using below commands:
cd bdutil
$ ./bdutil -b [Bucket Name] \
-z us-east1-b \
-n 2 \
-P [Project-ID] \
deploy
...
At (y/n) enter y
Facing below issues, please help to resolve the issues:
Mon Oct 8 05:35:49 UTC 2018: Exited 1 : gcloud --project=geslanu-218716 --quiet --verbosity=info compute instances
create geslanu-218716-w-0 --machine-type=n1-standard-1 --image-family=debian-8 --image-project=debian-cloud --netw
ork=default --tags=bdutil --scopes storage-full --boot-disk-type=pd-standard --zone=us-east1-b
Mon Oct 8 05:35:50 UTC 2018: Exited 1 : gcloud --project=geslanu-218716 --quiet --verbosity=info compute instances
create geslanu-218716-w-1 --machine-type=n1-standard-1 --image-family=debian-8 --image-project=debian-cloud --netw
ork=default --tags=bdutil --scopes storage-full --boot-disk-type=pd-standard --zone=us-east1-b
Mon Oct 8 05:35:50 UTC 2018: Exited 1 : gcloud --project=geslanu-218716 --quiet --verbosity=info compute instances
create geslanu-218716-m --machine-type=n1-standard-1 --image-family=debian-8 --image-project=debian-cloud --networ
k=default --tags=bdutil --scopes storage-full --boot-disk-type=pd-standard --zone=us-east1-b
Mon Oct 8 05:35:50 UTC 2018: Command failed: wait ${SUBPROC} on line 326.
Mon Oct 8 05:35:50 UTC 2018: Exit code of failed command: 1
Mon Oct 8 05:35:50 UTC 2018: Detailed debug info available in file: /tmp/bdutil-20181008-053541-yeq/debuginfo.txt
Mon Oct 8 05:35:50 UTC 2018: Check console output for error messages and/or retry your command.

check next output - Detailed debug info available in file: /tmp/bdutil-20181111-105223-nET/debuginfo.txt
ERROR: (gcloud.compute.instances.create) Could not fetch resource:
- Invalid value for field 'resource.name': 'antonmobile_gmail-w-0'. Must be a match of regex '(?:[a-z](?:[-a-z0-9$
I am personally had an issue with UNIQUE_ID where used underscore. that does not conform with name restrictions.
PS: I know it is probably to late to answer but somebody else could have the same issue.

Related

Google Cloud VM metadata-based keys ssh: handshake failed unable to authenticate and oslogin_cache_refresh: Failure getting groups, quitting

When trying to SSH to GCE VMs using metadata-based SSH keys I get the following error:
ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain
While troubleshooting I can see the keys in the instance metadata, but they are not being added to the user's authorized_keys file:
$ curl -H "Metadata-Flavor: Google" "http://metadata.google.internal/computeMetadata/v1/instance/attributes/ssh-keys"
username:ssh-ed25519 AAAAC3NzaC....omitted....
admin:ssh-ed25519 AAAAC3NzaC....omitted....
$ sudo ls -hal /home/**/.ssh/
/home/ubuntu/.ssh/:
total 8.0K
drwx------ 2 ubuntu ubuntu 4.0K Aug 11 23:19 .
drwxr-xr-x 3 ubuntu ubuntu 4.0K Aug 11 23:19 ..
-rw------- 1 ubuntu ubuntu 0 Aug 11 23:19 authorized_keys
# Only result is the default zero-length file for ubuntu user
I also see the following errors in the ssh server auth log and Google Guest Environment services:
$ sudo less /var/log/auth.log
Aug 11 23:28:59 test-vm sshd[2197]: Invalid user admin from 1.2.3.4 port 34570
Aug 11 23:28:59 test-vm sshd[2197]: Connection closed by invalid user admin 1.2.3.4 port 34570 [preauth]
$ sudo journalctl -u google-guest-agent.service
Aug 11 22:24:42 test-vm oslogin_cache_refresh[907]: Refreshing passwd entry cache
Aug 11 22:24:42 test-vm oslogin_cache_refresh[907]: Refreshing group entry cache
Aug 11 22:24:42 test-vm oslogin_cache_refresh[907]: Failure getting groups, quitting
Aug 11 22:24:42 test-vm oslogin_cache_refresh[907]: Failed to get groups, not updating group cache file, removing /etc/oslogin_group.cache.bak.
# or
Aug 11 23:19:37 test-vm GCEGuestAgent[766]: 2022-08-11T23:19:37.6541Z GCEGuestAgent Info: Creating user admin.
Aug 11 23:19:37 test-vm useradd[885]: failed adding user 'admin', data deleted
Aug 11 23:19:37 test-vm GCEGuestAgent[766]: 2022-08-11T23:19:37.6869Z GCEGuestAgent Error non_windows_accounts.go:144:
Error creating user: useradd: group admin exists - if you want to add this user to that group, use -g.
Currently the latest cloud-init and guest-oslogin packages for Ubuntu 20.04.4 LTS (focal) seem to have an issue that causes google-guest-agent.service to exit before completing its task. The issue was fixed and committed but not yet released for focal (and likely other Ubuntu versions).
For now you can try disabling OS Login by setting instance or project metadata enable-oslogin=FALSE. After which you should see the expected results and be able to SSH using those keys:
$ sudo journalctl -u google-guest-agent.service
Aug 11 23:10:33 test-vm GCEGuestAgent[761]: 2022-08-11T23:10:33.0517Z GCEGuestAgent Info: Created google sudoers file
Aug 11 23:10:33 test-vm GCEGuestAgent[761]: 2022-08-11T23:10:33.0522Z GCEGuestAgent Info: Creating user username.
Aug 11 23:10:33 test-vm useradd[881]: new group: name=username, GID=1002
Aug 11 23:10:33 test-vm useradd[881]: new user: name=username, UID=1001, GID=1002, home=/home/username, shell=/bin/bash, from=none
Aug 11 23:10:33 test-vm gpasswd[895]: user username added by root to group ubuntu
Aug 11 23:10:33 test-vm gpasswd[904]: user username added by root to group adm
Aug 11 23:10:33 test-vm gpasswd[983]: user username added by root to group google-sudoers
Aug 11 23:10:33 test-vm GCEGuestAgent[761]: 2022-08-11T23:10:33.7615Z GCEGuestAgent Info: Updating keys for user username.
$ sudo ls -hal /home/username/.ssh/
/home/username/.ssh/:
total 12K
drwx------ 2 username username 4.0K Aug 11 23:19 .
drwxr-xr-x 4 username username 4.0K Aug 11 23:35 ..
-rw------- 1 username username 589 Aug 11 23:19 authorized_keys
The admin user however will not work, since it conflicts with an existing linux group. You should pick a username that does not conflict with any of the name:x:123: names listed at getent group

How to set cpu_manager_policy to static in eks managed nodegroup.?

Hi i have been trying to do cpu pinning in my eks cluster. i have used amazon linux latest release, and my eks version is 1.22 . i have created a launch template where i have used this user data mentioned below.
Content-Type: multipart/mixed; boundary="//"
MIME-Version: 1.0
--//
#!/bin/bash
set -o xtrace
/etc/eks/bootstrap.sh $CLUSTER_NAME
sleep 2m
yum update -y
sudo rm /var/lib/kubelet/cpu_manager_state
sudo chmod 777 kubelet.service
sudo cat > /etc/systemd/system/kubelet.service <<EOF
[Unit]
Description=Kubernetes Kubelet
Documentation=https://github.com/kubernetes/kubernetes
After=docker.service iptables-restore.service
Requires=docker.service
[Service]
ExecStartPre=/sbin/iptables -P FORWARD ACCEPT -w 5
ExecStart=/usr/bin/kubelet --cloud-provider aws \
--image-credential-provider-config /etc/eks/ecr-credential-provider/ecr-
credential-provider-config \
--image-credential-provider-bin-dir /etc/eks/ecr-credential-provider \
--cpu-manager-policy=static \
--kube-reserved=cpu=0.5,memory=1Gi,ephemeral-storage=0.5Gi \
--system-reserved=cpu=0.5,memory=1Gi,ephemeral-storage=0.5Gi \
--config /etc/kubernetes/kubelet/kubelet-config.json \
--kubeconfig /var/lib/kubelet/kubeconfig \
--container-runtime docker \
--network-plugin cni $KUBELET_ARGS $KUBELET_EXTRA_ARGS
Restart=always
RestartSec=5
KillMode=process
[Install]
WantedBy=multi-user.target
EOF
sudo chmod 644 kubelet.service
sudo systemctl daemon-reload
sudo systemctl stop kubelet
sudo systemctl start kubelet
--//
after creating the template i have used it on the eks nodegroup creation. after waititng a while i am getting this error on the eks dashboard.
Health issues (1)
NodeCreationFailure Instances failed to join the kubernetes cluster .
and i have get into that ec2 instance and used the following command to view kubectl logs
$journalctl -f -u kubelet
the output is
[ec2-user#ip-10.100.11.111 kubelet]$ journalctl -f -u kubelet
-- Logs begin at Thu 2022-04-21 07:27:50 UTC. --
Apr 21 07:31:21 ip-10.100.11.111.us-west-2.compute.internal kubelet[12225]: I0421
07:31:21.199868 12225 state_mem.go:80] "Updated desired CPUSet" podUID="3b513cfa-
441d-4e25-9441-093b4c2ed548" containerName="efs-plugin" cpuSet="0-7"
Apr 21 07:31:21 ip-10.100.11.111.us-west-2.compute.internal kubelet[12225]: I0421
07:31:21.244811 12225 state_mem.go:80] "Updated desired CPUSet" podUID="3b513cfa-
441d-4e25-9441-093b4c2ed548" containerName="csi-provisioner" cpuSet="0-7"
Apr 21 07:31:21 ip-10.100.11.111.us-west-2.compute.internal kubelet[12225]: I0421
07:31:21.305206 12225 state_mem.go:80] "Updated desired CPUSet" podUID="3b513cfa-
441d-4e25-9441-093b4c2ed548" containerName="liveness-probe" cpuSet="0-7"
Apr 21 07:31:21 ip-10.100.11.111.us-west-2.compute.internal kubelet[12225]: I0421
07:31:21.335744 12225 state_mem.go:80] "Updated desired CPUSet" podUID="de537700-
f5ac-4039-a151-110ddf27d140" containerName="efs-plugin" cpuSet="0-7"
Apr 21 07:31:21 ip-10.100.11.111.us-west-2.compute.internal kubelet[12225]: I0421
07:31:21.388843 12225 state_mem.go:80] "Updated desired CPUSet" podUID="de537700-
f5ac-4039-a151-110ddf27d140" containerName="csi-driver-registrar" cpuSet="0-7"
Apr 21 07:31:21 ip-10.100.11.111.us-west-2.compute.internal kubelet[12225]: I0421
07:31:21.464789 12225 state_mem.go:80] "Updated desired CPUSet" podUID="de537700-
f5ac-4039-a151-110ddf27d140" containerName="liveness-probe" cpuSet="0-7"
Apr 21 07:31:21 ip-10.100.11.111.us-west-2.compute.internal kubelet[12225]: I0421
07:31:21.545206 12225 state_mem.go:80] "Updated desired CPUSet" podUID="a2f09d0d-
69f5-4bb7-82bb-edfa86cb87e2" containerName="kube-controller" cpuSet="0-7"
Apr 21 07:31:21 ip-10.100.11.111.us-west-2.compute.internal kubelet[12225]: I0421
07:31:21.633078 12225 state_mem.go:80] "Updated desired CPUSet" podUID="3ec70fe1-
3680-4e3c-bcfa-81f80ebe20b0" containerName="kube-proxy" cpuSet="0-7"
Apr 21 07:31:21 ip-10.100.11.111.us-west-2.compute.internal kubelet[12225]: I0421
07:31:21.696852 12225 state_mem.go:80] "Updated desired CPUSet" podUID="adbd9bef-
c4e0-4bd1-a6a6-52530ad4bea3" containerName="aws-node" cpuSet="0-7"
Apr 21 07:46:12 ip-10.100.11.111.us-west-2.compute.internal kubelet[12225]: E0421
07:46:12.424801 12225 certificate_manager.go:488] kubernetes.io/kubelet-serving:
certificate request was not signed: timed out waiting for the condition
Apr 21 08:01:16 ip-10.100.11.111.us-west-2.compute.internal kubelet[12225]: E0421
08:01:16.810385 12225 certificate_manager.go:488] kubernetes.io/kubelet-serving:
certificate request was not signed: timed out waiting for the condition
this was the output..
But before using this method i have also tried another method, where i have created a node group and then i have created an ami from one of the nodes in that nodegroup.. then modified the kubelet.service file and removed the old cpu_manager_state file.. then the i have used this image to create the nodegroup. Then it worked fine But the problem was i am unable to get into the pods running in those nodes and also i am unable to get the logs of the pods running there. and strangely if i use
$kubectl get nodes -o wide
in the output i was not getting the internal and external both ip addresses.
so i moved on to using the userdata instead of this method.
kindly give me instructions to create a managed nodegroup with cpu_manager_state as static policy for eks cluster .
I had the same question. I added the following userdata script to my launch template
User data script
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="==MYBOUNDARY=="
--==MYBOUNDARY==
Content-Type: text/x-shellscript; charset="us-ascii"
#!/bin/bash
yum install -y jq
set -o xtrace
cp /etc/kubernetes/kubelet/kubelet-config.json /etc/kubernetes/kubelet/kubelet-config.json.back
jq '. += { "cpuManagerPolicy":"static"}' /etc/kubernetes/kubelet/kubelet-config.json.back > /etc/kubernetes/kubelet/kubelet-config.json
--==MYBOUNDARY==--
Verification
You can verify the change took effect using kubectl:
# start a k8s API proxy
$ kubectl proxy
# get the node name
$ kubectl get nodes
# get kubelet config
$ curl -sSL "http://localhost:8001/api/v1/nodes/<<node_name>>/proxy/configz"
I got the solution from this guide: https://aws.amazon.com/premiumsupport/knowledge-center/eks-worker-nodes-image-cache/. However, I could not make the sed command properly work so I used jq instead.
Logs
If you can ssh into the node, you can check the userdata logs in /var/log/cloud-init-output.log - See https://stackoverflow.com/a/32460849/4400704
CPU pinning
I have a pod with a status QoS Guarantee (CPU limit and requested = 2) and I can verify it has two CPU reserved
$ cat /sys/fs/cgroup/cpuset/cpuset.cpus
2,10

Strange Offset in Apache Superset time selection - can this be fixed with some sort of timezone setting?

I experience a strange issue where my data is in UTC and the data picker selects time by a large offset. I'd like the UI to select UTC time only.
In the documentation it states Superset is build to run on UTC time only. I also found some threads one can change this by setting the Linux environment via ENV variable to other timezone:
ENV TZ Europe/Amsterdam
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
Wrong behaviour (UTC is approx 17h December 22nd):
(same behaviour in all dashboards)
Query behind this data (I used "view query in SQLLab")
SELECT toDateTime(intDiv(toUInt32(toDateTime(delivery_date)), 300)*300) AS __timestamp,
COUNT(*) AS count
FROM spearad_data.a_performance_view
WHERE delivery_date >= toDateTime('2020-12-21 23:00:00')
AND delivery_date < toDateTime('2020-12-22 00:00:00')
GROUP BY toDateTime(intDiv(toUInt32(toDateTime(delivery_date)), 300)*300)
ORDER BY count DESC
LIMIT 10000;
In My case I did check the ECS docker container and the EC2 instance the task (container) runs on.
EC2 machine:
[ec_user#1.2.3.4]$date
Tue Dec 22 16:31:07 UTC 2020
[ec2-user#1.2.3.4]$ echo $TZ
[ec2-user#ip-1-2-3-4]$ date +'%:z %Z'
+00:00 UTC
[ec2-user#ip-1-2-3-4]$ cat /etc/timezone
cat: /etc/timezone: No such file or directory
[ec2-user#ip-1-2-3-4]$ cat /etc/timezone
cat: /etc/timezone: No such file or directory
[ec2-user#ip-1-2-3-4]$ timedatectl
Local time: Tue 2020-12-22 16:40:47 UTC
Universal time: Tue 2020-12-22 16:40:47 UTC
RTC time: Tue 2020-12-22 16:40:42
Time zone: n/a (UTC, +0000)
NTP enabled: yes
NTP synchronized: no
RTC in local TZ: no
DST active: n/a
ECS container:
/ # date
Tue Dec 22 16:43:53 UTC 2020
/ # echo $TZ
/ # date +'%:z %Z'
/ # cat /etc/timezone
cat: can't open '/etc/timezone': No such file or directory
Clickhouse DB:
SELECT now();
2020-12-22T16:50:32
Superset MySQL (AWS RDS):
SELECT now();
2020-12-22T16:52:27
https://time.is/de/UTC
16:57:01
Save a query in SQLLab
created_on
2020-12-22T16:59:03
Data is UTC based as well. So where do I need to change this? It seems there's another setting or configuration missing.
I have the following setup:
Apache Superset 0.37 on AWS ECS
Superset ConfigDB: AWS RDS
Fact DB: Clickhouse DB 20.7
Driver: clickhouse-sqlalchemy (native mode)

Sphinx installation centos7

I just updated sphinx to the latest version on a dedicated server running with centos 7, but after hours of search I can't find the problem.
The sphinx index has created well, but I can't start search daemon. I got this messages all the time :
systemctl status searchd.service
searchd.service - SphinxSearch Search Engine
Loaded: loaded (/usr/lib/systemd/system/searchd.service; disabled; vendor preset: disabled)
Active: failed (Result: timeout) since Sat 2018-03-24 21:14:09 CET; 3min 4s ago
Process: 17865 ExecStartPre=/bin/chown sphinx.sphinx /var/run/sphinx (code=exited, status=0/SUCCESS)
Process: 17863 ExecStartPre=/bin/mkdir -p /var/run/sphinx (code=killed, signal=TERM)
Mar 24 21:14:09 systemd[1]: Starting SphinxSearch Search Engine...
Mar 24 21:14:09 systemd[1]: searchd.service start-pre operation timed out. Terminating.
Mar 24 21:14:09 systemd[1]: Failed to start SphinxSearch Search Engine.
Mar 24 21:14:09 systemd[1]: Unit searchd.service entered failed state.
Mar 24 21:14:09 systemd[1]: searchd.service failed.
I have really no idea where this problem comes from.
In your systemd service file (mine is in /usr/lib/systemd/system/searchd.service) comment out:
/bin/chown sphinx.sphinx /var/run/sphinx
/bin/mkdir -p /var/run/sphinx manually
(you can run these commands manually if it's not done yet).
Then change from
Type=forking
to
Type=simple
Then do systemctl daemon-reload and you can start/stop/status the service:
[root#server ~]# cat /usr/lib/systemd/system/searchd.service
[Unit]
Description=SphinxSearch Search Engine
After=network.target remote-fs.target nss-lookup.target
After=syslog.target
[Service]
Type=simple
User=sphinx
Group=sphinx
# Run ExecStartPre with root-permissions
PermissionsStartOnly=true
#ExecStartPre=/bin/mkdir -p /var/run/sphinx
#ExecStartPre=/bin/chown sphinx.sphinx /var/run/sphinx
# Run ExecStart with User=sphinx / Group=sphinx
ExecStart=/usr/bin/searchd --config /etc/sphinx/sphinx.conf
ExecStop=/usr/bin/searchd --config /etc/sphinx/sphinx.conf --stopwait
KillMode=process
KillSignal=SIGTERM
SendSIGKILL=no
LimitNOFILE=infinity
TimeoutStartSec=infinity
PIDFile=/var/run/sphinx/searchd.pid
[Install]
WantedBy=multi-user.target
Alias=sphinx.service
Alias=sphinxsearch.service
[root#server ~]# systemctl start searchd
[root#server ~]# systemctl status searchd
● searchd.service - SphinxSearch Search Engine
Loaded: loaded (/usr/lib/systemd/system/searchd.service; disabled; vendor preset: disabled)
Active: active (running) since Sun 2018-03-25 10:41:24 EDT; 4s ago
Process: 111091 ExecStop=/usr/bin/searchd --config /etc/sphinx/sphinx.conf --stopwait (code=exited, status=1/FAILURE)
Main PID: 112030 (searchd)
CGroup: /system.slice/searchd.service
├─112029 /usr/bin/searchd --config /etc/sphinx/sphinx.conf
└─112030 /usr/bin/searchd --config /etc/sphinx/sphinx.conf
Mar 25 10:41:24 server.domain.com searchd[112026]: Sphinx 2.3.2-id64-beta (4409612)
Mar 25 10:41:24 server.domain.com searchd[112026]: Copyright (c) 2001-2016, Andrew Aksyonoff
Mar 25 10:41:24 server.domain.com searchd[112026]: Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
Mar 25 10:41:24 server.domain.com searchd[112026]: Sphinx 2.3.2-id64-beta (4409612)
Mar 25 10:41:24 server.domain.com searchd[112026]: Copyright (c) 2001-2016, Andrew Aksyonoff
Mar 25 10:41:24 server.domain.com searchd[112026]: Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
Mar 25 10:41:24 server.domain.com searchd[112026]: precaching index 'test1'
Mar 25 10:41:24 server.domain.com searchd[112026]: WARNING: index 'test1': prealloc: failed to open /var/lib/sphinx/test1.sph: No such file or directory...T SERVING
Mar 25 10:41:24 server.domain.com searchd[112026]: precaching index 'testrt'
Mar 25 10:41:24 server.domain.com systemd[1]: searchd.service: Supervising process 112030 which is not our child. We'll most likely not notice when it exits.
Hint: Some lines were ellipsized, use -l to show in full.
[root#server ~]# systemctl stop searchd
[root#server ~]# systemctl status searchd
● searchd.service - SphinxSearch Search Engine
Loaded: loaded (/usr/lib/systemd/system/searchd.service; disabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Sun 2018-03-25 10:41:36 EDT; 1s ago
Process: 112468 ExecStop=/usr/bin/searchd --config /etc/sphinx/sphinx.conf --stopwait (code=exited, status=1/FAILURE)
Main PID: 112030
Mar 25 10:41:24 server.domain.com searchd[112026]: WARNING: index 'test1': prealloc: failed to open /var/lib/sphinx/test1.sph: No such file or directory...T SERVING
Mar 25 10:41:24 server.domain.com searchd[112026]: precaching index 'testrt'
Mar 25 10:41:24 server.domain.com systemd[1]: searchd.service: Supervising process 112030 which is not our child. We'll most likely not notice when it exits.
Mar 25 10:41:33 server.domain.com systemd[1]: Stopping SphinxSearch Search Engine...
Mar 25 10:41:33 server.domain.com searchd[112468]: [Sun Mar 25 10:41:33.183 2018] [112468] using config file '/etc/sphinx/sphinx.conf'...
Mar 25 10:41:33 server.domain.com searchd[112468]: [Sun Mar 25 10:41:33.183 2018] [112468] stop: successfully sent SIGTERM to pid 112030
Mar 25 10:41:36 server.domain.com systemd[1]: searchd.service: control process exited, code=exited status=1
Mar 25 10:41:36 server.domain.com systemd[1]: Stopped SphinxSearch Search Engine.
Mar 25 10:41:36 server.domain.com systemd[1]: Unit searchd.service entered failed state.
Mar 25 10:41:36 server.domain.com systemd[1]: searchd.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
I had the same problem and finally found the solution that worked for me.
I have edited my "/etc/systemd/system/sphinx.service" to look like
[Unit]
Description=SphinxSearch Search Engine
After=network.target remote-fs.target nss-lookup.target
After=syslog.target
[Service]
User=sphinx
Group=sphinx
RuntimeDirectory=sphinxsearch
RuntimeDirectoryMode=0775
# Run ExecStart with User=sphinx / Group=sphinx
ExecStart=/usr/bin/searchd --config /etc/sphinx/sphinx.conf
ExecStop=/usr/bin/searchd --config /etc/sphinx/sphinx.conf --stopwait
KillMode=process
KillSignal=SIGTERM
SendSIGKILL=no
LimitNOFILE=infinity
TimeoutStartSec=infinity
#PIDFile=/var/run/sphinx/searchd.pid
PIDFile=/var/run/sphinxsearch/searchd.pid
[Install]
WantedBy=multi-user.target
Alias=sphinx.service
Alias=sphinxsearch.service
In that case my searchd is able to survive the reboot. The solution from previous post have the problem with searchd starting after reboot before the /var/run/sphinxsearch dir was deleting after reboot in my case.
The fact is that RHEL (CentOS) 7 does not perceive the "Infinity" value of the "TimeoutStartSec"parameter. You must set a numeric value. For Example, TimeoutStartSec=600

k8s: Error pulling images from ECR

We constantly get Waiting: ImagePullBackOff during CI upgrades. Anybody know whats happening? k8s cluster 1.6.2 installed via kops. During upgrades, we do kubectl set image and during the last 2 days, we are seeing the following error
Failed to pull image "********.dkr.ecr.eu-west-1.amazonaws.com/backend:da76bb49ec9a": rpc error: code = 2 desc = net/http: request canceled
Error syncing pod, skipping: failed to "StartContainer" for "backend" with ErrImagePull: "rpc error: code = 2 desc = net/http: request canceled"
journalctl -r -u kubelet
Jul 26 09:32:40 ip-10-0-49-227 kubelet[840]: W0726 09:32:40.731903 840 docker_sandbox.go:263] NetworkPlugin kubenet failed on the status hook for pod "backend-1277054742-bb8zm_default": Unexpected command output nsenter: cannot open : No such file or directory
Jul 26 09:32:40 ip-10-0-49-227 kubelet[840]: E0726 09:32:40.724387 840 generic.go:239] PLEG: Ignoring events for pod frontend-1493767179-84rkl/default: rpc error: code = 2 desc = Error: No such container: 2421109e0d1eb31242c5088b547c0f29377816ca068a283b8fe6c2d8e7e5874d
Jul 26 09:32:40 ip-10-0-49-227 kubelet[840]: E0726 09:32:40.724371 840 kuberuntime_manager.go:858] getPodContainerStatuses for pod "frontend-1493767179-84rkl_default(0fff3b22-71c8-11e7-9679-02c1112ca4ec)" failed: rpc error: code = 2 desc = Error: No such container: 2421109e0d1eb31242c5088b547c0f29377816ca068a283b8fe6c2d8e7e5874d
Jul 26 09:32:40 ip-10-0-49-227 kubelet[840]: E0726 09:32:40.724358 840 kuberuntime_container.go:385] ContainerStatus for 2421109e0d1eb31242c5088b547c0f29377816ca068a283b8fe6c2d8e7e5874d error: rpc error: code = 2 desc = Error: No such container: 2421109e0d1eb31242c5088b547c0f29377816ca068a283b8fe6c2d8e7e5874d
Jul 26 09:32:40 ip-10-0-49-227 kubelet[840]: E0726 09:32:40.724329 840 remote_runtime.go:269] ContainerStatus "2421109e0d1eb31242c5088b547c0f29377816ca068a283b8fe6c2d8e7e5874d" from runtime service failed: rpc error: code = 2 desc = Error: No such container: 2421109e0d1eb31242c5088b547c0f29377816ca068a283b8fe6c2d8e7e5874d
Jul 26 09:32:40 ip-10-0-49-227 kubelet[840]: with error: exit status 1
Try running kubectl create configmap -n kube-system kube-dns
For more context, check out known issues with kubernetes 1.6 https://github.com/kubernetes/kops/releases/tag/1.6.0
This may be caused by a known docker bug where shutdown occurs before the content is synced to disk on layer creation. The fix is included in docker v1.13.
work around is to remove the empty files and re-pull the image.