Hi i have been trying to do cpu pinning in my eks cluster. i have used amazon linux latest release, and my eks version is 1.22 . i have created a launch template where i have used this user data mentioned below.
Content-Type: multipart/mixed; boundary="//"
MIME-Version: 1.0
--//
#!/bin/bash
set -o xtrace
/etc/eks/bootstrap.sh $CLUSTER_NAME
sleep 2m
yum update -y
sudo rm /var/lib/kubelet/cpu_manager_state
sudo chmod 777 kubelet.service
sudo cat > /etc/systemd/system/kubelet.service <<EOF
[Unit]
Description=Kubernetes Kubelet
Documentation=https://github.com/kubernetes/kubernetes
After=docker.service iptables-restore.service
Requires=docker.service
[Service]
ExecStartPre=/sbin/iptables -P FORWARD ACCEPT -w 5
ExecStart=/usr/bin/kubelet --cloud-provider aws \
--image-credential-provider-config /etc/eks/ecr-credential-provider/ecr-
credential-provider-config \
--image-credential-provider-bin-dir /etc/eks/ecr-credential-provider \
--cpu-manager-policy=static \
--kube-reserved=cpu=0.5,memory=1Gi,ephemeral-storage=0.5Gi \
--system-reserved=cpu=0.5,memory=1Gi,ephemeral-storage=0.5Gi \
--config /etc/kubernetes/kubelet/kubelet-config.json \
--kubeconfig /var/lib/kubelet/kubeconfig \
--container-runtime docker \
--network-plugin cni $KUBELET_ARGS $KUBELET_EXTRA_ARGS
Restart=always
RestartSec=5
KillMode=process
[Install]
WantedBy=multi-user.target
EOF
sudo chmod 644 kubelet.service
sudo systemctl daemon-reload
sudo systemctl stop kubelet
sudo systemctl start kubelet
--//
after creating the template i have used it on the eks nodegroup creation. after waititng a while i am getting this error on the eks dashboard.
Health issues (1)
NodeCreationFailure Instances failed to join the kubernetes cluster .
and i have get into that ec2 instance and used the following command to view kubectl logs
$journalctl -f -u kubelet
the output is
[ec2-user#ip-10.100.11.111 kubelet]$ journalctl -f -u kubelet
-- Logs begin at Thu 2022-04-21 07:27:50 UTC. --
Apr 21 07:31:21 ip-10.100.11.111.us-west-2.compute.internal kubelet[12225]: I0421
07:31:21.199868 12225 state_mem.go:80] "Updated desired CPUSet" podUID="3b513cfa-
441d-4e25-9441-093b4c2ed548" containerName="efs-plugin" cpuSet="0-7"
Apr 21 07:31:21 ip-10.100.11.111.us-west-2.compute.internal kubelet[12225]: I0421
07:31:21.244811 12225 state_mem.go:80] "Updated desired CPUSet" podUID="3b513cfa-
441d-4e25-9441-093b4c2ed548" containerName="csi-provisioner" cpuSet="0-7"
Apr 21 07:31:21 ip-10.100.11.111.us-west-2.compute.internal kubelet[12225]: I0421
07:31:21.305206 12225 state_mem.go:80] "Updated desired CPUSet" podUID="3b513cfa-
441d-4e25-9441-093b4c2ed548" containerName="liveness-probe" cpuSet="0-7"
Apr 21 07:31:21 ip-10.100.11.111.us-west-2.compute.internal kubelet[12225]: I0421
07:31:21.335744 12225 state_mem.go:80] "Updated desired CPUSet" podUID="de537700-
f5ac-4039-a151-110ddf27d140" containerName="efs-plugin" cpuSet="0-7"
Apr 21 07:31:21 ip-10.100.11.111.us-west-2.compute.internal kubelet[12225]: I0421
07:31:21.388843 12225 state_mem.go:80] "Updated desired CPUSet" podUID="de537700-
f5ac-4039-a151-110ddf27d140" containerName="csi-driver-registrar" cpuSet="0-7"
Apr 21 07:31:21 ip-10.100.11.111.us-west-2.compute.internal kubelet[12225]: I0421
07:31:21.464789 12225 state_mem.go:80] "Updated desired CPUSet" podUID="de537700-
f5ac-4039-a151-110ddf27d140" containerName="liveness-probe" cpuSet="0-7"
Apr 21 07:31:21 ip-10.100.11.111.us-west-2.compute.internal kubelet[12225]: I0421
07:31:21.545206 12225 state_mem.go:80] "Updated desired CPUSet" podUID="a2f09d0d-
69f5-4bb7-82bb-edfa86cb87e2" containerName="kube-controller" cpuSet="0-7"
Apr 21 07:31:21 ip-10.100.11.111.us-west-2.compute.internal kubelet[12225]: I0421
07:31:21.633078 12225 state_mem.go:80] "Updated desired CPUSet" podUID="3ec70fe1-
3680-4e3c-bcfa-81f80ebe20b0" containerName="kube-proxy" cpuSet="0-7"
Apr 21 07:31:21 ip-10.100.11.111.us-west-2.compute.internal kubelet[12225]: I0421
07:31:21.696852 12225 state_mem.go:80] "Updated desired CPUSet" podUID="adbd9bef-
c4e0-4bd1-a6a6-52530ad4bea3" containerName="aws-node" cpuSet="0-7"
Apr 21 07:46:12 ip-10.100.11.111.us-west-2.compute.internal kubelet[12225]: E0421
07:46:12.424801 12225 certificate_manager.go:488] kubernetes.io/kubelet-serving:
certificate request was not signed: timed out waiting for the condition
Apr 21 08:01:16 ip-10.100.11.111.us-west-2.compute.internal kubelet[12225]: E0421
08:01:16.810385 12225 certificate_manager.go:488] kubernetes.io/kubelet-serving:
certificate request was not signed: timed out waiting for the condition
this was the output..
But before using this method i have also tried another method, where i have created a node group and then i have created an ami from one of the nodes in that nodegroup.. then modified the kubelet.service file and removed the old cpu_manager_state file.. then the i have used this image to create the nodegroup. Then it worked fine But the problem was i am unable to get into the pods running in those nodes and also i am unable to get the logs of the pods running there. and strangely if i use
$kubectl get nodes -o wide
in the output i was not getting the internal and external both ip addresses.
so i moved on to using the userdata instead of this method.
kindly give me instructions to create a managed nodegroup with cpu_manager_state as static policy for eks cluster .
I had the same question. I added the following userdata script to my launch template
User data script
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="==MYBOUNDARY=="
--==MYBOUNDARY==
Content-Type: text/x-shellscript; charset="us-ascii"
#!/bin/bash
yum install -y jq
set -o xtrace
cp /etc/kubernetes/kubelet/kubelet-config.json /etc/kubernetes/kubelet/kubelet-config.json.back
jq '. += { "cpuManagerPolicy":"static"}' /etc/kubernetes/kubelet/kubelet-config.json.back > /etc/kubernetes/kubelet/kubelet-config.json
--==MYBOUNDARY==--
Verification
You can verify the change took effect using kubectl:
# start a k8s API proxy
$ kubectl proxy
# get the node name
$ kubectl get nodes
# get kubelet config
$ curl -sSL "http://localhost:8001/api/v1/nodes/<<node_name>>/proxy/configz"
I got the solution from this guide: https://aws.amazon.com/premiumsupport/knowledge-center/eks-worker-nodes-image-cache/. However, I could not make the sed command properly work so I used jq instead.
Logs
If you can ssh into the node, you can check the userdata logs in /var/log/cloud-init-output.log - See https://stackoverflow.com/a/32460849/4400704
CPU pinning
I have a pod with a status QoS Guarantee (CPU limit and requested = 2) and I can verify it has two CPU reserved
$ cat /sys/fs/cgroup/cpuset/cpuset.cpus
2,10
Related
When trying to SSH to GCE VMs using metadata-based SSH keys I get the following error:
ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain
While troubleshooting I can see the keys in the instance metadata, but they are not being added to the user's authorized_keys file:
$ curl -H "Metadata-Flavor: Google" "http://metadata.google.internal/computeMetadata/v1/instance/attributes/ssh-keys"
username:ssh-ed25519 AAAAC3NzaC....omitted....
admin:ssh-ed25519 AAAAC3NzaC....omitted....
$ sudo ls -hal /home/**/.ssh/
/home/ubuntu/.ssh/:
total 8.0K
drwx------ 2 ubuntu ubuntu 4.0K Aug 11 23:19 .
drwxr-xr-x 3 ubuntu ubuntu 4.0K Aug 11 23:19 ..
-rw------- 1 ubuntu ubuntu 0 Aug 11 23:19 authorized_keys
# Only result is the default zero-length file for ubuntu user
I also see the following errors in the ssh server auth log and Google Guest Environment services:
$ sudo less /var/log/auth.log
Aug 11 23:28:59 test-vm sshd[2197]: Invalid user admin from 1.2.3.4 port 34570
Aug 11 23:28:59 test-vm sshd[2197]: Connection closed by invalid user admin 1.2.3.4 port 34570 [preauth]
$ sudo journalctl -u google-guest-agent.service
Aug 11 22:24:42 test-vm oslogin_cache_refresh[907]: Refreshing passwd entry cache
Aug 11 22:24:42 test-vm oslogin_cache_refresh[907]: Refreshing group entry cache
Aug 11 22:24:42 test-vm oslogin_cache_refresh[907]: Failure getting groups, quitting
Aug 11 22:24:42 test-vm oslogin_cache_refresh[907]: Failed to get groups, not updating group cache file, removing /etc/oslogin_group.cache.bak.
# or
Aug 11 23:19:37 test-vm GCEGuestAgent[766]: 2022-08-11T23:19:37.6541Z GCEGuestAgent Info: Creating user admin.
Aug 11 23:19:37 test-vm useradd[885]: failed adding user 'admin', data deleted
Aug 11 23:19:37 test-vm GCEGuestAgent[766]: 2022-08-11T23:19:37.6869Z GCEGuestAgent Error non_windows_accounts.go:144:
Error creating user: useradd: group admin exists - if you want to add this user to that group, use -g.
Currently the latest cloud-init and guest-oslogin packages for Ubuntu 20.04.4 LTS (focal) seem to have an issue that causes google-guest-agent.service to exit before completing its task. The issue was fixed and committed but not yet released for focal (and likely other Ubuntu versions).
For now you can try disabling OS Login by setting instance or project metadata enable-oslogin=FALSE. After which you should see the expected results and be able to SSH using those keys:
$ sudo journalctl -u google-guest-agent.service
Aug 11 23:10:33 test-vm GCEGuestAgent[761]: 2022-08-11T23:10:33.0517Z GCEGuestAgent Info: Created google sudoers file
Aug 11 23:10:33 test-vm GCEGuestAgent[761]: 2022-08-11T23:10:33.0522Z GCEGuestAgent Info: Creating user username.
Aug 11 23:10:33 test-vm useradd[881]: new group: name=username, GID=1002
Aug 11 23:10:33 test-vm useradd[881]: new user: name=username, UID=1001, GID=1002, home=/home/username, shell=/bin/bash, from=none
Aug 11 23:10:33 test-vm gpasswd[895]: user username added by root to group ubuntu
Aug 11 23:10:33 test-vm gpasswd[904]: user username added by root to group adm
Aug 11 23:10:33 test-vm gpasswd[983]: user username added by root to group google-sudoers
Aug 11 23:10:33 test-vm GCEGuestAgent[761]: 2022-08-11T23:10:33.7615Z GCEGuestAgent Info: Updating keys for user username.
$ sudo ls -hal /home/username/.ssh/
/home/username/.ssh/:
total 12K
drwx------ 2 username username 4.0K Aug 11 23:19 .
drwxr-xr-x 4 username username 4.0K Aug 11 23:35 ..
-rw------- 1 username username 589 Aug 11 23:19 authorized_keys
The admin user however will not work, since it conflicts with an existing linux group. You should pick a username that does not conflict with any of the name:x:123: names listed at getent group
I am trying to create hadoop-cluster in GCP, by using below commands:
cd bdutil
$ ./bdutil -b [Bucket Name] \
-z us-east1-b \
-n 2 \
-P [Project-ID] \
deploy
...
At (y/n) enter y
Facing below issues, please help to resolve the issues:
Mon Oct 8 05:35:49 UTC 2018: Exited 1 : gcloud --project=geslanu-218716 --quiet --verbosity=info compute instances
create geslanu-218716-w-0 --machine-type=n1-standard-1 --image-family=debian-8 --image-project=debian-cloud --netw
ork=default --tags=bdutil --scopes storage-full --boot-disk-type=pd-standard --zone=us-east1-b
Mon Oct 8 05:35:50 UTC 2018: Exited 1 : gcloud --project=geslanu-218716 --quiet --verbosity=info compute instances
create geslanu-218716-w-1 --machine-type=n1-standard-1 --image-family=debian-8 --image-project=debian-cloud --netw
ork=default --tags=bdutil --scopes storage-full --boot-disk-type=pd-standard --zone=us-east1-b
Mon Oct 8 05:35:50 UTC 2018: Exited 1 : gcloud --project=geslanu-218716 --quiet --verbosity=info compute instances
create geslanu-218716-m --machine-type=n1-standard-1 --image-family=debian-8 --image-project=debian-cloud --networ
k=default --tags=bdutil --scopes storage-full --boot-disk-type=pd-standard --zone=us-east1-b
Mon Oct 8 05:35:50 UTC 2018: Command failed: wait ${SUBPROC} on line 326.
Mon Oct 8 05:35:50 UTC 2018: Exit code of failed command: 1
Mon Oct 8 05:35:50 UTC 2018: Detailed debug info available in file: /tmp/bdutil-20181008-053541-yeq/debuginfo.txt
Mon Oct 8 05:35:50 UTC 2018: Check console output for error messages and/or retry your command.
check next output - Detailed debug info available in file: /tmp/bdutil-20181111-105223-nET/debuginfo.txt
ERROR: (gcloud.compute.instances.create) Could not fetch resource:
- Invalid value for field 'resource.name': 'antonmobile_gmail-w-0'. Must be a match of regex '(?:[a-z](?:[-a-z0-9$
I am personally had an issue with UNIQUE_ID where used underscore. that does not conform with name restrictions.
PS: I know it is probably to late to answer but somebody else could have the same issue.
I just updated sphinx to the latest version on a dedicated server running with centos 7, but after hours of search I can't find the problem.
The sphinx index has created well, but I can't start search daemon. I got this messages all the time :
systemctl status searchd.service
searchd.service - SphinxSearch Search Engine
Loaded: loaded (/usr/lib/systemd/system/searchd.service; disabled; vendor preset: disabled)
Active: failed (Result: timeout) since Sat 2018-03-24 21:14:09 CET; 3min 4s ago
Process: 17865 ExecStartPre=/bin/chown sphinx.sphinx /var/run/sphinx (code=exited, status=0/SUCCESS)
Process: 17863 ExecStartPre=/bin/mkdir -p /var/run/sphinx (code=killed, signal=TERM)
Mar 24 21:14:09 systemd[1]: Starting SphinxSearch Search Engine...
Mar 24 21:14:09 systemd[1]: searchd.service start-pre operation timed out. Terminating.
Mar 24 21:14:09 systemd[1]: Failed to start SphinxSearch Search Engine.
Mar 24 21:14:09 systemd[1]: Unit searchd.service entered failed state.
Mar 24 21:14:09 systemd[1]: searchd.service failed.
I have really no idea where this problem comes from.
In your systemd service file (mine is in /usr/lib/systemd/system/searchd.service) comment out:
/bin/chown sphinx.sphinx /var/run/sphinx
/bin/mkdir -p /var/run/sphinx manually
(you can run these commands manually if it's not done yet).
Then change from
Type=forking
to
Type=simple
Then do systemctl daemon-reload and you can start/stop/status the service:
[root#server ~]# cat /usr/lib/systemd/system/searchd.service
[Unit]
Description=SphinxSearch Search Engine
After=network.target remote-fs.target nss-lookup.target
After=syslog.target
[Service]
Type=simple
User=sphinx
Group=sphinx
# Run ExecStartPre with root-permissions
PermissionsStartOnly=true
#ExecStartPre=/bin/mkdir -p /var/run/sphinx
#ExecStartPre=/bin/chown sphinx.sphinx /var/run/sphinx
# Run ExecStart with User=sphinx / Group=sphinx
ExecStart=/usr/bin/searchd --config /etc/sphinx/sphinx.conf
ExecStop=/usr/bin/searchd --config /etc/sphinx/sphinx.conf --stopwait
KillMode=process
KillSignal=SIGTERM
SendSIGKILL=no
LimitNOFILE=infinity
TimeoutStartSec=infinity
PIDFile=/var/run/sphinx/searchd.pid
[Install]
WantedBy=multi-user.target
Alias=sphinx.service
Alias=sphinxsearch.service
[root#server ~]# systemctl start searchd
[root#server ~]# systemctl status searchd
● searchd.service - SphinxSearch Search Engine
Loaded: loaded (/usr/lib/systemd/system/searchd.service; disabled; vendor preset: disabled)
Active: active (running) since Sun 2018-03-25 10:41:24 EDT; 4s ago
Process: 111091 ExecStop=/usr/bin/searchd --config /etc/sphinx/sphinx.conf --stopwait (code=exited, status=1/FAILURE)
Main PID: 112030 (searchd)
CGroup: /system.slice/searchd.service
├─112029 /usr/bin/searchd --config /etc/sphinx/sphinx.conf
└─112030 /usr/bin/searchd --config /etc/sphinx/sphinx.conf
Mar 25 10:41:24 server.domain.com searchd[112026]: Sphinx 2.3.2-id64-beta (4409612)
Mar 25 10:41:24 server.domain.com searchd[112026]: Copyright (c) 2001-2016, Andrew Aksyonoff
Mar 25 10:41:24 server.domain.com searchd[112026]: Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
Mar 25 10:41:24 server.domain.com searchd[112026]: Sphinx 2.3.2-id64-beta (4409612)
Mar 25 10:41:24 server.domain.com searchd[112026]: Copyright (c) 2001-2016, Andrew Aksyonoff
Mar 25 10:41:24 server.domain.com searchd[112026]: Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
Mar 25 10:41:24 server.domain.com searchd[112026]: precaching index 'test1'
Mar 25 10:41:24 server.domain.com searchd[112026]: WARNING: index 'test1': prealloc: failed to open /var/lib/sphinx/test1.sph: No such file or directory...T SERVING
Mar 25 10:41:24 server.domain.com searchd[112026]: precaching index 'testrt'
Mar 25 10:41:24 server.domain.com systemd[1]: searchd.service: Supervising process 112030 which is not our child. We'll most likely not notice when it exits.
Hint: Some lines were ellipsized, use -l to show in full.
[root#server ~]# systemctl stop searchd
[root#server ~]# systemctl status searchd
● searchd.service - SphinxSearch Search Engine
Loaded: loaded (/usr/lib/systemd/system/searchd.service; disabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Sun 2018-03-25 10:41:36 EDT; 1s ago
Process: 112468 ExecStop=/usr/bin/searchd --config /etc/sphinx/sphinx.conf --stopwait (code=exited, status=1/FAILURE)
Main PID: 112030
Mar 25 10:41:24 server.domain.com searchd[112026]: WARNING: index 'test1': prealloc: failed to open /var/lib/sphinx/test1.sph: No such file or directory...T SERVING
Mar 25 10:41:24 server.domain.com searchd[112026]: precaching index 'testrt'
Mar 25 10:41:24 server.domain.com systemd[1]: searchd.service: Supervising process 112030 which is not our child. We'll most likely not notice when it exits.
Mar 25 10:41:33 server.domain.com systemd[1]: Stopping SphinxSearch Search Engine...
Mar 25 10:41:33 server.domain.com searchd[112468]: [Sun Mar 25 10:41:33.183 2018] [112468] using config file '/etc/sphinx/sphinx.conf'...
Mar 25 10:41:33 server.domain.com searchd[112468]: [Sun Mar 25 10:41:33.183 2018] [112468] stop: successfully sent SIGTERM to pid 112030
Mar 25 10:41:36 server.domain.com systemd[1]: searchd.service: control process exited, code=exited status=1
Mar 25 10:41:36 server.domain.com systemd[1]: Stopped SphinxSearch Search Engine.
Mar 25 10:41:36 server.domain.com systemd[1]: Unit searchd.service entered failed state.
Mar 25 10:41:36 server.domain.com systemd[1]: searchd.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
I had the same problem and finally found the solution that worked for me.
I have edited my "/etc/systemd/system/sphinx.service" to look like
[Unit]
Description=SphinxSearch Search Engine
After=network.target remote-fs.target nss-lookup.target
After=syslog.target
[Service]
User=sphinx
Group=sphinx
RuntimeDirectory=sphinxsearch
RuntimeDirectoryMode=0775
# Run ExecStart with User=sphinx / Group=sphinx
ExecStart=/usr/bin/searchd --config /etc/sphinx/sphinx.conf
ExecStop=/usr/bin/searchd --config /etc/sphinx/sphinx.conf --stopwait
KillMode=process
KillSignal=SIGTERM
SendSIGKILL=no
LimitNOFILE=infinity
TimeoutStartSec=infinity
#PIDFile=/var/run/sphinx/searchd.pid
PIDFile=/var/run/sphinxsearch/searchd.pid
[Install]
WantedBy=multi-user.target
Alias=sphinx.service
Alias=sphinxsearch.service
In that case my searchd is able to survive the reboot. The solution from previous post have the problem with searchd starting after reboot before the /var/run/sphinxsearch dir was deleting after reboot in my case.
The fact is that RHEL (CentOS) 7 does not perceive the "Infinity" value of the "TimeoutStartSec"parameter. You must set a numeric value. For Example, TimeoutStartSec=600
We constantly get Waiting: ImagePullBackOff during CI upgrades. Anybody know whats happening? k8s cluster 1.6.2 installed via kops. During upgrades, we do kubectl set image and during the last 2 days, we are seeing the following error
Failed to pull image "********.dkr.ecr.eu-west-1.amazonaws.com/backend:da76bb49ec9a": rpc error: code = 2 desc = net/http: request canceled
Error syncing pod, skipping: failed to "StartContainer" for "backend" with ErrImagePull: "rpc error: code = 2 desc = net/http: request canceled"
journalctl -r -u kubelet
Jul 26 09:32:40 ip-10-0-49-227 kubelet[840]: W0726 09:32:40.731903 840 docker_sandbox.go:263] NetworkPlugin kubenet failed on the status hook for pod "backend-1277054742-bb8zm_default": Unexpected command output nsenter: cannot open : No such file or directory
Jul 26 09:32:40 ip-10-0-49-227 kubelet[840]: E0726 09:32:40.724387 840 generic.go:239] PLEG: Ignoring events for pod frontend-1493767179-84rkl/default: rpc error: code = 2 desc = Error: No such container: 2421109e0d1eb31242c5088b547c0f29377816ca068a283b8fe6c2d8e7e5874d
Jul 26 09:32:40 ip-10-0-49-227 kubelet[840]: E0726 09:32:40.724371 840 kuberuntime_manager.go:858] getPodContainerStatuses for pod "frontend-1493767179-84rkl_default(0fff3b22-71c8-11e7-9679-02c1112ca4ec)" failed: rpc error: code = 2 desc = Error: No such container: 2421109e0d1eb31242c5088b547c0f29377816ca068a283b8fe6c2d8e7e5874d
Jul 26 09:32:40 ip-10-0-49-227 kubelet[840]: E0726 09:32:40.724358 840 kuberuntime_container.go:385] ContainerStatus for 2421109e0d1eb31242c5088b547c0f29377816ca068a283b8fe6c2d8e7e5874d error: rpc error: code = 2 desc = Error: No such container: 2421109e0d1eb31242c5088b547c0f29377816ca068a283b8fe6c2d8e7e5874d
Jul 26 09:32:40 ip-10-0-49-227 kubelet[840]: E0726 09:32:40.724329 840 remote_runtime.go:269] ContainerStatus "2421109e0d1eb31242c5088b547c0f29377816ca068a283b8fe6c2d8e7e5874d" from runtime service failed: rpc error: code = 2 desc = Error: No such container: 2421109e0d1eb31242c5088b547c0f29377816ca068a283b8fe6c2d8e7e5874d
Jul 26 09:32:40 ip-10-0-49-227 kubelet[840]: with error: exit status 1
Try running kubectl create configmap -n kube-system kube-dns
For more context, check out known issues with kubernetes 1.6 https://github.com/kubernetes/kops/releases/tag/1.6.0
This may be caused by a known docker bug where shutdown occurs before the content is synced to disk on layer creation. The fix is included in docker v1.13.
work around is to remove the empty files and re-pull the image.
Below is what happened to one mail send from a drupal client.
$ grep 'B6693C0977' /var/log/maillog
Jan 19 14:12:30 instance-1 postfix/pickup[19329]: B6693C0977: uid=0 from=<admin#mailgun.domainA.com>
Jan 19 14:12:30 instance-1 postfix/cleanup[20035]: B6693C0977: message-id=<20170119141230.B6693C0977#mail.instance-1.c.tw-pilot.internal>
Jan 19 14:12:30 instance-1 postfix/qmgr[19330]: B6693C0977: from=<admin#mailgun.domainA.com>, size=5681, nrcpt=1 (queue active)
Jan 19 14:12:33 instance-1 postfix/smtp[20039]: B6693C0977:
to=<username#hotmail.com>, relay=smtp.mailgun.org[52.41.19.62]:2525, delay=2.4,
delays=0.02/0.05/1.8/0.53, dsn=5.7.1, status=bounced (host smtp.mailgun.org
[52.41.19.62] said: 550 5.7.1 **Relaying denied** (in reply to RCPT TO command))
Jan 19 14:12:33 instance-1 postfix/bounce[20050]: B6693C0977: sender non-delivery notification: ABB94C0976
Jan 19 14:12:33 instance-1 postfix/qmgr[19330]: B6693C0977: removed
Relevant excerpts from my /etc/postfix/main.cf are below
# RELAYHOST SETTINGS
smtp_tls_security_level = encrypt
smtp_sasl_auth_enable = yes
smtp_sasl_password_maps = hash:/etc/postfix/sasl_passwd
smtp_sasl_security_options = noanonymous
sender_dependent_relayhost_maps = hash:/etc/postfix/relayhost_map
and from /etc/postfix/sasl_passwd is follows
#mailgun.domainA.com postmaster#mailgun.domainA.com:password
and from /etc/postfix/relayhost_map is follows
#mailgun.domainA.com [smtp.mailgun.org]:2525
The permissions of the db files are as follows
# ls -lZ /etc/postfix/relayhost_map.db
-rw-r-----. root postfix unconfined_u:object_r:postfix_etc_t:s0 /etc/postfix/relayhost_map.db
# ls -lZ /etc/postfix/sasl_passwd.db
-rw-r-----. root postfix unconfined_u:object_r:postfix_etc_t:s0 /etc/postfix/sasl_passwd.db
The problem is
Outbound mails are not going.
No logs are shown in mailgun console.
Any insight is appreciated
I know that this is an old question now but I've just had the same issue and wanted to post a response for anyone who comes across this article in future.
I believe your issue is in /etc/postfix/relayhost_map where you should have the following, note that there are no brackets, for me it was the inclusion of brackets that was causing the issue:
#mailgun.domainA.com smtp.mailgun.org:2525
For anyone who is not using /etc/postfix/relayhost_map and is doing it all in /etc/postfix/sasl_passwd directly the same applies:
smtp.mailgun.org:2525 postmaster#mailgun.domainA.com:password
Don't forget to regenerate the postfix sasl_passwd.db file and restart the service afterwards
sudo postmap /etc/postfix/sasl_passwd
sudo systemctl restart postfix
Or sudo service postfix restart if you're on an older system / not running systemd.
Usually this is realted to problems on their platform if everything was working ok previously just open a ticket and usually they fix it in a few hours (yes that its kind of hard a few hours)