AWS Beanstalk django 1.5 application not working - amazon-web-services

I'm new to AWS so please help me. I'll write only the things which might matter for my problem. If you need more info, just write in comments section.
When I ping ELB address or app address, I get "Request timeout".
Server:
Instance type: micro
Custom AMI: ami-c37474b7
Load balancer:
Only HTTP, port 80
And RDS, S3, ElastiCache, SQS.
I use S3 also to store django's static files, which works, because I can see those files in my bucket.
RDS and SQS also works. The problem with ElastiCache is timeout, which libmemcached fired, but that isn't the main problem.
sg-cced0da3 | SecurityGroup for ElasticBeanstalk environment.
22 (SSH) 0.0.0.0/0
80 (HTTP) sg-ceed0da1
sg-ceed0da1 | ELB created security group used when no security group is specified during ELB creation - modifications could impact traffic to future ELBs
80 (HTTP) 0.0.0.0/0
Config file
packages:
yum:
libevent: []
libmemcached: []
libmemcached-devel: []
container_commands:
01_collectstatic:
command: "django-admin.py collectstatic --noinput"
02_syncdb:
command: "django-admin.py syncdb --noinput"
leader_only: true
03_createadmin:
command: "utilities/scripts/createadmin.py"
leader_only: true
option_settings:
- namespace: aws:elasticbeanstalk:container:python
option_name: WSGIPath
value: findtofun/wsgi.py
- option_name: DJANGO_SETTINGS_MODULE
value: findtofun.settings
- option_name: AWS_ACCESS_KEY_ID
value: ...
- option_name: AWS_SECRET_KEY
value: ...
- namespace: aws:elasticbeanstalk:container:python:staticfiles
option_name: /static/
value: static/
LOGS
/var/log/eb-tools.log
2013-06-03 14:52:47,908 [INFO] (27814 MainThread) [directoryHooksExecutor.py-29] [root directoryHooksExecutor info]
Script succeeded.
2013-06-03 14:52:47,908 [INFO] (27814 MainThread) [directoryHooksExecutor.py-29] [root directoryHooksExecutor info] Executing script:
/opt/elasticbeanstalk/hooks/appdeploy/pre/03deploy.py
2013-06-03 14:52:50,019 [INFO] (27814 MainThread) [directoryHooksExecutor.py-29] [root directoryHooksExecutor info] Output from script: New python executable in
/opt/python/run/venv/bin/python2.6
Installing
/var/log/httpd/error_log
Python/2.6.8 configured -- resuming normal operations
[Mon Jun 03 16:53:06 2013] [error] Exception KeyError:
KeyError(140672020449248,) in ignored
[Mon Jun 03 14:53:06 2013] [notice] caught SIGTERM, shutting down
[Mon Jun 03 14:53:08 2013] [notice] suEXEC mechanism enabled
(wrapper: /usr/sbin/suexec)
[Mon Jun 03 14:53:08 2013] [notice] Digest: generating secret for digest authentication
...
[Mon Jun 03 14:53:08 2013] [notice] Digest: done
[Mon Jun 03 14:53:08 2013] [notice] Apache/2.2.22 (Unix) DAV/2 mod_wsgi/3.2
Python/2.6.8 configured -- resuming normal operations List item
/opt/python/log/supervisord.log
2013-06-03 04:39:35,544 CRIT Supervisor running as root (no user in config file)
2013-06-03 04:39:35,650 INFO RPC interface 'supervisor' initialized
2013-06-03 04:39:35,651 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2013-06-03 04:39:35,651 INFO supervisord started with pid 3488
2013-06-03 04:39:36,658 INFO spawned: 'httpd' with pid 3498
2013-06-03 04:39:37,660 INFO success: httpd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2013-06-03 04:44:51,265 INFO stopped: httpd (exit status 0)
2013-06-03 04:44:52,280 INFO spawned: 'httpd' with pid 3804
2013-06-03 04:44:53,283 INFO success: httpd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2013-06-03 14:53:06,986 INFO stopped: httpd (exit status 0)
2013-06-03 14:53:08,000 INFO spawned: 'httpd' with pid 27871
2013-06-03 14:53:09,003 INFO success: httpd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
/var/log/cfn-init.log
2013-06-03 14:53:05,520 [DEBUG] Running test for command 03_createadmin
2013-06-03 14:53:05,535 [DEBUG] Test command output:
2013-06-03 14:53:05,536 [DEBUG] Test for command 03_createadmin passed
2013-06-03 14:53:05,986 [INFO] Command 03_createadmin succeeded
2013-06-03 14:53:05,987 [DEBUG] Command 03_createadmin output:
2013-06-03 14:53:05,987 [DEBUG] No services specified
2013-06-03 14:53:05,994 [INFO] ConfigSets completed
2013-06-03 14:53:06,000 [DEBUG] Not clearing reboot trigger as scheduling support is not available
2013-06-03 14:53:06,292 [DEBUG] CloudFormation client initialized with endpoint https://cloudformation.eu-west-1.amazonaws.com
2013-06-03 14:53:06,292 [DEBUG] Describing resource AWSEBAutoScalingGroup in stack arn:aws:cloudformation:eu-west-1:352769977590:stack/awseb-e-bwrsuih23z-stack/52c9b3c0-cbf6-11e2-ace7-5017c2ccb886
2013-06-03 14:53:06,489 [DEBUG] Not setting a reboot trigger as scheduling support is not available
2013-06-03 14:53:06,510 [INFO] Running configSets: Hook-EnactAppDeploy
2013-06-03 14:53:06,511 [INFO] Running configSet Hook-EnactAppDeploy
2013-06-03 14:53:06,512 [INFO] Running config Hook-EnactAppDeploy
2013-06-03 14:53:06,512 [DEBUG] No packages specified
2013-06-03 14:53:06,512 [DEBUG] No groups specified
2013-06-03 14:53:06,512 [DEBUG] No users specified
2013-06-03 14:53:06,513 [DEBUG] No sources specified
2013-06-03 14:53:06,513 [DEBUG] /etc/httpd/conf.d/01ebsys.conf already exists
2013-06-03 14:53:06,513 [DEBUG] Moving /etc/httpd/conf.d/01ebsys.conf to /etc/httpd/conf.d/01ebsys.conf.bak
2013-06-03 14:53:06,513 [DEBUG] Writing content to /etc/httpd/conf.d/01ebsys.conf
2013-06-03 14:53:06,514 [DEBUG] No mode specified for /etc/httpd/conf.d/01ebsys.conf
2013-06-03 14:53:06,514 [DEBUG] Running command aclean
2013-06-03 14:53:06,514 [DEBUG] No test for command aclean
2013-06-03 14:53:06,532 [INFO] Command aclean succeeded
2013-06-03 14:53:06,533 [DEBUG] Command aclean output:
2013-06-03 14:53:06,533 [DEBUG] Running command clean
2013-06-03 14:53:06,534 [DEBUG] No test for command clean
2013-06-03 14:53:06,547 [INFO] Command clean succeeded
2013-06-03 14:53:06,548 [DEBUG] Command clean output:
2013-06-03 14:53:06,548 [DEBUG] Running command hooks
2013-06-03 14:53:06,548 [DEBUG] No test for command hooks
2013-06-03 14:53:19,278 [INFO] Command hooks succeeded
2013-06-03 14:53:19,279 [DEBUG] Command hooks output: Executing directory: /opt/elasticbeanstalk/hooks/appdeploy/enact/
Executing script: /opt/elasticbeanstalk/hooks/appdeploy/enact/01flip.py
Output from script: httpd: stopped
httpd: started
httpd RUNNING pid 27871, uptime 0:00:03
Script succeeded.
Executing script: /opt/elasticbeanstalk/hooks/appdeploy/enact/09clean.sh
Output from script:
Script succeeded.

Don't know if this is the right answer, but I guess it works when I put DEBUG = False in settings.py file. Can someone clarify this.

Related

Run Ops Agent on Linux in Azure

I would like to be able to monitor (logs, performance metrics) VM's in Azure (and other clouds) using Google Cloud Logging and Monitoring.
As a proof of concept,
I'm using an Ubuntu 20.04 instance in Azure
I have installed Ops Agent
I have put a key file in place for a service account with the required roles (Logging, Monitoring)
When I check the status of the Ops Agent, I see the following (mildly redacted)
● google-cloud-ops-agent-opentelemetry-collector.service - Google Cloud Ops Agent - Metrics Agent
Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-opentelemetry-collector.service; static; vendor preset: enabled)
Active: failed (Result: exit-code) since Wed 2022-02-16 22:39:22 UTC; 1min 5s ago
Process: 2730195 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=otel -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY} (code=exited, status=0/SUCCESS)
Process: 2730208 ExecStart=/opt/google-cloud-ops-agent/subagents/opentelemetry-collector/otelopscol --config=${RUNTIME_DIRECTORY}/otel.yaml (code=exited, status=1/FAILURE)
Main PID: 2730208 (code=exited, status=1/FAILURE)
Feb 16 22:39:22 HOSTNAME systemd[1]: google-cloud-ops-agent-opentelemetry-collector.service: Scheduled restart job, restart counter is at 5.
Feb 16 22:39:22 HOSTNAME systemd[1]: Stopped Google Cloud Ops Agent - Metrics Agent.
Feb 16 22:39:22 HOSTNAME systemd[1]: google-cloud-ops-agent-opentelemetry-collector.service: Start request repeated too quickly.
Feb 16 22:39:22 HOSTNAME systemd[1]: google-cloud-ops-agent-opentelemetry-collector.service: Failed with result 'exit-code'.
Feb 16 22:39:22 HOSTNAME systemd[1]: Failed to start Google Cloud Ops Agent - Metrics Agent.
● google-cloud-ops-agent-fluent-bit.service - Google Cloud Ops Agent - Logging Agent
Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent-fluent-bit.service; static; vendor preset: enabled)
Active: failed (Result: exit-code) since Wed 2022-02-16 22:39:22 UTC; 1min 5s ago
Process: 2730194 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -service=fluentbit -in /etc/google-cloud-ops-agent/config.yaml -logs ${LOGS_DIRECTORY} -state ${STATE_DIRECTORY} (code=exited, status=0/SUCCESS)
Process: 2730207 ExecStart=/opt/google-cloud-ops-agent/subagents/fluent-bit/bin/fluent-bit --config ${RUNTIME_DIRECTORY}/fluent_bit_main.conf --parser ${RUNTIME_DIRECTORY}/fluent_bit_parser.conf --log_file ${LOGS_DIRECTORY}/logging-module.log --storage_path ${STATE_DIRECTORY}/buffers (co>
Main PID: 2730207 (code=exited, status=255/EXCEPTION)
Feb 16 22:39:22 HOSTNAME systemd[1]: google-cloud-ops-agent-fluent-bit.service: Scheduled restart job, restart counter is at 5.
Feb 16 22:39:22 HOSTNAME systemd[1]: Stopped Google Cloud Ops Agent - Logging Agent.
Feb 16 22:39:22 HOSTNAME systemd[1]: google-cloud-ops-agent-fluent-bit.service: Start request repeated too quickly.
Feb 16 22:39:22 HOSTNAME systemd[1]: google-cloud-ops-agent-fluent-bit.service: Failed with result 'exit-code'.
Feb 16 22:39:22 HOSTNAME systemd[1]: Failed to start Google Cloud Ops Agent - Logging Agent.
● google-cloud-ops-agent.service - Google Cloud Ops Agent
Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent.service; enabled; vendor preset: enabled)
Active: active (exited) since Wed 2022-02-16 22:39:21 UTC; 1min 7s ago
Process: 2730090 ExecStartPre=/opt/google-cloud-ops-agent/libexec/google_cloud_ops_agent_engine -in /etc/google-cloud-ops-agent/config.yaml (code=exited, status=0/SUCCESS)
Process: 2730102 ExecStart=/bin/true (code=exited, status=0/SUCCESS)
Main PID: 2730102 (code=exited, status=0/SUCCESS)
Feb 16 22:39:21 HOSTNAME systemd[1]: Starting Google Cloud Ops Agent...
Feb 16 22:39:21 HOSTNAME systemd[1]: Finished Google Cloud Ops Agent.
The Ops Agent logs show
[2022/02/16 22:39:22] [ info] [engine] started (pid=2730207)
[2022/02/16 22:39:22] [ info] [storage] version=1.1.5, initializing...
[2022/02/16 22:39:22] [ info] [storage] root path '/var/lib/google-cloud-ops-agent/fluent-bit/buffers'
[2022/02/16 22:39:22] [ info] [storage] normal synchronization mode, checksum enabled, max_chunks_up=128
[2022/02/16 22:39:22] [ info] [storage] backlog input plugin: storage_backlog.2
[2022/02/16 22:39:22] [ info] [cmetrics] version=0.2.2
[2022/02/16 22:39:22] [ info] [input:storage_backlog:storage_backlog.2] queue memory limit: 47.7M
[2022/02/16 22:39:22] [ info] [output:stackdriver:stackdriver.0] metadata_server set to http://metadata.google.internal
[2022/02/16 22:39:22] [ warn] [output:stackdriver:stackdriver.0] client_email is not defined, using a default one
[2022/02/16 22:39:22] [ warn] [output:stackdriver:stackdriver.0] private_key is not defined, fetching it from metadata server
[2022/02/16 22:39:22] [ warn] [net] getaddrinfo(host='metadata.google.internal', err=-2): Name or service not known
[2022/02/16 22:39:22] [error] [output:stackdriver:stackdriver.0] failed to create metadata connection
[2022/02/16 22:39:22] [error] [output:stackdriver:stackdriver.0] can't fetch token from the metadata server
[2022/02/16 22:39:22] [ warn] [output:stackdriver:stackdriver.0] token retrieval failed
[2022/02/16 22:39:22] [ warn] [net] getaddrinfo(host='metadata.google.internal', err=-2): Name or service not known
[2022/02/16 22:39:22] [error] [output:stackdriver:stackdriver.0] failed to create metadata connection
[2022/02/16 22:39:22] [error] [output:stackdriver:stackdriver.0] can't fetch project id from the metadata server
[2022/02/16 22:39:22] [error] [output] failed to initialize 'stackdriver' plugin
[2022/02/16 22:39:22] [ info] [input] pausing fluentbit_metrics.0
[2022/02/16 22:39:22] [ info] [input] pausing tail.1
[2022/02/16 22:39:22] [ info] [input] pausing storage_backlog.2
I notice private_key is not defined, fetching it from metadata server, which suggests that the key file is not being picked up.
The documentation says The Ops Agent is the primary agent for collecting telemetry from your Compute Engine instances. See here.
Can the Ops Agent only be run on Compute Engine instances or is it reasonable to expect that it could be run anywhere if properly configured?
When google-cloud-ops-agent.service is started, it starts google-cloud-ops-agent-fluent-bit.service and google-cloud-ops-agent-opentelemetry-collector.service and then exits. Environment variables added as overrides to google-cloud-ops-agent.service do not persist to the others.
I found that I had to add GOOGLE_APPLICATION_CREDENTIALS to google-cloud-ops-agent-opentelemetry-collector.service and GOOGLE_SERVICE_CREDENTIALS to google-cloud-ops-agent-fluent-bit.service. You can override the systemd units non-interactively:
SYSTEMD_EDITOR=tee systemctl edit google-cloud-ops-agent-fluent-bit.service <<'EOF'
[Service]
Environment='GOOGLE_SERVICE_CREDENTIALS=/etc/google/auth/application_default_credentials.json'
EOF
SYSTEMD_EDITOR=tee systemctl edit google-cloud-ops-agent-opentelemetry-collector.service <<'EOF'
[Service]
Environment='GOOGLE_APPLICATION_CREDENTIALS=/etc/google/auth/application_default_credentials.json'
EOF
Ops Agent is looking for credentials and not finding them.
This means you either did not copy the service account to the correct location with the correct file access permissions OR you did not set up the environment variable GOOGLE_APPLICATION_CREDENTIALS correctly with the correct file access permissions.
The agent then checks the metadata service which does not support Google OAuth access tokens (Azure provides MSI credentials if setup)

"Failed to set feature gates from initial flags-based config" err="unrecognized feature gate: CSIBlockVolume"

Steps
I have a use case in which I want to create a kubernetes cluster from scratch using kubeadm.
$ kubeadm init --config admin.yaml --v=7
admin.yaml:
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
nodeRegistration:
criSocket: /run/containerd/containerd.sock
ignorePreflightErrors:
- SystemVerification
localAPIEndpoint:
bindPort: 6443
---
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
apiServer:
extraArgs:
feature-gates: CSIBlockVolume=true,CSIDriverRegistry=true,CSINodeInfo=true,VolumeSnapshotDataSource=true
---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
featureGates:
CSIBlockVolume: true
CSIDriverRegistry: true
CSINodeInfo: true
All Operations seem to work until the connection to the kublet should be established.
This is the final log before the crash. The GET requests are sent approx 100 times before it crashes.
LOG:
I1216 12:31:45.043530 15460 round_trippers.go:463] GET https://<IP>:6443/healthz?timeout=10s
I1216 12:31:45.043550 15460 round_trippers.go:469] Request Headers:
I1216 12:31:45.043555 15460 round_trippers.go:473] Accept: application/json, */*
I1216 12:31:45.043721 15460 round_trippers.go:574] Response Status: in 0 milliseconds
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp 127.0.0.1:10248: connect: connection refused.
Unfortunately, an error has occurred:
timed out waiting for the condition
This error is likely caused by:
- The kubelet is not running
- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
I read the logs for kubelet with
$ journalctl -xeu kubelet
This is the output I received:
- The job identifier is 49904.
Dec 16 13:40:42 <IP> kubelet[24113]: I1216 13:40:42.883879 24113 server.go:198] "Warning: For remote container runtime, --pod-infra-container-image is ignored in kubelet, which should be set in tha>
Dec 16 13:40:42 <IP> kubelet[24113]: E1216 13:40:42.885069 24113 server.go:217] "Failed to set feature gates from initial flags-based config" err="unrecognized feature gate: CSIBlockVolume"
Dec 16 13:40:42 <IP> systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
-- Subject: Unit process exited
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- An ExecStart= process belonging to unit kubelet.service has exited.
--
-- The process' exit code is 'exited' and its exit status is 1.
Dec 16 13:40:42 <IP> systemd[1]: kubelet.service: Failed with result 'exit-code'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- The unit kubelet.service has entered the 'failed' state with result 'exit-code'.
Setup
Software Versions:
$ kubelet --version
Kubernetes v1.23.0
$ kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.0", GitCommit:"ab69524f795c42094a6630298ff53f3c3ebab7f4", GitTreeState:"clean", BuildDate:"2021-12-07T18:15:11Z", GoVersion:"go1.17.3", Compiler:"gc", Platform:"linux/amd64"}
Kubernetes and platform
$ uname -a
Linux 5.11.0-1022-aws #23~20.04.1-Ubuntu SMP Mon Nov 15 14:03:19 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
The server is deployed in amazon AWS
Container runtime: containerd
Docker installed: No
I also checked the Kubernetes documentaion, which, if I read this correctly, states that all the Feature Gates are now in the GA state, so integrated into Kubernetes not not experimental anymore.
...kubelet[24113]: E1216 13:40:42.885069 24113 server.go:217] "Failed to set feature gates from initial flags-based config" err="unrecognized feature gate: CSIBlockVolume"
CSIBlockVolume feature gate applies to api-server and not kubelet, you need to enable this at /etc/kubernetes/manifests/kube-apiserver.yaml by adding to --feature-gate="...,CSIBlockVolume=true"

Amazon Linux 2 worker fails to reboot

I'm running a Node.js application on an Amazon Linux 2 worker instance, connected to SQS.
The problem
It all runs fine, except that for technical reasons I need to restart the server regularly. To do this, I've set up a cron to run /sbin/shutdown -r now at night.
As the instance boots back up, I get an error regarding the SQS daemon service:
[INFO] Executing instruction: configureSqsd
[INFO] get sqsd conf from cfn metadata and write into sqsd conf file ...
[INFO] Executing instruction: startSqsd
[INFO] Running command /bin/sh -c systemctl show -p PartOf sqsd.service
[INFO] Running command /bin/sh -c systemctl is-active sqsd.service
[INFO] Running command /bin/sh -c systemctl start sqsd.service
[ERROR] An error occurred during execution of command [self-startup] - [startSqsd].
Stop running the command. Error: startProcess Failure: starting process "sqsd" failed:
Command /bin/sh -c systemctl start sqsd.service failed with error exit status 1.
Stderr:Job for sqsd.service failed because the control process exited with error code.
See "systemctl status sqsd.service" and "journalctl -xe" for details.
The instance is then stuck in a loop where the initialization runs until it hits the sqsd.service error and then starts over again.
Logs
The systemctl status sqsd.service command doesn't appear to show much more information than we already got, only that it exited with status 1:
● sqsd.service - This is sqsd daemon
Loaded: loaded (/etc/systemd/system/sqsd.service; enabled; vendor preset: disabled)
Active: deactivating (stop-sigterm) (Result: exit-code)
Process: 2748 ExecStopPost=/bin/sh -c (code=exited, status=0/SUCCESS)
Process: 2745 ExecStopPost=/bin/sh -c rm -f /var/pids/sqsd.pid (code=exited, status=0/SUCCESS)
Process: 2753 ExecStart=/bin/sh -c /opt/elasticbeanstalk/lib/ruby/bin/aws-sqsd start (code=exited, status=1/FAILURE)
CGroup: /system.slice/sqsd.service
└─2789 /opt/elasticbeanstalk/lib/ruby/bin/ruby /opt/elasticbeanstalk/lib/ruby/bin/aws-sqsd start
The most interesting found when checking journalctl -xe is:
sqsd[9704]: /opt/elasticbeanstalk/lib/ruby/lib/ruby/gems/2.6.0/gems/aws-sqsd-3.0.3/bin/aws-sqsd:58:in `initialize': No such file or directory # rb_sysopen - /var/run/aws-sqsd/default.pid (Errno::ENOENT)
sqsd[9704]: from /opt/elasticbeanstalk/lib/ruby/lib/ruby/gems/2.6.0/gems/aws-sqsd-3.0.3/bin/aws-sqsd:58:in `open'
sqsd[9704]: from /opt/elasticbeanstalk/lib/ruby/lib/ruby/gems/2.6.0/gems/aws-sqsd-3.0.3/bin/aws-sqsd:58:in `start'
sqsd[9704]: from /opt/elasticbeanstalk/lib/ruby/lib/ruby/gems/2.6.0/gems/aws-sqsd-3.0.3/bin/aws-sqsd:83:in `launch'
sqsd[9704]: from /opt/elasticbeanstalk/lib/ruby/lib/ruby/gems/2.6.0/gems/aws-sqsd-3.0.3/bin/aws-sqsd:111:in `<top (required)>'
sqsd[9704]: from /opt/elasticbeanstalk/lib/ruby/bin/aws-sqsd:23:in `load'
sqsd[9704]: from /opt/elasticbeanstalk/lib/ruby/bin/aws-sqsd:23:in `<main>'
systemd[1]: sqsd.service: control process exited, code=exited status=1
systemd[1]: Failed to start This is sqsd daemon.
Further investigation
As per the logs, the file /var/run/aws-sqsd/default.pid does not exist when rebooting the server. It does exist on a rebuild and contains the application process ID.
If I add the file, the setup process gets a little bit further until a similar file is missing.
Solutions?
Has anyone run into this issue before? Not sure why starting sqsd.service fails after a normal reboot but works fine on initial deploy and after rebuilding the environment... It almost seems like it's looking for a config file that doesn't exist...
Are there any other ways to safely reboot the instance that I should try?
I have the same exact issue. Not posting a solution but some more data on the issue. I found errors in /var/log/messages that suggest that the SQSd daemon ran out of memory.
Apr 28 15:43:05 ip-172-31-121-3 sqsd: /opt/elasticbeanstalk/lib/ruby/lib/ruby/gems/2.6.0/gems/aws-sqsd-3.0.4/bin/aws-sqsd:42:in `fork': Cannot allocate memory - fork(2) (Errno::ENOMEM)
Apr 28 15:43:05 ip-172-31-121-3 sqsd: from /opt/elasticbeanstalk/lib/ruby/lib/ruby/gems/2.6.0/gems/aws-sqsd-3.0.4/bin/aws-sqsd:42:in `start'
Apr 28 15:43:05 ip-172-31-121-3 sqsd: from /opt/elasticbeanstalk/lib/ruby/lib/ruby/gems/2.6.0/gems/aws-sqsd-3.0.4/bin/aws-sqsd:83:in `launch'
Apr 28 15:43:05 ip-172-31-121-3 sqsd: from /opt/elasticbeanstalk/lib/ruby/lib/ruby/gems/2.6.0/gems/aws-sqsd-3.0.4/bin/aws-sqsd:111:in `<top (required)>'
Apr 28 15:43:05 ip-172-31-121-3 sqsd: from /opt/elasticbeanstalk/lib/ruby/bin/aws-sqsd:23:in `load'
Apr 28 15:43:05 ip-172-31-121-3 sqsd: from /opt/elasticbeanstalk/lib/ruby/bin/aws-sqsd:23:in `<main>'
Apr 28 15:43:05 ip-172-31-121-3 systemd: sqsd.service: control process exited, code=exited status=1
Apr 28 15:43:05 ip-172-31-121-3 systemd: Failed to start This is sqsd daemon.
Apr 28 15:43:05 ip-172-31-121-3 systemd: Unit sqsd.service entered failed state.
Apr 28 15:43:05 ip-172-31-121-3 systemd: sqsd.service failed.
After setting up a larger instance class, it was going through fine but I'm not sure that it wasn't just the refreshed instance that did it (like david.emilsson mentioned) or the extra memory.

Issue with awslogs service and CloudWatch Logs Agent on Ubuntu 16.04

On one of my AWS ec2 instances running Ubuntu 16.04, I'm getting the following errors filled up in my /var/syslog.
Jul 17 18:11:21 Mysql-Slave systemd[1]: Stopped The CloudWatch Logs agent.
Jul 17 18:11:21 Mysql-Slave systemd[1]: Started The CloudWatch Logs agent.
Jul 17 18:11:26 Mysql-Slave systemd[1]: awslogs.service: Main process exited, code=exited, status=255/n/a
Jul 17 18:11:26 Mysql-Slave systemd[1]: awslogs.service: Unit entered failed state.
Jul 17 18:11:26 Mysql-Slave systemd[1]: awslogs.service: Failed with result 'exit-code'.
Jul 17 18:11:26 Mysql-Slave systemd[1]: awslogs.service: Service hold-off time over, scheduling restart.
Jul 17 18:11:26 Mysql-Slave systemd[1]: Stopped The CloudWatch Logs agent.
Jul 17 18:11:26 Mysql-Slave systemd[1]: Started The CloudWatch Logs agent.
Jul 17 18:11:32 Mysql-Slave systemd[1]: awslogs.service: Main process exited, code=exited, status=255/n/a
Jul 17 18:11:32 Mysql-Slave systemd[1]: awslogs.service: Unit entered failed state.
Jul 17 18:11:32 Mysql-Slave systemd[1]: awslogs.service: Failed with result 'exit-code'.
Jul 17 18:11:32 Mysql-Slave systemd[1]: awslogs.service: Service hold-off time over, scheduling restart.
Jul 17 18:11:32 Mysql-Slave systemd[1]: Stopped The CloudWatch Logs agent.
Jul 17 18:11:32 Mysql-Slave systemd[1]: Started The CloudWatch Logs agent.
The /var/log/awslogs.log contains these messages:
database is locked
2018-07-17 20:59:01,055 - cwlogs.push - INFO - 27074 - MainThread - Missing or invalid value for use_gzip_http_content_encoding config. Defaulting to using gzip encoding.
2018-07-17 20:59:01,055 - cwlogs.push - INFO - 27074 - MainThread - Using default logging configuration.
database is locked
2018-07-17 20:59:06,549 - cwlogs.push - INFO - 27104 - MainThread - Missing or invalid value for use_gzip_http_content_encoding config. Defaulting to using gzip encoding.
2018-07-17 20:59:06,549 - cwlogs.push - INFO - 27104 - MainThread - Using default logging configuration.
database is locked
2018-07-17 20:59:12,054 - cwlogs.push - INFO - 27110 - MainThread - Missing or invalid value for use_gzip_http_content_encoding config. Defaulting to using gzip encoding.
2018-07-17 20:59:12,054 - cwlogs.push - INFO - 27110 - MainThread - Using default logging configuration.
Any pointers in troubleshooting this will be of great help.
A similar issue was posted in the following link - https://forums.aws.amazon.com/thread.jspa?threadID=165134
I did the following:
a) Stopped the awslogs service
$ service awslogs stop ## Amazon Linux
OR
$ service awslogsd stop ## Amazon Linux 2
b) Deleted the agent-state file in /var/awslogs/state/ (I renamed it in my case)
$ mv agent-state agent-state.old ## Amazon Linux
OR
$ cd /var/lib/awslogs; mv agent-stat agent-stat.old ## Amazon Linux 2
c) Restarted the awslogs service
$ service awslogs start ## Amazon Linux
OR
$ sudo systemctl start awslogsd ## Amazon Linux 2
A new agent-state file was created as a result and the errors mentioned my post disappeared after this.
Please try the following commands based on your Linux version
sudo service awslogs start
If you are running Amazon Linux 2, try the below command
sudo systemctl start awslogsd
took me 2 hours to figure this out
In my case, I found duplicate entries for some properties in /etc/awslogs/awslogs.conf file.
(Not all were duplicates, as some of the properties were commented, and I uncommented them to set values.)
It didn't work. Then I scrolled till the bottom of the file.
I found following entries. Set the values to these properties and it worked.
[/var/log/messages]
datetime_format = %b %d %H:%M:%S
file = /home/ec2-user/application.log
buffer_duration = 5000
log_stream_name = {instance_id}
initial_position = start_of_file
log_group_name = MyProject

Docker-Machine provisioned aws instance can not start docker engine

When I start a EC2 Instance with
docker-machine create --driver amazonec2 --amazonec2-region eu-central-1 --amazonec2-instance-type t2.2xlarge aws-test
docker-machine can create the VM, exchange the certs but the start of the engine fails.
Log in the EC2:
ubuntu#aws-manager2:~$ systemctl status docker.service
● docker.service - Docker Application Container Engine
Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: e
Drop-In: /etc/systemd/system/docker.service.d
└─10-machine.conf
Active: inactive (dead) (Result: exit-code) since Thu 2017-06-29 09:18:44 UTC
Docs: https://docs.docker.com
Process: 5263 ExecStart=/usr/bin/docker daemon -H tcp://0.0.0.0:2376 -H unix:/
Main PID: 5263 (code=exited, status=1/FAILURE)
Jun 29 09:18:44 aws-manager2 systemd[1]: Failed to start Docker Application Cont
Jun 29 09:18:44 aws-manager2 systemd[1]: docker.service: Unit entered failed sta
Jun 29 09:18:44 aws-manager2 systemd[1]: docker.service: Failed with result 'exi
Jun 29 09:18:44 aws-manager2 systemd[1]: docker.service: Service hold-off time o
Jun 29 09:18:44 aws-manager2 systemd[1]: Stopped Docker Application Container En
Jun 29 09:18:44 aws-manager2 systemd[1]: docker.service: Start request repeated
Jun 29 09:18:44 aws-manager2 systemd[1]: Failed to start Docker Application Cont
lines 1-16/16 (END)...skipping...
● docker.service - Docker Application Container Engine
Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/docker.service.d
└─10-machine.conf
Active: inactive (dead) (Result: exit-code) since Thu 2017-06-29 09:18:44 UTC; 1min 53s ago
Docs: https://docs.docker.com
Process: 5263 ExecStart=/usr/bin/docker daemon -H tcp://0.0.0.0:2376 -H unix:///var/run/docker.sock --s
Main PID: 5263 (code=exited, status=1/FAILURE)
Jun 29 09:18:44 aws-manager2 systemd[1]: Failed to start Docker Application Container Engine.
Jun 29 09:18:44 aws-manager2 systemd[1]: docker.service: Unit entered failed state.
Jun 29 09:18:44 aws-manager2 systemd[1]: docker.service: Failed with result 'exit-code'.
Jun 29 09:18:44 aws-manager2 systemd[1]: docker.service: Service hold-off time over, scheduling restart.
Jun 29 09:18:44 aws-manager2 systemd[1]: Stopped Docker Application Container Engine.
Jun 29 09:18:44 aws-manager2 systemd[1]: docker.service: Start request repeated too quickly.
Jun 29 09:18:44 aws-manager2 systemd[1]: Failed to start Docker Application Container Engine.
Log at startup:
$ docker-machine create --driver amazonec2 --amazonec2-region eu-central-1 --amazonec2-instance-type t2.2xlarge aws-manager2
Running pre-create checks...
Creating machine...
(aws-manager2) Launching instance...
Waiting for machine to be running, this may take a few minutes...
Detecting operating system of created instance...
Waiting for SSH to be available...
Detecting the provisioner...
Provisioning with ubuntu(systemd)...
Installing Docker...
Copying certs to the local machine directory...
Copying certs to the remote machine...
Setting Docker configuration on the remote daemon...
Error creating machine: Error running provisioning: ssh command error:
command : sudo systemctl -f start docker
err : exit status 1
output : Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.
Yesterday it worked with the same configuration. Could there be a change in the used AIM since yesterday? I tried it from a different host but get also the same error.
It seems that there is a bug in the Docker Version that was rollout yesterday. A Workaround for us:
docker-machine create --driver amazonec2 --engine-install-url=https://web.archive.org/web/20170623081500/https://get.docker.com