My flask app deployment via App Engine Flex is timing out and after setting debug=True. I see the following line repeating over and over until it fails. I am not sure however what this is and cannot find anything useful in logs explorer.
Updating service [default] (this may take several minutes)...working DEBUG: Operation [apps/enhanced-bonito-349015/operations/81b83124-17b1-4d90-abdc-54b3fa28df67] not complete. Waiting to retry.
Could anyone share advice on where to look to resolve this issue?
Here is my app.yaml (I thought this was due to a memory issue..):
runtime: python
env:flex
entrypoint: gunicorn - b :$PORT main:app
runtime_config:
python_version:3
resources:
cpu:4
memory_gb: 12
disk_size_gb: 1000
readiness_check:
path: "/readines_check"
check_interval_sec: 5
timeout_sec: 4
failure_threshold: 2
success_threshold: 2
app_start_timeout_sec: 300
Error logs:
ERROR: (gcloud.app.deploy) Error Response: [4] An internal error occurred while processing task /app-engine-flex/flex_await_healthy/flex_await_healthy>2022-05-10T23:21:10.941Z47607.vt.0: Your deployment has failed to become healthy in the allotted time and therefore was rolled back. If you believe this was an error, try adjusting the 'app_start_timeout_sec' setting in the 'readiness_check' section.
There could be possible ways to resolve such deployment errors.
Increase the value of app_start_timeout_sec to the maximum value which is 1800
Make sure that all the Google Cloud services that Endpoints and ESP require are enabled on your project.
Assuming that splitHealthChecks feature is enabled, make sure to follow all the steps needed when migrating from the legacy version.
I am pushing a minimalistic Spring Boot web application on Cloud Foundry. My manifest looks like
---
applications:
- name: training-app
path: target/spring-boot-initial-0.0.1-SNAPSHOT.jar
instances: 1
memory: 1G
buildpacks:
- java_buildpack
env:
TRAINING_KEY_3: from manifest
When I push the application with Java Buildpack (https://github.com/cloudfoundry/java-buildpack/releases/tag/v4.45) , I see that it is creating an additional process of type -task which does not have any running instance though.
name: training-app
requested state: started
isolation segment: trial
routes: ***************************
last uploaded: Thu 20 Jan 21:29:31 IST 2022
stack: cflinuxfs3
buildpacks:
isolation segment: trial
name version detect output buildpack name
java_buildpack v4.45-offline-https://github.com/cloudfoundry/java-buildpack.git#f1b695a0 java java
type: web
sidecars:
instances: 1/1
memory usage: 1024M
start command: JAVA_OPTS="-agentpath:$PWD/.java-buildpack/open_jdk_jre/bin/jvmkill-1.16.0_RELEASE=printHeapHistogram=1 -Djava.io.tmpdir=$TMPDIR -XX:ActiveProcessorCount=$(nproc)
-Djava.ext.dirs=$PWD/.java-buildpack/container_security_provider:$PWD/.java-buildpack/open_jdk_jre/lib/ext -Djava.security.properties=$PWD/.java-buildpack/java_security/java.security $JAVA_OPTS" &&
CALCULATED_MEMORY=$($PWD/.java-buildpack/open_jdk_jre/bin/java-buildpack-memory-calculator-3.13.0_RELEASE -totMemory=$MEMORY_LIMIT -loadedClasses=13109 -poolType=metaspace -stackThreads=250 -vmOptions="$JAVA_OPTS") && echo JVM Memory Configuration:
$CALCULATED_MEMORY && JAVA_OPTS="$JAVA_OPTS $CALCULATED_MEMORY" && MALLOC_ARENA_MAX=2 SERVER_PORT=$PORT eval exec $PWD/.java-buildpack/open_jdk_jre/bin/java $JAVA_OPTS -cp $PWD/. org.springframework.boot.loader.JarLauncher
state since cpu memory disk details
#0 running 2022-01-20T15:59:55Z 0.0% 62.2M of 1G 130M of 1G
type: task
sidecars:
instances: 0/0
memory usage: 1024M
start command: JAVA_OPTS="-agentpath:$PWD/.java-buildpack/open_jdk_jre/bin/jvmkill-1.16.0_RELEASE=printHeapHistogram=1 -Djava.io.tmpdir=$TMPDIR -XX:ActiveProcessorCount=$(nproc)
-Djava.ext.dirs=$PWD/.java-buildpack/container_security_provider:$PWD/.java-buildpack/open_jdk_jre/lib/ext -Djava.security.properties=$PWD/.java-buildpack/java_security/java.security $JAVA_OPTS" &&
CALCULATED_MEMORY=$($PWD/.java-buildpack/open_jdk_jre/bin/java-buildpack-memory-calculator-3.13.0_RELEASE -totMemory=$MEMORY_LIMIT -loadedClasses=13109 -poolType=metaspace -stackThreads=250 -vmOptions="$JAVA_OPTS") && echo JVM Memory Configuration:
$CALCULATED_MEMORY && JAVA_OPTS="$JAVA_OPTS $CALCULATED_MEMORY" && MALLOC_ARENA_MAX=2 SERVER_PORT=$PORT eval exec $PWD/.java-buildpack/open_jdk_jre/bin/java $JAVA_OPTS -cp $PWD/. org.springframework.boot.loader.JarLauncher
There are no running instances of this process.
I understand that it is a Springboot Web application , and that corresponds to the process of type web , however I do not know
Who is creating the process of type task
What is the purpose of this process ?
It would be great of someone is able to help me here.
Regards
AM
Who is creating the process of type task
The buildpack creates both. This is what's been happening for a while, but recent cf cli changes are making this more visible.
What is the purpose of this process ?
I didn't add that into the buildpack so I can't 100% say its purpose, but I believe it is meant to be used in conjunction with running Java apps ask tasks on CF.
See this commit.
When you run a task, there is a --process flag to the cf run-task command which can be used to set a process to use as the command template. I believe the idea is that you'd set it to task so it can use that command to run your ask. See here for reference to that flag.
I have an Elastic Beanstalk instance which then runs out of memory starts giving 500 errors and the health goes to degraded which is expected behavior. Now when the memory is released server health is back to normal the HTTPS requests are stalling. There is no CPU or memory activity on the server but server still fails to load 2 out of 5 requests. The issue is resolved once i restart the Elastic Beanstalk instance.
I checked the error logs but there are no records seems as if the requests are not hitting the server they are being blocked at the load balancer. Is any one facing this issue? I tried running few different applications on the Elastic Beanstalk instance but results are the same.
Any suggestions or pointing me to docs will be appreciated as i didn't find anything in their docs about this.
Thanks
It is possible that some important process was OOM killed, and has not started back up properly. You can look at this question for help on that: Finding which process was killed by Linux OOM killer
It may be possible to simply reboot the server, and everything may start back up properly again.
The best solution is to not run out of memory in the first place. If you are unable to upgrade the server, perhaps due to cost reasons, then consider adding a swap file.
I have an Elastic Beanstalk application in which I add a swap file using an ebextension.
# .ebextensions/01-swap.config
commands:
"01-swap":
command: |
dd if=/dev/zero of=/var/swapfile bs=1M count=512
chmod 600 /var/swapfile
mkswap /var/swapfile
swapon /var/swapfile
echo "/var/swapfile none swap sw 0 0" >> /etc/fstab
test: test ! -f /var/swapfile
https://github.com/stefansundin/rssbox/blob/1e40fe60f888ad0143e5c4fb83c1471986032963/.ebextensions/01-swap.config
(Docker container on AWS-ECS exits before all the logs are printed to CloudWatch Logs)
Why are some streams of a CloudWatch Logs Group incomplete (i.e., the Fargate Docker Container exits successfully but the logs stop being updated abruptly)? Seeing this intermittently, in almost all log groups, however, not on every log stream/task run. I'm running on version 1.3.0
Description:
A Dockerfile runs node.js or Python scripts using the CMD command.
These are not servers/long-running processes, and my use case requires the containers to exit when the task completes.
Sample Dockerfile:
FROM node:6
WORKDIR /path/to/app/
COPY package*.json ./
RUN npm install
COPY . .
CMD [ "node", "run-this-script.js" ]
All the logs are printed correctly to my terminal's stdout/stderr when this command is run on the terminal locally with docker run.
To run these as ECS Tasks on Fargate, the log driver for is set as awslogs from a CloudFormation Template.
...
LogConfiguration:
LogDriver: 'awslogs'
Options:
awslogs-group: !Sub '/ecs/ecs-tasks-${TaskName}'
awslogs-region: !Ref AWS::Region
awslogs-stream-prefix: ecs
...
Seeing that sometimes the cloduwatch logs output is incomplete, I have run tests and checked every limit from CW Logs Limits and am certain the problem is not there.
I initially thought this is an issue with node js exiting asynchronously before console.log() is flushed, or that the process is exiting too soon, but the same problem occurs when i use a different language as well - which makes me believe this is not an issue with the code, but rather with cloudwatch specifically.
Inducing delays in the code by adding a sleep timer has not worked for me.
It's possible that since the docker container exits immediately after the task is completed, the logs don't get enough time to be written over to CWLogs, but there must be a way to ensure that this doesn't happen?
sample logs:
incomplete stream:
{ "message": "configs to run", "data": {"dailyConfigs":"filename.json"]}}
running for filename
completed log stream:
{ "message": "configs to run", "data": {"dailyConfigs":"filename.json"]}}
running for filename
stdout: entered query_script
... <more log lines>
stderr:
real 0m23.394s
user 0m0.008s
sys 0m0.004s
(node:1) DeprecationWarning: PG.end is deprecated - please see the upgrade guide at https://node-postgres.com/guides/upgrading
UPDATE: This now appears to be fixed, so there is no need to implement the workaround described below
I've seen the same behaviour when using ECS Fargate containers to run Python scripts - and had the same resulting frustration!
I think it's due to CloudWatch Logs Agent publishing log events in batches:
How are log events batched?
A batch becomes full and is published when any of the following conditions are met:
The buffer_duration amount of time has passed since the first log event was added.
Less than batch_size of log events have been accumulated but adding the new log event exceeds the batch_size.
The number of log events has reached batch_count.
Log events from the batch don't span more than 24 hours, but adding the new log event exceeds the 24 hours constraint.
(Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AgentReference.html)
So a possible explanation is that log events are buffered by the agent but not yet published when the ECS task is stopped. (And if so, that seems like an ECS issue - any AWS ECS engineers willing to give their perspective on this...?)
There doesn't seem to be a direct way to ensure the logs are published, but it does suggest one could wait at least buffer_duration seconds (by default, 5 seconds), and any prior logs should be published.
With a bit of testing that I'll describe below, here's a workaround I landed on. A shell script run_then_wait.sh wraps the command to trigger the Python script, to add a sleep after the script completes.
Dockerfile
FROM python:3.7-alpine
ADD run_then_wait.sh .
ADD main.py .
# The original command
# ENTRYPOINT ["python", "main.py"]
# To run the original command and then wait
ENTRYPOINT ["sh", "run_then_wait.sh", "python", "main.py"]
run_then_wait.sh
#!/bin/sh
set -e
# Wait 10 seconds on exit: twice the `buffer_duration` default of 5 seconds
trap 'echo "Waiting for logs to flush to CloudWatch Logs..."; sleep 10' EXIT
# Run the given command
"$#"
main.py
import logging
import time
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()
if __name__ == "__main__":
# After testing some random values, had most luck to induce the
# issue by sleeping 9 seconds here; would occur ~30% of the time
time.sleep(9)
logger.info("Hello world")
Hopefully the approach can be adapted to your situation. You could also implement the sleep inside your script, but it can be trickier to ensure it happens regardless of how it terminates.
It's hard to prove that the proposed explanation is accurate, so I used the above code to test whether the workaround was effective. The test was the original command vs. with run_then_wait.sh, 30 runs each. The results were that the issue was observed 30% of the time, vs 0% of the time, respectively. Hope this is similarly effective for you!
Just contacted AWS support about this issue and here is their response:
...
Based on that case, I can see that this occurs for containers in a
Fargate Task that exit quickly after outputting to stdout/stderr. It
seems to be related to how the awslogs driver works, and how Docker in
Fargate communicates to the CW endpoint.
Looking at our internal tickets for the same, I can see that our
service team are still working to get a permanent resolution for this
reported bug. Unfortunately, there is no ETA shared for when the fix
will be deployed. However, I've taken this opportunity to add this
case to the internal ticket to inform the team of the similar and try
to expedite the process
In the meantime, this can be avoided by extending the lifetime of the
exiting container by adding a delay (~>10 seconds) between the logging
output of the application and the exit of the process (exit of the
container).
...
Update:
Contacted AWS around August 1st, 2019, they say this issue has been fixed.
I observed this as well. It must be an ECS bug?
My workaround (Python 3.7):
import atexit
from time import sleep
atexit.register(finalizer)
def finalizer():
logger.info("All tasks have finished. Exiting.")
# Workaround:
# Fargate will exit and final batch of CloudWatch logs will be lost
sleep(10)
I had the same problem with flushing logs to CloudWatch.
Following asavoy's answer I switched from exec form to shell form of the ENTRYPOINT and added a 10 sec sleep at the end.
Before:
ENTRYPOINT ["java","-jar","/app.jar"]
After:
ENTRYPOINT java -jar /app.jar; sleep 10
The problem:
I have a managed Cloud composer environment, under a 1.9.7-gke.6 Kubernetes cluster master.
I tried to upgrade it (as well as the default-pool nodes) to 1.10.7-gke.1, since an upgrade was available.
Since then, Airflow has been acting randomly. Tasks that were working properly are failing for no given reason. This makes Airflow unusable, since the scheduling becomes unreliable.
Here is an example of a task that runs every 15 minutes and for which the behavior is very visible right after the upgrade:
airflow_tree_view
On hover on a failing task, it only shows an Operator: null message (null_operator). Also, there is no log at all for that task.
I have been able to reproduce the situation with another Composer environment in order to ensure that the upgrade is the cause of the dysfunction.
What I have tried so far :
I assumed the upgrade might have screwed up either the scheduler or Celery (Cloud composer defaults to CeleryExecutor).
I tried restarting the scheduler with the following command:
kubectl get deployment airflow-scheduler -o yaml | kubectl replace --force -f -
I also tried to restart Celery from inside the workers, with
kubectl exec -it airflow-worker-799dc94759-7vck4 -- sudo celery multi restart 1
Celery restarts, but it doesn't fix the issue.
So I tried to restart the airflow completely the same way I did with airflow-scheduler.
None of these fixed the issue.
Side note, I can't access Flower to monitor Celery when following this tutorial (Google Cloud - Connecting to Flower). Connecting to localhost:5555 stay in 'waiting' state forever. I don't know if it is related.
Let me know if I'm missing something!
1.10.7-gke.2 is available now [1]. Can you further upgrade to 1.10.7-gke.2 to see if the issue persists?
[1] https://cloud.google.com/kubernetes-engine/release-notes