invalid cloud build timeout? - google-cloud-platform

invalid cloud build timeout? - google-cloud-platform

We have a build that takes anywhere from 1 minute to 15 minutes(monobuild that is not parallized yet so it may build 8 servers or 1). It was timing out so I modified the build file to
steps:
- name: gcr.io/$PROJECT_ID/continuous-deploy
timeout: 1200s
I also ran these commands(the last one failed though even though I got that from another post so it worked for them somehow)...
Deans-MacBook-Pro:orderly dean$ gcloud config set app/cloud_build_timeout 1250
Updated property [app/cloud_build_timeout].
Deans-MacBook-Pro:orderly dean$ gcloud config set builds/timeout 1300
Updated property [builds/timeout].
Deans-MacBook-Pro:orderly dean$ gcloud config set container/build-timeout 1350
ERROR: (gcloud.config.set) Section [container] has no property [build-timeout].
Deans-MacBook-Pro:orderly dean$
I get the following error that anything greater than 10 minutes is invalid on google
invalid build: invalid timeout in build step #0: build step timeout "20m0s" must be <= build timeout "10m0s"
Why MUST it be less than 10m0s? I really need our builds to be about 20 minutes.
I was going off of
Why can't I override the timeout on my Google Cloud Build?
and
GCP Cloud build ignores timeout settings
thanks,
Dean

The timeout of the steps should be less or equal than the timeout of the whole task.
By setting the timeout at the step level to 20 minutes it is causing the error as the default timeout for the whole task is 10 minutes by default.
The way to avoid this happenning is to set the timeout of the full task to be grater or equal to the the timeout of the specific steps.
I added a small example on how to define this.
steps:
- name: gcr.io/$PROJECT_ID/continuous-deploy
timeout: 1200s # Step Timeout
timeout: 1200s # Full Task Timeout

Related

How do I use the timeout flag in gcloud run deploy?

I am trying to deploy a node app to cloud run using the reference https://cloud.google.com/sdk/gcloud/reference/run/deploy. The build takes around 12 mins so I am including the timeout flag mentioned here: https://cloud.google.com/sdk/gcloud/reference/run/deploy#--timeout.
My command looks like this:
gcloud run deploy my-node-project --timeout 900 --source .
My build still times out after 10 minutes though. How do I adjust it to give me the 12 minutes I need?

App Engine Flexible deployment fails to become healthy in the allotted time

My flask app deployment via App Engine Flex is timing out and after setting debug=True. I see the following line repeating over and over until it fails. I am not sure however what this is and cannot find anything useful in logs explorer.
Updating service [default] (this may take several minutes)...working DEBUG: Operation [apps/enhanced-bonito-349015/operations/81b83124-17b1-4d90-abdc-54b3fa28df67] not complete. Waiting to retry.
Could anyone share advice on where to look to resolve this issue?
Here is my app.yaml (I thought this was due to a memory issue..):
runtime: python
env:flex
entrypoint: gunicorn - b :$PORT main:app
runtime_config:
python_version:3
resources:
cpu:4
memory_gb: 12
disk_size_gb: 1000
readiness_check:
path: "/readines_check"
check_interval_sec: 5
timeout_sec: 4
failure_threshold: 2
success_threshold: 2
app_start_timeout_sec: 300
Error logs:
ERROR: (gcloud.app.deploy) Error Response: [4] An internal error occurred while processing task /app-engine-flex/flex_await_healthy/flex_await_healthy>2022-05-10T23:21:10.941Z47607.vt.0: Your deployment has failed to become healthy in the allotted time and therefore was rolled back. If you believe this was an error, try adjusting the 'app_start_timeout_sec' setting in the 'readiness_check' section.

There could be possible ways to resolve such deployment errors.
Increase the value of app_start_timeout_sec to the maximum value which is 1800
Make sure that all the Google Cloud services that Endpoints and ESP require are enabled on your project.
Assuming that splitHealthChecks feature is enabled, make sure to follow all the steps needed when migrating from the legacy version.

How to increase the cloud build timeout when using ```gcloud run deploy```?

When attempting to deploy to Cloud Run using the gcloud run deploy I am hitting the 10m Cloud Build timeout limit. gcloud run deploy is working well as long as the build step does not exceed 10m. When the build step exceeds 10m, the build fails with the "Timed out" status as shown in below screenshot. AFAIK there are no arguments to gcloud run deploy that can set the Cloud Build timeout limit. gcloud run deploy docs are here: https://cloud.google.com/sdk/gcloud/reference/run/deploy
I've attempted to increase the Cloud Build timeout limit using gcloud config set builds/timeout 20m and gcloud config set container/build_timeout 20m, but these settings are not reflected in the execution details of the cloud build process when using gcloud run deploy.
In the GUI, this is the setting I want to change:
Is it possible to increase the Cloud Build timeout limit using gcloud run deploy?

How about splitting the command into (more easily configured) constituents?
[I've not tried this]
Build the container image specifying the timeout
:
gcloud builds submit --source=.... --timeout=...
Then reference the image that results when you gcloud run deploy:
gcloud run deploy ... --image=...

I know this is answered and confirmed, but #DazWikin's solution was the harder way to solve this problem than #SimonKarman's solution.
For those who do not have the cloudbuild.yml file like myself, this solution still is a valid one, you just need to edit the one created by google itself. You can find it under builds > triggers > Desired Trigger (Edit)
Then when you open the editor you can apply the timeout. If you want other changes to the yaml file you can also checkout the schema here:
https://cloud.google.com/build/docs/build-config-file-schema#yaml
Note: I am using cloudrun and this worked for me and therefore I am not 100% if it works with all builds generated by google
Hope it will be helpful for someone else in future :)

If you're using a --source such as the cloudbuild.yaml you can add the following property to alter the timeout in seconds:
...
timeout: "1800s"
...
You can find this in the documentation

Missing log lines when writing to cloudwatch from ECS Docker containers

(Docker container on AWS-ECS exits before all the logs are printed to CloudWatch Logs)
Why are some streams of a CloudWatch Logs Group incomplete (i.e., the Fargate Docker Container exits successfully but the logs stop being updated abruptly)? Seeing this intermittently, in almost all log groups, however, not on every log stream/task run. I'm running on version 1.3.0
Description:
A Dockerfile runs node.js or Python scripts using the CMD command.
These are not servers/long-running processes, and my use case requires the containers to exit when the task completes.
Sample Dockerfile:
FROM node:6
WORKDIR /path/to/app/
COPY package*.json ./
RUN npm install
COPY . .
CMD [ "node", "run-this-script.js" ]
All the logs are printed correctly to my terminal's stdout/stderr when this command is run on the terminal locally with docker run.
To run these as ECS Tasks on Fargate, the log driver for is set as awslogs from a CloudFormation Template.
...
LogConfiguration:
LogDriver: 'awslogs'
Options:
awslogs-group: !Sub '/ecs/ecs-tasks-${TaskName}'
awslogs-region: !Ref AWS::Region
awslogs-stream-prefix: ecs
...
Seeing that sometimes the cloduwatch logs output is incomplete, I have run tests and checked every limit from CW Logs Limits and am certain the problem is not there.
I initially thought this is an issue with node js exiting asynchronously before console.log() is flushed, or that the process is exiting too soon, but the same problem occurs when i use a different language as well - which makes me believe this is not an issue with the code, but rather with cloudwatch specifically.
Inducing delays in the code by adding a sleep timer has not worked for me.
It's possible that since the docker container exits immediately after the task is completed, the logs don't get enough time to be written over to CWLogs, but there must be a way to ensure that this doesn't happen?
sample logs:
incomplete stream:
{ "message": "configs to run", "data": {"dailyConfigs":"filename.json"]}}
running for filename
completed log stream:
{ "message": "configs to run", "data": {"dailyConfigs":"filename.json"]}}
running for filename
stdout: entered query_script
... <more log lines>
stderr:
real 0m23.394s
user 0m0.008s
sys 0m0.004s
(node:1) DeprecationWarning: PG.end is deprecated - please see the upgrade guide at https://node-postgres.com/guides/upgrading

UPDATE: This now appears to be fixed, so there is no need to implement the workaround described below
I've seen the same behaviour when using ECS Fargate containers to run Python scripts - and had the same resulting frustration!
I think it's due to CloudWatch Logs Agent publishing log events in batches:
How are log events batched?
A batch becomes full and is published when any of the following conditions are met:
The buffer_duration amount of time has passed since the first log event was added.
Less than batch_size of log events have been accumulated but adding the new log event exceeds the batch_size.
The number of log events has reached batch_count.
Log events from the batch don't span more than 24 hours, but adding the new log event exceeds the 24 hours constraint.
(Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AgentReference.html)
So a possible explanation is that log events are buffered by the agent but not yet published when the ECS task is stopped. (And if so, that seems like an ECS issue - any AWS ECS engineers willing to give their perspective on this...?)
There doesn't seem to be a direct way to ensure the logs are published, but it does suggest one could wait at least buffer_duration seconds (by default, 5 seconds), and any prior logs should be published.
With a bit of testing that I'll describe below, here's a workaround I landed on. A shell script run_then_wait.sh wraps the command to trigger the Python script, to add a sleep after the script completes.
Dockerfile
FROM python:3.7-alpine
ADD run_then_wait.sh .
ADD main.py .
# The original command
# ENTRYPOINT ["python", "main.py"]
# To run the original command and then wait
ENTRYPOINT ["sh", "run_then_wait.sh", "python", "main.py"]
run_then_wait.sh
#!/bin/sh
set -e
# Wait 10 seconds on exit: twice the `buffer_duration` default of 5 seconds
trap 'echo "Waiting for logs to flush to CloudWatch Logs..."; sleep 10' EXIT
# Run the given command
"$#"
main.py
import logging
import time
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()
if __name__ == "__main__":
# After testing some random values, had most luck to induce the
# issue by sleeping 9 seconds here; would occur ~30% of the time
time.sleep(9)
logger.info("Hello world")
Hopefully the approach can be adapted to your situation. You could also implement the sleep inside your script, but it can be trickier to ensure it happens regardless of how it terminates.
It's hard to prove that the proposed explanation is accurate, so I used the above code to test whether the workaround was effective. The test was the original command vs. with run_then_wait.sh, 30 runs each. The results were that the issue was observed 30% of the time, vs 0% of the time, respectively. Hope this is similarly effective for you!

Just contacted AWS support about this issue and here is their response:
...
Based on that case, I can see that this occurs for containers in a
Fargate Task that exit quickly after outputting to stdout/stderr. It
seems to be related to how the awslogs driver works, and how Docker in
Fargate communicates to the CW endpoint.
Looking at our internal tickets for the same, I can see that our
service team are still working to get a permanent resolution for this
reported bug. Unfortunately, there is no ETA shared for when the fix
will be deployed. However, I've taken this opportunity to add this
case to the internal ticket to inform the team of the similar and try
to expedite the process
In the meantime, this can be avoided by extending the lifetime of the
exiting container by adding a delay (~>10 seconds) between the logging
output of the application and the exit of the process (exit of the
container).
...
Update:
Contacted AWS around August 1st, 2019, they say this issue has been fixed.

I observed this as well. It must be an ECS bug?
My workaround (Python 3.7):
import atexit
from time import sleep
atexit.register(finalizer)
def finalizer():
logger.info("All tasks have finished. Exiting.")
# Workaround:
# Fargate will exit and final batch of CloudWatch logs will be lost
sleep(10)

I had the same problem with flushing logs to CloudWatch.
Following asavoy's answer I switched from exec form to shell form of the ENTRYPOINT and added a 10 sec sleep at the end.
Before:
ENTRYPOINT ["java","-jar","/app.jar"]
After:
ENTRYPOINT java -jar /app.jar; sleep 10

All jobs failing in C COMPSs execution

I have downloaded COMPSs 1.4 and some test programs from http://www.bsc.es/computer-sciences/grid-computing/comp-superscalar/downloads-and-documentation and I am trying to test them. Java executions went fine; however, I amb having problems with C.
I am currently trying to execute the Simple. The Readme states that I only need two commands:
buidapp simple
runcompss --lang=c master/simple 1
The app builds fine, but when executing with this command, I get the following error:
[ERRMGR] - WARNING: Job 1 for running task 1 on worker localhost has failed; resubmitting task to the same worker.
[ERRMGR] - WARNING: Task 1 execution on worker localhost has failed; rescheduling task execution. (changing worker)
[ERRMGR] - WARNING: No task could be scheduled to any of the available resources.
This could end up blocking COMPSs. Will check it again in 20 seconds.
Possible causes:
-Network problems: non-reachable nodes, sshd service not started, etc.
-There isn't any computing resource that fits the defined tasks constraints.
If this happens 2 more times, the runtime will shutdown.
After 3 checks, the execution ends with no results. Is there something I am missing?

When running an application with the C binding, the default project.xml is not valid because you have to define a project.xml which includes the place where the worker binaries are deployed in each host.
<Project>
<Worker Name="localhost">
<InstallDir>/opt/COMPSs/Runtime/scripts/system/</InstallDir>
<WorkingDir>[/path/to/dir/used_as_working_dir]</WorkingDir>
<AppDir>[/path/to/installation]</AppDir>
<LimitOfTasks>4</LimitOfTasks>
</Worker>
</Project>

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js