Cloud Run Error 504 (Upstream Request Timeout) after successful deploy - google-cloud-platform

I was following this tutorial from Google to deploy a servise to Cloud Run (https://codelabs.developers.google.com/codelabs/cloud-run-hello-python3#5). In Cloud Shell my project is deployed successfully (screenshot below). However, once I click on the link I get timeout. If I test it locally from Cloud Shell it works fine.
Why could this be happening? Where could I get more data about the issue?

As mentioned in the Documentation :
For Cloud Run services, the request timeout setting specifies the time
within which a response must be returned by services deployed to Cloud
Run. If a response isn't returned within the time specified, the
request ends and error 504 is returned.
The timeout is set by default to 5 minutes and can be extended up to
60 minutes. You can change this setting when you deploy a container
image or by updating the service configuration. In addition to
changing the Cloud Run request timeout, you should also check your
language framework to see whether it has its own request timeout
setting that you must also update.
You can refer to this Public group issue which will be helpful in resolving the current error.

You can increase timeout by clicking EDIT & DEPLOY NEW REVISION and then adjust new Request timeout value

Related

Airflow web-server produces temporary 502 errors in Cloud Composer

I'm encountering 502 errors on AirFlow(2.0.2) UI hosted in Cloud Composer(1.17.0).
Error: Server Error The server encountered a temporary error and could not complete your request.
Please try again in 30 seconds.
They last for a few minutes and it happens several times a day after it's gone everything works fine.
At the moment of errors:
there is a gap in logs and after we can see that logs resumed with messages about staring gunicorn:
[1133] [INFO] Starting gunicorn 19.10.0
there is a spike in resource usage of web-server
I didn't spot any other suspicious activity in other parts of the system(workers, scheduler, DB)
I think that this is a result of OOM error because we have DAGs with a big number of tasks (2k).
But I'd like to be sure and I haven't found a way to connect to VM of app engine in tenant project(where Airflow server is hosted by default) to get additional logs.
Maybe anyone knows a way to get additional logs from AirFlow server VMs or have any other idea?
Cloud Composer documentation shows Troubleshooting DAGs sections. It shows how to check individual workers logs. It even mentions OOM issues (direct link).
Generally troubleshooting section is well documented so you should be able to find many interesting information. You can also use Cloud Monitoring and Cloud Logging to monitor Composer, but I am not sure if this will be valuable in this use case (reference).

Media Tailor ad returning 504 error in AWS

I'm using AWS Media Tailor to test an ad inserting demo. The demo page is this one: https://github.com/aws-samples/aws-media-services-simple-vod-workflow/tree/master/12-AdMarkerInsertion.
When I place my manifest into a TheoPlayer I always get an 504 error. My manifes page is: https://ebf348c58b834d189af82777f4f742a6.mediatailor.us-west-2.amazonaws.com/v1/master/3c879a81c14534e13d0b39aac4479d6d57e7c462/MyTestCampaign/llama.m3u8.
I have also tried with: https://ebf348c58b834d189af82777f4f742a6.mediatailor.us-west-2.amazonaws.com/v1/master/3c879a81c14534e13d0b39aac4479d6d57e7c462/MyTestCampaign/llama_with_slates.m3u8.
The specific error is:
{"message":"failed to generate manifest: Unable to obtain template playlist. sessionId:[c915d529-3527-4e37-89e0-087e393e75de]"}
I have read about this error: https://docs.aws.amazon.com/mediatailor/latest/ug/playback-errors-examples.html
But don't know how to fix it.
Maybe I did something wrong or do I need a quote in AWS?
Any idea?
Thanks for the inquiry!
The following example shows the result when a timeout occurs between AWS Elemental MediaTailor and either the ad decision server (ADS) or the origin server.
An HTTP 504 error is known as a Gateway Timeout meaning that a resource was unresponsive and prevented the request from completing successfully. In this case since MediaTailor is returning an HTTP 504 this means that either the ADS or Origin failed to respond within the timeout period.
To troubleshoot this you will need to determine which dependency is failing to respond to MediaTailor and correct it. Typically the issue is the ADS failing to respond to a VAST request performed by MediaTailor which you can confirm by reviewing your CloudWatch logs.
https://docs.aws.amazon.com/mediatailor/latest/ug/monitor-cloudwatch-ads-logs.html
Make sure that your ADS follows the guidelines listed below for integrating with MediaTailor.
https://docs.aws.amazon.com/mediatailor/latest/ug/vast-integration.html

Postman:Execute request in collection runner after successfully completing first request

I am trying to deploy cloud VM’s using Postman and below is the workflow that I am trying accomplish.
1.) Send request to deploy VM image. (it may take few minutes for the vm to be successfully deployed).
2.) Send another request to check the status of VM deployment , check response for completion.
3.) If response is not completed , send another health check request after 10 seconds, until response contains completed.
4.) If response for above health is successful , execute next request in collection.
Thanks
Add the below logic as test script for the request to check the status of VM deployment.
Send a request to check for the deploy status.
If deploy is not complete, add a wait time of 10 seconds.
setTimeout(function(){}, 10000);
Set the next request as check status
postman.setNextRequest("request name of check deploy status")
If the deploy is complete, using postman.setNextRequest() continue with the next request in collection
If deploy is not complete, repeat with a delay and using postman.setNextRequest(), run the check status request again.

Cloud Run finishes but Cloud Scheduler thinks that job has failed

I have a Cloud Run service setup and I have a Cloud Scheduler task that calls an endpoint on that service. When the task completes (http handler returns), I'm seeing the following error:
The request failed because the HTTP connection to the instance had an error.
However, the actual handler returns HTTP 200 and successfully exists. Does anyone know what this error means and under what circumstances it shows up?
I'm also attaching a screenshot of the logs.
Does your job take longer than 120 seconds? I was having the same issue and figured out node versions prior to 13 has 120 seconds server.timeout limit. I installed node 13 on docker and problem is gone.
Error 503 is returned by the Google Frontend (GFE). The Cloud Run service either has a transient issue, or the GFE has determined that your service is not ready or not working correctly.
In your log entries, I see a POST request. 7 ms later is the error 503. This tells me your Cloud Run application is not yet ready (in a ready state determined by Cloud Run).
One minute, 8 seconds before, I see ReplaceService. This tells me that your service is not yet in a running state and that if you retry later, you will see success.
I've run an incremental sleep test on my FLASK endpoint which returns 200 within 1 min, 2 min and 10 min of waiting time. Having triggered the endpoint via the Cloud Scheduler, the job failed only in the 10 min test. I've found that it was one of the properties of my Cloud Scheduler job causing the failure. The following solved my issue.
gcloud scheduler jobs describe <my_test_scheduler>
There, you'll see a property called 'attemptDeadline' which was set to 180 seconds by default.
You can update that property using:
gcloud scheduler jobs update http <my_test_scheduler> --attempt-deadline 1000s
Ref: scheduler update

Health check in Cloud Foundry

Does anyone know how I can tell my cloud foundry instance to monitor my health endpoint, so that when my health endpoint says that the app health is not status: UP, that the app is restarted?
The cf CLI 6.24.0 (released Feb 2017) exposed this type of health checking.
In your app manifest, use:
applications:
- name: myapp
health-check-type: http
health-check-http-endpoint: /admin/health
Your app needs to return a 200 status code from that path, or an error code when it's not status UP.
You can also use the cf set-health-check command to configure it on existing apps.
Check out this documentation for more details on the different health check types.
If an app instance dies, Cloud Foundry, by default, will new up a new instance and try to start it. That resiliency is built into Cloud Foundry.
Actuators are rest end points auto injected in your app that allow you to see the app's status and health at runtime.
https://spring.io/guides/gs/actuator-service/
Try Actuators out.
I don't believe that custom url health checking is available to day in CF. If your application instance is no longer healthy and you want to restart it you can System.exit(1) and CF will restart it for you.
I've heard rumors of custom health checks possibly coming in the future with the CC V3 api and Diego.
the way to do health check in PCF
cf set-health-check APP-NAME <HEALTH-CHECK-TYPE> --endpoint <CUSTOM-HTTP-ENDPOINT>
HEALTH-CHECK-TYPE = process | port | http ( ideally http for web apps )
CUSTOM-HTTP-ENDPOINT = /health
Reference: https://docs.cloudfoundry.org/devguide/deploy-apps/healthchecks.html