Airflow web-server produces temporary 502 errors in Cloud Composer - google-cloud-platform

I'm encountering 502 errors on AirFlow(2.0.2) UI hosted in Cloud Composer(1.17.0).
Error: Server Error The server encountered a temporary error and could not complete your request.
Please try again in 30 seconds.
They last for a few minutes and it happens several times a day after it's gone everything works fine.
At the moment of errors:
there is a gap in logs and after we can see that logs resumed with messages about staring gunicorn:
[1133] [INFO] Starting gunicorn 19.10.0
there is a spike in resource usage of web-server
I didn't spot any other suspicious activity in other parts of the system(workers, scheduler, DB)
I think that this is a result of OOM error because we have DAGs with a big number of tasks (2k).
But I'd like to be sure and I haven't found a way to connect to VM of app engine in tenant project(where Airflow server is hosted by default) to get additional logs.
Maybe anyone knows a way to get additional logs from AirFlow server VMs or have any other idea?

Cloud Composer documentation shows Troubleshooting DAGs sections. It shows how to check individual workers logs. It even mentions OOM issues (direct link).
Generally troubleshooting section is well documented so you should be able to find many interesting information. You can also use Cloud Monitoring and Cloud Logging to monitor Composer, but I am not sure if this will be valuable in this use case (reference).

Related

Cloud Run Error 504 (Upstream Request Timeout) after successful deploy

I was following this tutorial from Google to deploy a servise to Cloud Run (https://codelabs.developers.google.com/codelabs/cloud-run-hello-python3#5). In Cloud Shell my project is deployed successfully (screenshot below). However, once I click on the link I get timeout. If I test it locally from Cloud Shell it works fine.
Why could this be happening? Where could I get more data about the issue?
As mentioned in the Documentation :
For Cloud Run services, the request timeout setting specifies the time
within which a response must be returned by services deployed to Cloud
Run. If a response isn't returned within the time specified, the
request ends and error 504 is returned.
The timeout is set by default to 5 minutes and can be extended up to
60 minutes. You can change this setting when you deploy a container
image or by updating the service configuration. In addition to
changing the Cloud Run request timeout, you should also check your
language framework to see whether it has its own request timeout
setting that you must also update.
You can refer to this Public group issue which will be helpful in resolving the current error.
You can increase timeout by clicking EDIT & DEPLOY NEW REVISION and then adjust new Request timeout value

Cloud Run finishes but Cloud Scheduler thinks that job has failed

I have a Cloud Run service setup and I have a Cloud Scheduler task that calls an endpoint on that service. When the task completes (http handler returns), I'm seeing the following error:
The request failed because the HTTP connection to the instance had an error.
However, the actual handler returns HTTP 200 and successfully exists. Does anyone know what this error means and under what circumstances it shows up?
I'm also attaching a screenshot of the logs.
Does your job take longer than 120 seconds? I was having the same issue and figured out node versions prior to 13 has 120 seconds server.timeout limit. I installed node 13 on docker and problem is gone.
Error 503 is returned by the Google Frontend (GFE). The Cloud Run service either has a transient issue, or the GFE has determined that your service is not ready or not working correctly.
In your log entries, I see a POST request. 7 ms later is the error 503. This tells me your Cloud Run application is not yet ready (in a ready state determined by Cloud Run).
One minute, 8 seconds before, I see ReplaceService. This tells me that your service is not yet in a running state and that if you retry later, you will see success.
I've run an incremental sleep test on my FLASK endpoint which returns 200 within 1 min, 2 min and 10 min of waiting time. Having triggered the endpoint via the Cloud Scheduler, the job failed only in the 10 min test. I've found that it was one of the properties of my Cloud Scheduler job causing the failure. The following solved my issue.
gcloud scheduler jobs describe <my_test_scheduler>
There, you'll see a property called 'attemptDeadline' which was set to 180 seconds by default.
You can update that property using:
gcloud scheduler jobs update http <my_test_scheduler> --attempt-deadline 1000s
Ref: scheduler update

Cloud composer tasks fail without reason or logs

I run Airflow in a managed Cloud-composer environment (version 1.9.0), whic runs on a Kubernetes 1.10.9-gke.5 cluster.
All my DAGs run daily at 3:00 AM or 4:00 AM. But sometime in the morning, I see a few Tasks failed without a reason during the night.
When checking the log using the UI - I see no log and I see no log either when I check the log folder in the GCS bucket
In the instance details, it reads "Dependencies Blocking Task From Getting Scheduled" but the dependency is the dagrun itself.
Although the DAG is set with 5 retries and an email message it does not look as if any retry took place and I haven't received an email about the failure.
I usually just clear the task instance and it run successfully on the first try.
Has anyone encountered a similar problem?
Empty logs often means the Airflow worker pod was evicted (i.e., it died before it could flush logs to GCS), which is usually due to an out of memory condition. If you go to your GKE cluster (the one under Composer's hood) you will probably see that there is indeed a evicted pod (GKE > Workloads > "airflow-worker").
You will probably see in "Tasks Instances" that said tasks have no Start Date nor Job Id or worker (Hostname) assigned, which, added to no logs, is a proof of the death of the pod.
Since this normally happens in highly parallelised DAGs, a way to avoid this is to reduce the worker concurrency or use a better machine.
EDIT: I filed this Feature Request on your behalf to get emails in case of failure, even if the pod was evicted.

Authentication with Cognito - where to find logs

We have 2 React Native app are using AWS Cognito for authentication. We use library react-native-aws-cognito-js in our code. The apps are working fine until these 2 days. Apps are experiencing intermittent "Internal Server Error".
How can I find more information about this error? Any tool can help us pinpoint the cause?
Update
From CloudTrail, each API call has an event "CreateNetworkInterface". Many of such API calls have error code "Client.NetworkInterfaceLimitExceeded". What is the cause and solution to this?
According to this AWS Doc (in Chinese), CloudWatch will not write to log when error is due to insufficient IP/ENI. That explains the increase in error number but no logs in CloudWatch.
Upate 2
We have found a scheduled Lambda job which may exhausted IP addresses. We stopped the batch job. But still can't have too many user login to server due to "Client.NetworkInterfaceLimitExceeded" error. I realized that there are many "CreateNetworkInterface" event and few "DeleteNetworkInterface" event. How can I "clean up / reset" all network interface in VPC?
Short answer: Cloud Trail.
Long answer with a suggestion
Assuming your application code is fine, most likely the cause of your 500 error is based on Cognito's initial limitations (e.g., number of calls per user): https://docs.aws.amazon.com/cognito/latest/developerguide/limits.html.
AWS suggests to use Cloud Trail, for logging Api calls.
However I would suggest, to prove the limitations first, add some logs around the api call yourself, and in development you could call your app/api with a high number of calls; and most likely you will see the 500 error due to the limitations.
You could do the following in the terminal:
for i in `seq 1 1000`; do curl --cookie SecureCookie=TokenValueFromAWS http://localhost:desirablePort/SecuredPath; done

Error when deploying applications to cloud foundry using cloud bees plugin

I have integrated my Cloud Foundry account with Cloud Bees as mentioned in the url -
http://docs.cloudfoundry.com/docs/dotcom/integration/cloudbees/
and trying to deploy few sample applications from github.
Build was successful every time but when I went for app-deployment using this plugin, it gave one exception (one particular exception for 2-3 applications I have tried).
[INFO] Deployment done in 1.2 sec
[cloudbees-deployer] Deploying as (jenkins) to the svcnvghi293 account
[cloudbees-deployer] Deploying null
com.cloudbees.plugins.deployer.exceptions.DeployException: Could not create DeployEvent
at com.cloudbees.plugins.deployer.impl.run.RunEngineImpl.createEvent(RunEngineImpl.java:132)
at com.cloudbees.plugins.deployer.impl.run.RunEngineImpl.createEvent(RunEngineImpl.java:51)
at com.cloudbees.plugins.deployer.engines.Engine.perform(Engine.java:82)
at com.cloudbees.plugins.deployer.DeployPublisher.perform(DeployPublisher.java:95)
at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19)
at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:728)
at hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:703)
at hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.post2(MavenModuleSetBuild.java:994)
at hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:650)
at hudson.model.Run.execute(Run.java:1530)
at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:477)
at hudson.model.ResourceController.execute(ResourceController.java:88)
at hudson.model.Executor.run(Executor.java:237)
Caused by: java.lang.NullPointerException
at com.cloudbees.plugins.deployer.impl.run.RunEngineImpl$EventImpl.<init>(RunEngineImpl.java:208)
at com.cloudbees.plugins.deployer.impl.run.RunEngineImpl.createEvent(RunEngineImpl.java:124)
... 12 more
Build step 'Deploy applications' marked build as failure
Finished: FAILURE
does anyone have any idea about this ?
Thanks in advance.
After a bit of digging I figured out which account you have.
The issue is that you had left the CloudBees RUN#Cloud host service in the list of host services to deploy to but you had not provided a complete configuration for it, e.g. see the "Application Id cannot be empty" red error text in this screenshot
I have removed this host section and saved your hellospring job. Build 8 shows a successful deployment.