Some scheduled tasks failing with error: Connection Failure: Status code unavailable - coldfusion

I've googled the issue and pretty much every answer has been related to a certificate issue. Problem is, we have other tasks on the same server that trigger just fine. The file runs as expected directly from a browser, so it's not an issue in the CF code. And with other scheduled tasks running fine I don't see it being a problem with any certificate. Any suggestions on what else could cause this?
From the log:
Information [DefaultQuartzScheduler_Worker-9] - MyTask - myreport triggered.
Information [DefaultQuartzScheduler_Worker-9] - Starting HTTP request {URL='http://myserver/reports/myreport.cfm', method='get'}
Error [DefaultQuartzScheduler_Worker-9] - Connection Failure: Status code unavailable

Related

Troubleshooting error 503 on Google Cloud Run

I am running a container on google cloud run. For each request a new instance is started. The requests need around 15 minutes to get processed. I modified the default timeout and everything is working fine. But sometimes, around 10% of the request, I get an error
The request failed because either the HTTP response was malformed or
connection to the instance had an error. Additional troubleshooting
documentation can be found at:
https://cloud.google.com/run/docs/troubleshooting#timeout-503
When I re-run the exact same request, I get no errors. I tried to put try catch every where, but I am not able to figure out what is happening. I checked the CPU, memory usage ... Everything looks fine, he maximum reached is 50%. Any advice on how can I get more information about the problem?

Google Cloud Composer Airflow sqlalchemy OperationalError causing DAG to hang forever

I have a bunch of tasks within a Cloud Composer Airflow DAG, one of which is a KubernetesPodOperator. This task seems to get stuck in the scheduled state forever and so the DAG runs continuously for 15 hours without finishing (it normally takes about an hour). I have to manually mark it failed for it to end.
I've set the DAG timeout to 2 hours but it does not make any difference.
The Cloud Composer logs show the following error:
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not connect to server:
Connection refused
Is the server running on host "airflow-sqlproxy-service.default.svc.cluster.local" (10.7.124.107)
and accepting TCP/IP connections on port 3306?
The error log also gives me a link to this documentation about that error type: https://docs.sqlalchemy.org/en/13/errors.html#operationalerror
When the DAG is next triggered on schedule, it works fine without any fix required. This issue happens intermittently, we've not been able to reproduce it.
Does anyone know the cause of this error and how to fix it?
The reason behind the issue is related to SQLAlchemy using a session by a thread and creating a callable session that can be used later in the Airflow Code. If there are some minimum delays between the queries and sessions, MySQL might close the connection. The connection timeout is set to approximately 10 minutes.
Solutions:
Use the airflow.utils.db.provide_session decorator. This decorator
provides a valid session to the Airflow database in the session
parameter and closes the session at the end of the function.
Do not use a single long-running function. Instead, move all database
queries to separate functions, so that there are multiple functions
with the airflow.utils.db.provide_session decorator. In this case,
sessions are automatically closed after retrieving query results.

Redislabs UI logging error when number of nodes more than one

I am new at Redis Enterprise and can't fix this problem:
I have a Redis Enterprise cluster (v.6.0) in AWS with two nodes. When I have only one node I can enter UI, but after adding other (second) nodes always throws me out to the login page after entering credentials. Meanwhile, the cluster works fine (information is taken from rladmin).
In what direction I should investigate the issue?
P.S.: Can this error from logs cause an issue?
ERROR redis_mgr MainThread: Connect failed: connect: connection failed: Error 2 connecting to unix socket: /var/opt/redislabs/run/ccs.sock. No such file or directory.: retrying
Possibly, this solution will help anybody:
the reason was that ALB before UI didn't use sticky sessions.
the solution was to enable a sticky session and it works.

Webhook call failed: URL_REJECTED error in DialogFlow v2 Fulfillments

Error description
Upon calling DialogFlow v2 detectIntent API, we randomly get an internal error with status code 13:
Webhook call failed. Fetch failure with no HTTP status code. Status: State: URL_REJECTED Reason: 67
This error seems to happen randomly. The same request can succeed or fail.
Interesting point, the service has been deteriorating since Friday 23th August 2019, to fail on almost every call today.
Our investigation
We didn't find anything at all about URL_REJECTED with DialogFlow or Google on internet.
But we found the meaning of the status code 13 on this page:
Internal errors. This means that some invariants expected by the underlying system have been broken. This error code is reserved for serious errors.
We also checked that we aren't banning Google IP, our that our load-balancing is not messed up (we thought of that since it would make sense with random fails).
The webhook is up and running, and we can call it ourselves. The problem seems to happen in Google's infra, as the error code 13 seems to show.
(I answer immediatly because we fixed it before posting the question. But I posted nevertheless because it may be useful for others)
The problem was that the webhook was called using http.
Setting https solved the problem.
It seems that Google activated a webhook policy of rejecting unsecure calls in their servers.
It may have been deployed gradually on their cluster, which would explain the gradual degradation.
We know that we should have migrated to https a long time ago, but still we didn't find any mention of the application of this policy on the net.
Thank you for posting this. I came across the same issue. Changed my webhook to HTTPS seems to fix the problem.

Route53 Domain Transfer - Registry error - 2400 : Command failed (421 SESSION TIMEOUT)

I am trying to transfer a domain using Route53 and after a few minutes I receive an email with the following error.
Registry error - 2400 : Command failed (421 SESSION TIMEOUT)
Anyone have any ideas what this means or how to get around it?
I have never seen your error. There is a document on transferring domains with error messages. The reason that I am responding is that I have seen domain transfers fail going to Route 53 without every learning why they failed. Maybe this will help you.
NSI Registry Registrar Protocol (RRP)
421 Command failed due to server error. Client should try again A
transient server error has caused RRP command failure. A subsequent
retry may produce successful results.