Firebase function connection with GCP Redis instance in the same VPC keeps on disconnecting - google-cloud-platform

I am working on multiple Firebase cloud functions (all hosted in the same region) that connect with a GCP hosted Redis instance in the same region, using a VPC connector. I am using version 3.0.2 of the nodejs library for Redis. In the cloud functions' debug logs, I am seeing frequent connection reset logs, triggered for each cloud function with no fixed pattern around the timeline for the connection reset. And each time, the error captured in the error event handler is ECONNRESET. While creating the Redis instance, I have provided a retry_strategy to reconnect after 5 ms with maximum of 10 such attempts, along with the retry_unfulfilled_commands set to true, expecting that any unfulfilled command at the time of connection reset will be automatically retried (refer the code below).
const redisLib = require('redis');
const client = redisLib.createClient(REDIS_PORT, REDIS_HOST, {
enable_offline_queue: true,
retry_unfulfilled_commands: true,
retry_strategy: function(options) {
if (options.error && options.error.code === "ECONNREFUSED") {
// End reconnecting on a specific error and flush all commands with
// a individual error
return new Error("The server refused the connection");
}
if (options.attempt > REDIS_CONNECTION_RETRY_ATTEMPTS) {
// End reconnecting with built in error
console.log('Connection retry count exceeded 10');
return undefined;
}
// reconnect after 5 ms
console.log('Retrying connection after 5 ms');
return 5;
},
});
client.on('connect', () => {
console.log('Redis instance connected');
});
client.on('error', (err) => {
console.error(`Error connecting to Redis instance - ${err}`);
});
exports.getUserDataForId = (userId) => {
console.log('getUserDataForId invoked');
return new Promise((resolve, reject) => {
if(!client.connected) {
console.log('Redis instance not yet connected');
}
client.get(userId, (err, reply) => {
if(err) {
console.error(JSON.stringify(err));
reject(err);
} else {
resolve(reply);
}
});
});
}
// more such exports for different operations
Following are the questions / issues I am facing.
Why is the connection getting reset intermittently?
I have seen logs that even if the cloud function is being executed, the connection to Redis server lost resulting in failure of the command.
With retry_unfulfilled_commands set to true, I hoped it will handle the scenario as mentioned in point number 2 above, but as per debug logs, the cloud function times out in such scenario. This is what I observed in the logs in that case.
getUserDataForId invoked
Retrying connection after 5 ms
Redis instance connected
Function execution took 60002 ms, finished with status: 'timeout' --> coming from wrapper cloud function
Should I, instead of having a Redis connection instance at global level, try to have a connection created during each such Redis operation? It might have some performance issues as well as issues around number of concurrent Redis connections (since I have multiple cloud functions and all those will be creating Redis connections for each simultaneous invocation), right?
So, how to best handle it since I am facing all these issues during development itself, so not really sure if it's code related issue or some infrastructure configuration related issue.

This behavior could be caused by background activities.
"Background activity is anything that happens after your function has
terminated"
When the background activity interferes with subsequent invocations in Cloud Functions, unexpected behavior and errors that are hard to diagnose may occur. Accessing the network after a function terminates usually leads to "ECONNRESET" errors.
To troubleshoot this, make sure that there is no background activity by searching the logs for entries after the line saying that the invocation finished. Background activity can sometimes be buried deeper in the code, especially when asynchronous operations such as callbacks or timers are present. Review your code to make sure all asynchronous operations finish before you terminate the function.
Source

Related

Errors connecting to AWS Keyspaces using a lambda layer

Intermittently getting the following error when connecting to an AWS keyspace using a lambda layer
All host(s) tried for query failed. First host tried, 3.248.244.53:9142: Host considered as DOWN. See innerErrors.
I am trying to query a table in a keyspace using a nodejs lambda function as follows:
import cassandra from 'cassandra-driver';
import fs from 'fs';
export default class AmazonKeyspace {
tpmsClient = null;
constructor () {
let auth = new cassandra.auth.PlainTextAuthProvider('cass-user-at-xxxxxxxxxx', 'zzzzzzzzz');
let sslOptions1 = {
ca: [ fs.readFileSync('/opt/utils/AmazonRootCA1.pem', 'utf-8')],
host: 'cassandra.eu-west-1.amazonaws.com',
rejectUnauthorized: true
};
this.tpmsClient = new cassandra.Client({
contactPoints: ['cassandra.eu-west-1.amazonaws.com'],
localDataCenter: 'eu-west-1',
authProvider: auth,
sslOptions: sslOptions1,
keyspace: 'tpms',
protocolOptions: { port: 9142 }
});
}
getOrganisation = async (orgKey) => {
const SQL = 'select * FROM organisation where organisation_id=?;';
return new Promise((resolve, reject) => {
this.tpmsClient.execute(SQL, [orgKey], {prepare: true}, (err, result) => {
if (!err?.message) resolve(result.rows);
else reject(err.message);
});
});
};
}
I am basically following this recommended AWS documentation.
https://docs.aws.amazon.com/keyspaces/latest/devguide/using_nodejs_driver.html
It seems that around 10-20% of the time the lambda function (cassandra driver) cannot connect to the endpoint.
I am pretty familiar with Cassandra (I already use a 6 node cluster that I manage) and don't have any issues with that.
Could this be a timeout or do I need more contact points?
Followed the recommended guides. Checked from the AWS console for any errors but none shown.
UPDATE:
Update to the above question....
I am occasionally (1 in 50 if I parallel call the function (5 concurrent calls)) getting the below error:
"All host(s) tried for query failed. First host tried,
3.248.244.5:9142: DriverError: Socket was closed at Connection.clearAndInvokePending
(/opt/node_modules/cassandra-driver/lib/connection.js:265:15) at
Connection.close
(/opt/node_modules/cassandra-driver/lib/connection.js:618:8) at
TLSSocket.
(/opt/node_modules/cassandra-driver/lib/connection.js:93:10) at
TLSSocket.emit (node:events:525:35)\n at node:net:313:12\n at
TCP.done (node:_tls_wrap:587:7) { info: 'Cassandra Driver Error',
isSocketError: true, coordinator: '3.248.244.5:9142'}
This exception may be caused by throttling in the keyspaces side, resulting the Driver Error that you are seeing sporadically.
I would suggest taking a look over this repo which should help you to put measures in place to either prevent the occurrence of this issue or at least reveal the true cause of the exception.
Some of the errors you see in the logs you will need to investigate Amazon CloudWatch metrics to see if you have throttling or system errors. I've built this AWS CloudFormation template to deploy a CloudWatch dashboard with all the appropriate metrics. This will provide better observability for your application.
A System Error indicates an event that must be resolved by AWS and often part of normal operations. Activities such as timeouts, server faults, or scaling activity could result in server errors. A User error indicates an event that can often be resolved by the user such as invalid query or exceeding a capacity quota. Amazon Keyspaces passes the System Error back as a Cassandra ServerError. In most cases this a transient error, in which case you can retry your request until it succeeds. Using the Cassandra driver’s default retry policy customers can also experience NoHostAvailableException or AllNodesFailedException or messages like yours "All host(s) tried for query failed". This is a client side exception that is thrown once all host in the load balancing policy’s query plan have attempted the request.
Take a look at this retry policy for NodeJs which should help resolve your "All hosts failed" exception or pass back the original exception.
The retry policies in the Cassandra drivers are pretty crude and will not be able to do more sophisticated things like circuit breaker patters. You may want to eventually use a "failfast" retry policy for the driver and handle the exceptions in your application code.

Connection to AWS MemoryDB cluster sometimes fails

We have an application that is using AWS MemoryDB for Redis. We have setup a cluster with one shard and two nodes. One of the nodes (named 0001-001) is a primary read/write while the other one is a read replica (named 0001-002).
After deploying the application, connecting to MemoryDB sometimes fails when we use the cluster endpoint connection string to connect. If we restart the application a few times it suddenly starts working. It seems to be random when it succeeds or not. The error we get is the following:
Endpoint Unspecified/ourapp-memorydb-cluster-0001-001.ourapp-memorydb-cluster.xxxxx.memorydb.eu-west-1.amazonaws.com:6379 serving hashslot 6024 is not reachable at this point of time. Please check connectTimeout value. If it is low, try increasing it to give the ConnectionMultiplexer a chance to recover from the network disconnect. IOCP: (Busy=0,Free=1000,Min=2,Max=1000), WORKER: (Busy=0,Free=32767,Min=2,Max=32767), Local-CPU: n/a
If we connect directly to the primary read/write node we get no such errors.
If we connect directly to the read replica it always fails. It even gets the error above, compaining about the "0001-001" node.
We use .NET Core 6
We use Microsoft.Extensions.Caching.StackExchangeRedis 6.0.4 which depends on StackExchange.Redis 2.2.4
The application is hosted in AWS ECS
StackExchangeRedisCache is added to the service collection in a startup file :
services.AddStackExchangeRedisCache(o =>
{
o.InstanceName = redisConfiguration.Instance;
o.ConfigurationOptions = ToRedisConfigurationOptions(redisConfiguration);
});
...where ToRedisConfiguration returns a basic ConfigurationOptions object :
new ConfigurationOptions()
{
EndPoints =
{
{ "clustercfg.ourapp-memorydb-cluster.xxxxx.memorydb.eu-west-1.amazonaws.com", 6379 } // Cluster endpoint
},
User = "username",
Password = "password",
Ssl = true,
AbortOnConnectFail = false,
ConnectTimeout = 60000
};
We tried multiple shards with multiple nodes and it also sometimes fail to connect to the cluster. We even tried to update the dependency StackExchange.Redis to 2.5.43 but no luck.
We could "solve" it by directly connecting to the primary node, but if a failover occurs and 0001-002 becomes the primary node we would have to manually change our connection string, which is not acceptable in a production environment.
Any help or advice is appreciated, thanks!

Do Google Cloud Tasks Delete themselves once they have executed?

In my application, I have implemented Google Tasks so that my users can receive notifications on when their ToDo item is due.
My main issue is that when my Cloud Task fires, I noticed that I still see it located in my Cloud Task Console. So, do they delete themselves once they are fired? For my application, I want the cloud tasks to delete themselves once they are done.
I noticed in the documentation this line you can also fine-tune the configuration for the task, like scheduling a time in the future when it should be executed or limiting the number of times you want the task to be retried if it fails. The thing is, my task is not failing and yet I see the number of retries at 4.
firebase cloud functions
exports.firestoreTtlCallback = functions.https.onRequest(async (req, res) => {
try {
const payload = req.body;
let entry = await (await admin.firestore().doc(payload.docPath).get()).data();
let tokens = await (await admin.firestore().doc(`/users/${payload.uid}`).get()).get('tokens')
await admin.messaging().sendMulticast({
tokens,
notification: {
title: "App",
body: entry['text']
}
}).then((response) => {
log('Successfully sent message:')
log(response)
}).catch((error) => {
log('Error in sending Message')
log(error)
})
const taskClient = new CloudTasksClient();
let { expirationTask } = admin.firestore().doc(payload.docPath).get()
await taskClient.deleteTask({ name: expirationTask })
await admin.firestore().doc(payload.docPath).update({ expirationTask: admin.firestore.FieldValue.delete() })
res.status(200)
} catch (err) {
log(err)
res.status(500).send(err)
}
})
A task can be deleted if it is scheduled or dispatched. A task cannot be deleted if it has completed successfully or permanently failed according to this documentation.
The task attempt has succeeded if the app's request handler returns
an HTTP response code in the range [200 - 299].
The task attempt has failed if the app's handler returns a non-2xx
response code or Cloud Tasks does not receive response before the
deadline which is :
For HTTP tasks, 10 minutes. The deadline must be in the interval [15
seconds, 30 minutes]
For App Engine tasks, 0 indicates that the request has the default
deadline. The default deadline depends on the scaling type of
the service: 10 minutes for standard apps with automatic scaling, 24
hours for standard apps with manual and basic scaling, and 60
minutes for flex apps.
Failed tasks will be retried according to the retry
configuration. Please check your queue.yaml file for the retry
configuration set and if you want to specify and set them as per your
choice follow this.
The task will be pushed to the worker as an HTTP request. If the worker or the redirected worker acknowledges the task by returning a successful HTTP response code ([200 - 299]), the task will be removed from the queue as per this documentation. If any other HTTP response code is returned or no response is received, the task will be retried according to the following:
User-specified throttling: retry configuration, rate limits, and
the queue's state.
System throttling: To prevent the worker from overloading, Cloud
Tasks may temporarily reduce the queue's effective rate.
User-specified settings will not be changed.

Google GCP cloud run redis client loses connection to the instance

I am running my nodejs application on google cloudrun. My application connects to google memorystore redis. Every few mins am getting the following error
Error: read Connection Reset
Followed by
AbortError: Redis connection lost and command aborted. It might have been processed.
Please help what am I missing?
My nodejs code
const redis = require('redis')
const redisClient = redis.createClient({host:'xxx', port: 6379})
redisClient.on('error, function (err) {
console.log(err)
}
const data = await redisClient.getExAsync('key')
Use "setInterval" function in order to invoke Redis operation every minute.
async function RedisKA() {
client.get("key2", (err, reply) => {
console.log(`${kaCount} redis keep `);
});
}
let updateIntervalId = setInterval(RedisKA, 60000);
If you want to avoid the request timeout on the Cloud Run side, which is 5 minutes by default then set your value based on your requirement.
The issue may be caused due a socket time out. This is expected to happen when there is no activity for a period of time.
This could be avoided by periodically executing any command on the connection, for example one command per minute, so it will keep the socket alive and will not abort the connection.

API Gateway occasionally spikes 5XX errors in Production

Our API Gateway and Lambdas are regularly used and work just fine most of the time, however we see spikes in 5XX errors now and then which causes a spike in customer complaints and other issues. When I look at the logs during this time I see a flood of the following error:
Execution failed due to configuration error: Malformed Lambda proxy response
There are no other details beyond this. After 10 or 15 minutes it will go away along with customer complaints. I've read that it may happen if you exceed your concurrency limit, but looking at the dashboard and it doesn't look like we ever break above 150 concurrent executions.
The calls themselves being hit work consistently as well, aside from these random spikes in 5XXs.
What else might be causing this inconsistency?
Looking through logs to try and get this figured out. I have made the logs as verbose as possible and there is nothing there. We'll have a normal call with a success response then a few minutes later this error comes up with no other logging, just the error alone. Then a few minutes after that we have logs starting for the next successful call.
10:25:42 Successfully completed execution
10:25:42 Method completed with status: 200
10:42:01 Execution failed due to configuration error: Malformed Lambda
proxy response
12:21:21 Successfully completed execution
12:21:21 Method completed with status: 200
Logging can't go further because the lambdas are never even executed. So we have no details on the payload sent to it, or any internal logging for the call, etc. It just immediately fails at the API Gateway level.
Edit: We still get these spikes but we are working on splitting the lambdas out more. We have an ExpressJS app that handles the lion's share of all requests. So we are breaking more off, especially high traffic requests, into their own lambdas to see if this helps. In-case there is an issue where a container gets too backlogged or times-out because it was handling long running requests (that takes upwards of 20s) as well as being hammered by requests that finish < 500ms.
Other theory is that maybe there is an error that gets triggered somewhere that kills the process or something else and that container is bad until it gets destroyed and respawned. As these spike and then go away in a few minutes. So breaking the lambdas up more should reduce the odds of errors from one cascading and impacting all other requests.
We also increase the resources of the lambda to see if that would help with it handling so many requests.
This usually happens when there is a timeout with your call and if there is a delay with your lambda execution.
If you are accessing an external resource such as RDS or an external network call, wrap that with a promise and handle with a timeout. This way you can identify which resource is having a bottleneck or taking a long time to execute.
exports.handler = function(event, context, callback) {
var response = {}; // set the response object
var err = "An error occured";
setTimeout(function () {
callback(err, response);
}, 3000); // 3000 ms is the timeout
}
// Actual code here
};
Also, check for any missing callbacks. That will also cause this issue.
Hope this helps.