Errors connecting to AWS Keyspaces using a lambda layer - amazon-web-services

Intermittently getting the following error when connecting to an AWS keyspace using a lambda layer
All host(s) tried for query failed. First host tried, 3.248.244.53:9142: Host considered as DOWN. See innerErrors.
I am trying to query a table in a keyspace using a nodejs lambda function as follows:
import cassandra from 'cassandra-driver';
import fs from 'fs';
export default class AmazonKeyspace {
tpmsClient = null;
constructor () {
let auth = new cassandra.auth.PlainTextAuthProvider('cass-user-at-xxxxxxxxxx', 'zzzzzzzzz');
let sslOptions1 = {
ca: [ fs.readFileSync('/opt/utils/AmazonRootCA1.pem', 'utf-8')],
host: 'cassandra.eu-west-1.amazonaws.com',
rejectUnauthorized: true
};
this.tpmsClient = new cassandra.Client({
contactPoints: ['cassandra.eu-west-1.amazonaws.com'],
localDataCenter: 'eu-west-1',
authProvider: auth,
sslOptions: sslOptions1,
keyspace: 'tpms',
protocolOptions: { port: 9142 }
});
}
getOrganisation = async (orgKey) => {
const SQL = 'select * FROM organisation where organisation_id=?;';
return new Promise((resolve, reject) => {
this.tpmsClient.execute(SQL, [orgKey], {prepare: true}, (err, result) => {
if (!err?.message) resolve(result.rows);
else reject(err.message);
});
});
};
}
I am basically following this recommended AWS documentation.
https://docs.aws.amazon.com/keyspaces/latest/devguide/using_nodejs_driver.html
It seems that around 10-20% of the time the lambda function (cassandra driver) cannot connect to the endpoint.
I am pretty familiar with Cassandra (I already use a 6 node cluster that I manage) and don't have any issues with that.
Could this be a timeout or do I need more contact points?
Followed the recommended guides. Checked from the AWS console for any errors but none shown.
UPDATE:
Update to the above question....
I am occasionally (1 in 50 if I parallel call the function (5 concurrent calls)) getting the below error:
"All host(s) tried for query failed. First host tried,
3.248.244.5:9142: DriverError: Socket was closed at Connection.clearAndInvokePending
(/opt/node_modules/cassandra-driver/lib/connection.js:265:15) at
Connection.close
(/opt/node_modules/cassandra-driver/lib/connection.js:618:8) at
TLSSocket.
(/opt/node_modules/cassandra-driver/lib/connection.js:93:10) at
TLSSocket.emit (node:events:525:35)\n at node:net:313:12\n at
TCP.done (node:_tls_wrap:587:7) { info: 'Cassandra Driver Error',
isSocketError: true, coordinator: '3.248.244.5:9142'}

This exception may be caused by throttling in the keyspaces side, resulting the Driver Error that you are seeing sporadically.
I would suggest taking a look over this repo which should help you to put measures in place to either prevent the occurrence of this issue or at least reveal the true cause of the exception.

Some of the errors you see in the logs you will need to investigate Amazon CloudWatch metrics to see if you have throttling or system errors. I've built this AWS CloudFormation template to deploy a CloudWatch dashboard with all the appropriate metrics. This will provide better observability for your application.
A System Error indicates an event that must be resolved by AWS and often part of normal operations. Activities such as timeouts, server faults, or scaling activity could result in server errors. A User error indicates an event that can often be resolved by the user such as invalid query or exceeding a capacity quota. Amazon Keyspaces passes the System Error back as a Cassandra ServerError. In most cases this a transient error, in which case you can retry your request until it succeeds. Using the Cassandra driver’s default retry policy customers can also experience NoHostAvailableException or AllNodesFailedException or messages like yours "All host(s) tried for query failed". This is a client side exception that is thrown once all host in the load balancing policy’s query plan have attempted the request.
Take a look at this retry policy for NodeJs which should help resolve your "All hosts failed" exception or pass back the original exception.
The retry policies in the Cassandra drivers are pretty crude and will not be able to do more sophisticated things like circuit breaker patters. You may want to eventually use a "failfast" retry policy for the driver and handle the exceptions in your application code.

Related

Can't reach local DynamoDB from Lambda in AWS SAM

I have a simple AWS SAM setup with a Go lambda function. It's using https://github.com/aws/aws-sdk-go-v2 for dealing with DynamoDB, the endpoint is set via environment variables from the template. (Here: https://github.com/abtercms/abtercms2/blob/main/template.yaml)
My problem is that I can't get my lambda to return anything from DynamoDB, as a matter of fact I don't think it can reach it.
{
"level":"error",
"status":500,
"error":"failed to fetch item (status 404), err: operation error DynamoDB: GetItem, exceeded maximum number of attempts, 3, https response error StatusCode: 0, RequestID: , request send failed, Post \"http://127.0.0.1:8000/\": dial tcp 127.0.0.1:8000: connect: connection refused",
"path":"GET /websites/abc",
"time":1658243299,
"message":"..."}
I'm able to reach the local DynamoDB just fine from the host machine.
Results were the same both for running DynamoDB in docker compose and via the .jar file, so the problem is not there as far as I can tell.
Also feel free to check out the whole project, I'm just running make sam-local and make curl-get-website-abc above. The code is here: https://github.com/abtercms/abtercms2.

Connection to AWS MemoryDB cluster sometimes fails

We have an application that is using AWS MemoryDB for Redis. We have setup a cluster with one shard and two nodes. One of the nodes (named 0001-001) is a primary read/write while the other one is a read replica (named 0001-002).
After deploying the application, connecting to MemoryDB sometimes fails when we use the cluster endpoint connection string to connect. If we restart the application a few times it suddenly starts working. It seems to be random when it succeeds or not. The error we get is the following:
Endpoint Unspecified/ourapp-memorydb-cluster-0001-001.ourapp-memorydb-cluster.xxxxx.memorydb.eu-west-1.amazonaws.com:6379 serving hashslot 6024 is not reachable at this point of time. Please check connectTimeout value. If it is low, try increasing it to give the ConnectionMultiplexer a chance to recover from the network disconnect. IOCP: (Busy=0,Free=1000,Min=2,Max=1000), WORKER: (Busy=0,Free=32767,Min=2,Max=32767), Local-CPU: n/a
If we connect directly to the primary read/write node we get no such errors.
If we connect directly to the read replica it always fails. It even gets the error above, compaining about the "0001-001" node.
We use .NET Core 6
We use Microsoft.Extensions.Caching.StackExchangeRedis 6.0.4 which depends on StackExchange.Redis 2.2.4
The application is hosted in AWS ECS
StackExchangeRedisCache is added to the service collection in a startup file :
services.AddStackExchangeRedisCache(o =>
{
o.InstanceName = redisConfiguration.Instance;
o.ConfigurationOptions = ToRedisConfigurationOptions(redisConfiguration);
});
...where ToRedisConfiguration returns a basic ConfigurationOptions object :
new ConfigurationOptions()
{
EndPoints =
{
{ "clustercfg.ourapp-memorydb-cluster.xxxxx.memorydb.eu-west-1.amazonaws.com", 6379 } // Cluster endpoint
},
User = "username",
Password = "password",
Ssl = true,
AbortOnConnectFail = false,
ConnectTimeout = 60000
};
We tried multiple shards with multiple nodes and it also sometimes fail to connect to the cluster. We even tried to update the dependency StackExchange.Redis to 2.5.43 but no luck.
We could "solve" it by directly connecting to the primary node, but if a failover occurs and 0001-002 becomes the primary node we would have to manually change our connection string, which is not acceptable in a production environment.
Any help or advice is appreciated, thanks!

AWS SDK for JavaScript CloudWatch Logs - GetLogEventsCommand isn't fetching logs, potentially due to a log stream size issue?

I have multiple Node.js applications deployed via AWS Elastic Beanstalk on the Docker platform. I can manually download the full logs for every environment without trouble via the AWS console. Let's say I have two AWS Elastic Beanstalk Environments: env-a and env-b.
I've started using the AWS SDK for JavaScript, specifically #aws-sdk/client-cloudwatch-logs, in a Node app so that I can programmatically fetch logs, render them in a custom UI, and do my own analysis as needed.
I'm running the following code in order to fetch the log events for a given app (pseudocode):
// IMPORTS
const {
CloudWatchLogsClient,
DescribeLogStreamsCommand,
GetLogEventsCommand
} = require("#aws-sdk/client-cloudwatch-logs");
// SETUP
const awsCloudWatchClient = new CloudWatchLogsClient({
region: process.env.AWS_REGION,
});
// APPLICATION CODE
const logGroupName = getLogGroupName();
// Get the log streams for the given log group.
const logStreamRes = await awsCloudWatchClient.send(new DescribeLogStreamsCommand({
descending: true,
logGroupName,
orderBy: 'LastEventTime',
limit: 50,
}))
// For testing purposes, I'll just use the first log stream name I find.
const logStreamName = logStreamRes.logStreams[0].logStreamName;
// Get the log events for the first log stream.
const logEventRes = await awsCloudWatchClient.send(new GetLogEventsCommand({
logGroupName,
logStreamName,
}));
const logEvents = logEventRes.events;
Now, I can fetch the log events for env-a without trouble using this code. However, GetLogEventsCommand always returns an empty collection when I attempt to fetch the logs for env-b. If I download the logs manually via the AWS console, I can definitely see that logs exist - yet for a reason that isn't clear to me yet, the AWS SDK doesn't seem to recognize that.
Here's some interesting details that may help diagnose the issue.
env-a is configured in Elastic Beanstalk so that each new deploy (which happens potentially multiple times a day) replaces EC2 instances. On the other hand, env-b is configured so that new application code is deployed to existing EC2 instances without actually replacing them. Since log streams map to EC2 instances, env-a has a high number of pretty small log streams whereas env-b` has three extremely large log streams for each of its long-lived EC2 instances. The logs are easily >1 MBs in size.
Considering that GetLogEventsCommand returns responses up to 1 MB in size, am I hitting some size limit and the AWS SDK is handling it by returning 0 log events for env-b? I tried setting a limit on the GetLogEventsCommand above, but still causes the AWS SDK to return 0 events for env-a.
Another interesting note: if I go to Amazon CloudWatch > Log Group and select env-a's Log Group, I can see the log events for every log stream without trouble. If I try to view the log events for env-b's three very large log streams, I run into "Rate exceeded" errors on the console. This seems to confirm that the log stream's event count is simply too large for both the AWS console and AWS SDK to process, though I'm not certain.
Is there anything I can do to get the AWS SDK to fetch env-b's logs? How can I further confirm that excessive log stream size is the culprit here? And if that's the case, is there anything I can do about it, e.g. purge logs?
Or could this be some other issue that I'm not seeing?

Aws dax stability issues

I am attempting to introduce DAX to our architecture but so far with no success. Connection to dax happenns through lambdas and the setup done is like the examples in AWS documentation. Lambda and Dax are in the same vpc, they can see each other most of the time and dax is returning responses. Dax also has 8111 port open.
However, after running our regression tests a few times there are errors that starts popping out in cloudwatch. The most frequent ones are:
"Failed to pull from [daxurlhere] (10.0.1.177,10.0.1.25,10.0.2.11):
TimeoutError: Connection timeout after 10000ms"
Error: NoRouteException: not able to resolve address:
[{"host":"[daxurlhere]","port":8111}]
ERROR caught exception during cluster refresh: DaxClientError:
NoRouteException: not able to resolve address:[{"host":"[daxurlhere]","port":8111}]
ERROR Failed to resolve [daxurl]: Error: queryA ECONNREFUSED [daxurl]
When those errors happen they are breaking a few of our regression tests. Funny thing is that they are not persistent and it is very hard to track the issue.
Any suggestions would be more than welcome!
Seems your configuration is fine. Check the below steps:
1. Make sure you are not strongly consistently reading
From the AWS doc:
DAX can't serve strongly consistent reads by itself because it's not tightly coupled to DynamoDB. For this reason, any subsequent reads from DAX would have to be eventually consistent reads
see this code results strongly consistent read and make the connection unstable
const parameters = {
TableName: 'Travels',
ConsistentRead: false,
ExpressionAttributeNames: {
'#createdAt': 'createdAt',
},
ExpressionAttributeValues: {
':createdAt': Date.now(), -----> Look at this
},
KeyConditionExpression: '#createdAt >= :createdAt',
};
const endpoint = DAX_CLUSTER_ENDPOINT;
const daxService = new AmazonDaxClient({ endpoints: [endpoint], region });
const daxClient = new AWS.DynamoDB.DocumentClient({ service: daxService });
response = await daxClient.query(parameters).promise();
Date.now() wouldn't generate same value everytime. If a request does not exactly match a previous request, it won't be a cache hit. check the parameters on your large requests like limit, projection expression, exclusive start key;
2. Check the Clusters Monitor - Cloudwatch query/scan cache hit,the clusters cacheing the data.
3. Other helpful links:
https://forums.aws.amazon.com/thread.jspa?messageID=896762
AWS DAX cluster has zero cache hits and cache miss
Be aware the although the DAX distributes reads among the nodes in the clusters for reads, all the writes happen though the master node. We have seen cascading failover of nodes during write intensive periods. The master node gets overwhelmed, reboots, and another node now becomes master, reboots, etc.

Firebase function connection with GCP Redis instance in the same VPC keeps on disconnecting

I am working on multiple Firebase cloud functions (all hosted in the same region) that connect with a GCP hosted Redis instance in the same region, using a VPC connector. I am using version 3.0.2 of the nodejs library for Redis. In the cloud functions' debug logs, I am seeing frequent connection reset logs, triggered for each cloud function with no fixed pattern around the timeline for the connection reset. And each time, the error captured in the error event handler is ECONNRESET. While creating the Redis instance, I have provided a retry_strategy to reconnect after 5 ms with maximum of 10 such attempts, along with the retry_unfulfilled_commands set to true, expecting that any unfulfilled command at the time of connection reset will be automatically retried (refer the code below).
const redisLib = require('redis');
const client = redisLib.createClient(REDIS_PORT, REDIS_HOST, {
enable_offline_queue: true,
retry_unfulfilled_commands: true,
retry_strategy: function(options) {
if (options.error && options.error.code === "ECONNREFUSED") {
// End reconnecting on a specific error and flush all commands with
// a individual error
return new Error("The server refused the connection");
}
if (options.attempt > REDIS_CONNECTION_RETRY_ATTEMPTS) {
// End reconnecting with built in error
console.log('Connection retry count exceeded 10');
return undefined;
}
// reconnect after 5 ms
console.log('Retrying connection after 5 ms');
return 5;
},
});
client.on('connect', () => {
console.log('Redis instance connected');
});
client.on('error', (err) => {
console.error(`Error connecting to Redis instance - ${err}`);
});
exports.getUserDataForId = (userId) => {
console.log('getUserDataForId invoked');
return new Promise((resolve, reject) => {
if(!client.connected) {
console.log('Redis instance not yet connected');
}
client.get(userId, (err, reply) => {
if(err) {
console.error(JSON.stringify(err));
reject(err);
} else {
resolve(reply);
}
});
});
}
// more such exports for different operations
Following are the questions / issues I am facing.
Why is the connection getting reset intermittently?
I have seen logs that even if the cloud function is being executed, the connection to Redis server lost resulting in failure of the command.
With retry_unfulfilled_commands set to true, I hoped it will handle the scenario as mentioned in point number 2 above, but as per debug logs, the cloud function times out in such scenario. This is what I observed in the logs in that case.
getUserDataForId invoked
Retrying connection after 5 ms
Redis instance connected
Function execution took 60002 ms, finished with status: 'timeout' --> coming from wrapper cloud function
Should I, instead of having a Redis connection instance at global level, try to have a connection created during each such Redis operation? It might have some performance issues as well as issues around number of concurrent Redis connections (since I have multiple cloud functions and all those will be creating Redis connections for each simultaneous invocation), right?
So, how to best handle it since I am facing all these issues during development itself, so not really sure if it's code related issue or some infrastructure configuration related issue.
This behavior could be caused by background activities.
"Background activity is anything that happens after your function has
terminated"
When the background activity interferes with subsequent invocations in Cloud Functions, unexpected behavior and errors that are hard to diagnose may occur. Accessing the network after a function terminates usually leads to "ECONNRESET" errors.
To troubleshoot this, make sure that there is no background activity by searching the logs for entries after the line saying that the invocation finished. Background activity can sometimes be buried deeper in the code, especially when asynchronous operations such as callbacks or timers are present. Review your code to make sure all asynchronous operations finish before you terminate the function.
Source