Aws dax stability issues - amazon-web-services

I am attempting to introduce DAX to our architecture but so far with no success. Connection to dax happenns through lambdas and the setup done is like the examples in AWS documentation. Lambda and Dax are in the same vpc, they can see each other most of the time and dax is returning responses. Dax also has 8111 port open.
However, after running our regression tests a few times there are errors that starts popping out in cloudwatch. The most frequent ones are:
"Failed to pull from [daxurlhere] (10.0.1.177,10.0.1.25,10.0.2.11):
TimeoutError: Connection timeout after 10000ms"
Error: NoRouteException: not able to resolve address:
[{"host":"[daxurlhere]","port":8111}]
ERROR caught exception during cluster refresh: DaxClientError:
NoRouteException: not able to resolve address:[{"host":"[daxurlhere]","port":8111}]
ERROR Failed to resolve [daxurl]: Error: queryA ECONNREFUSED [daxurl]
When those errors happen they are breaking a few of our regression tests. Funny thing is that they are not persistent and it is very hard to track the issue.
Any suggestions would be more than welcome!

Seems your configuration is fine. Check the below steps:
1. Make sure you are not strongly consistently reading
From the AWS doc:
DAX can't serve strongly consistent reads by itself because it's not tightly coupled to DynamoDB. For this reason, any subsequent reads from DAX would have to be eventually consistent reads
see this code results strongly consistent read and make the connection unstable
const parameters = {
TableName: 'Travels',
ConsistentRead: false,
ExpressionAttributeNames: {
'#createdAt': 'createdAt',
},
ExpressionAttributeValues: {
':createdAt': Date.now(), -----> Look at this
},
KeyConditionExpression: '#createdAt >= :createdAt',
};
const endpoint = DAX_CLUSTER_ENDPOINT;
const daxService = new AmazonDaxClient({ endpoints: [endpoint], region });
const daxClient = new AWS.DynamoDB.DocumentClient({ service: daxService });
response = await daxClient.query(parameters).promise();
Date.now() wouldn't generate same value everytime. If a request does not exactly match a previous request, it won't be a cache hit. check the parameters on your large requests like limit, projection expression, exclusive start key;
2. Check the Clusters Monitor - Cloudwatch query/scan cache hit,the clusters cacheing the data.
3. Other helpful links:
https://forums.aws.amazon.com/thread.jspa?messageID=896762
AWS DAX cluster has zero cache hits and cache miss

Be aware the although the DAX distributes reads among the nodes in the clusters for reads, all the writes happen though the master node. We have seen cascading failover of nodes during write intensive periods. The master node gets overwhelmed, reboots, and another node now becomes master, reboots, etc.

Related

Errors connecting to AWS Keyspaces using a lambda layer

Intermittently getting the following error when connecting to an AWS keyspace using a lambda layer
All host(s) tried for query failed. First host tried, 3.248.244.53:9142: Host considered as DOWN. See innerErrors.
I am trying to query a table in a keyspace using a nodejs lambda function as follows:
import cassandra from 'cassandra-driver';
import fs from 'fs';
export default class AmazonKeyspace {
tpmsClient = null;
constructor () {
let auth = new cassandra.auth.PlainTextAuthProvider('cass-user-at-xxxxxxxxxx', 'zzzzzzzzz');
let sslOptions1 = {
ca: [ fs.readFileSync('/opt/utils/AmazonRootCA1.pem', 'utf-8')],
host: 'cassandra.eu-west-1.amazonaws.com',
rejectUnauthorized: true
};
this.tpmsClient = new cassandra.Client({
contactPoints: ['cassandra.eu-west-1.amazonaws.com'],
localDataCenter: 'eu-west-1',
authProvider: auth,
sslOptions: sslOptions1,
keyspace: 'tpms',
protocolOptions: { port: 9142 }
});
}
getOrganisation = async (orgKey) => {
const SQL = 'select * FROM organisation where organisation_id=?;';
return new Promise((resolve, reject) => {
this.tpmsClient.execute(SQL, [orgKey], {prepare: true}, (err, result) => {
if (!err?.message) resolve(result.rows);
else reject(err.message);
});
});
};
}
I am basically following this recommended AWS documentation.
https://docs.aws.amazon.com/keyspaces/latest/devguide/using_nodejs_driver.html
It seems that around 10-20% of the time the lambda function (cassandra driver) cannot connect to the endpoint.
I am pretty familiar with Cassandra (I already use a 6 node cluster that I manage) and don't have any issues with that.
Could this be a timeout or do I need more contact points?
Followed the recommended guides. Checked from the AWS console for any errors but none shown.
UPDATE:
Update to the above question....
I am occasionally (1 in 50 if I parallel call the function (5 concurrent calls)) getting the below error:
"All host(s) tried for query failed. First host tried,
3.248.244.5:9142: DriverError: Socket was closed at Connection.clearAndInvokePending
(/opt/node_modules/cassandra-driver/lib/connection.js:265:15) at
Connection.close
(/opt/node_modules/cassandra-driver/lib/connection.js:618:8) at
TLSSocket.
(/opt/node_modules/cassandra-driver/lib/connection.js:93:10) at
TLSSocket.emit (node:events:525:35)\n at node:net:313:12\n at
TCP.done (node:_tls_wrap:587:7) { info: 'Cassandra Driver Error',
isSocketError: true, coordinator: '3.248.244.5:9142'}
This exception may be caused by throttling in the keyspaces side, resulting the Driver Error that you are seeing sporadically.
I would suggest taking a look over this repo which should help you to put measures in place to either prevent the occurrence of this issue or at least reveal the true cause of the exception.
Some of the errors you see in the logs you will need to investigate Amazon CloudWatch metrics to see if you have throttling or system errors. I've built this AWS CloudFormation template to deploy a CloudWatch dashboard with all the appropriate metrics. This will provide better observability for your application.
A System Error indicates an event that must be resolved by AWS and often part of normal operations. Activities such as timeouts, server faults, or scaling activity could result in server errors. A User error indicates an event that can often be resolved by the user such as invalid query or exceeding a capacity quota. Amazon Keyspaces passes the System Error back as a Cassandra ServerError. In most cases this a transient error, in which case you can retry your request until it succeeds. Using the Cassandra driver’s default retry policy customers can also experience NoHostAvailableException or AllNodesFailedException or messages like yours "All host(s) tried for query failed". This is a client side exception that is thrown once all host in the load balancing policy’s query plan have attempted the request.
Take a look at this retry policy for NodeJs which should help resolve your "All hosts failed" exception or pass back the original exception.
The retry policies in the Cassandra drivers are pretty crude and will not be able to do more sophisticated things like circuit breaker patters. You may want to eventually use a "failfast" retry policy for the driver and handle the exceptions in your application code.

Firebase function connection with GCP Redis instance in the same VPC keeps on disconnecting

I am working on multiple Firebase cloud functions (all hosted in the same region) that connect with a GCP hosted Redis instance in the same region, using a VPC connector. I am using version 3.0.2 of the nodejs library for Redis. In the cloud functions' debug logs, I am seeing frequent connection reset logs, triggered for each cloud function with no fixed pattern around the timeline for the connection reset. And each time, the error captured in the error event handler is ECONNRESET. While creating the Redis instance, I have provided a retry_strategy to reconnect after 5 ms with maximum of 10 such attempts, along with the retry_unfulfilled_commands set to true, expecting that any unfulfilled command at the time of connection reset will be automatically retried (refer the code below).
const redisLib = require('redis');
const client = redisLib.createClient(REDIS_PORT, REDIS_HOST, {
enable_offline_queue: true,
retry_unfulfilled_commands: true,
retry_strategy: function(options) {
if (options.error && options.error.code === "ECONNREFUSED") {
// End reconnecting on a specific error and flush all commands with
// a individual error
return new Error("The server refused the connection");
}
if (options.attempt > REDIS_CONNECTION_RETRY_ATTEMPTS) {
// End reconnecting with built in error
console.log('Connection retry count exceeded 10');
return undefined;
}
// reconnect after 5 ms
console.log('Retrying connection after 5 ms');
return 5;
},
});
client.on('connect', () => {
console.log('Redis instance connected');
});
client.on('error', (err) => {
console.error(`Error connecting to Redis instance - ${err}`);
});
exports.getUserDataForId = (userId) => {
console.log('getUserDataForId invoked');
return new Promise((resolve, reject) => {
if(!client.connected) {
console.log('Redis instance not yet connected');
}
client.get(userId, (err, reply) => {
if(err) {
console.error(JSON.stringify(err));
reject(err);
} else {
resolve(reply);
}
});
});
}
// more such exports for different operations
Following are the questions / issues I am facing.
Why is the connection getting reset intermittently?
I have seen logs that even if the cloud function is being executed, the connection to Redis server lost resulting in failure of the command.
With retry_unfulfilled_commands set to true, I hoped it will handle the scenario as mentioned in point number 2 above, but as per debug logs, the cloud function times out in such scenario. This is what I observed in the logs in that case.
getUserDataForId invoked
Retrying connection after 5 ms
Redis instance connected
Function execution took 60002 ms, finished with status: 'timeout' --> coming from wrapper cloud function
Should I, instead of having a Redis connection instance at global level, try to have a connection created during each such Redis operation? It might have some performance issues as well as issues around number of concurrent Redis connections (since I have multiple cloud functions and all those will be creating Redis connections for each simultaneous invocation), right?
So, how to best handle it since I am facing all these issues during development itself, so not really sure if it's code related issue or some infrastructure configuration related issue.
This behavior could be caused by background activities.
"Background activity is anything that happens after your function has
terminated"
When the background activity interferes with subsequent invocations in Cloud Functions, unexpected behavior and errors that are hard to diagnose may occur. Accessing the network after a function terminates usually leads to "ECONNRESET" errors.
To troubleshoot this, make sure that there is no background activity by searching the logs for entries after the line saying that the invocation finished. Background activity can sometimes be buried deeper in the code, especially when asynchronous operations such as callbacks or timers are present. Review your code to make sure all asynchronous operations finish before you terminate the function.
Source

A timeout occurred while waiting for memory resources to execute the query in resource pool 'SloDWPool'

I have a series of Azure SQL Data Warehouse databases (for our development/evaluation purposes). Due to a recent unplanned extended outage (due to an issue with the Tenant Ring associated with some of these databases), I decided to resume the canary queries I had been running before but had quiesced for a couple of months due to frequent exceptions.
The canary queries are not running particularly frequently on any specific database, say every 15 minutes. On one database, I've received two indications of issues completing the canary query in 24 hours. The error is:
Msg 110802, Level 16, State 1, Server adwscdev1, Line 1110802;An internal DMS error occurred that caused this operation to fail. Details: A timeout occurred while waiting for memory resources to execute the query in resource pool 'SloDWPool' (2000000007). Rerun the query.
This database is under essentially no load, running at more than 100 DWU.
Other databases on the same logical server may be running under a load, but I have not seen the error on them.
What is the explanation for this error?
Please open a support ticket for this issue, support will have full access to the DMS logs and be able to see exactly what is going on. this behavior is not expected.
While I agree a support case would be reasonable I think you should also try scaling up to say DWU400 and retrying. I would also consider trying largerc or xlargerc on DWU100 and DWU400 as described here. Note it gets more memory and resources per query.
Run the following then retry your query:
EXEC sp_addrolemember 'largerc', 'yourLoginName'

DynamoDB slow response

So my problem is that DynamoDB is taking quite some time to return single object. I'm using node.js and AWS docclient. The weird thing is that it takes from 100ms to 200ms to "select" single item from DB.
Is there anyway to make it faster?
Exampel code:
var AWS = require("aws-sdk");
var docClient = new AWS.DynamoDB.DocumentClient();
console.time("user get");
var params = {
TableName : 'User',
Key: {
"id": "2f34rf23-4523452-345234"
}
};
docClient.get(params, function(err, data) {
if (err) {
callback(err);
}
else {
console.timeEnd("user get");
}
});
And average for this simple piece of code in lambda is 130ms. Any idea what could I do to make it faster? User table has only Primary partition key "id" and global secondary index with primary key email. When I try this from my console it takes even more time.
Any help will be much appreciated!
I faced exactly the same issue using Lambda#Edge. Responses from DynamoDB took 130-140ms on average while the DynamoDB latency graph shown 10-20ms latency.
I managed to improve response times to ~30ms on average by disabling ssl, parameter validations, and convertResponseTypes:
const docClient = new AWS.DynamoDB.DocumentClient({
apiVersion: '2012-08-10',
sslEnabled: false,
paramValidation: false,
convertResponseTypes: false
});
Most likely the cause of the issue was CPU/Network throttling in the lambda itself. Lambda#Edge for viewer request can have maximum 128MB which is a pretty slow lambda. So disabling extra-checks and SSL validation made things lots faster.
If you are running just a regular Lambda, increasing memory should fix the issue.
Have you warmed up your Lambda function? If you are only running it ad-hoc, and not running a continuous load, the function might not be available yet on the container running it, so additional time might be taken there. One way to support or refute this theory would be to look at latency metrics for the GetItem API. Finally, you could try using AWS X-Ray to find other spots of latency in your stack.
The DynamoDB SDK could also be retrying, adding to your perceived latency in the Lambda function. Given that your items are around 10 KB, it is possible you are getting throttled. Have you provisioned enough read capacity? You can verify both your read latency and read throttling metrics in the DynamoDB console for your table.
I know this is a little old, but for anyone finding this question now: the instantiation of the client can be extremely slow. This was despite fast local testing, yet accessing Dynamo DB from the same region and Elastic Beanstalk instance was extremely slow!
Accessing Dynamo from a single client instance improved the speeds significantly.
Reusing the connection helped speed up my calls from ~120ms to ~35ms.
Reusing Connections with Keep-Alive in Node.js
By default, the default Node.js HTTP/HTTPS agent creates a new TCP connection for every new request. To avoid the cost of establishing a new connection, you can reuse an existing connection.
For short-lived operations, such as DynamoDB queries, the latency overhead of setting up a TCP connection might be greater than the operation itself. Additionally, since DynamoDB encryption at rest is integrated with AWS KMS, you may experience latencies from the database having to re-establish new AWS KMS cache entries for each operation.

Amazon Service to submit CPU intensive task

I have a web application running 24/7 in a AWS micro instance and it works just fine.
Occasionally (10 to 50 times a day) I need process big amounts of data (stored on RDS) in a CPU intensive task. That's overkill for my micro instance.
Starting a EC2 server for this tasks doesn't seem a good idea because these tasks must be executed on demand when a user asks for them, and I need low latency (less than 10 seconds).
Is there any amazon service where I can submit my task and take advantage of higher CPU capacity?
Keep in mind that my task needs to read a big amount of data from RDS.
You can queue up all those task 10 to 50 over a particular cut-off time during a day and launch an instance to process that and terminate the same when you are done with the processing. The scheduling part of that can be done by the Micro instance.
Once the Micro instance starts the High Compute Instance; then rest of that can be carried out by High Compute Instance; once the queue for all those to be processed are empty you can terminate that instance.
This is like 0 Instance to 1 Instance over a schedule.
It depends on the cost that you are willing to pay, the complexity of the processing, the future scale of your service, the ability to pre-calculate your results and so on.
One option is to have a larger instance (or pool of instances in case of scale of your service), ready for processing that you can trigger them. You can lower the cost of this machine by using reserve instances pricing (http://aws.amazon.com/ec2/purchasing-options/reserved-instances/), or even better, using Spot instances (http://aws.amazon.com/ec2/purchasing-options/spot-instances/). With spot you are facing the risk that some of the times, you won't be able to have the instances up and running, and you need to back it with on-demand instance(s).
This is probably a more expensive solution, but as you have more and more jobs like that running, the "cost per job" is decreasing dramatically.
Another option is to off-load the processing to a different service. If you can run your calculation with a query syntax of external services like DynamoDB or Redis, for example, you can keep on using your micro instance to trigger the query. For example, Redis with ElastiCache can have complex data manipulations like sorted set intersection etc. You do need to make sure that you have your data also in the other data store, and write the query.
Another option is to have these calculations pre-calculated. It really depends on the type of jobs you need to run. If you can prepare these output in advanced and only update their results with the latest data from the time it was calculated to the time it is requested, it might also be easier on your machines to prepare for it without these unpredictable CPU peaks.
We used AWS Batch for this, basically because it heads above other options out there. Besides AWS Batch, we considered additional servers, AWS Lambda, using a separate server for each task and so on.
But we finally chose AWS Batch because of the number of reasons:
In AWS Batch, all processes are completely isolated. This means, that the tasks will not impact each other & break the workflow.
You can set the min. and max. RAM you want to use up
AWS Batch supports containers which makes it easier to integrate if you use containers already
You can also create queues for different tasks to gain more control over your resources and expenses.
AWS Batch is very price-effective, meaning you pay for what you use.
It's also pretty easy to set up. Here's a quick code snippet on how to launch rockets for each individual user. For more info go here: https://fulcrum.rocks/blog/cpu-intensive-tasks
`const comand = "npm run rocket"
const newJob = await new Promise((resolve, reject) => {
batch.submitJob(
{
jobName: "your_important_job",
jobDefinition: "killer_process",
jobQueue: "night_users",
timeout: {
attemptDurationSeconds: 600
},
retryStrategy: {
attempts: 1
},
containerOverrides: {
vcpus: 2,
memory: 2048,
command: [comand]
}
},
(err, data) => {
if (err) {
console.error(err.message);
reject(err);
}
resolve(data);
}
);
});
`