Connection to AWS MemoryDB cluster sometimes fails

Connection to AWS MemoryDB cluster sometimes fails - amazon-web-services

We have an application that is using AWS MemoryDB for Redis. We have setup a cluster with one shard and two nodes. One of the nodes (named 0001-001) is a primary read/write while the other one is a read replica (named 0001-002).
After deploying the application, connecting to MemoryDB sometimes fails when we use the cluster endpoint connection string to connect. If we restart the application a few times it suddenly starts working. It seems to be random when it succeeds or not. The error we get is the following:
Endpoint Unspecified/ourapp-memorydb-cluster-0001-001.ourapp-memorydb-cluster.xxxxx.memorydb.eu-west-1.amazonaws.com:6379 serving hashslot 6024 is not reachable at this point of time. Please check connectTimeout value. If it is low, try increasing it to give the ConnectionMultiplexer a chance to recover from the network disconnect. IOCP: (Busy=0,Free=1000,Min=2,Max=1000), WORKER: (Busy=0,Free=32767,Min=2,Max=32767), Local-CPU: n/a
If we connect directly to the primary read/write node we get no such errors.
If we connect directly to the read replica it always fails. It even gets the error above, compaining about the "0001-001" node.
We use .NET Core 6
We use Microsoft.Extensions.Caching.StackExchangeRedis 6.0.4 which depends on StackExchange.Redis 2.2.4
The application is hosted in AWS ECS
StackExchangeRedisCache is added to the service collection in a startup file :
services.AddStackExchangeRedisCache(o =>
{
o.InstanceName = redisConfiguration.Instance;
o.ConfigurationOptions = ToRedisConfigurationOptions(redisConfiguration);
});
...where ToRedisConfiguration returns a basic ConfigurationOptions object :
new ConfigurationOptions()
{
EndPoints =
{
{ "clustercfg.ourapp-memorydb-cluster.xxxxx.memorydb.eu-west-1.amazonaws.com", 6379 } // Cluster endpoint
},
User = "username",
Password = "password",
Ssl = true,
AbortOnConnectFail = false,
ConnectTimeout = 60000
};
We tried multiple shards with multiple nodes and it also sometimes fail to connect to the cluster. We even tried to update the dependency StackExchange.Redis to 2.5.43 but no luck.
We could "solve" it by directly connecting to the primary node, but if a failover occurs and 0001-002 becomes the primary node we would have to manually change our connection string, which is not acceptable in a production environment.
Any help or advice is appreciated, thanks!

Related

Errors connecting to AWS Keyspaces using a lambda layer

Intermittently getting the following error when connecting to an AWS keyspace using a lambda layer
All host(s) tried for query failed. First host tried, 3.248.244.53:9142: Host considered as DOWN. See innerErrors.
I am trying to query a table in a keyspace using a nodejs lambda function as follows:
import cassandra from 'cassandra-driver';
import fs from 'fs';
export default class AmazonKeyspace {
tpmsClient = null;
constructor () {
let auth = new cassandra.auth.PlainTextAuthProvider('cass-user-at-xxxxxxxxxx', 'zzzzzzzzz');
let sslOptions1 = {
ca: [ fs.readFileSync('/opt/utils/AmazonRootCA1.pem', 'utf-8')],
host: 'cassandra.eu-west-1.amazonaws.com',
rejectUnauthorized: true
};
this.tpmsClient = new cassandra.Client({
contactPoints: ['cassandra.eu-west-1.amazonaws.com'],
localDataCenter: 'eu-west-1',
authProvider: auth,
sslOptions: sslOptions1,
keyspace: 'tpms',
protocolOptions: { port: 9142 }
});
}
getOrganisation = async (orgKey) => {
const SQL = 'select * FROM organisation where organisation_id=?;';
return new Promise((resolve, reject) => {
this.tpmsClient.execute(SQL, [orgKey], {prepare: true}, (err, result) => {
if (!err?.message) resolve(result.rows);
else reject(err.message);
});
});
};
}
I am basically following this recommended AWS documentation.
https://docs.aws.amazon.com/keyspaces/latest/devguide/using_nodejs_driver.html
It seems that around 10-20% of the time the lambda function (cassandra driver) cannot connect to the endpoint.
I am pretty familiar with Cassandra (I already use a 6 node cluster that I manage) and don't have any issues with that.
Could this be a timeout or do I need more contact points?
Followed the recommended guides. Checked from the AWS console for any errors but none shown.
UPDATE:
Update to the above question....
I am occasionally (1 in 50 if I parallel call the function (5 concurrent calls)) getting the below error:
"All host(s) tried for query failed. First host tried,
3.248.244.5:9142: DriverError: Socket was closed at Connection.clearAndInvokePending
(/opt/node_modules/cassandra-driver/lib/connection.js:265:15) at
Connection.close
(/opt/node_modules/cassandra-driver/lib/connection.js:618:8) at
TLSSocket.
(/opt/node_modules/cassandra-driver/lib/connection.js:93:10) at
TLSSocket.emit (node:events:525:35)\n at node:net:313:12\n at
TCP.done (node:_tls_wrap:587:7) { info: 'Cassandra Driver Error',
isSocketError: true, coordinator: '3.248.244.5:9142'}

This exception may be caused by throttling in the keyspaces side, resulting the Driver Error that you are seeing sporadically.
I would suggest taking a look over this repo which should help you to put measures in place to either prevent the occurrence of this issue or at least reveal the true cause of the exception.

Some of the errors you see in the logs you will need to investigate Amazon CloudWatch metrics to see if you have throttling or system errors. I've built this AWS CloudFormation template to deploy a CloudWatch dashboard with all the appropriate metrics. This will provide better observability for your application.
A System Error indicates an event that must be resolved by AWS and often part of normal operations. Activities such as timeouts, server faults, or scaling activity could result in server errors. A User error indicates an event that can often be resolved by the user such as invalid query or exceeding a capacity quota. Amazon Keyspaces passes the System Error back as a Cassandra ServerError. In most cases this a transient error, in which case you can retry your request until it succeeds. Using the Cassandra driver’s default retry policy customers can also experience NoHostAvailableException or AllNodesFailedException or messages like yours "All host(s) tried for query failed". This is a client side exception that is thrown once all host in the load balancing policy’s query plan have attempted the request.
Take a look at this retry policy for NodeJs which should help resolve your "All hosts failed" exception or pass back the original exception.
The retry policies in the Cassandra drivers are pretty crude and will not be able to do more sophisticated things like circuit breaker patters. You may want to eventually use a "failfast" retry policy for the driver and handle the exceptions in your application code.

Kafka Multi broker setup with ec2 machine: Timed out waiting for a node assignment. Call: createTopics

I am trying to setup kafka with 3 broker nodes and 1 zookeeper node in AWS EC2 instances. I have following server.properties for every broker:
kafka-1:
broker.id=0
listeners=PLAINTEXT_1://ec2-**-***-**-17.eu-central-1.compute.amazonaws.com:9092
advertised.listeners=PLAINTEXT_1://ec2-**-***-**-17.eu-central-1.compute.amazonaws.com:9092
listener.security.protocol.map=,PLAINTEXT_1:PLAINTEXT
inter.broker.listener.name=PLAINTEXT_1
zookeeper.connect=ec2-**-***-**-105.eu-central-1.compute.amazonaws.com:2181
kafka-2:
broker.id=1
listeners=PLAINTEXT_2://ec2-**-***-**-43.eu-central-1.compute.amazonaws.com:9093
advertised.listeners=PLAINTEXT_2://ec2-**-***-**-43.eu-central-1.compute.amazonaws.com:9093
listener.security.protocol.map=,PLAINTEXT_2:PLAINTEXT
inter.broker.listener.name=PLAINTEXT_2
zookeeper.connect=ec2-**-***-**-105.eu-central-1.compute.amazonaws.com:2181
kafka-3:
broker.id=2
listeners=PLAINTEXT_3://ec2-**-***-**-27.eu-central-1.compute.amazonaws.com:9094
advertised.listeners=PLAINTEXT_3://ec2-**-***-**-27.eu-central-1.compute.amazonaws.com:9094
listener.security.protocol.map=,PLAINTEXT_3:PLAINTEXT
inter.broker.listener.name=PLAINTEXT_3
zookeeper.connect=ec2-**-***-**-105.eu-central-1.compute.amazonaws.com:2181
zookeeper:
tickTime=2000
dataDir=/var/lib/zookeeper
clientPort=2181
When I ran following command in zookeeper I see that they are connected
I also telnetted from any broker to other ones with broker port they are all connected
However, when I try to create topic with 2 replication factor I get Timed out waiting for a node assignment
I cannot understand what is incorrect with my setup, I see 3 nodes running in zookeeper, but having problems when creating topic. BTW, when I make replication factor 1 I get the same error. How can I make sure that everything is alright with my cluster?

It's good that telnet checks the port is open, but it doesn't verify the Kafka protocol works. You could use kcat utility for that, but the fix includes
listeners are set to either PLAINTEXT://:9092 or PLAINTEXT://0.0.0.0:9092 for every broker, which means using the same port
Removing the number from the listener mapping and advertised listeners property such that each broker is the same
I'd also recommend looking at using Ansible/Terraform/Cloudformation to ensure you consistently modify the cluster rather than edit individual settings manually

An Issue with an AWS EC2 instance WebSocket connection failed: Error in connection establishment: net::ERR_CONNECTION_TIMED_OUT

As I tried to run the chat app from localhost connected to MySQL database which had been coded with PHP via WebSocket it was successful.
Also when I tried to run from the PuTTY terminal logged into SSH credentials, it was displaying as Server Started with the port# 8080
ubuntu#ec3-193-123-96:/home/admin/web/ec3-193-123-96.eu-central-1.compute.amazonaws.com/public_html/application/libraries/server$ php websocket_server.php
PHP Fatal error: Uncaught React\Socket\ConnectionException: Could not bind to tcp://0.0.0.0:8080: Address already in use in /home/admin/web/ec3-193-123-96.eu-central-1.compute.amazonaws.com/public_html/application/libraries/vendor/react/socket/src/Server.php:29
Stack trace:
#0 /home/admin/web/ec3-193-123-96.eu-central-1.compute.amazonaws.com/public_html/application/libraries/vendor/cboden/ratchet/src/Ratchet/Server/IoServer.php(70): React\Socket\Server->listen(8080, '0.0.0.0')
#1 /home/admin/web/ec3-193-123-96.eu-central-1.compute.amazonaws.com/public_html/application/libraries/server/websocket_server.php(121): Ratchet\Server\IoServer::factory(Object(Ratchet\Http\HttpServer), 8080)
#2 {main}
thrown in /home/admin/web/ec3-193-123-96.eu-central-1.compute.amazonaws.com/public_html/application/libraries/vendor/react/socket/src/Server.php on line 29
ubuntu#ec3-193-123-96:/home/admin/web/ec3-193-123-96.eu-central-1.compute.amazonaws.com/public_html/application/libraries/server$
So I tried to change the port#8080 to port# 8282, it was successful
ubuntu#ec3-193-123-96:/home/admin/web/ec3-193-123-96.eu-central-1.compute.amazonaws.com/public_html/application/libraries/server$ php websocket_server.php
Keeping the shell script running, open a couple of web browser windows, and open a Javascript console or a page with the following Javascript:
var conn = new WebSocket('ws://0.0.0.0:8282');
conn.onopen = function(e) {
console.log("Connection established!");
};
conn.onmessage = function(e) {
console.log(e.data);
};
From the browser console results:
WebSocket connection to 'ws://5.160.195.94:8282/' failed: Error in
connection establishment: net::ERR_CONNECTION_TIMED_OUT
websocket_server.php
<?php
use Ratchet\Server\IoServer;
use Ratchet\Http\HttpServer;
use Ratchet\WebSocket\WsServer;
use MyApp\Chat;
require dirname(__DIR__) . '/vendor/autoload.php';
$server = IoServer::factory(
new HttpServer(
new WsServer(
new Chat()
)
),
8282
);
$server->run();
I even tried to assign Public IP and Private IP, but with no good it resulted in the same old result?
This was the composer files generated after executing and adding src folder $composer require cboden/ratchet
composer.json(On AmazonWebServer)
{
"autoload": {
"psr-4": {
"MyApp\\": "src"
}
},
"require": {
"cboden/ratchet": "^0.4.1"
}
}
composer.json(On localhost)
{
"autoload": {
"psr-4": {
"MyApp\\": "src"
}
},
"require": {
"cboden/ratchet": "^0.4.3"
}
}
How am I suppose to resolve/overcome while connecting it from the WebSocket especially from the hosted server with the domain name such as
http://ec3-193-123-96.eu-central-1.compute.amazonaws.com/
var conn = new WebSocket('ws://localhost:8282');
From the Security Group
Under Inbound tab
Under Outbound tab

When it comes to a connectivity issue with an EC2 there are few things you need to check to find the root cause.
SSH into the EC2 instance that the application is running and make sure you can access it from within the EC2 instance. If it works then its a network related issue that we need to solve.
If step 1 was successful. You have now identified it is a network issue to solve this you need to check the following.
Check if an Internet Gateway is created and attached to your VPC.
Next check if your subnets routing table has its default route pointing to the internet gateway. check this link to complete this and the above step.
Check your subnets Network ACLs rules to see if ports are not blocked
finally, you would want to check your Instances Security group as you have shown.
If you need access via a EC2 dns you will need to provision your ec2 instance in a public subnet and assign an elastic IP
If an issue still exists check if the EC2 status checks pass, or try provisioning a new instance.

How to reconnect if AWS RDS recovery happens

How have I written the code
createPool is used at the start of the app
then for every request I am using getConnection
I am using AWS RDS & it went into sudden recovery mode, due to which my db url was unchanged but instance IP must have changed as it was created in another AZ
So for such a scenario I am supposed to reinitialize my db connection so that new instance DNS is updated.
The issue is in such a scenario I did not received any timeout error or connection error. So how do I capture this type of error?
Kindly guide if possible.
Thanks

It is unclear from your description what exactly you have built, but it sounds like you've created a connection pool.
If you open a connection to the db, the first time you call getConnection you should validate that the connection is still active - obviously if the db fails over, the existing connection will get closed, and you will either need to create a new connection or re-open the existing one.

SSL connection from AWS lambda to AWS Redshift

I am trying to connect to an AWS Redshift database from a lambda function using c#, dotnet core 2.0, and npgsql. I am having difficulty with SSL.
I have created two non-publicly-accessible Redshift databases in a dedicated VPC. The lambda executes in the same VPC. The two databases are identical in every way except that one has the "force SSL" parameter set to true.
Using the following code snippet, I can access the non-SSL database just fine:
using (var conn = new NpgsqlConnection ("Host=x; Port=5439; Username=x;
Password=x;Database=xxx")
{
Console.WriteLine("Redshift pre-Open!");
conn.Open();
Console.WriteLine("Redshift: post-Open!");
...
}
When I access the SSL database, I get the "missing hba.conf" error message - seems standard, I've seen it before ...
When I append to the connection string: "ssl Mode=Require;Server Compatibility Mode=Redshift;Trust Server Certificate=true"
the conn.open statement hangs, and the second write statement never shows up in cloudwatch.
And yet ... this connection statement works when accessing the same database thru a rest API and C#/dotnetcore 2 WEBAPI (same runtime environment), with
an EC2 instance and load balancer.
A Python lambda connecting to the SSL database, in the same environment - subnets, security groups, lambda triggers, lambda parameters, ... is working just fine.
The csproj references Amazon.Lambda.Core 1.0.0, Amazon.Lambda.Serialization.Json 1.1.0, and
Npgsql.EntityFrameworkCore.PostgreSQL 2.0.1.
I'd try Wireshark, maybe, in another environment - but running as a lambda, I'm not sure how best to debug. I've tried many permutations and combinations, and I wouldn't put it past myself to be missing something blindingly obvious,
but I absolutely do not see why hangs. Thank you.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js