AWS Keyspace DSBulk unload failed, "Token metadata not present" - amazon-web-services

Getting error when trying to unload or count data from AWS Keyspace using dsbulk.
Error:
Operation COUNT_20221021-192729-813222 failed: Token metadata not present.
Command line:
$ dsbulk count/unload -k my_best_storage -t book_awards -f ./dsbulk_keyspaces.conf
Config:
datastax-java-driver {
basic.contact-points = [ "cassandra.us-east-2.amazonaws.com:9142"]
advanced.auth-provider {
class = PlainTextAuthProvider
username = "aw.keyspaces-at-XXX"
password = "XXXX"
}
basic.load-balancing-policy {
local-datacenter = "us-east-2"
}
basic.request {
consistency = LOCAL_QUORUM
default-idempotence = true
}
advanced {
request{
log-warnings = true
}
ssl-engine-factory {
class = DefaultSslEngineFactory
truststore-path = "./cassandra_truststore.jks"
truststore-password = "XXX"
hostname-validation = false
}
metadata {
token-map.enabled = false
}
}
}
dsbulk load - loading operator works fine...

I suspect the problem here is that your cluster is using the proprietary com.amazonaws.cassandra.DefaultPartitioner partitioner which most open-source tools and drivers don't recognise.
The DataStax Bulk Loader (DSBulk) tool uses the Cassandra Java driver under the hood to connect to Cassandra clusters. The Java driver uses the partitioner to determine which nodes own tokens [ranges]. Only the following Cassandra partitioners are supported:
Murmur3Partitioner
RandomPartitioner
ByteOrderedPartitioner
Since the Java driver doesn't know about DefaultPartitioner, it doesn't have a map of token range owners (token metadata) and so can't determine how to "split" the Cassandra ring to query the nodes.
As you already figured out, this doesn't affect the load command because it simply sends writes to coordinators and lets the coordinators figure out how the data is partitioned. But for unload and count commands which require reads, the Java driver can't determine which coordinators to pick for sub-range queries with an unsupported partitioner.
Maybe as a workaround you can try to disable token-awareness with:
$ dsbulk count [...]
--driver.advanced.metadata.token-map.enabled false
but I don't have an AWS Keyspaces cluster I could test and I'm doubtful it will work. In any case, you're welcome to try.
There is an outstanding DSBulk feature request to provide the ability to completely disable token-awareness (internal ticket ID DAT-622) but it is unassigned at the time of writing so I'm not in a position to provide any expectation on when it will be prioritised. Cheers!

Amazon Keyspaces now supports multiple partitioners including MurMr3Partitioner. See the following to update your partitioner. You will also want to set token-map.enabled to true.
metadata {
token-map.enabled = true
}
Additionally, if you are using VPC Endpoints you will need the following permissions to make sure that you will see available peers.
{
"Version":"2012-10-17",
"Statement":[
{
"Sid":"ListVPCEndpoints",
"Effect":"Allow",
"Action":[
"ec2:DescribeNetworkInterfaces",
"ec2:DescribeVpcEndpoints"
],
"Resource":"*"
}
]
}
I would also recommend increasing the connection pool size for the data load process.
advanced.connection.pool.local.size = 3
Finally, I would recommend using AWS glue instead of DSBulk. DSBulk is single process tool and will not scale for larger data loads. Additionally, learning glue will be helpful in managing other aspects of the data lifecycle. See my example on how to unload/export data using AWS Glue.

Related

Errors connecting to AWS Keyspaces using a lambda layer

Intermittently getting the following error when connecting to an AWS keyspace using a lambda layer
All host(s) tried for query failed. First host tried, 3.248.244.53:9142: Host considered as DOWN. See innerErrors.
I am trying to query a table in a keyspace using a nodejs lambda function as follows:
import cassandra from 'cassandra-driver';
import fs from 'fs';
export default class AmazonKeyspace {
tpmsClient = null;
constructor () {
let auth = new cassandra.auth.PlainTextAuthProvider('cass-user-at-xxxxxxxxxx', 'zzzzzzzzz');
let sslOptions1 = {
ca: [ fs.readFileSync('/opt/utils/AmazonRootCA1.pem', 'utf-8')],
host: 'cassandra.eu-west-1.amazonaws.com',
rejectUnauthorized: true
};
this.tpmsClient = new cassandra.Client({
contactPoints: ['cassandra.eu-west-1.amazonaws.com'],
localDataCenter: 'eu-west-1',
authProvider: auth,
sslOptions: sslOptions1,
keyspace: 'tpms',
protocolOptions: { port: 9142 }
});
}
getOrganisation = async (orgKey) => {
const SQL = 'select * FROM organisation where organisation_id=?;';
return new Promise((resolve, reject) => {
this.tpmsClient.execute(SQL, [orgKey], {prepare: true}, (err, result) => {
if (!err?.message) resolve(result.rows);
else reject(err.message);
});
});
};
}
I am basically following this recommended AWS documentation.
https://docs.aws.amazon.com/keyspaces/latest/devguide/using_nodejs_driver.html
It seems that around 10-20% of the time the lambda function (cassandra driver) cannot connect to the endpoint.
I am pretty familiar with Cassandra (I already use a 6 node cluster that I manage) and don't have any issues with that.
Could this be a timeout or do I need more contact points?
Followed the recommended guides. Checked from the AWS console for any errors but none shown.
UPDATE:
Update to the above question....
I am occasionally (1 in 50 if I parallel call the function (5 concurrent calls)) getting the below error:
"All host(s) tried for query failed. First host tried,
3.248.244.5:9142: DriverError: Socket was closed at Connection.clearAndInvokePending
(/opt/node_modules/cassandra-driver/lib/connection.js:265:15) at
Connection.close
(/opt/node_modules/cassandra-driver/lib/connection.js:618:8) at
TLSSocket.
(/opt/node_modules/cassandra-driver/lib/connection.js:93:10) at
TLSSocket.emit (node:events:525:35)\n at node:net:313:12\n at
TCP.done (node:_tls_wrap:587:7) { info: 'Cassandra Driver Error',
isSocketError: true, coordinator: '3.248.244.5:9142'}
This exception may be caused by throttling in the keyspaces side, resulting the Driver Error that you are seeing sporadically.
I would suggest taking a look over this repo which should help you to put measures in place to either prevent the occurrence of this issue or at least reveal the true cause of the exception.
Some of the errors you see in the logs you will need to investigate Amazon CloudWatch metrics to see if you have throttling or system errors. I've built this AWS CloudFormation template to deploy a CloudWatch dashboard with all the appropriate metrics. This will provide better observability for your application.
A System Error indicates an event that must be resolved by AWS and often part of normal operations. Activities such as timeouts, server faults, or scaling activity could result in server errors. A User error indicates an event that can often be resolved by the user such as invalid query or exceeding a capacity quota. Amazon Keyspaces passes the System Error back as a Cassandra ServerError. In most cases this a transient error, in which case you can retry your request until it succeeds. Using the Cassandra driver’s default retry policy customers can also experience NoHostAvailableException or AllNodesFailedException or messages like yours "All host(s) tried for query failed". This is a client side exception that is thrown once all host in the load balancing policy’s query plan have attempted the request.
Take a look at this retry policy for NodeJs which should help resolve your "All hosts failed" exception or pass back the original exception.
The retry policies in the Cassandra drivers are pretty crude and will not be able to do more sophisticated things like circuit breaker patters. You may want to eventually use a "failfast" retry policy for the driver and handle the exceptions in your application code.

GCP BigTable Metrics - what do 404 requests mean?

We switched to BigTable some time ago and since then there is a number of "404 requests" and also a high number of errors in the GCP Metrics console.
We see no errors in our logs and even data storage/retrieval seems to work as expected.
What is the cause for these errors and how is it possible to find out what is causing them?
As mentioned previously 404 means resource is not found. The relevant resource here is the Bigtable table (which could mean that either the instance id or table id are misconfigured in your application).
I'm guessing that you are looking at the metrics under APIs & Services > Cloud Bigtable API. These metrics show the response code from the Cloud Bigtable Service. You should be able to see this error rate under Monitoring > Metrics Explorer > metric:bigtable.googleapis.com/server/error_count and grouping by instance, method, error_code and app_profile. This will tell which instance and which RPC is causing the errors. Which let you grep your source code for incorrect usages.
A significantly more complex approach is that you can install an interceptor in Bigtable client that:
dumps the resource name of the RPC
once you identify the problematic table name, logs the stack trace of the caller
Something along these lines:
BigtableDataSettings.Builder builder = BigtableDataSettings.newBuilder()
.setProjectId("...")
.setInstanceId("...");
ConcurrentHashMap<String, Boolean> seenTables = new ConcurrentHashMap<>();
builder.stubSettings().setTransportChannelProvider(
EnhancedBigtableStubSettings.defaultGrpcTransportProviderBuilder()
.setInterceptorProvider(() -> ImmutableList.of(new ClientInterceptor() {
#Override
public <ReqT, RespT> ClientCall<ReqT, RespT> interceptCall(
MethodDescriptor<ReqT, RespT> methodDescriptor, CallOptions callOptions,
Channel channel) {
return new ForwardingClientCall.SimpleForwardingClientCall<ReqT, RespT>(channel.newCall(methodDescriptor, callOptions)) {
#Override
public void sendMessage(ReqT message) {
Message protoMessage = (Message) message;
FieldDescriptor desc = protoMessage.getDescriptorForType()
.findFieldByName("table_name");
if (desc != null) {
String tableName = (String) protoMessage.getField(desc);
if (seenTables.putIfAbsent(tableName, true) == null) {
System.out.println("Found new tableName: " + tableName);
}
if ("projects/my-project/instances/my-instance/tables/my-mispelled-table".equals(
tableName)) {
new RuntimeException(
"Fake error to get caller location of mispelled table id").printStackTrace();
}
}
delegate().sendMessage(message);
}
};
}
}))
.build()
);
Google Cloud Support here,
Without more insight I won’t be able to provide valid information about this 404 issue.
The issue must be either a typo or with the configuration, but cannot confirm with the shared data.
In order to provide more meaningful support, I would suggest you to open a Public Issue Tracker or a Google Cloud Support ticket.

AWS SDK for JavaScript CloudWatch Logs - GetLogEventsCommand isn't fetching logs, potentially due to a log stream size issue?

I have multiple Node.js applications deployed via AWS Elastic Beanstalk on the Docker platform. I can manually download the full logs for every environment without trouble via the AWS console. Let's say I have two AWS Elastic Beanstalk Environments: env-a and env-b.
I've started using the AWS SDK for JavaScript, specifically #aws-sdk/client-cloudwatch-logs, in a Node app so that I can programmatically fetch logs, render them in a custom UI, and do my own analysis as needed.
I'm running the following code in order to fetch the log events for a given app (pseudocode):
// IMPORTS
const {
CloudWatchLogsClient,
DescribeLogStreamsCommand,
GetLogEventsCommand
} = require("#aws-sdk/client-cloudwatch-logs");
// SETUP
const awsCloudWatchClient = new CloudWatchLogsClient({
region: process.env.AWS_REGION,
});
// APPLICATION CODE
const logGroupName = getLogGroupName();
// Get the log streams for the given log group.
const logStreamRes = await awsCloudWatchClient.send(new DescribeLogStreamsCommand({
descending: true,
logGroupName,
orderBy: 'LastEventTime',
limit: 50,
}))
// For testing purposes, I'll just use the first log stream name I find.
const logStreamName = logStreamRes.logStreams[0].logStreamName;
// Get the log events for the first log stream.
const logEventRes = await awsCloudWatchClient.send(new GetLogEventsCommand({
logGroupName,
logStreamName,
}));
const logEvents = logEventRes.events;
Now, I can fetch the log events for env-a without trouble using this code. However, GetLogEventsCommand always returns an empty collection when I attempt to fetch the logs for env-b. If I download the logs manually via the AWS console, I can definitely see that logs exist - yet for a reason that isn't clear to me yet, the AWS SDK doesn't seem to recognize that.
Here's some interesting details that may help diagnose the issue.
env-a is configured in Elastic Beanstalk so that each new deploy (which happens potentially multiple times a day) replaces EC2 instances. On the other hand, env-b is configured so that new application code is deployed to existing EC2 instances without actually replacing them. Since log streams map to EC2 instances, env-a has a high number of pretty small log streams whereas env-b` has three extremely large log streams for each of its long-lived EC2 instances. The logs are easily >1 MBs in size.
Considering that GetLogEventsCommand returns responses up to 1 MB in size, am I hitting some size limit and the AWS SDK is handling it by returning 0 log events for env-b? I tried setting a limit on the GetLogEventsCommand above, but still causes the AWS SDK to return 0 events for env-a.
Another interesting note: if I go to Amazon CloudWatch > Log Group and select env-a's Log Group, I can see the log events for every log stream without trouble. If I try to view the log events for env-b's three very large log streams, I run into "Rate exceeded" errors on the console. This seems to confirm that the log stream's event count is simply too large for both the AWS console and AWS SDK to process, though I'm not certain.
Is there anything I can do to get the AWS SDK to fetch env-b's logs? How can I further confirm that excessive log stream size is the culprit here? And if that's the case, is there anything I can do about it, e.g. purge logs?
Or could this be some other issue that I'm not seeing?

How to increase _cluster/settings/cluster.max_shards_per_node for AWS Elasticsearch Service

I uses AWS Elasticsearch service version 7.1 and its built-it Kibana to manage application logs. New indexes are created daily by Logstash. My Logstash gets error about maximum shards limit reach from time to time and I have to delete old indexes for it to become working again.
I found from this document (https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/aes-handling-errors.html) that I have an option to increase _cluster/settings/cluster.max_shards_per_node.
So I have tried that by put following command in Kibana Dev Tools
PUT /_cluster/settings
{
"defaults" : {
"cluster.max_shards_per_node": "2000"
}
}
But I got this error
{
"Message": "Your request: '/_cluster/settings' payload is not allowed."
}
Someone suggests that this error occurs when I try to update some settings that are not allowed by AWS, but this document (https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/aes-supported-es-operations.html#es_version_7_1) tells me that cluster.max_shards_per_node is one in the allowed list.
Please suggest how to update this settings.
You're almost there, you need to rename defaults to persistent
PUT /_cluster/settings
{
"persistent" : {
"cluster.max_shards_per_node": "2000"
}
}
Beware though, that the more shards you allow per node, the more resources each node will need and the worse the performance can get.

How to avoid a TooManyApplicationVersion exception on AWS Elastic Beanstalk?

Lately has anyone witnessed the
TooManyApplicationVersions Exception
on AWS Elastic Beanstalk console while deploying a new application version (war)? It's so annoying to see this message as it appears only after you have finished uploading the war.
I would be interested to know why this exception occurs and what precautions one should take to avoid such situations?
Cause
The exception you are seeing stems from reaching your respective account limits for AWS Elastic Beanstalk, see section Errors in CreateApplicationVersion [paraphrased]:
TooManyApplicationVersions - The caller has exceeded the limit on the
number of application versions associated with their account.
TooManyApplications - The caller has exceeded the limit on the number of applications associated with their account.
The current limits are outlined in the respective FAQ How many applications can I run with AWS Elastic Beanstalk?:
You can create up to 25 applications and 500 application versions. By
default you can run up to 10 environments across all of your
applications. If you are also using AWS outside of Elastic Beanstalk,
you may not be [...] If you need more resources, complete the AWS Elastic
Beanstalk request form and your request will be promptly evaluated. [emphasis mine]
Solution
As emphasized, AWS offers the usual escalation option and allows you to submit a Request to Increase AWS Elastic Beanstalk Limits, if you really need that many application versions to be available for reuse still. Otherwise you might just delete older ones you will not use anymore and the problem should vanish accordingly.
Good luck!
Here's a one liner that uses the AWS CLI that will help you clear out old application versions:
aws elasticbeanstalk describe-application-versions --output text --query 'ApplicationVersions[*].[ApplicationName,VersionLabel,DateCreated]' | grep "2014-02" | while read app ver date; do aws elasticbeanstalk delete-application-version --application-name $app --version-label $ver --delete-source-bundle; done
Replace the grep with whatever date, (2013, 2014-01, 2014-02-0, etc) you see fit.
As of EB CLI 3.3, you can now run the following to clear out old versions:
$ eb labs cleanup-versions
By default this will cleanup to the last 10 versions and/or older than 60 days. Adding --help, outputs the following:
usage: eb labs cleanup-versions [options...]
Cleans up old application versions.
optional arguments:
--num-to-leave NUM number of versions to leave DEFAULT=10
--older-than DAYS delete only versions older than x days DEFAULT=60
--force don't prompt for confirmation
You're approaching maximum number of versions and need to delete old or unused ones.
In the current web console you can simply do it on the Application Versions tab of your Beanstalk environment.
This is the piece of code that we use in our deploy script to delete the oldest application version.
console.log('Deleting oldest application version.');
params = {};
local.waitFor(function(done) {
eb.describeApplicationVersions(params, function(err, data) {
if (err) {
console.error(err, err.stack);
local.abort('Could not retrieve the list of application version.');
} else {
// This is probably not needed as the list is already sorted but it is
// not written anywhere that this will always be the case
function compare(a,b) {
if (a.DateCreated > b.DateCreated)
return -1;
if (a.DateCreated < b.DateCreated)
return 1;
return 0;
}
var applicationsVersion = data['ApplicationVersions'].sort(compare),
oldestApplication = applicationsVersion[applicationsVersion.length - 1],
applicationName = oldestApplication['ApplicationName'],
versionLabel = oldestApplication['VersionLabel'];
params = {
ApplicationName: applicationName, /* required */
VersionLabel: versionLabel, /* required */
DeleteSourceBundle: false /* Do not delete source bundle from S3 */
};
eb.deleteApplicationVersion(params, function(err, data) {
if (err) {
console.error(err, err.stack);
local.abort('Could not delete the oldest application version. (' + versionLabel + ')')
} else {
console.log('Successfully deleted the oldest application version. (' + versionLabel + ')');
}
});
}
});
});
Documentation for the Elastic Beantalk API (js): http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/ElasticBeanstalk.html