GCP BigTable Metrics - what do 404 requests mean? - google-cloud-platform

We switched to BigTable some time ago and since then there is a number of "404 requests" and also a high number of errors in the GCP Metrics console.
We see no errors in our logs and even data storage/retrieval seems to work as expected.
What is the cause for these errors and how is it possible to find out what is causing them?

As mentioned previously 404 means resource is not found. The relevant resource here is the Bigtable table (which could mean that either the instance id or table id are misconfigured in your application).
I'm guessing that you are looking at the metrics under APIs & Services > Cloud Bigtable API. These metrics show the response code from the Cloud Bigtable Service. You should be able to see this error rate under Monitoring > Metrics Explorer > metric:bigtable.googleapis.com/server/error_count and grouping by instance, method, error_code and app_profile. This will tell which instance and which RPC is causing the errors. Which let you grep your source code for incorrect usages.
A significantly more complex approach is that you can install an interceptor in Bigtable client that:
dumps the resource name of the RPC
once you identify the problematic table name, logs the stack trace of the caller
Something along these lines:
BigtableDataSettings.Builder builder = BigtableDataSettings.newBuilder()
.setProjectId("...")
.setInstanceId("...");
ConcurrentHashMap<String, Boolean> seenTables = new ConcurrentHashMap<>();
builder.stubSettings().setTransportChannelProvider(
EnhancedBigtableStubSettings.defaultGrpcTransportProviderBuilder()
.setInterceptorProvider(() -> ImmutableList.of(new ClientInterceptor() {
#Override
public <ReqT, RespT> ClientCall<ReqT, RespT> interceptCall(
MethodDescriptor<ReqT, RespT> methodDescriptor, CallOptions callOptions,
Channel channel) {
return new ForwardingClientCall.SimpleForwardingClientCall<ReqT, RespT>(channel.newCall(methodDescriptor, callOptions)) {
#Override
public void sendMessage(ReqT message) {
Message protoMessage = (Message) message;
FieldDescriptor desc = protoMessage.getDescriptorForType()
.findFieldByName("table_name");
if (desc != null) {
String tableName = (String) protoMessage.getField(desc);
if (seenTables.putIfAbsent(tableName, true) == null) {
System.out.println("Found new tableName: " + tableName);
}
if ("projects/my-project/instances/my-instance/tables/my-mispelled-table".equals(
tableName)) {
new RuntimeException(
"Fake error to get caller location of mispelled table id").printStackTrace();
}
}
delegate().sendMessage(message);
}
};
}
}))
.build()
);

Google Cloud Support here,
Without more insight I won’t be able to provide valid information about this 404 issue.
The issue must be either a typo or with the configuration, but cannot confirm with the shared data.
In order to provide more meaningful support, I would suggest you to open a Public Issue Tracker or a Google Cloud Support ticket.

Related

Errors connecting to AWS Keyspaces using a lambda layer

Intermittently getting the following error when connecting to an AWS keyspace using a lambda layer
All host(s) tried for query failed. First host tried, 3.248.244.53:9142: Host considered as DOWN. See innerErrors.
I am trying to query a table in a keyspace using a nodejs lambda function as follows:
import cassandra from 'cassandra-driver';
import fs from 'fs';
export default class AmazonKeyspace {
tpmsClient = null;
constructor () {
let auth = new cassandra.auth.PlainTextAuthProvider('cass-user-at-xxxxxxxxxx', 'zzzzzzzzz');
let sslOptions1 = {
ca: [ fs.readFileSync('/opt/utils/AmazonRootCA1.pem', 'utf-8')],
host: 'cassandra.eu-west-1.amazonaws.com',
rejectUnauthorized: true
};
this.tpmsClient = new cassandra.Client({
contactPoints: ['cassandra.eu-west-1.amazonaws.com'],
localDataCenter: 'eu-west-1',
authProvider: auth,
sslOptions: sslOptions1,
keyspace: 'tpms',
protocolOptions: { port: 9142 }
});
}
getOrganisation = async (orgKey) => {
const SQL = 'select * FROM organisation where organisation_id=?;';
return new Promise((resolve, reject) => {
this.tpmsClient.execute(SQL, [orgKey], {prepare: true}, (err, result) => {
if (!err?.message) resolve(result.rows);
else reject(err.message);
});
});
};
}
I am basically following this recommended AWS documentation.
https://docs.aws.amazon.com/keyspaces/latest/devguide/using_nodejs_driver.html
It seems that around 10-20% of the time the lambda function (cassandra driver) cannot connect to the endpoint.
I am pretty familiar with Cassandra (I already use a 6 node cluster that I manage) and don't have any issues with that.
Could this be a timeout or do I need more contact points?
Followed the recommended guides. Checked from the AWS console for any errors but none shown.
UPDATE:
Update to the above question....
I am occasionally (1 in 50 if I parallel call the function (5 concurrent calls)) getting the below error:
"All host(s) tried for query failed. First host tried,
3.248.244.5:9142: DriverError: Socket was closed at Connection.clearAndInvokePending
(/opt/node_modules/cassandra-driver/lib/connection.js:265:15) at
Connection.close
(/opt/node_modules/cassandra-driver/lib/connection.js:618:8) at
TLSSocket.
(/opt/node_modules/cassandra-driver/lib/connection.js:93:10) at
TLSSocket.emit (node:events:525:35)\n at node:net:313:12\n at
TCP.done (node:_tls_wrap:587:7) { info: 'Cassandra Driver Error',
isSocketError: true, coordinator: '3.248.244.5:9142'}
This exception may be caused by throttling in the keyspaces side, resulting the Driver Error that you are seeing sporadically.
I would suggest taking a look over this repo which should help you to put measures in place to either prevent the occurrence of this issue or at least reveal the true cause of the exception.
Some of the errors you see in the logs you will need to investigate Amazon CloudWatch metrics to see if you have throttling or system errors. I've built this AWS CloudFormation template to deploy a CloudWatch dashboard with all the appropriate metrics. This will provide better observability for your application.
A System Error indicates an event that must be resolved by AWS and often part of normal operations. Activities such as timeouts, server faults, or scaling activity could result in server errors. A User error indicates an event that can often be resolved by the user such as invalid query or exceeding a capacity quota. Amazon Keyspaces passes the System Error back as a Cassandra ServerError. In most cases this a transient error, in which case you can retry your request until it succeeds. Using the Cassandra driver’s default retry policy customers can also experience NoHostAvailableException or AllNodesFailedException or messages like yours "All host(s) tried for query failed". This is a client side exception that is thrown once all host in the load balancing policy’s query plan have attempted the request.
Take a look at this retry policy for NodeJs which should help resolve your "All hosts failed" exception or pass back the original exception.
The retry policies in the Cassandra drivers are pretty crude and will not be able to do more sophisticated things like circuit breaker patters. You may want to eventually use a "failfast" retry policy for the driver and handle the exceptions in your application code.

AWS Keyspace DSBulk unload failed, "Token metadata not present"

Getting error when trying to unload or count data from AWS Keyspace using dsbulk.
Error:
Operation COUNT_20221021-192729-813222 failed: Token metadata not present.
Command line:
$ dsbulk count/unload -k my_best_storage -t book_awards -f ./dsbulk_keyspaces.conf
Config:
datastax-java-driver {
basic.contact-points = [ "cassandra.us-east-2.amazonaws.com:9142"]
advanced.auth-provider {
class = PlainTextAuthProvider
username = "aw.keyspaces-at-XXX"
password = "XXXX"
}
basic.load-balancing-policy {
local-datacenter = "us-east-2"
}
basic.request {
consistency = LOCAL_QUORUM
default-idempotence = true
}
advanced {
request{
log-warnings = true
}
ssl-engine-factory {
class = DefaultSslEngineFactory
truststore-path = "./cassandra_truststore.jks"
truststore-password = "XXX"
hostname-validation = false
}
metadata {
token-map.enabled = false
}
}
}
dsbulk load - loading operator works fine...
I suspect the problem here is that your cluster is using the proprietary com.amazonaws.cassandra.DefaultPartitioner partitioner which most open-source tools and drivers don't recognise.
The DataStax Bulk Loader (DSBulk) tool uses the Cassandra Java driver under the hood to connect to Cassandra clusters. The Java driver uses the partitioner to determine which nodes own tokens [ranges]. Only the following Cassandra partitioners are supported:
Murmur3Partitioner
RandomPartitioner
ByteOrderedPartitioner
Since the Java driver doesn't know about DefaultPartitioner, it doesn't have a map of token range owners (token metadata) and so can't determine how to "split" the Cassandra ring to query the nodes.
As you already figured out, this doesn't affect the load command because it simply sends writes to coordinators and lets the coordinators figure out how the data is partitioned. But for unload and count commands which require reads, the Java driver can't determine which coordinators to pick for sub-range queries with an unsupported partitioner.
Maybe as a workaround you can try to disable token-awareness with:
$ dsbulk count [...]
--driver.advanced.metadata.token-map.enabled false
but I don't have an AWS Keyspaces cluster I could test and I'm doubtful it will work. In any case, you're welcome to try.
There is an outstanding DSBulk feature request to provide the ability to completely disable token-awareness (internal ticket ID DAT-622) but it is unassigned at the time of writing so I'm not in a position to provide any expectation on when it will be prioritised. Cheers!
Amazon Keyspaces now supports multiple partitioners including MurMr3Partitioner. See the following to update your partitioner. You will also want to set token-map.enabled to true.
metadata {
token-map.enabled = true
}
Additionally, if you are using VPC Endpoints you will need the following permissions to make sure that you will see available peers.
{
"Version":"2012-10-17",
"Statement":[
{
"Sid":"ListVPCEndpoints",
"Effect":"Allow",
"Action":[
"ec2:DescribeNetworkInterfaces",
"ec2:DescribeVpcEndpoints"
],
"Resource":"*"
}
]
}
I would also recommend increasing the connection pool size for the data load process.
advanced.connection.pool.local.size = 3
Finally, I would recommend using AWS glue instead of DSBulk. DSBulk is single process tool and will not scale for larger data loads. Additionally, learning glue will be helpful in managing other aspects of the data lifecycle. See my example on how to unload/export data using AWS Glue.

Response is different between google list jobs API and its corresponding java library

I am testing list jobs API from google side and want to use it in my java application, so i am using the corresponding java maven library
public void listJobs() {
try {
// Initialize client that will be used to send requests. This client only needs to be created
// once, and can be reused for multiple requests.
BigQuery bigquery = getBigQuery();
long time = 1653340103419L;
Page<Job> jobs = bigquery.listJobs(BigQuery.JobListOption.minCreationTime(1653264000000L),
BigQuery.JobListOption.maxCreationTime(1653350399000L),
BigQuery.JobListOption.allUsers(), BigQuery.JobListOption.pageSize(100));
if (jobs == null) {
System.out.println("Dataset does not contain any jobs.");
return;
}
jobs.getValues().forEach(job -> System.out.println("Success! Job ID: " + job.getJobId()));
} catch (BigQueryException | IOException e) {
System.out.println("Jobs not listed in dataset due to error: \n" + e.toString());
}
}
But the response is different from the web API which i have given above, suppose if i given the minCreationTime as currentTimeStamp value, after this i had executed some jobs in bigquery console, then they will show up immediately in web API but they are not shown in java maven side, there is factor of some latency after some 3-4 hours those jobs are coming on maven library side.
But ultimately this library is calling the same API, then what is the thing i am missing, my desired end result is simple, i want to poll the jobs repeatedly using this library after a particular time stamp that is by using minCreationTime parameter, that is happening exactly in web API but not on library side, i have inspected the web API to check whether it is converting the given timestamp or not. But it is sending the exact parameters i have given. Where exactly i am missing, why there is difference in output ?

Streaming pubsub -bigtable using apache beam dataflow java

Trying to update the pubsub json message to bigtable .I am running code from local machine .the dataflow job is getting created .but i dont see any data updated in bigtable instance and also it does not throw any error in console or dataflow job.I also tried to have hardcode value and try to update in bigtable but still it didnt work. Please can anyone suggest or guide me in this issue
try{
PipelineOptions options = PipelineOptionsFactory.fromArgs(projectArgs).create();
options.setRunner(DataflowRunner.class);
System.out.println("tempfile-" + options.getTempLocation());
Pipeline p = Pipeline.create(options);
System.out.println("options" + options.getTempLocation());
p.apply("Read PubSub Messages", PubsubIO.readStrings().fromTopic(PUBSUB_SUBSCRIPTION))
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))))
.apply(ParDo.of(new RowGenerator())).apply(CloudBigtableIO.writeToTable(bigtableConfig));
p.run();
}catch (Exception e) {
// TODO: handle exception
System.out.println(e);
}
}
#ProcessElement
public void processElement(ProcessContext context) {
try {
System.out.println("In for RowGenerator");
String decodedMessageAsJsonString = context.element();
System.out.println("decodedMessageAsJsonString"+decodedMessageAsJsonString);
String rowKey = String.valueOf(
LocalDateTime.ofInstant(Instant.now(), ZoneId.of("UTC"))
.toEpochSecond(ZoneOffset.UTC));
System.out.println("rowKey"+rowKey);
Put put = new Put(rowKey.getBytes());
put.addColumn("VALUE".getBytes(), "VALUE".getBytes(), decodedMessageAsJsonString.getBytes());
// put.addColumn(Bytes.toBytes("IBS"), Bytes.toBytes("name"),Bytes.toBytes("ram"));
context.output(put);
}catch (Throwable e) {
// TODO: handle exception
System.out.println(e);
}
}enter image description here
I don't see any issue with the Bigtable side of the template. Just make sure that the column family (which I am assuming is "VALUE" exists on the destination table.
Are you sure that you are reading the right PubSub subscription and there are messages being sent to PubSub. If its all correct, it seems there is some issue in the PubSub configuration. Maybe add the PubSub tag on the question and someone from the pubsub community can help.

Google Cloud Monitoring - Get uptime check current status

I created an uptime check for my website. Then, I found this documentation page that shows how to extract information regarding the uptime check with C#.
After running the code:
public static object GetUptimeCheckConfig(string configName)
{
var client = UptimeCheckServiceClient.Create();
UptimeCheckConfig config = client.GetUptimeCheckConfig(configName);
if (config == null)
{
Console.Error.WriteLine(
"No configuration found with the name {0}", configName);
return -1;
}
Console.WriteLine("Name: {0}", config.Name);
Console.WriteLine("Display Name: {0}", config.DisplayName);
Console.WriteLine("Http Path: {0}", config.HttpCheck.Path);
return 0;
}
I found that this method provides information only about the configuration of the check. I want to get information about its current status (working good \ broken). Seems like this information is missing.
I also tried this REST call helper - the requested information is missing there too.
Is this possible to extract the current health status of the resource?
Or I need to choose a more complex way to extract the data (e.g. via Webhooks)?
From GCP metrics docs:
To monitor the availability of a service, create an uptime check. These checks monitor the monitoring.googleapis.com/uptime_check/check_passed metric type. Don't configure an alerting policy to track a metric type such as compute.googleapis.com/instance/uptime if your goal is to monitor the availability of a service.
And then at uptime check docs:
To determine the status of your uptime checks using the API, monitor the metric monitoring.googleapis.com/uptime_check/check_passed. See Google Cloud metrics list for details.
Original answer:
Instead of GetUptimeCheckConfig you want to use timeSeries API.
You can try it in API explorer at https://cloud.google.com/monitoring/api/ref_v3/rest/v3/projects.timeSeries/query
Request args:
projects/YOUR_PROJECT_ID
Request body:
{
"query": "fetch uptime_url::monitoring.googleapis.com/uptime_check/request_latency | filter check_id = 'YOUR_CHECK_ID' | group_by [checker_location]"
}
* just make sure you replace YOUR_PROJECT_ID and YOUR_CHECK_ID with actual ids