How can I manually specify a X-Cloud-Trace-Context header value to and correlate and trace logs in separate Cloud Run requests? - google-cloud-platform

I'm using Cloud Run and Cloud Tasks to do some async processing of webhooks. When I get a request to my Cloud Run service, I queue up a task in my Cloud Tasks queue and return a response from my service immediately. Cloud Tasks will then trigger my service again (different endpoint) and do some processing. I want to correlate all the logs in these steps by using the same trace id, but it is not working.
When creating a task in Cloud Tasks, I request it to send the X-Cloud-Trace-Context header and I fill it with the original request's X-Cloud-Trace-Context header value. Theoretically, when the request comes to my Cloud Run service from Cloud Tasks, it should have this header value, and all my logs will be correlated correctly. However, when this second request comes, it looks like Cloud Run is overriding the header with a new trace id.
Is there a way to prevent this from happening? If not, what is the recommended solution to correlate all the logs (generated by service code and also the logs auto generated by GCP) in the steps described above?
Thanks for the help!

We found that passing along the traceparent header into the cloud task works. The trace id is preserved and a new span/parent id is automatically assigned by cloudrun.
task = {
"http_request": {
"url": url,
"headers": {
"traceparent": request.headers.get('traceparent', "")
}
}
}
Note it also appears to work with "X-Cloud-Trace-Context" but you have to split the value and only pass along the trace id (ex the cloudrun header value is like "trace_id/span_id;flags" -- you have to split out just the trace_id and set that as the task header value). Otherwise it seems like cloudrun considers the header invalid and, as you mentioned, sets a whole new trace context.
As a related note - while this gets the right header into place you still need to actually log the trace_id in some fashion for your logs to correlate. Looks to me like the logs generated by cloudrun itself do this, but I had to configure my logger so that my logs would also be correlated.

I don't think you can override the HTTP headers set by Cloud Tasks, but you can override the trace member in the log records sent to Stack Driver.
So you could include the original trace ID in the task payload and then override the trace in the logs produced by your Cloud Run endpoint which performs the real work.

Related

AWS SDK for JavaScript CloudWatch Logs - GetLogEventsCommand isn't fetching logs, potentially due to a log stream size issue?

I have multiple Node.js applications deployed via AWS Elastic Beanstalk on the Docker platform. I can manually download the full logs for every environment without trouble via the AWS console. Let's say I have two AWS Elastic Beanstalk Environments: env-a and env-b.
I've started using the AWS SDK for JavaScript, specifically #aws-sdk/client-cloudwatch-logs, in a Node app so that I can programmatically fetch logs, render them in a custom UI, and do my own analysis as needed.
I'm running the following code in order to fetch the log events for a given app (pseudocode):
// IMPORTS
const {
CloudWatchLogsClient,
DescribeLogStreamsCommand,
GetLogEventsCommand
} = require("#aws-sdk/client-cloudwatch-logs");
// SETUP
const awsCloudWatchClient = new CloudWatchLogsClient({
region: process.env.AWS_REGION,
});
// APPLICATION CODE
const logGroupName = getLogGroupName();
// Get the log streams for the given log group.
const logStreamRes = await awsCloudWatchClient.send(new DescribeLogStreamsCommand({
descending: true,
logGroupName,
orderBy: 'LastEventTime',
limit: 50,
}))
// For testing purposes, I'll just use the first log stream name I find.
const logStreamName = logStreamRes.logStreams[0].logStreamName;
// Get the log events for the first log stream.
const logEventRes = await awsCloudWatchClient.send(new GetLogEventsCommand({
logGroupName,
logStreamName,
}));
const logEvents = logEventRes.events;
Now, I can fetch the log events for env-a without trouble using this code. However, GetLogEventsCommand always returns an empty collection when I attempt to fetch the logs for env-b. If I download the logs manually via the AWS console, I can definitely see that logs exist - yet for a reason that isn't clear to me yet, the AWS SDK doesn't seem to recognize that.
Here's some interesting details that may help diagnose the issue.
env-a is configured in Elastic Beanstalk so that each new deploy (which happens potentially multiple times a day) replaces EC2 instances. On the other hand, env-b is configured so that new application code is deployed to existing EC2 instances without actually replacing them. Since log streams map to EC2 instances, env-a has a high number of pretty small log streams whereas env-b` has three extremely large log streams for each of its long-lived EC2 instances. The logs are easily >1 MBs in size.
Considering that GetLogEventsCommand returns responses up to 1 MB in size, am I hitting some size limit and the AWS SDK is handling it by returning 0 log events for env-b? I tried setting a limit on the GetLogEventsCommand above, but still causes the AWS SDK to return 0 events for env-a.
Another interesting note: if I go to Amazon CloudWatch > Log Group and select env-a's Log Group, I can see the log events for every log stream without trouble. If I try to view the log events for env-b's three very large log streams, I run into "Rate exceeded" errors on the console. This seems to confirm that the log stream's event count is simply too large for both the AWS console and AWS SDK to process, though I'm not certain.
Is there anything I can do to get the AWS SDK to fetch env-b's logs? How can I further confirm that excessive log stream size is the culprit here? And if that's the case, is there anything I can do about it, e.g. purge logs?
Or could this be some other issue that I'm not seeing?

Should django health-check endpoint /ht/ be accessible from everybody?

From the documentation reported here I read
This project checks for various conditions and provides reports when
anomalous behavior is detected.The following health checks are bundled
with this project: cache, database, storage, disk and memory
utilization (viapsutil), AWS S3 storage, Celery task queue, Celery
ping, RabbitMQ, Migrations
and from use case section
The primary intended use case is to monitor conditions via HTTP(S),
with responses available in HTML and JSONformats. When you get back a
response that includes one or more problems, you can then decide the
appropriate courseof action, which could include generating
notifications and/or automating the replacement of a failing node with
a newone
And then
The /ht/ endpoint will respond aHTTP 200 if all checks passed and a HTTP
500 if any of the tests failed.
From a security point of view: should this url (https://example.com/ht) be reachable from everybody? It seems to give away different information.

AWS Storage gateway : refresh cache Too many requests have been sent to server

I am calling AWS Storage Gateway refreshCache method quite too frequently I guess, (As the message suggests), but I am not sure how long do I need to wait till I hit it again, any help will be appreciated.
AWSStorageGateway gatewayClient = AWSStorageGatewayClientBuilder.standard().build();
RefreshCacheRequest cacheRequest = new RefreshCacheRequest();
cacheRequest.setFileShareARN(this.fileShareArn);
gatewayClient.refreshCache(cacheRequest);
com.amazonaws.services.storagegateway.model.InvalidGatewayRequestException: Too many requests have been sent to server. (Service: AWSStorageGateway; Status Code: 400; Error Code: InvalidGatewayRequestException; Request ID: f1ffa249-6908-4ae1-9f71-93fe7f26b2af)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1367)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770)
I think you can refer to the official document. https://docs.aws.amazon.com/storagegateway/latest/APIReference/API_RefreshCache.html
As it said,
When this API is called, it only initiates the refresh operation. When the API call completes and returns a success code, it doesn't necessarily mean that the file refresh has completed. You should use the refresh-complete notification to determine that the operation has completed before you check for new files on the gateway file share.
So I guess after you called AWS Storage Gateway refreshCache method, you must wait until the refresh action completed. And if you call the method again during this period,some exceptions will be raised.
For the solution, you can refer to Monitoring Your File Share to set a notification.

What is the best way to get response time in OSB

I am doing it like this:
Inside OSB pipeline's message flow, at the beginning of request, assign the current time to a variable. Then in the response, use the current time of the response subtract the variable to calculate the response time. Then I have a reporting action to reporting this number.
I know OSB has a build in monitoring tool, it can display the response time for proxy server, pipeline and business server. As you can see my solution only include the time from the beginning of the pipeline + business server, but not including the time of the request and response message going through the proxy server. Besides that calculating it this way also feels like a non-standard approach.
OSB provided a JMX API which can get these build in monitoring data. But this would make our project more complicated.
If we want to use the OSB reporting action to report the response time. Is there a best way to do it?
Just switch Weblogic to use extended log format, and tell it to add time-taken to the list of tokens it logs on each response.
http://middlewaretechnologies.blogspot.com.au/2012/03/configure-extended-logging-in-http.html
or if you want to read the official docs:
http://docs.oracle.com/cd/E14571_01/web.1111/e13701/web_server.htm#CNFGD207

Timestamp of server from a web service call

Is there a way that I can retrieve the timestamp of a web service call? I'm trying to get the time of the server hosting the web service.
Easiest thing to do is to just log them in the server implementation of your service contract, you can use PostSharp to make some attributes to take of this aspect.
For instance, you can write a Trace attribute which simply logs a debug message when a method is invoke. Here's one I wrote a while back which tracks how long a method takes and log a warning message if it takes longer than a set threshold:
http://theburningmonk.com/2010/03/aop-method-execution-time-watcher-with-postsharp/
I came across some 'trace' attribute example before, if you want I can look for it for ya.