AWS SDK for JavaScript CloudWatch Logs - GetLogEventsCommand isn't fetching logs, potentially due to a log stream size issue? - amazon-web-services

I have multiple Node.js applications deployed via AWS Elastic Beanstalk on the Docker platform. I can manually download the full logs for every environment without trouble via the AWS console. Let's say I have two AWS Elastic Beanstalk Environments: env-a and env-b.
I've started using the AWS SDK for JavaScript, specifically #aws-sdk/client-cloudwatch-logs, in a Node app so that I can programmatically fetch logs, render them in a custom UI, and do my own analysis as needed.
I'm running the following code in order to fetch the log events for a given app (pseudocode):
// IMPORTS
const {
CloudWatchLogsClient,
DescribeLogStreamsCommand,
GetLogEventsCommand
} = require("#aws-sdk/client-cloudwatch-logs");
// SETUP
const awsCloudWatchClient = new CloudWatchLogsClient({
region: process.env.AWS_REGION,
});
// APPLICATION CODE
const logGroupName = getLogGroupName();
// Get the log streams for the given log group.
const logStreamRes = await awsCloudWatchClient.send(new DescribeLogStreamsCommand({
descending: true,
logGroupName,
orderBy: 'LastEventTime',
limit: 50,
}))
// For testing purposes, I'll just use the first log stream name I find.
const logStreamName = logStreamRes.logStreams[0].logStreamName;
// Get the log events for the first log stream.
const logEventRes = await awsCloudWatchClient.send(new GetLogEventsCommand({
logGroupName,
logStreamName,
}));
const logEvents = logEventRes.events;
Now, I can fetch the log events for env-a without trouble using this code. However, GetLogEventsCommand always returns an empty collection when I attempt to fetch the logs for env-b. If I download the logs manually via the AWS console, I can definitely see that logs exist - yet for a reason that isn't clear to me yet, the AWS SDK doesn't seem to recognize that.
Here's some interesting details that may help diagnose the issue.
env-a is configured in Elastic Beanstalk so that each new deploy (which happens potentially multiple times a day) replaces EC2 instances. On the other hand, env-b is configured so that new application code is deployed to existing EC2 instances without actually replacing them. Since log streams map to EC2 instances, env-a has a high number of pretty small log streams whereas env-b` has three extremely large log streams for each of its long-lived EC2 instances. The logs are easily >1 MBs in size.
Considering that GetLogEventsCommand returns responses up to 1 MB in size, am I hitting some size limit and the AWS SDK is handling it by returning 0 log events for env-b? I tried setting a limit on the GetLogEventsCommand above, but still causes the AWS SDK to return 0 events for env-a.
Another interesting note: if I go to Amazon CloudWatch > Log Group and select env-a's Log Group, I can see the log events for every log stream without trouble. If I try to view the log events for env-b's three very large log streams, I run into "Rate exceeded" errors on the console. This seems to confirm that the log stream's event count is simply too large for both the AWS console and AWS SDK to process, though I'm not certain.
Is there anything I can do to get the AWS SDK to fetch env-b's logs? How can I further confirm that excessive log stream size is the culprit here? And if that's the case, is there anything I can do about it, e.g. purge logs?
Or could this be some other issue that I'm not seeing?

Related

How to redirect multiple ECS log streams into a single log stream in CloudWatch

I currently have my application running in ECS. I have enabled the awslogs agent indicating the Log group and the region. Everything works great, send the logs to the Log group and create a Log stream. However, every time I restart the container, it creates a new Log stream.
Is there a way that instead of creating a Log stream as the container restarts, it all goes into a single Log stream?
I've been looking for a solution for a long time and I haven't found anything.
For example, instead of there being 2 Log streams, there is only 1 each time the container is restarted.
Something like this:
The simplest way is to use the PutLogEvents api directly. Beyond that you can get as fancy as you want. You could use a firelens side car container in your task to handle all events using a logging api that writes directly to cloudwatch.
For example, you can do this in python with boto3 cloudwatch put_log_events
response = boto3.client("logs").put_log_events(
logGroupName="your-log-group",
logStreamName="your-log-stream",
logEvents=[
{"timestamp": 123, "message": "log message"},
],
)

How can I manually specify a X-Cloud-Trace-Context header value to and correlate and trace logs in separate Cloud Run requests?

I'm using Cloud Run and Cloud Tasks to do some async processing of webhooks. When I get a request to my Cloud Run service, I queue up a task in my Cloud Tasks queue and return a response from my service immediately. Cloud Tasks will then trigger my service again (different endpoint) and do some processing. I want to correlate all the logs in these steps by using the same trace id, but it is not working.
When creating a task in Cloud Tasks, I request it to send the X-Cloud-Trace-Context header and I fill it with the original request's X-Cloud-Trace-Context header value. Theoretically, when the request comes to my Cloud Run service from Cloud Tasks, it should have this header value, and all my logs will be correlated correctly. However, when this second request comes, it looks like Cloud Run is overriding the header with a new trace id.
Is there a way to prevent this from happening? If not, what is the recommended solution to correlate all the logs (generated by service code and also the logs auto generated by GCP) in the steps described above?
Thanks for the help!
We found that passing along the traceparent header into the cloud task works. The trace id is preserved and a new span/parent id is automatically assigned by cloudrun.
task = {
"http_request": {
"url": url,
"headers": {
"traceparent": request.headers.get('traceparent', "")
}
}
}
Note it also appears to work with "X-Cloud-Trace-Context" but you have to split the value and only pass along the trace id (ex the cloudrun header value is like "trace_id/span_id;flags" -- you have to split out just the trace_id and set that as the task header value). Otherwise it seems like cloudrun considers the header invalid and, as you mentioned, sets a whole new trace context.
As a related note - while this gets the right header into place you still need to actually log the trace_id in some fashion for your logs to correlate. Looks to me like the logs generated by cloudrun itself do this, but I had to configure my logger so that my logs would also be correlated.
I don't think you can override the HTTP headers set by Cloud Tasks, but you can override the trace member in the log records sent to Stack Driver.
So you could include the original trace ID in the task payload and then override the trace in the logs produced by your Cloud Run endpoint which performs the real work.

How can I avoid "IN_USED_ADDRESSES" error when starting multiple Dataflow jobs from the same template?

I have created a Dataflow template which allows me to import data from CSV file in Cloud Storage into BigQuery. I use Cloud Function for Firebase to create jobs from this template at certain time everyday. This is the code in the Function (with some irrelevant parts removed).
const filePath = object.name?.replace(".csv", "");
// Exit function if file changes are in temporary or staging folder
if (
filePath?.includes("staging") ||
filePath?.includes("temp") ||
filePath?.includes("templates")
)
return;
const dataflow = google.dataflow("v1b3");
const auth = await google.auth.getClient({
scopes: ["https://www.googleapis.com/auth/cloud-platform"],
});
let request = {
auth,
projectId: process.env.GCLOUD_PROJECT,
location: "asia-east1",
gcsPath: "gs://my_project_bucket/templates/csv_to_bq",
requestBody: {
jobName: `csv-to-bq-${filePath?.replace(/\//g, "-")}`,
environment: {
tempLocation: "gs://my_project_bucket/temp",
},
parameters: {
input: `gs://my_project_bucket/${object.name}`,
output: biqQueryOutput,
},
},
};
return dataflow.projects.locations.templates.launch(request);
This function is triggered every time any file is written in Cloud Storage. I am working with sensors so at least I have to import 89 different data i.e. different CSV files within 15 minutes.
The whole process works fine if there are only 4 jobs working at the same time. However, when the function tried to create the fifth job, the API returned many different types of errors.
Error 1 (not exact since somehow I cannot find the error anymore):
Error Response: [400] The following quotas were exceeded: IN_USE_ADDRESSES
Error 2:
Dataflow quota error for jobs-per-project quota. Project *** is running 25 jobs.
Please check the quota usage via GCP Console.
If it exceeds the limit, please wait for a workflow to finish or contact Google Cloud Support to request an increase in quota.
If it does not, contact Google Cloud Support.
Error 3:
Quota exceeded for quota metric 'Job template requests' and limit 'Job template requests per minute per user' of service 'dataflow.googleapis.com' for consumer 'project_number:****'.
I know I can space out starting jobs to avoid Error 2 and 3. However, I don't know how to start jobs in a way that won't fill up the addresses. So, how do I avoid that? If I cannot, then what approach should I use?
I had answered this in another post here - Which Compute Engine quotas need to be updated to run Dataflow with 50 workers (IN_USE_ADDRESSES, CPUS, CPUS_ALL_REGIONS ..)?.
Let me know if that helps.
This is a GCP external IP quota issue and the best solution is not to use any public IPs for dataflow jobs as long as your pipeline resources stay within GCP networks.
To enable public IP in dataflow jobs:
Create or update your subnetwork to allow Private google access. this is fairly simple to do using the console - VPC > networks > subnetworks > tick enable private google access
In the parameters of your Cloud Dataflow job, specify --usePublicIps=false and --network=[NETWORK] or --subnetwork=[SUBNETWORK].
Note: - For internal IP IN_USED errors just change your subnet CIDR range to accommodate more addresses like 20.0.0.0/16 will give you close to 60k internal IP address.
By this, you will never be exceeding your internal IP ranges

Authentication with Cognito - where to find logs

We have 2 React Native app are using AWS Cognito for authentication. We use library react-native-aws-cognito-js in our code. The apps are working fine until these 2 days. Apps are experiencing intermittent "Internal Server Error".
How can I find more information about this error? Any tool can help us pinpoint the cause?
Update
From CloudTrail, each API call has an event "CreateNetworkInterface". Many of such API calls have error code "Client.NetworkInterfaceLimitExceeded". What is the cause and solution to this?
According to this AWS Doc (in Chinese), CloudWatch will not write to log when error is due to insufficient IP/ENI. That explains the increase in error number but no logs in CloudWatch.
Upate 2
We have found a scheduled Lambda job which may exhausted IP addresses. We stopped the batch job. But still can't have too many user login to server due to "Client.NetworkInterfaceLimitExceeded" error. I realized that there are many "CreateNetworkInterface" event and few "DeleteNetworkInterface" event. How can I "clean up / reset" all network interface in VPC?
Short answer: Cloud Trail.
Long answer with a suggestion
Assuming your application code is fine, most likely the cause of your 500 error is based on Cognito's initial limitations (e.g., number of calls per user): https://docs.aws.amazon.com/cognito/latest/developerguide/limits.html.
AWS suggests to use Cloud Trail, for logging Api calls.
However I would suggest, to prove the limitations first, add some logs around the api call yourself, and in development you could call your app/api with a high number of calls; and most likely you will see the 500 error due to the limitations.
You could do the following in the terminal:
for i in `seq 1 1000`; do curl --cookie SecureCookie=TokenValueFromAWS http://localhost:desirablePort/SecuredPath; done

Forwarding journald to Cloudwatch Logs

I'm a newbie to CentOS and wanted to know the best way to parse journal logs to CloudWatch Logs.
My thought processes so far are:
Use FIFO to parse the journal logs and ingest this to Cloudwatch Logs, - It looks like this could come with draw backs where logs could be dropped if we hit buffering limits.
Forward journal logs to syslog and send syslogs to Cloudwatch Logs --
The idea is essentially to have everything logging to journald as JSON and then forward this across to CloudWatch Logs.
What is the best way to do this? How have others solved this problem?
Take a look at https://github.com/advantageous/systemd-cloud-watch
We had problems with journald-cloudwatch-logs. It just did not work for us at all.
It does not limit the size of the message or commandLine that it sends to CloudWatch and the CloudWatch sends back an error that journald-cloudwatch-logs cannot handle which makes it out of sync.
systemd-cloud-watch is stateless and it asks CloudWatch where it left off.
systemd-cloud-watch also creates the log-group if missing.
systemd-cloud-watch also uses the name tag and the private ip address so that you can easily find the log you are looking for.
We also include a packer file to show you how to build and configure a systemd-cloud-watch image with EC2/Centos/Systemd. There is no question about how to configure systemd because we have a working example.
Take a look at https://github.com/saymedia/journald-cloudwatch-logs by Matin Atkins.
This open source project creates a binary that does exactly what you want - ship your (systemd) journald logs to AWS CloudWatch Logs.
The project depends on libsystemd to forward directly to CloudWatch. It does not rely on forwarding to syslog. This is a good thing.
The project appears to use golang's concurrent channels to read the logs and batches writes.
Vector can be used to ship logs from journald to AWS CloudWatch Logs.
journald can be used as a source and AWS Cloudwatch Logs as a sink.
I'm working on integrating this with an existing deployment of about 6 EC2 instances that generate about 30 GB of logs daily. I'll update this answer with any caveats or gotchas after we've used Vector in production for a few weeks.
EDIT 8/17/2020
A few things to be aware of. The match batch size for the PutLogEvents is 1MB and there is a max of 5 requests per second per stream. See the limits here..
To help with that, in my set up each journald unit has it's own log stream. Also, there are a lot of fields that the Vector journald sink includes, I used a vector transform to remove all the ones I didn't need. However, I'm still running into rate limits.
EDIT 10/6/2020
I have this running in production now. I had to update the version of vector I was using from 0.8.1 to 0.10.0 to take care an issue with vector not respecting the max bytes per batch requirement for AWS CloudWatch logs. As far as the rate limit issues I was experiencing, it turns out I wasn't having any issues. I was getting this message in the vector logs tower_limit::rate::service: rate limit exceeded, disabling service. What that actually means is that vector is pausing send logs temporarily to respect the rate limit of the sink. Also, each Cloudwatch Log Stream can consume up to 18 GB per hour which is fine for my 30 GB per day requirement for over 30 different services on 6 VMs.
One issue I did run into was causing the CPU to spike on our main API service. I had a source for each service unit to tail the journald logs. I believe this somehow blocked our API from not being able to write to journald (not 100% though). What I did was have one source and specified multiple units to follow so there was only one command tailing the logs and I increased the batch size since each service generates a lot of logs. I then used vector's template syntax to split the Log Group and Log Stream based on the service name. Below is an example configuration:
[sources.journald_logs]
type = "journald"
units = ["api", "sshd", "vector", "review", "other-service"]
batch_size = 100
[sinks.cloud_watch_logs]
type = "aws_cloudwatch_logs"
inputs = ["journald_logs"]
group_name = "/production/{{host}}/{{_SYSTEMD_UNIT}}"
healthcheck = true
region = "${region}"
stream_name = "{{_SYSTEMD_UNIT}}"
encoding = "json"
I have one final issue I need to iron out, but it's not related to this question. I'm using a file source for nginx since it writes to an access log file. Vector is consuming 80% of the CPU on that machine getting the logs and sending them to AWS CloudWatch. Filebeat also runs on the same box sending the logs to Logstash, but it's never caused any issues. Once we get vector working reliably we'll retire the Elastic Stack, but for now we have them running side by side.