How to redirect multiple ECS log streams into a single log stream in CloudWatch

How to redirect multiple ECS log streams into a single log stream in CloudWatch - amazon-web-services

I currently have my application running in ECS. I have enabled the awslogs agent indicating the Log group and the region. Everything works great, send the logs to the Log group and create a Log stream. However, every time I restart the container, it creates a new Log stream.
Is there a way that instead of creating a Log stream as the container restarts, it all goes into a single Log stream?
I've been looking for a solution for a long time and I haven't found anything.
For example, instead of there being 2 Log streams, there is only 1 each time the container is restarted.
Something like this:

The simplest way is to use the PutLogEvents api directly. Beyond that you can get as fancy as you want. You could use a firelens side car container in your task to handle all events using a logging api that writes directly to cloudwatch.
For example, you can do this in python with boto3 cloudwatch put_log_events
response = boto3.client("logs").put_log_events(
logGroupName="your-log-group",
logStreamName="your-log-stream",
logEvents=[
{"timestamp": 123, "message": "log message"},
],
)

Related

AWS SDK for JavaScript CloudWatch Logs - GetLogEventsCommand isn't fetching logs, potentially due to a log stream size issue?

I have multiple Node.js applications deployed via AWS Elastic Beanstalk on the Docker platform. I can manually download the full logs for every environment without trouble via the AWS console. Let's say I have two AWS Elastic Beanstalk Environments: env-a and env-b.
I've started using the AWS SDK for JavaScript, specifically #aws-sdk/client-cloudwatch-logs, in a Node app so that I can programmatically fetch logs, render them in a custom UI, and do my own analysis as needed.
I'm running the following code in order to fetch the log events for a given app (pseudocode):
// IMPORTS
const {
CloudWatchLogsClient,
DescribeLogStreamsCommand,
GetLogEventsCommand
} = require("#aws-sdk/client-cloudwatch-logs");
// SETUP
const awsCloudWatchClient = new CloudWatchLogsClient({
region: process.env.AWS_REGION,
});
// APPLICATION CODE
const logGroupName = getLogGroupName();
// Get the log streams for the given log group.
const logStreamRes = await awsCloudWatchClient.send(new DescribeLogStreamsCommand({
descending: true,
logGroupName,
orderBy: 'LastEventTime',
limit: 50,
}))
// For testing purposes, I'll just use the first log stream name I find.
const logStreamName = logStreamRes.logStreams[0].logStreamName;
// Get the log events for the first log stream.
const logEventRes = await awsCloudWatchClient.send(new GetLogEventsCommand({
logGroupName,
logStreamName,
}));
const logEvents = logEventRes.events;
Now, I can fetch the log events for env-a without trouble using this code. However, GetLogEventsCommand always returns an empty collection when I attempt to fetch the logs for env-b. If I download the logs manually via the AWS console, I can definitely see that logs exist - yet for a reason that isn't clear to me yet, the AWS SDK doesn't seem to recognize that.
Here's some interesting details that may help diagnose the issue.
env-a is configured in Elastic Beanstalk so that each new deploy (which happens potentially multiple times a day) replaces EC2 instances. On the other hand, env-b is configured so that new application code is deployed to existing EC2 instances without actually replacing them. Since log streams map to EC2 instances, env-a has a high number of pretty small log streams whereas env-b` has three extremely large log streams for each of its long-lived EC2 instances. The logs are easily >1 MBs in size.
Considering that GetLogEventsCommand returns responses up to 1 MB in size, am I hitting some size limit and the AWS SDK is handling it by returning 0 log events for env-b? I tried setting a limit on the GetLogEventsCommand above, but still causes the AWS SDK to return 0 events for env-a.
Another interesting note: if I go to Amazon CloudWatch > Log Group and select env-a's Log Group, I can see the log events for every log stream without trouble. If I try to view the log events for env-b's three very large log streams, I run into "Rate exceeded" errors on the console. This seems to confirm that the log stream's event count is simply too large for both the AWS console and AWS SDK to process, though I'm not certain.
Is there anything I can do to get the AWS SDK to fetch env-b's logs? How can I further confirm that excessive log stream size is the culprit here? And if that's the case, is there anything I can do about it, e.g. purge logs?
Or could this be some other issue that I'm not seeing?

How can I manually specify a X-Cloud-Trace-Context header value to and correlate and trace logs in separate Cloud Run requests?

I'm using Cloud Run and Cloud Tasks to do some async processing of webhooks. When I get a request to my Cloud Run service, I queue up a task in my Cloud Tasks queue and return a response from my service immediately. Cloud Tasks will then trigger my service again (different endpoint) and do some processing. I want to correlate all the logs in these steps by using the same trace id, but it is not working.
When creating a task in Cloud Tasks, I request it to send the X-Cloud-Trace-Context header and I fill it with the original request's X-Cloud-Trace-Context header value. Theoretically, when the request comes to my Cloud Run service from Cloud Tasks, it should have this header value, and all my logs will be correlated correctly. However, when this second request comes, it looks like Cloud Run is overriding the header with a new trace id.
Is there a way to prevent this from happening? If not, what is the recommended solution to correlate all the logs (generated by service code and also the logs auto generated by GCP) in the steps described above?
Thanks for the help!

We found that passing along the traceparent header into the cloud task works. The trace id is preserved and a new span/parent id is automatically assigned by cloudrun.
task = {
"http_request": {
"url": url,
"headers": {
"traceparent": request.headers.get('traceparent', "")
}
}
}
Note it also appears to work with "X-Cloud-Trace-Context" but you have to split the value and only pass along the trace id (ex the cloudrun header value is like "trace_id/span_id;flags" -- you have to split out just the trace_id and set that as the task header value). Otherwise it seems like cloudrun considers the header invalid and, as you mentioned, sets a whole new trace context.
As a related note - while this gets the right header into place you still need to actually log the trace_id in some fashion for your logs to correlate. Looks to me like the logs generated by cloudrun itself do this, but I had to configure my logger so that my logs would also be correlated.

I don't think you can override the HTTP headers set by Cloud Tasks, but you can override the trace member in the log records sent to Stack Driver.
So you could include the original trace ID in the task payload and then override the trace in the logs produced by your Cloud Run endpoint which performs the real work.

How to track distributed tasks progress

Here is my case:
When my server receieve a request, it will trigger distributed tasks, in my case many AWS lambda functions (the peek value could be 3000)
I need to track each task progress / status i.e. pending, running, success, error
My server could have many replicas
I still want to know about the task progress / status even if any of my server replica down
My current design:
I choose AWS S3 as my helper
When a task start to execute, it will create marker file in a special folder on S3 e.g. running folder
When the task fail or success, it will move the marker file from running folder to fail folder or success folder
I check the marker files on S3 to check the progress of the tasks.
The problems:
There is a limit for AWS S3 concurrent access
My case is likely to exceed the limit some day
Attempt Solutions:
I had tried my best to reduce the number of request to S3
I don't want to track the progress by storing data in my DB because my DB has already been under heavy workload.
To be honest, it is kind of wierd that using marker files on S3 to track progress of the tasks. However, it worked before.
Is there any recommendations ?
Thanks in advance !

This sounds like a perfect application of persistent event queueing, specifically Kinesis. As each Lambda starts it generates a “starting” event on Kinesis. When it succeeds or fails, it generates the appropriate event. You could even create progress events along the way if you want to see how far they have gotten.
Your server can then monitor the number of starting events against ending events (success or failure) until these two numbers are equal. It can query the error events to see which processes failed and why. All servers can query the same events without disrupting each other, and any server can go down and recover without losing data.
Make sure to put an Origination Key on events that are supposed to be grouped together so they don't get mixed up with a subsequent event. Also, each Lambda should be given its own key so you can trace progress per Lambda. Guids are perfect for this.

Cloudwatch logs - No Event Data after time elapses

I've looked on the AWS forums and elsewhere but haven't found a solution. I have a lambda function that, when invoked, creates a log stream which populates with log events. After about 12 hours or so, the log stream is still present, but when I open it, I see the following:
The link explains how to start sending event data, but I already have this set up, and I am sending event data, it just disappears after a certain time period.
I'm guessing there is some setting somewhere (either for max storage allowed or for whether logs get purged), but if there is, I haven't found it.

Another reason for missing data in the log stream might be a corrupted agent-state file. First check your logs
vim /var/log/awslogs.log
If you find something like "Caught exception: An error occurred (InvalidSequenceTokenException) when calling the PutLogEvents operation: The given sequenceToken is invalid. The next expected sequenceToken is:" you can regenerate the agent-state file as follows:
sudo rm /var/lib/awslogs/agent-state
sudo service awslogs stop
sudo service awslogs start

TL;DR: Just use the CLI. See Update 2 below.
This is really bizarre but I can replicate it...
I un-checked the "Expire Events After" box, and lo and behold I was able to open older log streams. What seems REALLY odd is that if I choose to display the "Stored Bytes" data, many of the files are listed at 0 bytes even though they have log events:
Update 1:
This solution no longer works as I can only view the log events in the first two log streams. What's more is that the Stored Bytes column displays different (and more accurate) data:
This leads me to believe that AWS made some kind of update.
UPDATE 2:
Just use the CLI. I've verified that I can retrieve log events from the CLI that I cannot retrieve via the web console.
First install the CLI (if you haven't already) and use the following command:
aws logs get-log-events --log-group-name NAME-OF-LOGGROUP --log-stream-name LOG-STREAM-NAME // be sure to escape special characters such as /, [, $ etc

Forwarding journald to Cloudwatch Logs

I'm a newbie to CentOS and wanted to know the best way to parse journal logs to CloudWatch Logs.
My thought processes so far are:
Use FIFO to parse the journal logs and ingest this to Cloudwatch Logs, - It looks like this could come with draw backs where logs could be dropped if we hit buffering limits.
Forward journal logs to syslog and send syslogs to Cloudwatch Logs --
The idea is essentially to have everything logging to journald as JSON and then forward this across to CloudWatch Logs.
What is the best way to do this? How have others solved this problem?

Take a look at https://github.com/advantageous/systemd-cloud-watch
We had problems with journald-cloudwatch-logs. It just did not work for us at all.
It does not limit the size of the message or commandLine that it sends to CloudWatch and the CloudWatch sends back an error that journald-cloudwatch-logs cannot handle which makes it out of sync.
systemd-cloud-watch is stateless and it asks CloudWatch where it left off.
systemd-cloud-watch also creates the log-group if missing.
systemd-cloud-watch also uses the name tag and the private ip address so that you can easily find the log you are looking for.
We also include a packer file to show you how to build and configure a systemd-cloud-watch image with EC2/Centos/Systemd. There is no question about how to configure systemd because we have a working example.

Take a look at https://github.com/saymedia/journald-cloudwatch-logs by Matin Atkins.
This open source project creates a binary that does exactly what you want - ship your (systemd) journald logs to AWS CloudWatch Logs.
The project depends on libsystemd to forward directly to CloudWatch. It does not rely on forwarding to syslog. This is a good thing.
The project appears to use golang's concurrent channels to read the logs and batches writes.

Vector can be used to ship logs from journald to AWS CloudWatch Logs.
journald can be used as a source and AWS Cloudwatch Logs as a sink.
I'm working on integrating this with an existing deployment of about 6 EC2 instances that generate about 30 GB of logs daily. I'll update this answer with any caveats or gotchas after we've used Vector in production for a few weeks.
EDIT 8/17/2020
A few things to be aware of. The match batch size for the PutLogEvents is 1MB and there is a max of 5 requests per second per stream. See the limits here..
To help with that, in my set up each journald unit has it's own log stream. Also, there are a lot of fields that the Vector journald sink includes, I used a vector transform to remove all the ones I didn't need. However, I'm still running into rate limits.
EDIT 10/6/2020
I have this running in production now. I had to update the version of vector I was using from 0.8.1 to 0.10.0 to take care an issue with vector not respecting the max bytes per batch requirement for AWS CloudWatch logs. As far as the rate limit issues I was experiencing, it turns out I wasn't having any issues. I was getting this message in the vector logs tower_limit::rate::service: rate limit exceeded, disabling service. What that actually means is that vector is pausing send logs temporarily to respect the rate limit of the sink. Also, each Cloudwatch Log Stream can consume up to 18 GB per hour which is fine for my 30 GB per day requirement for over 30 different services on 6 VMs.
One issue I did run into was causing the CPU to spike on our main API service. I had a source for each service unit to tail the journald logs. I believe this somehow blocked our API from not being able to write to journald (not 100% though). What I did was have one source and specified multiple units to follow so there was only one command tailing the logs and I increased the batch size since each service generates a lot of logs. I then used vector's template syntax to split the Log Group and Log Stream based on the service name. Below is an example configuration:
[sources.journald_logs]
type = "journald"
units = ["api", "sshd", "vector", "review", "other-service"]
batch_size = 100
[sinks.cloud_watch_logs]
type = "aws_cloudwatch_logs"
inputs = ["journald_logs"]
group_name = "/production/{{host}}/{{_SYSTEMD_UNIT}}"
healthcheck = true
region = "${region}"
stream_name = "{{_SYSTEMD_UNIT}}"
encoding = "json"
I have one final issue I need to iron out, but it's not related to this question. I'm using a file source for nginx since it writes to an access log file. Vector is consuming 80% of the CPU on that machine getting the logs and sending them to AWS CloudWatch. Filebeat also runs on the same box sending the logs to Logstash, but it's never caused any issues. Once we get vector working reliably we'll retire the Elastic Stack, but for now we have them running side by side.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js