I use AWS MSK cluster with brokers logging turned on to CloudWatch. Logging works and I can see brokers logs. We have some topics with cleanup.policy=compact and some with cleanup.policy=delete. The system is running on the new cluster for about 2 weeks now.
From my research (e.g. https://zendesk.engineering/an-investigation-into-kafka-log-compaction-5e520f4291f0) I see that kafka should run log cleaner (obviously) and there should be some traces in logs of this activity. However in my CloudWatch log group I cannot find a word "cleaner" or "cleaned" and I cannot find any trace of log cleaner running.
Is log cleaner running at all? It obviously should but I can't find anything in the logs to confirm this, and also we have a lot of messages eligible for cleanup but still not cleaned, for about 2 weeks now.
Kafka cluster version is 2.8.1
It is quite likely these logs are not being show in MSK since it seems that, by default, they do not go to the main log stream, from: https://jaceklaskowski.gitbooks.io/apache-kafka/content/kafka-log-LogCleaner.html
Please note that Kafka comes with a preconfigured kafka.log.LogCleaner logger in config/log4j.properties:
log4j.appender.cleanerAppender=org.apache.log4j.DailyRollingFileAppender
log4j.appender.cleanerAppender.DatePattern='.'yyyy-MM-dd-HH
log4j.appender.cleanerAppender.File=${kafka.logs.dir}/log-cleaner.log
log4j.appender.cleanerAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.cleanerAppender.layout.ConversionPattern=[%d] %p %m (%c)%n
log4j.logger.kafka.log.LogCleaner=INFO, cleanerAppender
log4j.additivity.kafka.log.LogCleaner=false
That means that the logs of LogCleaner go to logs/log-cleaner.log file at INFO logging level and are not added to the main logs (per log4j.additivity being off).
It is a bit misleading though because the LogCleaner takes care of compacted topics, I'm not sure where is logged (or at which log level since AWS MSK only exports INFO level logs) the deletion of messages in topics with delete cleanup policy.
I would contact AWS support to know if there is a way or to know what do they do with these logs.
Alternatively, you could try to set up open monitoring with Prometheus which will get all metrics exported by Kafka to JMX. If enabled, there should be a metric (max-clean-time-sec) that, at least, will tell you if it is running and you may get some other interesting information to troubleshoot your issue.
Related
A task is running for a few seconds before terminating, I don't know why, and it's not pushing any logs.
I'm using the "awslogs" driver and the log group exists in CloudWatch.
The "Logs" tab is empty. The log-stream is created in CW but it's devoid of actual log events. There are also no results under Insights for that stream.
The task role has the permissions mentioned at https://docs.aws.amazon.com/AmazonECS/latest/developerguide/using_cloudwatch_logs.html .
Any idea what the deal is with the logs?
The command wasn't valid nor was it comma-separated. It was terminating too early in the workflow to log anything, but yet after any other deployment issue would be identified. So, it was looking like it was successful but in reality wasn't yet even running. Interestingly, it would still take around a minute to terminate, so maybe this includes the overhead of pulling the image.
Timestamps indicate that task started and exited after some seconds. awslogs will send logs if container has been successfully started, so, in this case it may not be helping. You can follow step 6 of documentation to diagnose. Specifically, if you have a container that has stopped, expand the container and inspect the Status reason row to see what caused the task state to change. In most cases, that will lead you to actual cause
I run Airflow in a managed Cloud-composer environment (version 1.9.0), whic runs on a Kubernetes 1.10.9-gke.5 cluster.
All my DAGs run daily at 3:00 AM or 4:00 AM. But sometime in the morning, I see a few Tasks failed without a reason during the night.
When checking the log using the UI - I see no log and I see no log either when I check the log folder in the GCS bucket
In the instance details, it reads "Dependencies Blocking Task From Getting Scheduled" but the dependency is the dagrun itself.
Although the DAG is set with 5 retries and an email message it does not look as if any retry took place and I haven't received an email about the failure.
I usually just clear the task instance and it run successfully on the first try.
Has anyone encountered a similar problem?
Empty logs often means the Airflow worker pod was evicted (i.e., it died before it could flush logs to GCS), which is usually due to an out of memory condition. If you go to your GKE cluster (the one under Composer's hood) you will probably see that there is indeed a evicted pod (GKE > Workloads > "airflow-worker").
You will probably see in "Tasks Instances" that said tasks have no Start Date nor Job Id or worker (Hostname) assigned, which, added to no logs, is a proof of the death of the pod.
Since this normally happens in highly parallelised DAGs, a way to avoid this is to reduce the worker concurrency or use a better machine.
EDIT: I filed this Feature Request on your behalf to get emails in case of failure, even if the pod was evicted.
I have a fargate task which I have scheduled to run with CloudWatch Event rules, and output a timestamp to a database on a successful run. It also outputs a logfile to CloudWatch for every time it runs.
However, there was 1 time where the log file was not created, and the database not updated. I suspect the task was never even started, or had failed to start.
In CloudWatch, the event rule shows trigger and invocation at the time I expected the task to run, so I assume the task at least attempted to start.
My question is: is there any way I can debug or log information about the cluster failing to start a task?
Please let me know if I need to provide more information.
Edit: I should specify I'm looking for a way to read this information in a log file somewhere. I know I can see failed task reason in the web console, but that's only for relatively recent tasks.
I have posted the same question here: https://www.reddit.com/r/aws/comments/adtqvt/debugging_failed_fargate_task_initialization/ and StackOverflow: https://forums.aws.amazon.com/thread.jspa?messageID=884638󗾞
Go to the cluster and choose the Tasks tab
In the lower pane, choose Stopped for the Desired Task Status value
Locate the desired Task and click it's GUID
Scroll down to the Containers section and expand the relevant containers that are experiencing errors
You'll see some kind of Status reason for the error. In my case it was:
CannotStartContainerError: API error (500): failed to initialize logging driver: Cannot determine region for awslogs driver
Edit: I can't really take credit for figuring this out - found it here:
https://github.com/aws/amazon-ecs-agent/issues/1654#issuecomment-437178282
Try going to "CloudWatch -> Logs -> Insights" and click on "Run Query":
I just faced this problem and the lack of logs did make it quite difficult to resolve.
The problem in my case was the security group used for the task had been deleted. Hope this helps if any one has a similar issue.
We have 2 React Native app are using AWS Cognito for authentication. We use library react-native-aws-cognito-js in our code. The apps are working fine until these 2 days. Apps are experiencing intermittent "Internal Server Error".
How can I find more information about this error? Any tool can help us pinpoint the cause?
Update
From CloudTrail, each API call has an event "CreateNetworkInterface". Many of such API calls have error code "Client.NetworkInterfaceLimitExceeded". What is the cause and solution to this?
According to this AWS Doc (in Chinese), CloudWatch will not write to log when error is due to insufficient IP/ENI. That explains the increase in error number but no logs in CloudWatch.
Upate 2
We have found a scheduled Lambda job which may exhausted IP addresses. We stopped the batch job. But still can't have too many user login to server due to "Client.NetworkInterfaceLimitExceeded" error. I realized that there are many "CreateNetworkInterface" event and few "DeleteNetworkInterface" event. How can I "clean up / reset" all network interface in VPC?
Short answer: Cloud Trail.
Long answer with a suggestion
Assuming your application code is fine, most likely the cause of your 500 error is based on Cognito's initial limitations (e.g., number of calls per user): https://docs.aws.amazon.com/cognito/latest/developerguide/limits.html.
AWS suggests to use Cloud Trail, for logging Api calls.
However I would suggest, to prove the limitations first, add some logs around the api call yourself, and in development you could call your app/api with a high number of calls; and most likely you will see the 500 error due to the limitations.
You could do the following in the terminal:
for i in `seq 1 1000`; do curl --cookie SecureCookie=TokenValueFromAWS http://localhost:desirablePort/SecuredPath; done
I'm a newbie to CentOS and wanted to know the best way to parse journal logs to CloudWatch Logs.
My thought processes so far are:
Use FIFO to parse the journal logs and ingest this to Cloudwatch Logs, - It looks like this could come with draw backs where logs could be dropped if we hit buffering limits.
Forward journal logs to syslog and send syslogs to Cloudwatch Logs --
The idea is essentially to have everything logging to journald as JSON and then forward this across to CloudWatch Logs.
What is the best way to do this? How have others solved this problem?
Take a look at https://github.com/advantageous/systemd-cloud-watch
We had problems with journald-cloudwatch-logs. It just did not work for us at all.
It does not limit the size of the message or commandLine that it sends to CloudWatch and the CloudWatch sends back an error that journald-cloudwatch-logs cannot handle which makes it out of sync.
systemd-cloud-watch is stateless and it asks CloudWatch where it left off.
systemd-cloud-watch also creates the log-group if missing.
systemd-cloud-watch also uses the name tag and the private ip address so that you can easily find the log you are looking for.
We also include a packer file to show you how to build and configure a systemd-cloud-watch image with EC2/Centos/Systemd. There is no question about how to configure systemd because we have a working example.
Take a look at https://github.com/saymedia/journald-cloudwatch-logs by Matin Atkins.
This open source project creates a binary that does exactly what you want - ship your (systemd) journald logs to AWS CloudWatch Logs.
The project depends on libsystemd to forward directly to CloudWatch. It does not rely on forwarding to syslog. This is a good thing.
The project appears to use golang's concurrent channels to read the logs and batches writes.
Vector can be used to ship logs from journald to AWS CloudWatch Logs.
journald can be used as a source and AWS Cloudwatch Logs as a sink.
I'm working on integrating this with an existing deployment of about 6 EC2 instances that generate about 30 GB of logs daily. I'll update this answer with any caveats or gotchas after we've used Vector in production for a few weeks.
EDIT 8/17/2020
A few things to be aware of. The match batch size for the PutLogEvents is 1MB and there is a max of 5 requests per second per stream. See the limits here..
To help with that, in my set up each journald unit has it's own log stream. Also, there are a lot of fields that the Vector journald sink includes, I used a vector transform to remove all the ones I didn't need. However, I'm still running into rate limits.
EDIT 10/6/2020
I have this running in production now. I had to update the version of vector I was using from 0.8.1 to 0.10.0 to take care an issue with vector not respecting the max bytes per batch requirement for AWS CloudWatch logs. As far as the rate limit issues I was experiencing, it turns out I wasn't having any issues. I was getting this message in the vector logs tower_limit::rate::service: rate limit exceeded, disabling service. What that actually means is that vector is pausing send logs temporarily to respect the rate limit of the sink. Also, each Cloudwatch Log Stream can consume up to 18 GB per hour which is fine for my 30 GB per day requirement for over 30 different services on 6 VMs.
One issue I did run into was causing the CPU to spike on our main API service. I had a source for each service unit to tail the journald logs. I believe this somehow blocked our API from not being able to write to journald (not 100% though). What I did was have one source and specified multiple units to follow so there was only one command tailing the logs and I increased the batch size since each service generates a lot of logs. I then used vector's template syntax to split the Log Group and Log Stream based on the service name. Below is an example configuration:
[sources.journald_logs]
type = "journald"
units = ["api", "sshd", "vector", "review", "other-service"]
batch_size = 100
[sinks.cloud_watch_logs]
type = "aws_cloudwatch_logs"
inputs = ["journald_logs"]
group_name = "/production/{{host}}/{{_SYSTEMD_UNIT}}"
healthcheck = true
region = "${region}"
stream_name = "{{_SYSTEMD_UNIT}}"
encoding = "json"
I have one final issue I need to iron out, but it's not related to this question. I'm using a file source for nginx since it writes to an access log file. Vector is consuming 80% of the CPU on that machine getting the logs and sending them to AWS CloudWatch. Filebeat also runs on the same box sending the logs to Logstash, but it's never caused any issues. Once we get vector working reliably we'll retire the Elastic Stack, but for now we have them running side by side.