eks fluent-bit to elasticsearch timeout - amazon-web-services

So I had a working configuration with fluent-bit on eks and elasticsearch on AWS that was pointing on the AWS elasticsearch service but for cost saving purpose, we deleted that elasticsearch and created an instance with a solo elasticsearch, enough for dev purpose. And the aws service doesn't manage well with only one instance.
The issue is that during this migration the fluent-bit seems to have broken, and I get lots of "[warn] failed to flush chunk" and some "[error] [upstream] connection #55 to ES-SERVER:9200 timed out after 10 seconds".
My current configuration:
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Kube_Tag_Prefix kube.var.log.containers.
Merge_Log On
Merge_Log_Key log_processed
K8S-Logging.Parser On
K8S-Logging.Exclude Off
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
Parser docker
DB /var/log/flb_kube.db
Mem_Buf_Limit 50MB
Skip_Long_Lines On
Refresh_Interval 10
Ignore_Older 1m
I think the issue is in one of those configuration, if I comment the kubernetes filter I don't have the errors anymore but I'm loosing the fields in the indices...
I tried tweeking some parameters in fluent-bit to no avail, if anyone has a suggestion?
So, the previous logs did not indicate anything, but I finaly found something when activating trace_error in the elasticsearch output:
{"index":{"_index":"fluent-bit-2021.04.16","_type":"_doc","_id":"Xkxy 23gBidvuDr8mzw8W","status":400,"error":{"type":"mapper_parsing_exception","reas on":"object mapping for [kubernetes.labels.app] tried to parse field [app] as o bject, but found a concrete value"}}
Did someone get that error before and knows how to solve it?

So, after looking into the logs and finding the mapping issue I ssem to have resolved the issue. The logs are now corretly parsed and send to the elasticsearch.
To resolve it I had to augment the limit of output retry and add the Replace_Dots option.
[OUTPUT]
Name es
Match *
Host ELASTICSERVER
Port 9200
Index <fluent-bit-{now/d}>
Retry_Limit 20
Replace_Dots On
It seems that at the beginning I had issues with the content being sent, because of that the error seemed to have continued after the changed until a new index was created making me think that the error was still not resolved.

Related

Attempting to put a CloudWatch Agent onto an On-Premise server. Issues with cwagent-otel-collector

As said in the title, I am attempting to put a CloudWatch Agent (CW agent) on my On-Premise-Server (OPS).
After running this line of code that I got from the AWS User Guide to start the CW agent:
& $Env:ProgramFiles\Amazon\AmazonCloudWatchAgent\amazon-cloudwatch-agent-ctl.ps1 -m ec2 -a start
I got this error:
****** processing cwagent-otel-collector ******
cwagent-otel-collector will not be started as it has not been configured yet.
****** processing amazon-cloudwatch-agent ******
AmazonCloudWatchAgent has been started
I did/do not know what this was so I searched and found that when someone else had this issue, they did not create a config file.
I did create a config file (named config.json by default) using the configuration wizard and I am still having the issue.
I have tried looking into a number of pages on that user guide, but nothing has resolved the issue.
Thank you in advance for any assistance you can provide.
This message is info and not an error.
CloudWatch agent is bundled with the AWS OpenTelemetry collector agent. They're actually two agents. CloudWatch agent and Otel collector have separate configuration files. If you provide a config for one and not the other, it will only start the one that is configured. This is expected behavior.
Thank you for taking the time to answer. I have since resolved the issue (recently).
Everything from the command I was using to the path where the file resided was incorrect.
Starting over and going through all the steps again with background information helped.
The first installation combined with learning everything for the first time produced the issue.
Anyone having this issue I recommend that when you hit a wall like this you start over. I know it is not what anyone wants to do, but in the end it saved time.

How to run docker task with Amazon ECS - getting error `STOPPED (CannotStartContainerError: Error response from dae)`

My goal is to execute a benchmark deployed as a docker image. While doing so, I had too many issues, so I decided to first make something extremely trivial work.
So I decided to follow the guide in https://docs.aws.amazon.com/AmazonECS/latest/developerguide/create-task-definition.html
and use the "ping" example - it should just ping a domain couple of times, and stop.
The problem is, I always receive this message in the task status:
STOPPED (CannotStartContainerError: Error response from dae)
I tried it with various subnets and security groups, but the result is always the same - the task starts, and after a minute or two fails with the message above.
I even tried it on a fresh new AWS account, using these steps:
in https://us-east-2.console.aws.amazon.com/ecs/ created new cluster (networking only)
in task definitions, created a taskdef
with docker image alpine:latest, command ping -c 4 google.com
then I select the cluster, switch to "tasks" tab, and enter the run dialog
with one of pre-created subnets
After executing:
the task appears in the cluster's tasks list in PENDING state
it takes couple of minutes
eventually (using refresh button), it changes to the mentioned message - STOPPED (CannotStartContainerError: Error response from dae)
My guess is that the reason is:
either the task cannot download the image
or the instance cannot reach outside net
What can I be doing wrong? How to fix?
In my case too the log group was the problem. The one I had configured wasnt working. Hence I enabled the "Auto-configure CloudWatch Logs" option in the "Log Configuration" of the container settings.
Also if you open the stopped task, navigate to the container section, expand it, under the Details section you can see a detailed error message. Screenshot below
It could be a problem with the entry point as pointed in the comments of the question (in the task definition) Entrypoint: ["sh","-c"]
It could also be a bad reference, for example a wrong log group in the LogConfiguration or something similar.
I just create de group log in my cloudwatch console because it have not created, and now everything is going well.

GCP load balancer shows "Invalid Fingerprint"

I'm using a GCP Load Balancer that is currently functioning correctly, however when I view it in Cloud Console, it shows an "Invalid Fingerprint" status. More importantly, when I try to edit it--like, for example, by adding a new backend service--it fails with "Invalid Fingerprint." I'm under the impression this is some sort of SSL failure, but I have no idea what it means more than that or how to fix it.
tl;dr: GCP Load Balancer edits fail with "Invalid Fingerprint"
To add a little more context to this type of error status:
This issue happened around May 2016 for all kinds of load balancers in GCP and the cause was how the hash value for the fingerprint was calculated. This specific issue was solved by a roll-out a few days later.
Around 2017 the issue reappeared but for kubernetes, where the Google Kubernetes Engine(GKE) was updating the fingerprint every minute and when someone modifies the load balancer with the GUI, it causes multiple updates to trigger that does not finish by the time GKE sends a new update causing the invalid fingerprint error.
Most of this information I provided is linked on this Issue Tracker
that I invite you to check and 'star' for updates
Troubleshooting possibilities:
a) Using the command line to add or remove backend tend to be faster and less likely produce the "Invalid Fingerprint".
E.g.
$ gcloud compute backend-services create BACKEND_SERVICE_NAME
$ gcloud compute backend-services delete BACKEND_SERVICE_NAME
For all the flags, please reference: gcloud compute backend-services
b) When the issue is related to a GKE Ingress Load Balancer, in some cases, the way it was structured by the YAML file does not get updated (there is a small chance of happening when a new roll-out is deployed).
Recreating it should solve the issue, please refer to the Ingress Kubernetes Page to know all about it.
c) Possibly it is not directly related to the SSL failure since for these a more specific error will appear, like:
RSA key error: n does not equal p q RSA key error: d e not congruent to 1 RSA key error: dmp1 not congruent to d RSA key error: dmq1 not congruent to d RSA key error: iqmp not inverse of q
Curiosity: The error message "Invalid Fingerprint" was originally designed to be the error message when 'conditions were not met' but now is specific for the fingerprint hash calculation.
--- This answer was improved based on the comments of original Poster, thank you. ---

AWS Elastic Beanstalk GUI: Query failed to deserialize response

after running AWS Elastic Beanstalk application for few weeks suddenly I can't open my application. Page simply displays an error which doesn't provide much information how to fix it.
Error
A problem occurred while loading your page: AWS Query failed to deserialize response
(and there is no more information, Googling also haven't found any answer)
So before updating my subscription and starting paying to Amazon not insignificant amount of money for being able to contact their technical support I thought I will ask here first if someone here encountered this issue.
Thanks for any suggestions.
After receiving this generic error, I was able to dig into the actual error message by using the EB CLI. In my case the CLI threw "ZIP does not support timestamps before 1980".

Enabling HA namenodes on a secure cluster in Cloudera Manager fails

I am running a CDH4.1.2 secure cluster and it works fine with the single namenode+secondarynamenode configuration, but when I try to enable High Availability (quorum based) from the Cloudera Manager interface it dies at step 10 of 16, "Starting the NameNode that will be transitioned to active mode namenode ([my namenode's hostname])".
Digging into the role log file gives the following fatal error:
Exception in namenode joinjava.lang.IllegalArgumentException: Does not contain a valid host:port authority: [my namenode's fqhn]:[my namenode's fqhn]:0 at
org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:206) at
org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:158) at
org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:147) at
org.apache.hadoop.hdfs.server.namenode.NameNodeHttpServer.start(NameNodeHttpServer.java:143) at
org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:547) at
org.apache.hadoop.hdfs.server.namenode.NameNode.startCommonServices(NameNode.java:480) at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:443) at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:608) at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:589) at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1140) at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1204)
How can I resolve this?
It looks like you have two problems:
The NameNode's IP address is resolving to "my namenode's fqhn" instead of a regular hostname. Check your /etc/hosts file to fix this.
You need to configure dfs.https.port. With Cloudera Manager free edition, you must have had to add the appropriate configs to the safety valves to enable security. As part of that, you need to configure the dfs.https.port.
Given that this code path is traversed even in the non-HA mode, I'm surprised that you were able to get your secure NameNode to start up correctly before enabling HA. In case you haven't already, I recommend that you first enable security, test that all HDFS roles start up correctly and then enable HA.