GCP load balancer shows "Invalid Fingerprint" - google-cloud-platform

I'm using a GCP Load Balancer that is currently functioning correctly, however when I view it in Cloud Console, it shows an "Invalid Fingerprint" status. More importantly, when I try to edit it--like, for example, by adding a new backend service--it fails with "Invalid Fingerprint." I'm under the impression this is some sort of SSL failure, but I have no idea what it means more than that or how to fix it.
tl;dr: GCP Load Balancer edits fail with "Invalid Fingerprint"

To add a little more context to this type of error status:
This issue happened around May 2016 for all kinds of load balancers in GCP and the cause was how the hash value for the fingerprint was calculated. This specific issue was solved by a roll-out a few days later.
Around 2017 the issue reappeared but for kubernetes, where the Google Kubernetes Engine(GKE) was updating the fingerprint every minute and when someone modifies the load balancer with the GUI, it causes multiple updates to trigger that does not finish by the time GKE sends a new update causing the invalid fingerprint error.
Most of this information I provided is linked on this Issue Tracker
that I invite you to check and 'star' for updates
Troubleshooting possibilities:
a) Using the command line to add or remove backend tend to be faster and less likely produce the "Invalid Fingerprint".
E.g.
$ gcloud compute backend-services create BACKEND_SERVICE_NAME
$ gcloud compute backend-services delete BACKEND_SERVICE_NAME
For all the flags, please reference: gcloud compute backend-services
b) When the issue is related to a GKE Ingress Load Balancer, in some cases, the way it was structured by the YAML file does not get updated (there is a small chance of happening when a new roll-out is deployed).
Recreating it should solve the issue, please refer to the Ingress Kubernetes Page to know all about it.
c) Possibly it is not directly related to the SSL failure since for these a more specific error will appear, like:
RSA key error: n does not equal p q RSA key error: d e not congruent to 1 RSA key error: dmp1 not congruent to d RSA key error: dmq1 not congruent to d RSA key error: iqmp not inverse of q
Curiosity: The error message "Invalid Fingerprint" was originally designed to be the error message when 'conditions were not met' but now is specific for the fingerprint hash calculation.
--- This answer was improved based on the comments of original Poster, thank you. ---

Related

Unable to create environments on Google Cloud Composer

I tried to create a Google Cloud Composer environment but in the page to set it up I get the following errors:
Service Error: Failed to load GKE machine types. Please leave the field
empty to apply default values or retry later.
Service Error: Failed to load regions. Please leave the field empty to
apply default values or retry later.
Service Error: Failed to load zones. Please leave the field empty to apply
default values or retry later.
Service Error: Failed to load service accounts. Please leave the field
empty to apply default values or retry later.
The only parameters GCP lets me change are the region and the number of nodes, but still lets me create the environment. After 30 minutes the environment crashes with the following error:
CREATE operation on this environment failed 1 day ago with the following error message:
Http error status code: 400
Http error message: BAD REQUEST
Errors in: [Web server]; Error messages:
Failed to deploy the Airflow web server. This might be a temporary issue. You can retry the operation later.
If the issue persists, it might be caused by problems with permissions or network configuration. For more information, see https://cloud.google.com/composer/docs/troubleshooting-environment-creation.
An internal error occurred while processing task /app-engine-flex/flex_await_healthy/flex_await_healthy>2021-07-20T14:31:23.047Z7050.xd.0: Your deployment has failed to become healthy in the allotted time and therefore was rolled back. If you believe this was an error, try adjusting the 'app_start_timeout_sec' setting in the 'readiness_check' section.
Got error "Another operation failed." during CP_DEPLOYMENT_CREATING_STANDARD []
Is it a problem with permissions? If so, what permissions do I need? Thank you!
It looks like more of a temporary issue:
the first set of errors is stating you cannot load the metadata :
regions list, zones list ....
you dont have a clear
PERMISSION_DENIED error
the second error: is suggesting also:
This might be a temporary issue.

eks fluent-bit to elasticsearch timeout

So I had a working configuration with fluent-bit on eks and elasticsearch on AWS that was pointing on the AWS elasticsearch service but for cost saving purpose, we deleted that elasticsearch and created an instance with a solo elasticsearch, enough for dev purpose. And the aws service doesn't manage well with only one instance.
The issue is that during this migration the fluent-bit seems to have broken, and I get lots of "[warn] failed to flush chunk" and some "[error] [upstream] connection #55 to ES-SERVER:9200 timed out after 10 seconds".
My current configuration:
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Kube_Tag_Prefix kube.var.log.containers.
Merge_Log On
Merge_Log_Key log_processed
K8S-Logging.Parser On
K8S-Logging.Exclude Off
[INPUT]
Name tail
Tag kube.*
Path /var/log/containers/*.log
Parser docker
DB /var/log/flb_kube.db
Mem_Buf_Limit 50MB
Skip_Long_Lines On
Refresh_Interval 10
Ignore_Older 1m
I think the issue is in one of those configuration, if I comment the kubernetes filter I don't have the errors anymore but I'm loosing the fields in the indices...
I tried tweeking some parameters in fluent-bit to no avail, if anyone has a suggestion?
So, the previous logs did not indicate anything, but I finaly found something when activating trace_error in the elasticsearch output:
{"index":{"_index":"fluent-bit-2021.04.16","_type":"_doc","_id":"Xkxy 23gBidvuDr8mzw8W","status":400,"error":{"type":"mapper_parsing_exception","reas on":"object mapping for [kubernetes.labels.app] tried to parse field [app] as o bject, but found a concrete value"}}
Did someone get that error before and knows how to solve it?
So, after looking into the logs and finding the mapping issue I ssem to have resolved the issue. The logs are now corretly parsed and send to the elasticsearch.
To resolve it I had to augment the limit of output retry and add the Replace_Dots option.
[OUTPUT]
Name es
Match *
Host ELASTICSERVER
Port 9200
Index <fluent-bit-{now/d}>
Retry_Limit 20
Replace_Dots On
It seems that at the beginning I had issues with the content being sent, because of that the error seemed to have continued after the changed until a new index was created making me think that the error was still not resolved.

Random “upstream connect error or disconnect/reset before headers” between services with Istio 1.3

So, this problem is happening randomly (it seems) and between different services.
For example we have a service A which needs to talk to service B, and some times we get this error, but after a while, the error goes away. And this error doesn't happen too often.
When this happens, we see the error log in service A throwing the “upstream connect error” message, but none in service B. So we think it might be related with the sidecars.
One thing we notice is that in service B, we get a lot of this error messages in the istio-proxy container:
[src/istio/mixerclient/report_batch.cc:109] Mixer Report failed with: UNAVAILABLE:upstream connect error or disconnect/reset before headers. reset reason: connection failure
And according to documentation when a request comes in, envoy asks Mixer if everything is good (authorization and other things), and if Mixer doesn’t reply, the request is not success. So that’s why exists an option called policyCheckFailOpen.
We have that in false, I guess is a sane default, we don’t want the request to go through if Mixer cannot be reached, but why can’t?
disablePolicyChecks: true
policyCheckFailOpen: false
controlPlaneSecurityEnabled: false
NOTE: istio-policy is running with the istio-proxy sidecar. Is that correct?
We don’t see that error in some other service which can also fail.
Another log that I can see a lot, and this one happens in all the services not running as root with fsGroup defined in the YAML files is:
watchFileEvents: "/etc/certs": MODIFY|ATTRIB
watchFileEvents: "/etc/certs/..2020_02_10_09_41_46.891624651": MODIFY|ATTRIB
watchFileEvents: notifying
One of the leads I'm chasing is about default circuitBreakers values. Could that be related with this?
Thanks
The error you are seeing is because of a failure to establish a connection to istio-policy
Based on this github issue
Community members add two answers here which could help you with your issue
If mTLS is enabled globally make sure you set controlPlaneSecurityEnabled: true
I was facing the same issue, then I read about protocol selection. I realised the name of the port in the service definition should start with for example http-. This fixed the issue for me. And . if you face the issue still you might need to look at the tls-check for the pods and resolve it using destinationrules and policies.
istio-policy is running with the istio-proxy sidecar. Is that correct?
Yes, I just checked it and it's with sidecar.
Let me know if that help.

Getting ec2 instance ID suddenly stopped working

I've been getting an amazon instance ID from within the instance itself for over a year now by hitting this local web address http://169.254.169.254/latest/meta-data/instance-id. This is the appropriate method according to the AWS documentation. For some reason though, just this week that same call started throwing an error.
I tried pinging the 169.254.169.254 address from the command line and that fails, so it seems like something pretty basic has changed with the EC2 instances. I don't see any changes to the documentation on AWS. One thing I do notice is that I used to see the instance name in the upper right hand corner when loading up the instance and logging in remotely. That information doesn't appear anymore.
Here is the code I've been using to get the ID:
retID = New StreamReader(HttpWebRequest.Create("http://169.254.169.254/latest/meta-data/instance-id").GetResponse().GetResponseStream()).ReadToEnd()
Here is the full error stack:
at System.Net.HttpWebRequest.GetResponse()
at RunControllerInterface.NewRunControlCommunicate.getInstanceIDFromAmazon()
The error message itself says: Unable to connect to the remote server
Any help would be appreciated.
So I think I have at least a partial answer to this problem. When making this image, I was using a t3a.medium instance. As long as I use that same type of instance I am able to pull down the instance name.

Enabling HA namenodes on a secure cluster in Cloudera Manager fails

I am running a CDH4.1.2 secure cluster and it works fine with the single namenode+secondarynamenode configuration, but when I try to enable High Availability (quorum based) from the Cloudera Manager interface it dies at step 10 of 16, "Starting the NameNode that will be transitioned to active mode namenode ([my namenode's hostname])".
Digging into the role log file gives the following fatal error:
Exception in namenode joinjava.lang.IllegalArgumentException: Does not contain a valid host:port authority: [my namenode's fqhn]:[my namenode's fqhn]:0 at
org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:206) at
org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:158) at
org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:147) at
org.apache.hadoop.hdfs.server.namenode.NameNodeHttpServer.start(NameNodeHttpServer.java:143) at
org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:547) at
org.apache.hadoop.hdfs.server.namenode.NameNode.startCommonServices(NameNode.java:480) at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:443) at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:608) at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:589) at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1140) at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1204)
How can I resolve this?
It looks like you have two problems:
The NameNode's IP address is resolving to "my namenode's fqhn" instead of a regular hostname. Check your /etc/hosts file to fix this.
You need to configure dfs.https.port. With Cloudera Manager free edition, you must have had to add the appropriate configs to the safety valves to enable security. As part of that, you need to configure the dfs.https.port.
Given that this code path is traversed even in the non-HA mode, I'm surprised that you were able to get your secure NameNode to start up correctly before enabling HA. In case you haven't already, I recommend that you first enable security, test that all HDFS roles start up correctly and then enable HA.