We have a hyperledger application. The main application is hosted on AWS VM's whereas the DR is hosted on Azure VM's. Recently the Microsoft Team identified that one of the DR VM's became unavailable and the availability was restored in approximately 8 minutes. As per Microsoft "This unexpected occurrence was caused by an Azure initiated auto-recovery action. The auto-recovery action was triggered by a hardware issue on the physical node where the virtual machine was hosted. As designed, your VM was automatically moved to a different and healthy physical node to avoid further impact." The Zookeeper VM was also redeployed at the same
The day after this event occurred, we have started noticing that an orderer goes offline and immediately comes online after a few seconds. This disconnection/connection occurs regularly after a gap of 12 hours and 10 minutes.
We have noticed two things
In the log we get
- [orderer/consensus/kafka] startThread -> CRIT 24df#033[0m [channel:
testchainid] Cannot set up channel consumer = kafka server: The
requested offset is outside the range of offsets maintained by the
server for the given topic/partition.
- panic: [channel: testchainid] Cannot set up channel consumer = kafka
server: The requested offset is outside the range of offsets
maintained by the server for the given topic/partition.
- goroutine 52 [running]:
- github.com/hyperledger/fabric/vendor/github.com/op/go-logging.(*Logger).Panicf(0xc4202748a0,
0x108dede, 0x31, 0xc420327540, 0x2, 0x2)
- /w/workspace/fabric-binaries-x86_64/gopath/src/github.com/hyperledger/fabric/vendor/github.com/op/go-logging/logger.go:194
+0x134
- github.com/hyperledger/fabric/orderer/consensus/kafka.startThread(0xc42022cdc0)
- /w/workspace/fabric-binaries-x86_64/gopath/src/github.com/hyperledger/fabric/orderer/consensus/kafka/chain.go:261
+0xb33
- created by
github.com/hyperledger/fabric/orderer/consensus/kafka.(*chainImpl).Start
- /w/workspace/fabric-binaries-x86_64/gopath/src/github.com/hyperledger/fabric/orderer/consensus/kafka/chain.go:126
+0x3f
Another thing which we noticed is that, in logs prior to the VM failure event there were 3 kafka brokers but we can see only 2 kafka brokers in the logs after this event.
Can someone guide me on this? How do I resolve this problem?
Additional information - We have been through the Kafka logs of the day after which the VM was redeployed and we noticed the following
org.apache.kafka.common.network.InvalidReceiveException: Invalid receive (size = 1195725856 larger than 104857600)
at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:132)
at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:93)
at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:231)
at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:192)
at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:528)
at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:469)
at org.apache.kafka.common.network.Selector.poll(Selector.java:398)
at kafka.network.Processor.poll(SocketServer.scala:535)
at kafka.network.Processor.run(SocketServer.scala:452)
at java.lang.Thread.run(Thread.java:748)
It seems that we have a solution but it needs to be validated. Once the solution is validated, I will post it on this site.
Related
I have AWS MSK Kafka cluster with 2 brokers. From the logs I can see (on each broker) that they are constantly rebalancing. Every minute I can see in logs:
Preparing to rebalance group amazon.msk.canary.group.broker-1 in state PreparingRebalance with old generation 350887 (__consumer_offsets-21) (reason: Adding new member consumer-amazon.msk.canary.group.broker-1-27058-8aad596f-b00d-428a-abaa-f3a28d714f89 with group instance id None) (kafka.coordinator.group.GroupCoordinator)
And 25 seconds later:
Preparing to rebalance group amazon.msk.canary.group.broker-1 in state PreparingRebalance with old generation 350888 (__consumer_offsets-21) (reason: removing member consumer-amazon.msk.canary.group.broker-1-27058-8aad596f-b00d-428a-abaa-f3a28d714f89 on LeaveGroup) (kafka.coordinator.group.GroupCoordinator)
Why this happens? What is causing it? And what is amazon.msk.canary.group.broker-1 consumer group?
May it be something with the configuration of Java’s garbage
collection on the brokers? I remember reading that a misconfiguration of the garbage collectors can cause the broker to pause for a few seconds and lose connectivity to the Zookeeper, hence the flipping behavior. Could you check whether you are applying any custom configuration for garbage collection? (i.e. via KAFKA_JVM_PERFORMANCE_OPTS environmental variable)
My platform running over gcp cloud run. The db we use is snowflake.
Once a week, we schedule (with Cloud Schedule) a job that practically triggers up to 200 tasks (currently, will probably grow up in the future). All tasks is being added to certain queue.
Each task is practically push post call to a cloud-run instance.
Each cloud run instance is handling one request (see also environment settings), means - one task at a time. Moreover, each cloud run has 2 active sessions to 2 databases in snowflake (one for each). The first session is for "global_db" and the other one is to specific "person_id" db (Notice: There might be 2 active session to the same person_id db from different cloud run instances)
Issues:
1 - When set the tasks queue "Max concurrent dispatches" to 1000, I get 503 ("The request failed because the instance failed the readiness check.")
Issue was probably gcp autoscaling capacities - SOLVED by decrease the "Max concurrent dispatches" to reasonable number that gcp can handle with.
2- When set the tasks queue "Max concurrent dispatches" to more than 10,
I get multiple ConnectTimeoutError & OperationalError, with the following messages (I removed the long id's and just put {} for make the message shorter):
sqlalchemy.exc.OperationalError: (snowflake.connector.errors. ) 250003: Failed to execute request: HTTPSConnectionPool(host='*****.us-central1.gcp.snowflakecomputing.com', port=443): Max retries exceeded with url: /session/v1/login-request?request_id={REQUEST_ID}&databaseName={DB_NAME}&warehouse={NAME}&request_guid={GUID} (Caused by ConnectTimeoutError(<snowflake.connector.vendored.urllib3.connection.HTTPSConnection object at 0x3e583ff91550>, 'Connection to *****.us-central1.gcp.snowflakecomputing.com timed out. (connect timeout=60)'))
(Background on this error at: http://sqlalche.me/e/13/e3q8)
snowflake.connector.vendored.urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='*****.us-central1.gcp.snowflakecomputing.com', port=443): Max retries exceeded with url: /session/v1/login-request?request_id={ID}&databaseName={NAME}&warehouse={NAME}&request_guid={GUID}(Caused by ConnectTimeoutError(<snowflake.connector.vendored.urllib3.connection.HTTPSConnection object at 0x3eab877b3ed0>, 'Connection to *****.us-central1.gcp.snowflakecomputing.com timed out. (connect timeout=60)'))
Any ideas how can I solve it?
Ask any Q you have, and I will elaborate
environment settings -
cloud tasks queue - Check multiple configurations for "Max concurrent dispatches", from 10 to 1000 concurrency. max attempts is 1, max dispatches is 500.
cloud run - 5 hot instances, 1 request per one. Can autoscaling to max 1000 instances.
snowflake - ACCOUNT parameters were default (MAX_CONCURRENCY_LEVEL=8 and STATEMENT_QUEUED_TIMEOUT_IN_SECONDS=0) and was changed to (in order to handle those errors):
MAX_CONCURRENCY_LEVEL - 32
STATEMENT_QUEUED_TIMEOUT_IN_SECONDS - 600
I want to inform that we've found the problem - When the project was in it's beginning, we've created a VPC with static IP to the cloud run instance.
Unfortunately, the maximum number of connections to a single VPC network is 25..
I deployed 2 instances of Eureka server and a total of 12 instances microservices. .
Renews (last min) is as expected 24. But Renew Threshold is always 0. Is this how it supposed to be when self preservation is turned on? Also seeing this error - THE SELF PRESERVATION MODE IS TURNED OFF. THIS MAY NOT PROTECT INSTANCE EXPIRY IN CASE OF NETWORK/OTHER PROBLEMS. What's the expected behavior in this case and how to resolve this if this is a problem?
As mentioned above, I deployed 2 instances of Eureka Server but after running for a while like around 19-20 hours, one instance of Eureka Server always goes down. Why that could be possibly happening? I checked the processes running using top command and found that Eureka Server is taking a lot of memory. What needs to be configured on Eureka Server so that it don't take a lot of memory?
Below is the configuration in the application.properties file of Eureka Server:
spring.application.name=eureka-server
eureka.instance.appname=eureka-server
eureka.instance.instance-id=${spring.application.name}:${spring.application.instance_id:${random.int[1,999999]}}
eureka.server.enable-self-preservation=false
eureka.datacenter=AWS
eureka.environment=STAGE
eureka.client.registerWithEureka=false
eureka.client.fetchRegistry=false
Below is the command that I am using to start the Eureka Server instances.
#!/bin/bash
java -Xms128m -Xmx256m -Xss256k -XX:+HeapDumpOnOutOfMemoryError -Dspring.profiles.active=stage -Dserver.port=9011 -Deureka.instance.prefer-ip-address=true -Deureka.instance.hostname=example.co.za -Deureka.client.serviceUrl.defaultZone=http://example.co.za:9012/eureka/ -jar eureka-server-1.0.jar &
java -Xms128m -Xmx256m -Xss256k -XX:+HeapDumpOnOutOfMemoryError -Dspring.profiles.active=stage -Dserver.port=9012 -Deureka.instance.prefer-ip-address=true -Deureka.instance.hostname=example.co.za -Deureka.client.serviceUrl.defaultZone=http://example.co.za:9011/eureka/ -jar eureka-server-1.0.jar &
Is this approach to create multiple instances of Eureka Server correct?
Deployment is on AWS. Is there any specific configuration needed for Eureka Server on AWS?
Spring Boot version: 2.3.4.RELEASE
I am new to all these, any help or direction will be a great help.
Let me try to answer your question one by one.
Renews (last min) is as expected 24. But Renew Threshold is always 0. Is this how it supposed to be when self-preservation is turned on?
What's the expected behaviour in this case and how to resolve this if this is a problem?
I can see that eureka.server.enable-self-preservation=false in your configuration, This is really needed if you want to remove an already registered application as soon as it fails to renew its lease.
Self-preservation feature is to prevent the above-mentioned situation since it can happen if there are some network hiccups. Say, you have two services A and B, both are registered to eureka and suddenly, B failed to renew its lease because of a temporary network hiccup. If Self-preservation is not there then B will be removed from the registry and A won't be able to reach B despite B is available.
So we can say that Self-preservation is a resiliency feature of eureka.
Renews threshold is the expected renews per minute, Eureka server enters self-preservation mode if the actual number of heartbeats in last minute(Renews) is less than the expected number of renews per minute(Renew Threshold) and
Of course, you can control the Renews threshold. you can configure renewal-percent-threshold (by default it is 0.85)
So in your case,
Total number of application instances = 12
You don't have eureka.instance.leaseRenewalIntervalInSeconds so default value 30s
and eureka.client.registerWithEureka=false
so Renewals(last minute) will be 24
You don't have renewal-percent-threshold configured, so the default value is 0.85
Number of renewals per application instance per minute = 2 (30s each)
so in case of self-preservation is enable Renews threshold will be calculated as 2 * 12 * 0.85 = 21 (rounded)
And in your case self-preservation is turned off, so Eureka won't calculate Renews Threshold
One instance of Eureka Server always goes down. Why that could be possibly happening?
I'm not able to answer this question time being, this can be because of multiple reasons.
You can find the reason mostly from logs, or if you can post logs here it would be great.
What needs to be configured on Eureka Server so that it doesn't take a lot of memory?
From the information that you have provided, I cannot tell about your memory issue and in addition to that you already specified -Xmx256m and I didn't face any memory issues with the eureka servers so far.
But I can say that top is not the right tool for checking the memory consumed by your java process. When JVM starts, It takes some memory from the operating system.
This is the amount of memory you see in tools like ps and top. so better use jstat or jvmtop
Is this approach to create multiple instances of Eureka Server correct?
It seems you are having the same hostname(eureka.instance.hostname) for both instances. Replication won't work if you use the same hostname.
And make sure that you have the same application names in both instances.
Deployment is on AWS. Is there any specific configuration needed for Eureka Server on AWS?
Nothing specifically for AWS as per my knowledge, other than making sure that the instances can communicate with each other.
I am planning to use throttling in wso2-ei 6.4.0, From local system i tested the scenario i face some problems could please help me if any one know thanks in advance.
If we restart the wso2-ei node policy is not working. It taking again from starting ( suppose request limit is 10 for 1 hour,Before restarting the node it processing 5 request after restarting it should take remaining 5 request but it accepting 10 request
Throttling is working based on wso2-ei node level but suppose Linux server having 10 nodes how to distribute the throttling policy in Linux server level .
How to consider client ip in throttling. If request coming from F5 load balance i need to consider the requested system IP not F5 server IP.
If we restart the wso2-ei node policy is not working. It taking again from starting ( suppose request limit is 10 for 1 hour,Before restarting the node it processing 5 request after restarting it should take remaining 5 request but it accepting 10 request
The throttle mediator does not store the throttle count. Therefore if you perform a server restart it will reset the throttle count value and start from zero. In a production environment, it is not expected to have frequent server restarts.
Throttling is working based on wso2-ei node level but suppose Linux server having 10 nodes how to distribute the throttling policy in Linux server level .
If you want to maintain the throttle count across all the nodes you need to cluster the nodes. Throttle mediator uses hazelcast cluster messages to maintain a global count across the cluster.
We are running into a memory issues on our RDS PostgreSQL instance i. e. Memory usage of the postgresql server reaches almost 100% resulting in stalled queries, and subsequent downtime of production app.
The memory usage of the RDS instance doesn't go up gradually, but suddenly within a period of 30min to 2hrs
Most of the time this happens, we see that lot of traffic from bots is going on, though there is no specific pattern in terms of frequency. This could happen after 1 week to 1 month of the previous occurence.
Disconnecting all clients, and then restarting the application also doesn't help, as the memory usage again goes up very rapidly.
Running "Full Vaccum" is the only solution we have found that resolves the issue when it occurs.
What we have tried so far
Periodic vacuuming (not full vacuuming) of some tables that get frequent updates.
Stopped storing Web sessions in DB as they are highly volatile and result in lot of dead tuples.
Both these haven't helped.
We have considered using tools like pgcompact / pg_repack as they don't acquire exclusive lock. However these can't be used with RDS.
We now see a strong possibility that this has to do with memory bloat that can happen on postgresql with prepared statements in rails 4, as discussed in following pages:
Memory leaks on postgresql server after upgrade to Rails 4
https://github.com/rails/rails/issues/14645
As a quick trial, we have now disabled prepared statements in our rails database configuration, and are observing the system. If the issue re-occurs, this hypothesis would be proven wrong.
Setup details:
We run our production environment inside Amazon Elastic Beanstalk, with following configuration:
App servers
OS : 64bit Amazon Linux 2016.03 v2.1.0 running Ruby 2.1 (Puma)
Instance type: r3.xlarge
Root volume size: 100 GiB
Number of app servers : 2
Rails workers running on each server : 4
Max number of threads in each worker : 8
Database pool size : 50 (applicable for each worker)
Database (RDS) Details:
PostgreSQL Version: PostgreSQL 9.3.10
RDS Instance type: db.m4.2xlarge
Rails Version: 4.2.5
Current size on disk: 2.2GB
Number of tables: 94
The environment is monitored with AWS cloudwatch and NewRelic.
Periodic vacuum should help in containing table bloat but not index bloat.
1)Have you tried more aggressive parameters of auto-vacuum ?
2)Tried routine reindexing ? If locking is a concern then consider
DROP INDEX CONCURRENTLY ...
CREATE INDEX CONCURRENTLY ...