memcached-session-manager on AWS - amazon-web-services

I've got a website running on Amazon Web Services that is deployed using Elastic Beanstalk and runs on a minimum of 2 EC2 micro instances. An auto scaling policy is in place, so that it can scale up and scale down depending on the the traffic in the website. Because of this auto scaling policy, I wanted to avoid using sticky sessions and for that reason I am using memcached-session-manager. I am using Amazon ElastiCache (small instance) for the memcached server.
The configuration in the context.xml is as follows:
<Manager className="de.javakaffee.web.msm.MemcachedBackupSessionManager"
memcachedNodes="sessions.myinstancecode.0001.use1.cache.amazonaws.com:11211"
sticky="false"
sessionBackupAsync="false"
lockingMode="none"
transcoderFactoryClass="de.javakaffee.web.msm.serializer.kryo.KryoTranscoderFactory" />
This works fine when the traffic is low (i.e. less than 10 users online) but sometimes causes the EC2 instance to restart. You can imagine that if the website is currently running on two instances and they both decide to restart at the same time, the website becomes unreachable and it's a big problem. These are the last lines in the tail_catalina.log that is rotated on Amazon S3 before the EC2 instance decides to restart:
Jun 13, 2012 12:32:27 AM de.javakaffee.web.msm.BackupSessionTask handleException
WARNING: Could not store session 42F9761AC24F826E1FC3F2A834FBF442 in memcached.
Note that this session was relocated to this node because the original node was not available.
net.spy.memcached.internal.CheckedOperationTimeoutException: Timed out waiting for operation - failing node: sessions.myinstancecode.0001.use1.cache.amazonaws.com/10.194.23.99:11211
at net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:73)
at de.javakaffee.web.msm.BackupSessionTask.storeSessionInMemcached(BackupSessionTask.java:230)
at de.javakaffee.web.msm.BackupSessionTask.doBackupSession(BackupSessionTask.java:195)
at de.javakaffee.web.msm.BackupSessionTask.call(BackupSessionTask.java:120)
at de.javakaffee.web.msm.BackupSessionTask.call(BackupSessionTask.java:51)
at de.javakaffee.web.msm.BackupSessionService$SynchronousExecutorService.submit(BackupSessionService.java:339)
at de.javakaffee.web.msm.BackupSessionService.backupSession(BackupSessionService.java:198)
at de.javakaffee.web.msm.MemcachedSessionService.backupSession(MemcachedSessionService.java:967)
at de.javakaffee.web.msm.SessionTrackerValve.backupSession(SessionTrackerValve.java:226)
at de.javakaffee.web.msm.SessionTrackerValve.invoke(SessionTrackerValve.java:128)
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:472)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
at org.apache.catalina.valves.RemoteIpValve.invoke(RemoteIpValve.java:680)
at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:928)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:539)
at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:298)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)
Jun 13, 2012 12:32:28 AM de.javakaffee.web.msm.LockingStrategy onAfterBackupSession
WARNING: An error occurred during onAfterBackupSession.
net.spy.memcached.internal.CheckedOperationTimeoutException: Timed out waiting for operation - failing node: sessions.myinstancecode.0001.use1.cache.amazonaws.com/10.194.23.99:11211
at net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:73)
at de.javakaffee.web.msm.LockingStrategy.onAfterBackupSession(LockingStrategy.java:287)
at de.javakaffee.web.msm.MemcachedSessionService.backupSession(MemcachedSessionService.java:970)
at de.javakaffee.web.msm.SessionTrackerValve.backupSession(SessionTrackerValve.java:226)
at de.javakaffee.web.msm.SessionTrackerValve.invoke(SessionTrackerValve.java:128)
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:472)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
at org.apache.catalina.valves.RemoteIpValve.invoke(RemoteIpValve.java:680)
at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:928)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:539)
at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:298)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)
It seems like the Amazon ElastiCache node is failing, but the thing is that, checking on Amazon CloudWatch, I can see that the CPU utilization never went over 8%. Is there any reason why the Amazon ElastiCache node fails, even though it's not being stressed that much? Also, why does Amazon decide to restart (or better: terminate and start a new instance) when Amazon ElastiChace node fails?
Any help greatly appreciated.
Thanks!

You should increase the sessionBackupTimeout of memcached-session-manager, from the documentation:
sessionBackupTimeout (optional, default 100)
The timeout in milliseconds after that a session backup is considered
as beeing failed. This property is only evaluated if sessions are
stored synchronously (set via sessionBackupAsync). The default value
is 100 milliseconds.

Related

Spring Data Neo4J - Unable to acquire connection from pool within configured maximum time

We have a Reactive REST API using Spring Data Neo4j (SpringBoot v2.7.5) deployed to Kubernetes. When running a stress test to determine the breaking point, once the volume of requests that the service can handle has been breached, the service does not auto-recover, even after the load has dropped to a level at which the service can handle.
After the service has fallen over the Neo4J health indicator shows:
“org.neo4j.driver.exceptions.ClientException: Unable to acquire connection from the pool within configured maximum time of 60000ms”
With respect to connection/configuration settings we are using defaults configured by SDN.
Observations:
Up until the point at which the service breaks only a small number of connections are utilised, at the point at which it breaks the connections in use jumps up to the max pool size and the above mentioned error is observed. No matter how much time passes (even well beyond the max connection lifetime) the service is unable to acquire a connection from the pool. Upon manually shutting down and restarting the service/pod the service returns to a healthy state.
As an interim solution we now check the Neo4J health indicator, if the mentioned error is present the liveness state is set to down which triggers Kubernetes to restart the service automatically. However, I’m wondering if there is an underlying issue with the connections in the pool not getting ‘cleaned up’?
You can take a look at this discussion https://github.com/spring-projects/spring-data-neo4j/issues/2632
I had the same issue. The problem is that either Spring Framework or Neo4j reactive transaction manager doesn't close connections in a complex reactive flow e.g. when there are a lot of inner calls/mappings and somewhere inside an exception is thrown.
So as a workaround you can add #Transactional in such places to avoid multiple transactions to be created.

Eureka Server memory, renew threshold is 0, self preservation issue - AWS

I deployed 2 instances of Eureka server and a total of 12 instances microservices. .
Renews (last min) is as expected 24. But Renew Threshold is always 0. Is this how it supposed to be when self preservation is turned on? Also seeing this error - THE SELF PRESERVATION MODE IS TURNED OFF. THIS MAY NOT PROTECT INSTANCE EXPIRY IN CASE OF NETWORK/OTHER PROBLEMS. What's the expected behavior in this case and how to resolve this if this is a problem?
As mentioned above, I deployed 2 instances of Eureka Server but after running for a while like around 19-20 hours, one instance of Eureka Server always goes down. Why that could be possibly happening? I checked the processes running using top command and found that Eureka Server is taking a lot of memory. What needs to be configured on Eureka Server so that it don't take a lot of memory?
Below is the configuration in the application.properties file of Eureka Server:
spring.application.name=eureka-server
eureka.instance.appname=eureka-server
eureka.instance.instance-id=${spring.application.name}:${spring.application.instance_id:${random.int[1,999999]}}
eureka.server.enable-self-preservation=false
eureka.datacenter=AWS
eureka.environment=STAGE
eureka.client.registerWithEureka=false
eureka.client.fetchRegistry=false
Below is the command that I am using to start the Eureka Server instances.
#!/bin/bash
java -Xms128m -Xmx256m -Xss256k -XX:+HeapDumpOnOutOfMemoryError -Dspring.profiles.active=stage -Dserver.port=9011 -Deureka.instance.prefer-ip-address=true -Deureka.instance.hostname=example.co.za -Deureka.client.serviceUrl.defaultZone=http://example.co.za:9012/eureka/ -jar eureka-server-1.0.jar &
java -Xms128m -Xmx256m -Xss256k -XX:+HeapDumpOnOutOfMemoryError -Dspring.profiles.active=stage -Dserver.port=9012 -Deureka.instance.prefer-ip-address=true -Deureka.instance.hostname=example.co.za -Deureka.client.serviceUrl.defaultZone=http://example.co.za:9011/eureka/ -jar eureka-server-1.0.jar &
Is this approach to create multiple instances of Eureka Server correct?
Deployment is on AWS. Is there any specific configuration needed for Eureka Server on AWS?
Spring Boot version: 2.3.4.RELEASE
I am new to all these, any help or direction will be a great help.
Let me try to answer your question one by one.
Renews (last min) is as expected 24. But Renew Threshold is always 0. Is this how it supposed to be when self-preservation is turned on?
What's the expected behaviour in this case and how to resolve this if this is a problem?
I can see that eureka.server.enable-self-preservation=false in your configuration, This is really needed if you want to remove an already registered application as soon as it fails to renew its lease.
Self-preservation feature is to prevent the above-mentioned situation since it can happen if there are some network hiccups. Say, you have two services A and B, both are registered to eureka and suddenly, B failed to renew its lease because of a temporary network hiccup. If Self-preservation is not there then B will be removed from the registry and A won't be able to reach B despite B is available.
So we can say that Self-preservation is a resiliency feature of eureka.
Renews threshold is the expected renews per minute, Eureka server enters self-preservation mode if the actual number of heartbeats in last minute(Renews) is less than the expected number of renews per minute(Renew Threshold) and
Of course, you can control the Renews threshold. you can configure renewal-percent-threshold (by default it is 0.85)
So in your case,
Total number of application instances = 12
You don't have eureka.instance.leaseRenewalIntervalInSeconds so default value 30s
and eureka.client.registerWithEureka=false
so Renewals(last minute) will be 24
You don't have renewal-percent-threshold configured, so the default value is 0.85
Number of renewals per application instance per minute = 2 (30s each)
so in case of self-preservation is enable Renews threshold will be calculated as 2 * 12 * 0.85 = 21 (rounded)
And in your case self-preservation is turned off, so Eureka won't calculate Renews Threshold
One instance of Eureka Server always goes down. Why that could be possibly happening?
I'm not able to answer this question time being, this can be because of multiple reasons.
You can find the reason mostly from logs, or if you can post logs here it would be great.
What needs to be configured on Eureka Server so that it doesn't take a lot of memory?
From the information that you have provided, I cannot tell about your memory issue and in addition to that you already specified -Xmx256m and I didn't face any memory issues with the eureka servers so far.
But I can say that top is not the right tool for checking the memory consumed by your java process. When JVM starts, It takes some memory from the operating system.
This is the amount of memory you see in tools like ps and top. so better use jstat or jvmtop
Is this approach to create multiple instances of Eureka Server correct?
It seems you are having the same hostname(eureka.instance.hostname) for both instances. Replication won't work if you use the same hostname.
And make sure that you have the same application names in both instances.
Deployment is on AWS. Is there any specific configuration needed for Eureka Server on AWS?
Nothing specifically for AWS as per my knowledge, other than making sure that the instances can communicate with each other.

Datadog AWS EBS Monitoring - Drive Space - Exclude /dev/loop* devices

Okay, here is my setup:
Platform: AWS
Monitoring: DataDog
Metric: system.disk.in_use
Question: So I am running Ubuntu 18.04LTS instances and as time goes on, it seems to spawn additional devices periodically:
device:/dev/loop1, /dev/loop2 and so on.
When I first spun up these instances, there were only 3 /dev/loop(1-3) devices, however, over time, a /dev/loop4 showed up and our drive space alert paged me since these are 100% utilized when created.
So, I have to go into each of the monitors (one per environment) and add an exclusion for the new /dev/loop4, but I cannot set the exclusion until it has been created by at least one of the monitored instances.
Is there a way in DataDog that you can just add a blanket exclusion like:
device:/dev/loop*?
I have been combing through documentation and have not been able to find anything, so I thought I would ask here.
I solved this by adding 'squashfs' to the list of filesystem types to be ignored by the datadog agent.
Create a file /etc/datadog-agent/conf.d/disk.d/conf.yaml:
init_config:
file_system_global_blacklist:
- iso9660$
- squashfs
instances:
- use_mount: false
Restart datadog agent (systemctl restart datadog-agent).
You can use ! and *, like in:
avg:system.disk.in_use{!device:/dev/loop*} by {host,device}
source

How to provide Credentials to Micrometer for CloudWatch in Spring Boot 2

I'm using Spring Boot 2.0.9.RELEASE and am trying to figure out how to configure CloudWatch monitoring for my application running on an EC2-instance.
What I did can be seen in my answer to this question. But I'm stuck with the following exception:
ERROR Oct 23, 2019 12:20:06.881 [pool-2-thread-30] {} io.micrometer.cloudwatch.CloudWatchMeterRegistry:134 - error sending metric data.
com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [com.amazonaws.auth.profile.ProfileCredentialsProvider#32ee6fee: profile file cannot be null]
at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:136) ~[aws-java-sdk-core-1.11.641.jar!/:?]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.getCredentialsFromContext(AmazonHttpClient.java:1225) ~[aws-java-sdk-core-1.11.641.jar!/:?]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.runBeforeRequestHandlers(AmazonHttpClient.java:801) ~[aws-java-sdk-core-1.11.641.jar!/:?]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:751) ~[aws-java-sdk-core-1.11.641.jar!/:?]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744) ~[aws-java-sdk-core-1.11.641.jar!/:?]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726) ~[aws-java-sdk-core-1.11.641.jar!/:?]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686) ~[aws-java-sdk-core-1.11.641.jar!/:?]
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668) ~[aws-java-sdk-core-1.11.641.jar!/:?]
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532) ~[aws-java-sdk-core-1.11.641.jar!/:?]
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512) ~[aws-java-sdk-core-1.11.641.jar!/:?]
at com.amazonaws.services.cloudwatch.AmazonCloudWatchClient.doInvoke(AmazonCloudWatchClient.java:2027) ~[aws-java-sdk-cloudwatch-1.11.641.jar!/:?]
at com.amazonaws.services.cloudwatch.AmazonCloudWatchClient.invoke(AmazonCloudWatchClient.java:1994) ~[aws-java-sdk-cloudwatch-1.11.641.jar!/:?]
at com.amazonaws.services.cloudwatch.AmazonCloudWatchClient.invoke(AmazonCloudWatchClient.java:1983) ~[aws-java-sdk-cloudwatch-1.11.641.jar!/:?]
at com.amazonaws.services.cloudwatch.AmazonCloudWatchClient.executePutMetricData(AmazonCloudWatchClient.java:1754) ~[aws-java-sdk-cloudwatch-1.11.641.jar!/:?]
at com.amazonaws.services.cloudwatch.AmazonCloudWatchAsyncClient$20.call(AmazonCloudWatchAsyncClient.java:972) [aws-java-sdk-cloudwatch-1.11.641.jar!/:?]
at com.amazonaws.services.cloudwatch.AmazonCloudWatchAsyncClient$20.call(AmazonCloudWatchAsyncClient.java:966) [aws-java-sdk-cloudwatch-1.11.641.jar!/:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_191]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_191]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_191]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
Strange thing is that I already managed to put metrics to CloudWatch in an earlier try, but I do not know what I broke. I did not dream it. I can still see the metrics ^^.
From what I read, I'd have to do something with the IAM-role of my EC2-instance, but I'm lost here.
Turns out that the culprit was the following property:
cloud.aws.credentials.instanceProfile=false
I had it in my configuration, because I read somewhere that it is needed to run the application locally. However that wasn't true in my case.
Somehow it slipped to my procuction-configuration. And as the name of the property suggests, setting it to false will make Spring Boot not use the profile of the ec2-instance the application is running on. So unless you provide credentials in another way, the application will not be able to push metrics to CloudWatch.

Orderer disconnections in a Hyperledger Fabric application

We have a hyperledger application. The main application is hosted on AWS VM's whereas the DR is hosted on Azure VM's. Recently the Microsoft Team identified that one of the DR VM's became unavailable and the availability was restored in approximately 8 minutes. As per Microsoft "This unexpected occurrence was caused by an Azure initiated auto-recovery action. The auto-recovery action was triggered by a hardware issue on the physical node where the virtual machine was hosted. As designed, your VM was automatically moved to a different and healthy physical node to avoid further impact." The Zookeeper VM was also redeployed at the same
The day after this event occurred, we have started noticing that an orderer goes offline and immediately comes online after a few seconds. This disconnection/connection occurs regularly after a gap of 12 hours and 10 minutes.
We have noticed two things
In the log we get
- [orderer/consensus/kafka] startThread -> CRIT 24df#033[0m [channel:
testchainid] Cannot set up channel consumer = kafka server: The
requested offset is outside the range of offsets maintained by the
server for the given topic/partition.
- panic: [channel: testchainid] Cannot set up channel consumer = kafka
server: The requested offset is outside the range of offsets
maintained by the server for the given topic/partition.
- goroutine 52 [running]:
- github.com/hyperledger/fabric/vendor/github.com/op/go-logging.(*Logger).Panicf(0xc4202748a0,
0x108dede, 0x31, 0xc420327540, 0x2, 0x2)
- /w/workspace/fabric-binaries-x86_64/gopath/src/github.com/hyperledger/fabric/vendor/github.com/op/go-logging/logger.go:194
+0x134
- github.com/hyperledger/fabric/orderer/consensus/kafka.startThread(0xc42022cdc0)
- /w/workspace/fabric-binaries-x86_64/gopath/src/github.com/hyperledger/fabric/orderer/consensus/kafka/chain.go:261
+0xb33
- created by
github.com/hyperledger/fabric/orderer/consensus/kafka.(*chainImpl).Start
- /w/workspace/fabric-binaries-x86_64/gopath/src/github.com/hyperledger/fabric/orderer/consensus/kafka/chain.go:126
+0x3f
Another thing which we noticed is that, in logs prior to the VM failure event there were 3 kafka brokers but we can see only 2 kafka brokers in the logs after this event.
Can someone guide me on this? How do I resolve this problem?
Additional information - We have been through the Kafka logs of the day after which the VM was redeployed and we noticed the following
org.apache.kafka.common.network.InvalidReceiveException: Invalid receive (size = 1195725856 larger than 104857600)
at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:132)
at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:93)
at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:231)
at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:192)
at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:528)
at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:469)
at org.apache.kafka.common.network.Selector.poll(Selector.java:398)
at kafka.network.Processor.poll(SocketServer.scala:535)
at kafka.network.Processor.run(SocketServer.scala:452)
at java.lang.Thread.run(Thread.java:748)
It seems that we have a solution but it needs to be validated. Once the solution is validated, I will post it on this site.