Spring Session using Lettuce connecting to AWS ElastiCache Redis - amazon-web-services

I am using Spring Session to externalize our session to Redis (AWS ElastiCache). Lettuce is being used as our client to Redis.
My AWS Redis configuration is the following:
Redis Cluster enabled
Two shards (i.e. two masters)
One slave per master
My Lettuce configuration is the following:
<!-- Lettuce Configuration -->
<bean class="org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory">
<constructor-arg ref="redisClusterConfiguration"/>
</bean>
<!-- Redis Cluster Configuration -->
<bean id="redisClusterConfiguration" class="org.springframework.data.redis.connection.RedisClusterConfiguration">
<constructor-arg>
<list>
<value><!-- AMAZON SINGLE ENDPOINT HERE --></value>
</list>
</constructor-arg>
</bean>
The issue appears when we trigger a failover of a master node. I get the following events logged during a test failover
myserver-0001-001
cache-cluster
Monday, July 6, 2020 at 8:25:32 PM UTC+3
Finished recovery for cache nodes 0001
myserver-0001-001
cache-cluster
Monday, July 6, 2020 at 8:20:38 PM UTC+3
Recovering cache nodes 0001
myserver
replication-group
Monday, July 6, 2020 at 8:19:14 PM UTC+3
Failover to replica node myserver-0001-002 completed
myserver
replication-group
Monday, July 6, 2020 at 8:17:59 PM UTC+3
Test Failover API called for node group 0001
AWS customer support claims that as long as the Redis client used is Redis Cluster aware when the Failover to replica node myserver-0001-002 completed event is fired (i.e. 1m and 15s after triggering the failover) it should be able to connect to it (i.e. the newly promoted master). Our client seems to reconnect only after the Finished recovery for cache nodes 0001 event is fired (i.e. 7m and 32s later). Meanwhile we get errors like the following
org.springframework.data.redis.RedisSystemException: Error in execution; nested exception is io.lettuce.core.RedisCommandExecutionException: CLUSTERDOWN The cluster is down
While the failover is taking place, the following information can be seen from the redis-cli.
endpoint:6379> cluster nodes
ffe51ecc6a8c1f32ab3774eb8f159bd64392dc14 172.31.11.216:6379#1122 master - 0 1594114396000 9 connected 0-8191
f8ff7a20f4c493b63ba65f107f575631faa4eb1b 172.31.11.52:6379#1122 slave 4ab10ca0a6a932179432769fcba7fab0faba01f7 0 1594114396872 2 connected
c18fee0e47800d792676c7d14782d81d7d1684e8 172.31.10.64:6379#1122 master,fail - 1594114079948 1594114077000 8 connected
4ab10ca0a6a932179432769fcba7fab0faba01f7 172.31.10.84:6379#1122 myself,master - 0 1594114395000 2 connected 8192-16383
endpoint:6379> cluster nodes
ffe51ecc6a8c1f32ab3774eb8f159bd64392dc14 172.31.11.216:6379#1122 master - 0 1594114461262 9 connected 0-8191
f8ff7a20f4c493b63ba65f107f575631faa4eb1b 172.31.11.52:6379#1122 slave 4ab10ca0a6a932179432769fcba7fab0faba01f7 0 1594114460000 2 connected
6a7339ae4df3c78e31c9cc8fd8cec4803eed5fc1 172.31.10.64:6379#1122 master - 0 1594114460256 0 connected
c18fee0e47800d792676c7d14782d81d7d1684e8 172.31.10.64:6379#1122 master,fail - 1594114079948 1594114077000 8 connected
4ab10ca0a6a932179432769fcba7fab0faba01f7 172.31.10.84:6379#1122 myself,master - 0 1594114458000 2 connected 8192-16383
endpoint:6379> cluster nodes
ffe51ecc6a8c1f32ab3774eb8f159bd64392dc14 172.31.11.216:6379#1122 master - 0 1594114509000 9 connected 0-8191
f8ff7a20f4c493b63ba65f107f575631faa4eb1b 172.31.11.52:6379#1122 slave 4ab10ca0a6a932179432769fcba7fab0faba01f7 0 1594114510552 2 connected
6a7339ae4df3c78e31c9cc8fd8cec4803eed5fc1 172.31.10.64:6379#1122 slave ffe51ecc6a8c1f32ab3774eb8f159bd64392dc14 0 1594114510000 9 connected
c18fee0e47800d792676c7d14782d81d7d1684e8 172.31.10.64:6379#1122 master,fail - 1594114079948 1594114077000 8 connected
4ab10ca0a6a932179432769fcba7fab0faba01f7 172.31.10.84:6379#1122 myself,master - 0 1594114508000 2 connected 8192-16383
endpoint:6379> cluster nodes
ffe51ecc6a8c1f32ab3774eb8f159bd64392dc14 172.31.11.216:6379#1122 master - 0 1594114548000 9 connected 0-8191
f8ff7a20f4c493b63ba65f107f575631faa4eb1b 172.31.11.52:6379#1122 slave 4ab10ca0a6a932179432769fcba7fab0faba01f7 0 1594114548783 2 connected
6a7339ae4df3c78e31c9cc8fd8cec4803eed5fc1 172.31.10.64:6379#1122 slave ffe51ecc6a8c1f32ab3774eb8f159bd64392dc14 0 1594114547776 9 connected
4ab10ca0a6a932179432769fcba7fab0faba01f7 172.31.10.84:6379#1122 myself,master - 0 1594114547000 2 connected 8192-16383
As far as I understand, Lettuce used by Spring Session is Redis Cluster aware hence the RedisClusterConfiguration class used in the XML configuration. Checking the documentation and some similar questions here on SO as well as Lettuce's GitHub issues page, didn't make it clear to me how Lettuce works in Redis Cluster mode and specifically with AWS's trickery hiding the IPs under a common endpoint.
Shouldn't my configuration be enough for Lettuce to connect to the newly promoted master? Do I need to enable a different mode in Lettuce for it to be able to receive the notification from Redis and switch to the new master (e.b. topology refresh)?
Also, how is Lettuce handling the single endpoint from AWS? Is it resolving the IPs and then uses them? Are they cached?
If I want to enable reads to happen from all four nodes, is my configuration enough? In a Redis Cluster (i.e. even outside of AWS's context) when a slave is promoted to master is the client polling to get the information or is the cluster pushing it somehow to the client?
Any resources (even Lettuce source files) you have that could clarify the above as well as the different modes in the context of Lettuce, Redis, and AWS would be more than welcome.
As you can see I am a bit confused on this still.
Thanks a lot in advance for any help and information provided.
UPDATE
Debugging was enabled and breakpoints were used to intercept bean creation and configure topology refreshing that way. It seems that enabling the ClusterTopologyRefreshTask through the constructor of ClusterClientOptions:
protected ClusterClientOptions(Builder builder) {
super(builder);
this.validateClusterNodeMembership = builder.validateClusterNodeMembership;
this.maxRedirects = builder.maxRedirects;
ClusterTopologyRefreshOptions refreshOptions = builder.topologyRefreshOptions;
if (refreshOptions == null) {
refreshOptions = ClusterTopologyRefreshOptions.builder() //
.enablePeriodicRefresh(DEFAULT_REFRESH_CLUSTER_VIEW) // Breakpoint here and enter to enable refreshing
.refreshPeriod(DEFAULT_REFRESH_PERIOD_DURATION) // Breakpoint here and enter to set the refresh interval
.closeStaleConnections(builder.closeStaleConnections) //
.build();
}
this.topologyRefreshOptions = refreshOptions;
}
It seems to be refreshing OK, but the problem now is how to configure this when Lettuce is used through Spring Session and not as a plain client to Redis?

As I was going through my questions I realized I haven't answered that one! So here it is in case someone has the same issue.
What I ended up doing is creating a configuration bean for Redis instead of using XML. The code is as follows:
#EnableRedisHttpSession
public class RedisConfig {
private static final List<String> clusterNodes = Arrays.asList(System.getProperty("redis.endpoint"));
#Bean
public static ConfigureRedisAction configureRedisAction() {
return ConfigureRedisAction.NO_OP;
}
#Bean(destroyMethod = "destroy")
public LettuceConnectionFactory lettuceConnectionFactory() {
RedisClusterConfiguration redisClusterConfiguration = new RedisClusterConfiguration(clusterNodes);
return new LettuceConnectionFactory(redisClusterConfiguration, getLettuceClientConfiguration());
}
private LettuceClientConfiguration getLettuceClientConfiguration() {
ClusterTopologyRefreshOptions topologyRefreshOptions = ClusterTopologyRefreshOptions.builder().enablePeriodicRefresh(Duration.ofSeconds(30)).build();
ClusterClientOptions clusterClientOptions = ClusterClientOptions.builder().topologyRefreshOptions(topologyRefreshOptions).build();
return LettucePoolingClientConfiguration.builder().clientOptions(clusterClientOptions).build();
}
}
Then, instead of registering my ContextLoaderListener through XML I used an initializer instead like so:
public class Initializer extends AbstractHttpSessionApplicationInitializer {
public Initializer() {
super(RedisConfig.class);
}
}
This seems to set up refreshing OK, but I don't know if it is the proper way to do it! If anyone has any idea of a more proper solution please feel free to comment here.

Related

Kafka Multi broker setup with ec2 machine: Timed out waiting for a node assignment. Call: createTopics

I am trying to setup kafka with 3 broker nodes and 1 zookeeper node in AWS EC2 instances. I have following server.properties for every broker:
kafka-1:
broker.id=0
listeners=PLAINTEXT_1://ec2-**-***-**-17.eu-central-1.compute.amazonaws.com:9092
advertised.listeners=PLAINTEXT_1://ec2-**-***-**-17.eu-central-1.compute.amazonaws.com:9092
listener.security.protocol.map=,PLAINTEXT_1:PLAINTEXT
inter.broker.listener.name=PLAINTEXT_1
zookeeper.connect=ec2-**-***-**-105.eu-central-1.compute.amazonaws.com:2181
kafka-2:
broker.id=1
listeners=PLAINTEXT_2://ec2-**-***-**-43.eu-central-1.compute.amazonaws.com:9093
advertised.listeners=PLAINTEXT_2://ec2-**-***-**-43.eu-central-1.compute.amazonaws.com:9093
listener.security.protocol.map=,PLAINTEXT_2:PLAINTEXT
inter.broker.listener.name=PLAINTEXT_2
zookeeper.connect=ec2-**-***-**-105.eu-central-1.compute.amazonaws.com:2181
kafka-3:
broker.id=2
listeners=PLAINTEXT_3://ec2-**-***-**-27.eu-central-1.compute.amazonaws.com:9094
advertised.listeners=PLAINTEXT_3://ec2-**-***-**-27.eu-central-1.compute.amazonaws.com:9094
listener.security.protocol.map=,PLAINTEXT_3:PLAINTEXT
inter.broker.listener.name=PLAINTEXT_3
zookeeper.connect=ec2-**-***-**-105.eu-central-1.compute.amazonaws.com:2181
zookeeper:
tickTime=2000
dataDir=/var/lib/zookeeper
clientPort=2181
When I ran following command in zookeeper I see that they are connected
I also telnetted from any broker to other ones with broker port they are all connected
However, when I try to create topic with 2 replication factor I get Timed out waiting for a node assignment
I cannot understand what is incorrect with my setup, I see 3 nodes running in zookeeper, but having problems when creating topic. BTW, when I make replication factor 1 I get the same error. How can I make sure that everything is alright with my cluster?
It's good that telnet checks the port is open, but it doesn't verify the Kafka protocol works. You could use kcat utility for that, but the fix includes
listeners are set to either PLAINTEXT://:9092 or PLAINTEXT://0.0.0.0:9092 for every broker, which means using the same port
Removing the number from the listener mapping and advertised listeners property such that each broker is the same
I'd also recommend looking at using Ansible/Terraform/Cloudformation to ensure you consistently modify the cluster rather than edit individual settings manually

How use throttling in Wso2 EI using client IP

I am planning to use throttling in wso2-ei 6.4.0, From local system i tested the scenario i face some problems could please help me if any one know thanks in advance.
If we restart the wso2-ei node policy is not working. It taking again from starting ( suppose request limit is 10 for 1 hour,Before restarting the node it processing 5 request after restarting it should take remaining 5 request but it accepting 10 request
Throttling is working based on wso2-ei node level but suppose Linux server having 10 nodes how to distribute the throttling policy in Linux server level .
How to consider client ip in throttling. If request coming from F5 load balance i need to consider the requested system IP not F5 server IP.
If we restart the wso2-ei node policy is not working. It taking again from starting ( suppose request limit is 10 for 1 hour,Before restarting the node it processing 5 request after restarting it should take remaining 5 request but it accepting 10 request
The throttle mediator does not store the throttle count. Therefore if you perform a server restart it will reset the throttle count value and start from zero. In a production environment, it is not expected to have frequent server restarts.
Throttling is working based on wso2-ei node level but suppose Linux server having 10 nodes how to distribute the throttling policy in Linux server level .
If you want to maintain the throttle count across all the nodes you need to cluster the nodes. Throttle mediator uses hazelcast cluster messages to maintain a global count across the cluster.

Orderer disconnections in a Hyperledger Fabric application

We have a hyperledger application. The main application is hosted on AWS VM's whereas the DR is hosted on Azure VM's. Recently the Microsoft Team identified that one of the DR VM's became unavailable and the availability was restored in approximately 8 minutes. As per Microsoft "This unexpected occurrence was caused by an Azure initiated auto-recovery action. The auto-recovery action was triggered by a hardware issue on the physical node where the virtual machine was hosted. As designed, your VM was automatically moved to a different and healthy physical node to avoid further impact." The Zookeeper VM was also redeployed at the same
The day after this event occurred, we have started noticing that an orderer goes offline and immediately comes online after a few seconds. This disconnection/connection occurs regularly after a gap of 12 hours and 10 minutes.
We have noticed two things
In the log we get
- [orderer/consensus/kafka] startThread -> CRIT 24df#033[0m [channel:
testchainid] Cannot set up channel consumer = kafka server: The
requested offset is outside the range of offsets maintained by the
server for the given topic/partition.
- panic: [channel: testchainid] Cannot set up channel consumer = kafka
server: The requested offset is outside the range of offsets
maintained by the server for the given topic/partition.
- goroutine 52 [running]:
- github.com/hyperledger/fabric/vendor/github.com/op/go-logging.(*Logger).Panicf(0xc4202748a0,
0x108dede, 0x31, 0xc420327540, 0x2, 0x2)
- /w/workspace/fabric-binaries-x86_64/gopath/src/github.com/hyperledger/fabric/vendor/github.com/op/go-logging/logger.go:194
+0x134
- github.com/hyperledger/fabric/orderer/consensus/kafka.startThread(0xc42022cdc0)
- /w/workspace/fabric-binaries-x86_64/gopath/src/github.com/hyperledger/fabric/orderer/consensus/kafka/chain.go:261
+0xb33
- created by
github.com/hyperledger/fabric/orderer/consensus/kafka.(*chainImpl).Start
- /w/workspace/fabric-binaries-x86_64/gopath/src/github.com/hyperledger/fabric/orderer/consensus/kafka/chain.go:126
+0x3f
Another thing which we noticed is that, in logs prior to the VM failure event there were 3 kafka brokers but we can see only 2 kafka brokers in the logs after this event.
Can someone guide me on this? How do I resolve this problem?
Additional information - We have been through the Kafka logs of the day after which the VM was redeployed and we noticed the following
org.apache.kafka.common.network.InvalidReceiveException: Invalid receive (size = 1195725856 larger than 104857600)
at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:132)
at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:93)
at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:231)
at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:192)
at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:528)
at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:469)
at org.apache.kafka.common.network.Selector.poll(Selector.java:398)
at kafka.network.Processor.poll(SocketServer.scala:535)
at kafka.network.Processor.run(SocketServer.scala:452)
at java.lang.Thread.run(Thread.java:748)
It seems that we have a solution but it needs to be validated. Once the solution is validated, I will post it on this site.

Kafka consumer get "Marking the coordinator dead" error when using group-ids

I have a Kafka cluster running on Kubernetes (on AWS). Each broker has a corresponding external loadbalancer (ELB) and afaict, Kafka's advertised.listeners have been set appropriately so that the ELB's DNS names get returned when clients query for broker information. Most of the setup is similar to the one mentioned here.
I created a kafka consumer without specifying any group-id. With this consumer, reading messages from a topic worked just fine. However, if I set a group-id when creating the kafka consumer, I get back the following error messages:
2018-01-30 22:04:16,763.763.313055038:kafka.cluster:140735643595584:INFO:74479:Group coordinator for my-group-id is BrokerMetadata(nodeId=2, host=u'a17ee9a8a032411e8a3c902beb474154-867008169.us-west-2.elb.amazonaws.com', port=32402, rack=None)
2018-01-30 22:04:16,763.763.804912567:kafka.coordinator:140735643595584:INFO:74479:Discovered coordinator 2 for group my-group-id
2018-01-30 22:04:16,764.764.270067215:kafka.coordinator.consumer:140735643595584:INFO:74479:Revoking previously assigned partitions set([]) for group my-group-id
2018-01-30 22:04:16,866.866.26291275:kafka.coordinator:140735643595584:INFO:74479:(Re-)joining group my-group-id
2018-01-30 22:04:16,898.898.787975311:kafka.coordinator:140735643595584:INFO:74479:Joined group 'my-group-id' (generation 1) with member_id kafka-python-1.3.5-e31607c2-45ec-4461-8691-260bb84c76ba
2018-01-30 22:04:16,899.899.425029755:kafka.coordinator:140735643595584:INFO:74479:Elected group leader -- performing partition assignments using range
2018-01-30 22:04:16,936.936.614990234:kafka.coordinator:140735643595584:WARNING:74479:Marking the coordinator dead (node 2) for group my-group-id: [Error 15] GroupCoordinatorNotAvailableError.
2018-01-30 22:04:17,069.69.8890686035:kafka.cluster:140735643595584:INFO:74479:Group coordinator for my-group-id is BrokerMetadata(nodeId=2, host=u'my-elb.us-west-2.elb.amazonaws.com', port=32402, rack=None)
my-elb.us-west-2.elb.amazonaws.com:32402 is accessible from the client. I used kafkacat and set my-elb.us-west-2.elb.amazonaws.com:32402 as the broker address, it was able to list topics, consume topics, etc.
Any ideas what might be wrong?
Marking the coordinator dead happens when there is a Network communication error between the Consumer Client and the Coordinator (Also this can happen when the Coordinator dies and the group needs to rebalance). There are a variety of situations (offset commit request, fetch offset, etc) that can cause this issue. So to find the root cause issue you need to set the logging level to trace and debug :
logging.level.org.apache.kafka=TRACE
The problem was with 3 config settings in the server.properties that were set incorrectly.
The default minimum in-sync replicas was 2 (min.insync.replicas=2). However, the internal topic settings had a replication factor of 1 (offsets.topic.replication.factor=1).
When a consumer connected with a group-ip, it's corresponding entry had to be made the __consumer_offsets topic. When, this topic was updated, only a single replica was written. This threw errors that the number of in-sync replicas was below the required number.
org.apache.kafka.common.errors.NotEnoughReplicasException: Number of insync replicas for partition __consumer_offsets-42 is [1], below required minimum [2]
I changed the required number of in-sync replicas to 1 and things started working fine.

DSE spark cluster on AWS worker and Executor ports

I am trying to setup a 6 node DSE 5.1 spark cluster on AWS EC2 machines.
I have referred the DSE documentation
just to start with, I have opened all TCP ports , when I checked the logs, I found that worker process and executor process and driver process are using below ports
33xxx
33xxx
33xxx
34xxx
34xxx
34xxx
35xxx
35xxx
35xxx
36xxx
37xxx
37xxx
39xxx
40xxx
40xxx
41xxx
41xxx
43xxx
43xxx
43xxx
43xxx
45xxx
46xxx
the range here is from 33xxx to 46xxx, what is suggested range to open the ports ?, or is there any way to bind worker and executor ports ?
By default the port selection is random
See the Spark Docs
Specifically
spark.blockManager.port
spark.driver.port
While you can lock these down to a specific value by setting them in the SparkConf or on the CLI through Spark Submit, you need to make sure that every application has unique values so they do not collide.
In most cases it makes sense to keep the Driver in the same VPN as the Cluster.