WSO2 API Manager - Strange result of Failover test. How come? - wso2

WSO2AM version : 1.10.0
I setup API manager after reviewing deployment pattern document.(https://docs.wso2.com/display/CLUSTER44x/API+Manager+Deployment+Patterns)
1 publisher
1 store
1 Gateway manager
3 Gateway worker (Clustering)
2 Load Balancer
2 Key manager (HA)
4 jmeter slaves , 1 jmeter client
and then I did failover test with jmeter.
I killed one gateway worker node, during jmeter generating http requests.
(gateway worker node 3 -> 2)
I got the result different to that I expected. (little strange)
TPS dropped close to zero for 5 seconds, when a gateway worker process shutdown. (when I killed the gateway worker)
Graph - Performance break down when api gateway detached in cluster by killed
I wonder what happen that point?
Even though alive gateway workers try to recover something, I didn't think TPS would drop like that.

I personally haven't seen this behavior. Anyway, this can be an issue in either gateway worker cluster or the loadbalancer. To figure it out, you can remove the load balancer and send requests to all 3 gateway workers in parallel. Then kill one worker and see if there are any TPS drop in other workers. If there is, then it's an issue with gateway cluster, if not, the problem is with the loadbalancer.

Related

AWS ECS Fargate with an Application Load Balancer

I am using ECS Fargate to execute java code in threads. Is it possible to have an ALB stop sending requests to a Fargate task once I have a certain number of java threads running in that task? And then, as my threads finish I would allow the ALB to send another request.
I see an ALB performs health checks and if an task fails a health check the task is temporarily put out of the target group. But I don't see a way I would be able to utilize this as these health checks are timed(not checked before every sent request) and there is no way to manually send a fail signal.
Does anyone know a way in which I could force ALB to stop sending requests to a task once I have my set limit of threads in that task met.

How does ALB distribute requests to Fargate service during rolling update deployment?

I deploy a Fargate service in a cluster and use rolling update deployment. I configured an ALB in front of my service, and it is doing a health check as well. During the upgrade, I can see that my current task is marked as INACTIVE, and the new task is deployed. Both of the two tasks are in running state.
I understand that the ALB is doing a health check on the newly deployed tasks, but it keeps two tasks running for 5 minutes.
I have a few questions about this deployment period of time.
Does ALB distribute user requests to my new tasks before passing health check?
If the answer for the first question is no, Does ALB distribute user requests to the new service after passing health check before the old services is down?
If the second answer is yes, then there will be two versions of tasks running inside my service to serve user requests for 5 minutes. Is this true? How can I make sure it only send requests to one service at a time.
I don't want to change the deployment method to BLUE/GREEN. I want to keep the rolling update at the moment.
ALB will not send traffic to a task that is not yet passing health checks, so no to #1. ALB will send traffic to both old and new whilst deploying, so yes to #2. As soon as a replacement task is available ALB will start to drain the task it is replacing. The default time for that is 5 minutes. During that time the draining instance will not receive traffic, so sort of no to #3. The sort of part is that you will have some time with version A and B of your service will both be deployed. How long that is depends on the number of tasks and how long it takes for them to start to receive traffic.
The only way I can think of to send all traffic to one version and then hard cut over to the other is to create a completely new target group each time, keeping the old one active. Then, once the new target group is running switch to it. You'd have to change the routes in the ALB as you do that.
By the way, what is happening now is what I would call a rolling deployment.

Asp.Net SignalR Not working on load balancer with 2 instances (AWS ELB)

I have a SignalR application for real time support. This is hosted on AWS.
It works fine, if environment has a load balancer with single instance.
If increase instances to 2 under load balancer, SignlarR starts behaving intermittent. Broadcast few messages, lets says 4 out of 10 gets communicated over the subscribers.
Session stickiness - 1 day
Timeout - 10 min.
Machine key set as same on - API, Customers App, Admin support App ( API has Hub, Admin support and Customer are the clients).
Have anyone have similar case study and how to overcome that?

AWS classic LB changing IPs/dropping connections results in lost messages on RabbitMQ

I run a rabbit HA cluster with 3 nodes and a classic AWS load-balancer(LB) in front of them. There are two apps, one that publishes and the other one that consumes through the LB.
When publisher app starts sending 3 million messages, after short period of time its connection is put into Flow Control state. After the publishing is finished, in publisher app logs I can see that all 3 million messages are sent. On the other hand in consumer app log I can only see 500K - 1M messages (varies between runs), which means that the large number of messages is lost.
So what is happening is that in the middle of a run, classic LB decides to change its IP address or drop connections, thus loosing a lot of messages (see my update for more details).
The issue does not occur if I skip LB and hit the nodes directly, doing load-balancing on app side. Of course in this case I lose all the benefits of ELB.
My question are:
Why is LB changing IP addresses and dropping connections, is that related to high message rate from publisher or Flow Control state?
How to configure LB, so that this issue doesn't occur?
UPDATE:
This is my understanding what is happening:
I use AMQP 0-9-1 and publish without 'publish confirms', so message is considered sent as soon as it's put on a wire. Also, the connection on rabbitmq node is between LB and a node, not Publisher app and a node.
Before the communication enters Flow Control, messages are passed from LB to a node immediately
Then the connection between LB and a node enters Flow Control, Publisher App connection is not blocked and thus it continues to publish at the same rate. That causes messages to pile up on LB.
Then LB decides to change IP(s) or drop the connection for whatever reasons and create a new one, causing all the piled messages to be lost. This is clearly visible from the RabbitMQ logs:
=WARNING REPORT==== 6-Jan-2018::10:35:50 ===
closing AMQP connection <0.30342.375> (10.1.1.250:29564 -> 10.1.1.223:5672):
client unexpectedly closed TCP connection
=INFO REPORT==== 6-Jan-2018::10:35:51 ===
accepting AMQP connection <0.29123.375> (10.1.1.22:1886 -> 10.1.1.223:5672)
The solution is to use AWS network LB. The network LB is going to create a connection between Publisher App and rabbitmq node. So if the connection is blocked or dropped Publisher is going to be aware of that and act accordingly. I have run the same test with 3M messages and not the single message is lost.
In the AWS docs, there's this line which explains the behaviour:
Preserve source IP address Network Load Balancer preserves the client side source IP allowing the back-end to see the IP address of
the client. This can then be used by applications for further
processing.
From: https://aws.amazon.com/elasticloadbalancing/details/
ELBs will change their addresses when they scale in reaction to traffic. New nodes come up, and appear in DNS, and then old nodes may go away eventually, or they may stay online.
It increases capacity by utilizing either larger resources (resources with higher performance characteristics) or more individual resources. The Elastic Load Balancing service will update the Domain Name System (DNS) record of the load balancer when it scales so that the new resources have their respective IP addresses registered in DNS. The DNS record that is created includes a Time-to-Live (TTL) setting of 60 seconds, with the expectation that clients will re-lookup the DNS at least every 60 seconds. (emphasis added)
— from “Best Practices in Evaluating Elastic Load Balancing”
You may find more useful information in that "best practices" guide, including the concept of pre-warming a balancer with the help of AWS support, and how to ramp up your test traffic in a way that the balancer's scaling can keep up.
The behavior of a classic ELB is automatic, and not configurable by the user.
But it also sounds as if you have configuration issues with your queue, because it seems like it should be more resilient to dropped connections.
Note also that an AWS Network Load Balancer does not change its IP addresses and does not need to scale by replacing resources the way ELB does, because unlike ELB, it doesn't appear to run on hidden instances -- it's part of the network infrastructure, or at least appears that way. This might be a viable alternative.

Blue-green deployment on Web Service with WebSocket Implementation on AWS

I'm currently looking to implement Web sockets for a couple of Web services. But was wondering how this stateful http connections will impact Blue-green deployments & auto-scaling on AWS.
Was googling around but haven't came across anything. Would appreciate any advice / inputs.
Use connection draining (sending all new requests to your desired environment - green for example) and giving time for blue clients to fall off.
You can set the max-lifetime of your websocket (connection draining period should be longer than max if that kind of reliability is needed)
Otherwise would just handle client side. If websocket drops initiate new connection through your AWS ELB to a healthy server. Do not keep any state on your ephemeral ELB backends. This also would also work when scaling down on AWS.