wso2elb services mapping not working with all configured hostnames - wso2

I'm currently deploying some wso2as cluster, and am facing a strange problem with URL mapping.
I have setup two worker nodes (named was0 and was1), a manager node (named mgt) and an ELB (named elb).
The installation seems working fine, as I'm able to call URL mapped on load balancer like the following : http://was0.domain/services/... , was0.domain being mapped on the load balancer IP on the station accessing this address (outside the cluster).
When I call services on this endpoint, I'm able to load balance as I can notice that my wsdl has enpoints based on was0 and was1. The two worker nodes are pretty detected as application nodes on the ELB.
The problem I encounter however is that when I use was0 based URL, it works fine, but when I try to use the was1 one, the load balancer returns a blank page, and I don't notice any error in logs. I have both hosts was1 and was0 defined in my cluster configuration as application members for AS.
If I try from the ELB node to access the was1 based webservice directly on the WAS, I'm able to access it without problem (so the service is working on was1 node, and this node is also detected and registered inside the cluster, but not accessible through cluster).
Finally, this results in one call working when round robin targets was0, and one call not working when targetting was1.
So I'm currently wondering if I understood well the cluster behavior: should it work for both application servers mapped URL, or is it normal to have only the first was0 responding with success? How could I force generated WSDL to return a valid endpoint URL?
What I understood by reading documentation is that I need mapping WAS URLs on the ELB, and that this one will then balance on all WAS servers, but it doesn't seem working like that.
Please tell me if you need some configuration part, diagram or example, I didn't paste it here because it's quite big :)
For information, I had the same problem when balancing through 2 wso2esb worker nodes, but was able to solve it by forcing WSDL URLs prefix by the first node URL (esb0) with the WSDLEPRPrefix in ESB configuration. As I don't have a such setting in wso2as, I don't know how accessing the URL returned in WSDL.
Thank you by advance for your help,
BOUCNIAUX Benjamin

Related

Kubernetes: How to connect one pod to another on an arbitrary port - with or without services?

We are currently transitioning our apps to Kubernetes and I have two apps, appP and appH, that I need to communicate with each other over a port unknown at start up time.
Unlike most of our apps, we don't have a set port for them will to communicate over. Before Kubernetes, third party app (out of my control) would tell appP to start processing an item, itemA, identified with a unique id and it would also tell appH to handle the processed data produced by appP.
To coordinate communications between appP and appH, appH would generate a port based on the unique id and publish the host and port info to connect on to an intermediate app (IA). appP, once done with it's processing queries IA for the connection information based on the unique id and sends it over.
Now we have to adapt this to kubernetes. Each app runs in its own deployment, as does the IA. So how can I setup appH to accept the connection over a port without being able to specify it in the service definition?
Note: I've seen some posts say that pods should be able to communicate to any other pods in the cluster regardless of specifying the ports in the service definition but I can't seem to find a ton of confirming information on this and I don't have a ton of time on our cluster where it is free to bang my head against.
Would it would just fine as is regardless? My biggest worry is the ip resolution. Currently appH grabs its ip based on the host it's running on (using boost). Not sure how this resolves within a container.
If not, my next thought would be if I could setup a headless service with selector for appH in order to allow for ip resolution. What I am unsure of then is if I could have appP connect to <appH_Service>:<arbitrary_port>?
Would the service even have to be headless in this scenario? I mostly say headless w/ selector because I saw in one specific post that it is the only one you don't need a port in the spec for it. Also because I am unsure if the connection would go through unless it was the actual pod's ip it was connecting with, rather than the services.
Any info or clarification is appreciated. For the most part, I can't really change the architecture of these apps right now, I just have to get them talking to each other as is and haven't found a ton of clear information on this type of case.
Note: We use helm and coredns if anyone is curious.
The Kubernetes networking model is as follows: a Pod is a group of containers that share a single network identity (a cluster IP). Any port exposed by a container is thus automatically exposed on the Pod. The model demands that each Pods can communicate with other Pods.
This means that your current design can work without modifications.
What Services bring to the table is that you can bring a stable network identity to a group of Pods that is otherwise very volatile. It does not apply to your appP/appH coupling, I think.

Webpage resource request stalled for nearly a minute in Chrome

A resource on my webapp takes nearly a minute to load after a long stall. This happens consistently. As shown below, only 3 requests on this page actually hit the server itself, the rest hit the memory or disk cache. This problem only seems to occur on Chrome, both Safari and Firefox do not exhibit this behavior.
I have implemented the Cache-Control: no-store suggestion in this SO question but the problem persists. request stalled for a long time occasionally in chrome
Also included below is an example of what the response looks like once it finally does come in.
My app is hosted in AWS behind a Network Load Balancer which proxies to an EC2 instance running nginx and the app itself.
Any ideas what is causing this?
I encountered the exact same problem. We are using Elastic Beanstalk with Network Load Balancer (NLB) with TLS termination at NLB.
The feedback I got from AWS support is:
This problem can occur when a client connects to a TLS listener on a Network Load Balancer and does not send data immediately after completing the TLS handshake. The root cause is an edge case in the handling of new connections. Note that this only occurs if the Target Group for the TLS listener is configured to use the TCP protocol without Proxy Protocol v2 enabled
They are working on a fix for this issue now.
Somehow this problem can only be noticed when you are using Chrome browser.
In the meantime, you have these 2 options as workaround:
enable Proxy Protocol v2 on the Target Group OR
configure the Target Group to use TLS protocol for routing traffic to the targets
I know it's a late answer but I write it for someone seeking a solution.
TL;DR: In my case, enabling cross-zone load balancing attribute of NLB solved the problem.
With investigation using WireShark I figured out there were two different IPv4 addresses Chrome communicated with.
Sending packets to one of them always succeeded and to the other always failed.
Actually the two addresses delegated two Availability Zones.
By default, cross-zone load balancing is disabled if you choose NLB (on the contrary the same attribute of ALB is enabled by default).
Let's say there are two AZs; AZ-1 / AZ-2.
When you attach both AZs to a NLB, it has a node for each AZ.
The node belongs to AZ-1 just routes traffic to instances which also belong to AZ-1. AZ-2 instances are ignored.
My modest app (hosted on Fargate) has just one app server (ECS task) in AZ-2 so that the NLB node in AZ-1 cannot route traffic to anywhere.
I'm not familiar with TCP/IP or Browser implementation but in my understanding, your browser somehow selects the actual ip address after DNS lookup.
If the AZ-2 node is selected in the above case then everything goes fine, but if the AZ-1 is selected your browser starts stalling.
Maybe Chrome has a random strategy to select ip while Safari or Firefox has a sticky one, so that the problem only appears on Chrome.
After enabling cross-zone load balancing the ECS task on AZ-2 is visible from the AZ-1 NLB node, and it works fine with Chrome browser too.
(Please feel free to correct my poor English. Thank you!)
I see two things that could be responsible for delays:
1) Usage of CDNs
If the resources that load slow are loaded from CDNs (Content Delivery Networks) you should try to download them to the server and deliver directly.
Especially if you use http2 this can be a remarkable gain in speed, but also with http1. I've no experience with AWS, so I don't know how things are served there by default.
It's not shown clearly in your screenshot if the resources are loaded from CDN but as it's about scripts I think that's a reasonable assumption.
2) Chrome’s resource scheduler
General description: https://blog.chromium.org/2013/04/chrome-27-beta-speedier-web-and-new.html
It's possible or even probable that this scheduler has changed since the article was published but it's at least shown in your screenshot.
I think if you optimize the page with help of the https://www.webpagetest.org and the chrome web tools you can solve any problems with the scheduler but also other problems concerning speed and perhaps other issues too. Here is the link: https://developers.google.com/web/tools/
EDIT
3) Proxy-Issue
In general it's possible that chrome has either problems or reasons to delay because of the proxy-server. Details can't be known before locking at the log-files, perhaps you've to adjust that log-files are even produced and that the log-level is enough to tell you about any problems (Level Warning or even Info).
After monitoring the chrome net-export logs, it seems as though I was running into this issue: https://bugs.chromium.org/p/chromium/issues/detail?id=447463.
I still don't have a solution for how to fix the problem though.

AWS, Load Balancer 504 error after a few requests

I am repeating a question that I posted at https://forums.aws.amazon.com/thread.jspa?threadID=275855&tstart=0
to reach out more people.
Hi,
I am trying to deploy a REST service in AWS. The current architecture is:
Domain name (Route 53) -> Load Balancer -> Single EC2 instance (bound to an Elastic IP). And I use TLS/SSL certificate issued by a Certificate Manager.
The instance is Ubuntu 16.04 machine, and the service is implemented with (bare) Vert.X (==no proxy server).
However, 504 Error (gateway timeout) occurs after a few different requests (each of which takes <1s) in a series, and then it does not respond. The requests do not reach the server instance after a few requests. I checked that it happens in the same way when I access both the domain name and the load balancer directly. I have confirmed that the exact same scenario is working with direct URL.
I run up a dummy server returning "hello world" and it's working okay with the load balancer. The problem should be caused by something no coherent between the load balancer and the server code, but I can't get where to start.
I have checked several threads complaining the 504 errors, and followed some of the instructions, but they do not work. Especially I set keep-alive option in Vert.x and set the idle time longer than the balancer's. As the delays are not longer than the idel time with the direct communication, I believe it is not the problem anyway. I have checked the Security Groups also and confirmed the right ports are open. (The first few requests are working, so it must not be the problem also.)
Does any of you have a sense where I should start looking at? Even better, know the source of the problem?
Thanks in advance.
EDIT: I just found the issue in some of the code. I've answered myself below. Thanks for reading!
Found the issue in my code. Some of the APIs (implemented by my colleague...) was not flushing the buffer of HTTP responses in the server.
In Vert.X Java, it was resp.end().
It was somehow working with direct access probably the buffer was flushed at some point, but that flush seems not caught by the load balancer.
Hope nobody experiences this, but in case...

AWS Application Load Balancer (ALB) path based routing not functioning as expected

I am working on a POC to prove out AWS path based routing through an Application Load Balancer to a set of very basic "hello world" node.js applications using express. Without the path based routing in place and having multiple listeners, 1 listener for each application, each respective listener and application is working as expected. Therefore, the targets within the Target Groups have both passed health checks and are shown as healthy. However, when I switch to the path based routing implementation on 1 of the listeners (deleting the other unnecessary listener) I get the following error for both applications:
Cannot GET /expressapp
Cannot GET /expressapp2
I have gone through the following documentation to try to figure out the issue:
http://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-listeners.html#path-conditions
What am I missing? Any troubleshooting ideas?
I believe that you are getting this error because the services in question do not expect to receive paths prefixed with /expressapp and /expressapp2. When the ALB forwards traffic to your service, the path remains intact.
Stripping off the prefix cannot be handled by ALB. If you don't have access to the source code of the apps, you will need to use some kind of reverse-proxy like nginx to rewrite the urls before sending them onto the app.
If you have access to the source code, express supports changing the base url without modifying the code. You can read a value for the url prefix in as an environment variable and configure your respective service environments accordingly.
I would flip both rules from their respective positions I.e make expressapp2 rule #1 and express app rule #2 for it to work like you want it to.
The ALB evaluates these rules in order of priority and even though the context path is expressapp2 it still matches expressapp and the first rule is evaluated.

designing for failure when polling/retrying/queueing is not possible

I'm designing a website/web service to be hosted in the cloud (specifically AWS although that's mostly irrelevant), and I'm spending a lot of time thinking about "designing for failure". I want my system to seamlessly handle node failures, i.e. without any significant user impact or engineer intervention.
In most cases, it's easy to see how to handle sudden node failure. If my app has an API handled by 4 servers behind a load balancer, polled by AJAX or an iPhone app, the poller can simply detect the failed TCP/IP transmission and retry... assuming the load balancer behaves correctly, it will hit a healthy instance.
If the app is more processing-oriented, a queue service like SQS can be used to allow stateless nodes to pick up where the failed nodes left off.
The difficulty I see is with "points of entry", where no retry/polling is possible because the application hasn't been loaded yet, and where a failure means the app never starts. For example, the index.html on a webpage... if a node fails while transmitting that file, the user's browser will likely hang and not automatically retry (they will need to refresh).
The Load Balancer is also a single "point of entry/failure". However, in this case it appears we can solve the problem by creating multiple Load Balancers, and load balancing them using DNS Load Balancing as described here: http://blog.rightscale.com/2012/10/23/dns-load-balancing-and-using-multiple-load-balancers-in-the-cloud/
Is this a solution that would work for the simpler index.html case? Overall, how can we create redundancy where polling/retrying/queuing is not possible?
EDIT: Another idea is to have the index.html hosted statically on a CDN, S3, etc (where resource availability is more dependable), although that prevents using dynamic content. Dynamic content could be added if the page populates itself using JS, but that adds a dependency on JS as well as latency for the user.