Hadoop3 balancer vs disk balancer - hadoop3

I read Hadoop ver 3 document about disk balancer and it said
"Diskbalancer is a command line tool that distributes data evenly on all disks of a datanode.
This tool is different from Balancer which takes care of cluster-wide data balancing."
I really dont know whats difference between 'balancer' and 'disk balancer' yet.
Could you explain what is it?
Thank you!

Balancer deals with internodes data balancing present in multiple datanodes present in the cluster whereas disk balancer deals with data present disks of a single datanode.

Related

Load Balancers methodology (L4, L7) and AWS

Recently I have been trying to catch up with the knowledge that I'm missing around Load Balancing internals and I have found this great article
But it made me think about more questions than before;)
Till now I understood that if we talk about L4 LB we can differentiate:
LB terminate type - that creates for each incoming connection, a new one to backend.
LB passthrough type - that might be split into NAT, or Direct routing one (eventually with tunneling)
Now one of my questions that came to my mind is that how does it fit into AWS world - what type of LB is AWS Network Load Balancer in that case?
The next thing is about L7 LB's.
Does layer 7 LB also relies on NAT, or Direct routing? Or it's completely beyond that? When it's quite a lot of materials around layer 4, typically layer 7 is really poor in terms of proper articles covering internals - I know only top products like: haproxy or nginx, but still don't get the difference between them :(
I will be very thankful for anyone who might provide me with at least some piece of advice how connect the dots there :)
what type of LB is AWS Network Load Balancer in that case?
Network Load Balancer is a layer 4 load balancing, check the information provided on the amazon docs:
A Network Load Balancer functions at the fourth layer of the Open Systems Interconnection (OSI) model. It can handle millions of requests per second. After the load balancer receives a connection request, it selects a target from the target group for the default rule. It attempts to open a TCP connection to the selected target on the port specified in the listener configuration.
https://docs.aws.amazon.com/elasticloadbalancing/latest/network/introduction.html
For the second question.
Does layer 7 LB also relies on NAT, or Direct routing? Or it's completely beyond that?
In the amazon world application load balancer is the only wich support layer 7 LB and works like mention in the article
"The client makes a single HTTP/2 TCP connection to the load balancer. The load balancer then proceeds to make two backend connections"
So for the client is direct connection with the load balancer, and that connection is splited to be served at backend pools.

API Gateway/NLB/ECS Latency

I have a number of services deployed in ECS. They register with a Network Load Balancer (via a target group). The NLB is private, and is accessed via API Gateway + a VPC link.
Most of the time, requests to my services take ~4-5 seconds, but occasionally < 100ms. The latter should be the standard; the actual requests are served by my node instances in ~10ms or less. I'm starting to dig into this, but was wondering if there was a common bottleneck in setups similar to what I'm currently using.
Any insight would be greatly appreciated!
The answer to this was to enable Cross-Zone Load Balancing on my load balancers. This isn't immediately obvious and took two AWS support sessions to dig it up as the root cause.

Do load balancers flood?

I am reading about load balancing.
I understand the idea that load balancers transfer the load among several slave servers of any given app. However very few literature that I can find talks about what happens when the load balancers themselves start struggling with the huge amount of requests, to the point that the "simple" task of load balancing (distribute requests among slaves) becomes an impossible undertaking.
Take for example this picture where you see 3 Load Balancers (LB) and some slave servers.
Figure 1: Clients know one IP to which they connect, one load balancer is behind that IP and will have to handle all those requests, thus that first load balancer is the bottleneck (and the internet connection).
What happens when the first load balancer starts struggling? If I add a new load balancer to side with the first one, I must add even another one so that the clients only need to know one IP. So the dilema continues: I still have only one load balancer receiving all my requests...!
Figure 2: I added one load balancer, but for having clients to know just one IP I had to add another one to centralize the incoming connections, thus ending up with the same bottleneck.
Moreover, my internet connection will also reach its limit of clients it can handle so I probably will want to have my load balancers in remote places to avoid flooding my internet connection. However if I distribute my load balancers, and want to keep my clients knowing just one single IP they have to connect, I still need to have one central load balancer behind that IP carrying all the traffic once again...
How do real world companies like Google and Facebook handle these issues? Can this be done without giving the clients multiple IPs and expect them to choose one at random avoiding every client to connect to the same load balancer, thus flooding us?
Your question doesn't sound AWS specific, so here's a generic answer (elastic LB in AWS auto-scales depending on traffic):
You're right, you can overwhelm a loadbalancer with the number of requests coming in. If you deploy a LB on a standard build machine, you're likely to first exhaust/overload the network stack including max number of open connections and handling rate of incoming connections.
As a first step, you would fine tune the network stack of your LB machine. If that still does not provide you the required throughput, there are special purpose loadbalancer appliances on the market, that are built ground-up and highly optimized to handle a large number of incoming connections and routing them to several servers. Examples of these are F5 and netscaler
You can also design your application in ways that help you split traffic to different sub domains, thereby reducing the number of requests 1 LB has to handle.
It is also possible to implement a round-robin DNS, where you would have 1 DNS entry point to several client facing LBs instead of just one as you've depicted.
Advanced load balancers like Netscaler and similar also does GSLB with DNS not simple DNS-RR (to explain further scaling)
if you are to connect to i.e service.domain.com, you let the load balancers become Authorative DNS for the zone and you add all the load balancers as valid name servers.
When a client looks up "service.domain.com" any of your loadbalancers will answer the DNS request and reply with the IP of the correct data center for your client. You can then further make the loadbalancer reply on the DNS request based of geo location of your client, latency between clients dns server and netscaler, or you can answer based on the different data centers load.
In each datacenter you typically set up one node or several nodes in cluster. You can scale quite high using such a design.
Since you tagged Amazon, they have load balancers built in to their system so you don't need to. Just use ELB and Amazon will direct the traffic to your correct system.
If you are doing it yourself, load balancers typically have a very light processing load. They typically do little more than redirect a connection from one machine to another based on a shallow inspection (or no inspection) of the data. It is possible for them to be overwhelmed, but typically that requires a load that would saturate most connections.
If you are running it yourself, and if your load balancer is doing more work or your connection is getting saturated, the next step is to use Round-Robin DNS for looking up your load balancers, generally using a combination of NS and CNAME records so different name lookups give different IP addresses.
If you plan to use amazon elastic load balancer they claim that
Elastic Load Balancing automatically scales its request handling
capacity to meet the demands of application traffic. Additionally,
Elastic Load Balancing offers integration with Auto Scaling to ensure
that you have back-end capacity to meet varying levels of traffic
levels without requiring manual intervention.
so you can go with them and do not need to handle the Load Balancer using your own instance/product

Load balancer for php application

Questions about load balancers if you have time.
So I've been using AWS for some time now. Super basic instances, using them to do some tasks whenever I needed something done.
I have a task that needs to be load balanced now. It's not a public service though. It's pretty much a giant cron job that I don't want running on the same servers as my website.
I set up an AWS load balancer, but it doesn't do what I expected it to do.
It get's stuck on one server, and doesn't load balance at all. I've read why it does this, and that's all fine and well, but I need it to be a serious round-robin load balancer.
edit:
I've set up the instances on different zones, but no matter how many instances I add to the ELB, it just uses one. If I take that instance down, it switches to a different one, so I know it's working. But I really would like it to always use a different one under every circumstance.
I know there are alternatives. Here's my question(s):
Would a custom php load balancer be an ok option for now?
IE: Have a list of servers, and have php randomly select a ec2 instance. Wouldn't be scalable at all, bu atleast I could set this up in 2 mins and it can work for now.
or
Should I take the time to learn how HAProxy works, and set that up in place of the AWS ELB?
or
Am I doing it wrong, and AWS's ELB does do round-robin. I just have something configured wrong?
edit:
Structure:
1) Web server finds a task to do.
2) If it's too large it sends it off to AWS (to load balancer).
3) Do the job on EC2
4) Report back via curl to an API
5) Rinse and repeat
Everything works great. But because the connection always comes from my server (one IP) it get's sticky'd to a single EC2 machine.
ELB works well for sites whose loads increase gradually. If you are expecting an uncommon and sudden increase on the load, you can ask AWS to pre-warm it for you.
I can tell you I used ELB in different scenarios and it always worked well for me. As you didn't provide too much information about your architecture, I would bet that ELB works for you, and the case that all connections are hitting only one server, I would ask you:
1) Did you check the ELB to see how many instances are behind it?
2) The instances that you have behind the ELB, are all alive?
3) Are you accessing your application through the ELB DNS?
Anyway, I took an excerpt from the excellent article that does a very good comparison between ELB and HAProxy. http://harish11g.blogspot.com.br/2012/11/amazon-elb-vs-haproxy-ec2-analysis.html
ELB provides Round Robin and Session Sticky algorithms based on EC2
instance health status. HAProxy provides variety of algorithms like
Round Robin, Static-RR, Least connection, source, uri, url_param etc.
Hope this helps.
This point comes as a surprise to many users using Amazon ELB. Amazon
ELB behaves little strange when incoming traffic is originated from
Single or Specific IP ranges, it does not efficiently do round robin
and sticks the request. Amazon ELB starts favoring a single EC2 or
EC2’s in Single Availability zones alone in Multi-AZ deployments
during such conditions. For example: If you have application
A(customer company) and Application B, and Application B is deployed
inside AWS infrastructure with ELB front end. All the traffic
generated from Application A(single host) is sent to Application B in
AWS, in this case ELB of Application B will not efficiently Round
Robin the traffic to Web/App EC2 instances deployed under it. This is
because the entire incoming traffic from application A will be from a
Single Firewall/ NAT or Specific IP range servers and ELB will start
unevenly sticking the requests to Single EC2 or EC2’s in Single AZ.
Note: Users encounter this usually during load test, so it is ideal to
load test AWS Infra from multiple distributed agents.
More info at the Point 9 in the following article http://harish11g.blogspot.in/2012/07/aws-elastic-load-balancing-elb-amazon.html
HAProxy is not hard to learn and is tremendously lightweight yet flexible. I actually use HAProxy behind ELB for the best of both worlds -- the hardened, managed, hands-off reliability of ELB facing the Internet and unwrapping SSL, and the flexible configuration of HAProxy to allow me to fine tune how things hit my servers. I've never lost an HAProxy instance yet, but it I do, ELB will just take that one out of rotation... as I have seen happen when the back-end servers have all become inaccessible, which (because of the way it's configured) makes ELB think the HAProxy is unhealthy, but that's by design in my setup.

Can I figure out which instance is currently used by an Elastic Load Balancer?

I have created two Amazon EC2 instances. After that I created an Elastic Load Balancer and registered the two instances in it.
Now what I would like to know is, when we use the DNS name of the load balancer, which instance will the load balancer use?
The idea of Load balancing is to distribute workload across multiple computers or a computer cluster, network links, central processing units, disk drives, or other resources [...].
While there are many algorithms conceivable, the general goal is to achieve optimal resource utilization, maximize throughput, minimize response time, and avoid overload, which usually implies transparent distribution of the load between the load balanced resources. Therefore you usually won't know (and shouldn't need to know), which load balanced resource serves a particular request.
Accordingly, Elastic Load Balancing (ELB) automatically distributes incoming application traffic across multiple Amazon EC2 instances.
How this is done specifically is a fairly complicated topic, mostly due to the ELB routing documentation falling short of being non existent, so one needs to assemble some pieces to draw a conclusion - see my answer to the related question Can Elastic Load Balancers correctly distribute traffic to different size instances for a detailed analysis including all the references I'm aware of.
For the question at hand I think it boils down to the somewhat vague AWS team response from 2009 to ELB Strategy:
ELB loosely keeps track of how many requests (or connections in the
case of TCP) are outstanding at each instance. It does not monitor
resource usage (such as CPU or memory) at each instance. ELB
currently will round-robin amongst those instances that it believes
has the fewest outstanding requests. [emphasis mine]
stf ,
you cannot come to know, for which server load is distributing through EBS , EBS internally take care of request distribution .
Of course you can figure out which server your request goes to!
On each server you are going to need something akin to a health_check.html file (can be named anything, someone suggested index.htm but that is a bad idea and is another discussion entirely) so the load balancer can call it and determine how long it took to get a response.
On server #1 put the following in the health_check.html file: <HTML><BODY>1</BODY></HTML>
On server #2 put this in the health_check.html file: <HTML><BODY>2</BODY></HTML>
Now when you navigate to www.YourDomain.com/health_check.html you will know exactly which server you are on.
Clear your cookies and re-navigate to the same URL to see which server you get next. Good luck cloud developer!