Redundancy without central control point? - concurrency

If it possible to provide a service to multiple clients whereby if the server providing this service goes down, another one takes it's place- without some sort of centralised "control" which detects whether the main server has gone down and to redirect the clients to the new server?
Is it possible to do without having a centralised interface/gateway?
In other words, its a bit like asking can you design a node balancer without having a centralised control to direct clients?

Well, you are not giving much information about the "service" you are asking about, so I'll answer in a generic way.
For the first part of my answer, I'll assume you are talking about a "centralized interface/gateway" involving ip addresses. For this, there's CARP (Common Address Redundancy Protocol), quoted from the wiki:
The Common Address Redundancy Protocol or CARP is a protocol which
allows multiple hosts on the same local network to share a set of IP
addresses. Its primary purpose is to provide failover redundancy,
especially when used with firewalls and routers. In some
configurations CARP can also provide load balancing functionality. It
is a free, non patent-encumbered alternative to Cisco's HSRP. CARP is
mostly implemented in BSD operating systems.
Quoting the netbsd's "Introduction to CARP":
CARP works by allowing a group of hosts on the same network segment to
share an IP address. This group of hosts is referred to as a
"redundancy group". The redundancy group is assigned an IP address
that is shared amongst the group members. Within the group, one host
is designated the "master" and the rest as "backups". The master host
is the one that currently "holds" the shared IP; it responds to any
traffic or ARP requests directed towards it. Each host may belong to
more than one redundancy group at a time.
This might solve your question at the network level, by having the slaves takeover the ip address in order, without a single point of failure.
Now, for the second part of the answer (the application level), with distributed erlang, you can have several nodes (a cluster) that will give you fault tolerance and redundancy (so you would not use ip addresses here, but "distributed erlang" -a cluster of erlang nodes- instead).
You would have lots of nodes lying around with your Distributed Applciation started, and your application resource file would contain a list of nodes (ordered) where the application can be run.
Distributed erlang will control which of the nodes is "the master" and will automagically start and stop your application in the different nodes, as they go up and down.
Quoting (as less as possible) from http://www.erlang.org/doc/design_principles/distributed_applications.html:
In a distributed system with several Erlang nodes, there may be a need
to control applications in a distributed manner. If the node, where a
certain application is running, goes down, the application should be
restarted at another node.
The application will be started at the first node, specified by the
distributed configuration parameter, which is up and running. The
application is started as usual.
For distribution of application control to work properly, the nodes
where a distributed application may run must contact each other and
negotiate where to start the application.
When started, the node will wait for all nodes specified by
sync_nodes_mandatory and sync_nodes_optional to come up. When all
nodes have come up, or when all mandatory nodes have come up and the
time specified by sync_nodes_timeout has elapsed, all applications
will be started. If not all mandatory nodes have come up, the node
will terminate.
If the node where the application is running goes down, the
application is restarted (after the specified timeout) at the first
node, specified by the distributed configuration parameter, which is
up and running. This is called a failover
distributed = [{Application, [Timeout,] NodeDesc}]
If a node is started, which has higher priority according to
distributed, than the node where a distributed application is
currently running, the application will be restarted at the new node
and stopped at the old node. This is called a takeover.
Ok, that was meant as a general overview, since it can be a long topic :)
For the specific details, it is highly recommended to read the Distributed OTP Applications chapter for learnyousomeerlang (and of course the previous link: http://www.erlang.org/doc/design_principles/distributed_applications.html)
Also, your "service" might depend on other external systems like databases, so you should consider fault tolerance and redundancy there, too. The whole architecture needs to be fault tolerance and distributed for "the service" to work in this way.
Hope it helps!

This answer is a general overview to high availability for networked applications, not specific to Erlang. I don't know too much about what is available in the OTP framework yet because I am new to the language.
There are a few different problems here:
Client connection must be moved to the backup machine
The session may contain state data
How to detect a crash
Problem 1 - Moving client connection
This may be solved in many different ways and on different layers of the network architecture. The easiest thing is to code it right into the client, so that when a connection is lost it reconnects to another machine.
If you need network transparency you may use some technology to sync TCP states between different machines and then reroute all traffic to the new machine, which may be entirely invisible for the client. This is much harder to do than the first suggestion.
I'm sure there are lots of things to do in-between these two.
Problem 2 - State data
You obviously need to transfer the session state from the crashed machine unto the backup machine. This is really hard to do in a reliable way and you may lose the last few transactions because the crashed machine may not be able to send the last state before the crash. You can use a synchronized call in this way to be really sure about not losing state:
Transaction/message comes from the client into the main machine.
Main machine updates some state.
New state is sent to backup machine.
Backup machine confirms arrival of the new state.
Main machine confirms success to the client.
This may potentially be expensive (or at least not responsive enough) in some scenarios since you depend on the backup machine and the connection to it, including latency, before even confirming anything to the client. To make it perform better you can let the client check with the backup machine upon connection what transactions it received and then resend the lost ones, making it the client's responsibility to queue the work.
Problem 3 - Detecting a crash
This is an interesting problem because a crash is not always well-defined. Did something really crash? Consider a network program that closes the connection between the client and server, but both are still up and connected to the network. Or worse, makes the client disconnect from the server without the server noticing. Here are some questions to think about:
Should the client connect to the backup machine?
What if the main server updates some state and send it to the backup machine while the backup have the real client connected - will there be a data race?
Can both the main and backup machine be up at the same time or do you need to shut down work on one of them and move all sessions?
Do you need some sort of authority on this matter, some protocol to decide which one is master and which one is slave? Who is that authority? How do you decentralise it?
What if your nodes loses their connection between them but both continue to work as expected (called network partitioning)?

See Google's paper "Chubby lock server" (PDF) and "Paxos made live" (PDF) to get an idea.
Briefly,this solution involves using a consensus protocol to elect a master among a group of servers that handles all the requests. If the master fails, the protocol is used again to elect the next master.
Also, see gen_leader for an example in leader election which works with detecting failures and transferring service ownership.

Related

Blockchain - How do implementations do peer-to-peer discovery?

I'm currently learning how blockchain works, just out of personal curiosity. I'm going through this course and now I've setup the peer-to-peer connectivity using web sockets. Multiple instances of the blockchain application can now run and communicate with each other using these sockets.
The one downside of the course implementation is how the instances know how to find each other. Essentially they need to be explicitly configured to communicate. In my current project, I have it setup with 3 instances. One opens a socket on port 5001 and connects to nothing else. The other opens a socket on port 5002 and connects to the instance on 5001. And the third opens a socket on port 5003 and connects to the instances on 5002 and 5001.
The point is, all three are explicitly configured like that. All three must be started in exactly that order so they can properly connect to the others. While this is fine for a practice implementation, I know that's not how a real blockchain implementation must be working out there in the wild. There must be some mechanism of discovery that takes place which would allow any of these instances to locate whichever others are currently running.
Networking is not my area of expertise, so I'm at a loss on how this could be done.
P2P cryptocurrency clients usually have hardcoded list of peers. These peers are managed by community. If you start you client for the first time, these peers is all you have.
When you connect to another node, it saves your IP in its internal list. Any node can request this list from another node. In this way your client can discover other nodes. Client application saves list of nodes to disk. On the next startup you have hardcoded nodes and nodes you were connected last time. Some nodes may be offline, but this is ok.
More detailed explanation in case of Bitcoin: https://developer.bitcoin.org/devguide/p2p_network.html#peer-discovery

Communication between Amazon Lambda and Windows Application

I am a newbie to AWS and cloud computing in general, so I apologize if this question is foolish.
I am currently working on developing an app for Amazon Echo that would allow it to remotely control a PC (i.e. change volume, pause a movie, etc.). My problem is that I do not know how to communicate between my Amazon Lambda service and my Windows Application.
Any ideas?
There are potentially some problems with the way you have posed the question -- how to communicate between a Lambda Function and a Windows machine could involve a number of different solutions, but what you are looking for (as far as I can tell) is a more specific -- yet simultaneously more generalizable -- solution.
Are you trying to actually make an Alexa skill that users could use, or just something for yourself? It makes a big difference, because for just yourself there are a number of hacky solutions you could implement, like port forwarding and dynamic DNS, which fail dramatically if you try to do them in the real world. You need another component -- some kind of real-time push messaging -- that bridges between an "agent" in your Windows app and requests emitted by your Lambda code.
Your actual problem to solve is not so much how to communicate between AWS Lambda and a Windows Application, but rather one of a need for understanding how a platform like Alexa needs to communicate with a "smart home" device, specifically an entertainment device.
It is a relatively complicated undertaking, because -- fundamentally -- there is no way of communicating directly between Lambda and an arbitrary device out on the Internet. Dynamic IP addresses, network address translation (NAT), firewalls, security considerations, and other factors make it impossible to reliably initiate a connection from a Lambda function (or indeed from any Internet connected device) to any other arbitrary destination device. Most devices (my phone, my Alexa-controlled light switch, my Windows laptop) are running behind a boundary that assumes requests are initiated behind the boundary. When I open web sites, stream video, etc., I initiate the request and the response returns on the channel (often a TCP connection) that I have created, from behind my boundary (e.g. the router in my cable modem) that doesn't allow external initiation of TCP connections. They are bidirectional once established, but must be initiated from inside.
Of course, you can statically "poke a hole" in your router configuration by forwarding a specific TCP port to a specific internal (usually private) IP address, which works as long as your Internet provider doesn't change your IP address, and your internal device doesn't get a new IP address... and there'a UPnP NAT Traversal, which seems like a good solution until you realize that it is also terrible (though for a "hobbyist" application, it could work).
While this is a long and complex topic, the short answer is that Alexa, via Lambda code, is only capable of initiating connections, and your device, wherever it may be, is only capable of initiating connections -- not receiving them... and thus you need some kind of "meet in the middle" solution: something that allows the device to maintain its "connection" to a central "service" that can coordinate the interactions on demand.
For example:
AWS IoT Core is a managed cloud platform that lets connected devices easily and securely interact with cloud applications and other devices. AWS IoT Core can support billions of devices and trillions of messages, and can process and route those messages to AWS endpoints and to other devices reliably and securely. With AWS IoT Core, your applications can keep track of and communicate with all your devices, all the time, even when they aren’t connected.
https://aws.amazon.com/iot-core/
The client initiates the connection (e.g. via a web socket) to the IoT platform, and maintains it, so that when a message arrives at IoT, the service knows how to deliver that message to the client when it's received. ("even when they aren't online" refers to the "device shadow" capability, which allows you to programmatically interact with a proxy for the device, e.g. knowing the last temperature setting of a thermostat, and asking the thermostat to change its set point when the connection is re-established at some future point).
Or, potentially something like this:
Firebase Cloud Messaging (FCM) is a cross-platform messaging solution that lets you reliably deliver messages at no cost.
Using FCM, you can notify a client app that new email or other data is available to sync.
https://firebase.google.com/docs/cloud-messaging/
Both of these potential solutions solve the problem by "knowing how to contact" arbitrary devices, wherever they may be... and I would suggest that this is the core of your actual need.
There are a lot of alternatives for such a "service," including roll-your-own websocket or HTML EventSource implementations with servers... the purpose of this is not product recommendations but rather to give you an idea of what you would need for such a scenario -- an intermediate platform that can be interacted with by the Lambda code, which also knows how to communicate with "agent" code running on the device... because both Lambda and the agent need to initiate the communication channels and thus additional components are required to bridge them together.

C++ frameworks for distributed computer applications

I have a C++/MFC application ("MyApp") that has always run standalone, but which now needs to be run simultaneously on at least two PCs, and eventually on perhaps up to 20 PCs. What is desirable is:
The system will be deployed with fixed PC names ("host1", "host2", etc) and IP addresses (192.168.1.[host number]) on an isolated network
Turn on the first PC, and it will start all the others using wake-on-lan
One or more instances of "MyApp" automatically start on each node, configure themselves, and do their stuff more-or-less independently (there is no critical inter-node communication once started)
The first node in the system also hosts a user interface that provides some access to other nodes. Communication to/from the other nodes only occurs at the user's request, which is generally sporadic for normal users, and occasionally intense for users who have to test that new features are working as designed across the system.
For simulation/test purposes, MyApp should also be able to be started on a specified subset of arbitrarily-named computers on a LAN.
The point of the last requirement is so that when trying to reproduce the problem that is occurring on a 20-PC system on the other side of the world, I don't have to find 20 PCs from somewhere and hook them up, I can instead start by (politely) stealing some spare CPU cycles from other PCs in my office.
I can imagine building this from the ground up in C++/MFC along these lines:
write a new service that automatically starts with Windows, and which can start/stop instances of MyApp when it receives the corresponding command via the network, or when configured to automatically start MyApp in non-test deployments.
have MyApp detect which node it is in the wider system, and if it is the first node, ensure all other nodes are turned on via wake-on-lan
design and implement a communications protocol between the various local and remote services, and MyApp instances
This really seems quite basic, but also like it would be reinventing wheels if I did it from the ground up. So the central question is are there frameworks with C++ classes that make this kind of thing easy? What options should I investigate? And is referring to this as "distributed computing" misleading? What else would you call it? (Please suggest additional or different tags)

Debugging network applications and testing for synchronicity?

If I have a server running on my machine, and several clients running on other networks, what are some concepts of testing for synchronicity between them? How would I know when a client goes out-of-sync?
I'm particularly interested in how network programmers in the field of game design do this (or just any continuous network exchange application), where realtime synchronicity would be a commonly vital aspect of success.
I can see how this may be easily achieved on LAN via side-by-side comparisons on separate machines... but once you branch out the scenario to include clients from foreign networks, I'm just not sure how it can be done without clogging up your messaging system with debug information, and therefore effectively changing the way that synchronicity would result without that debug info being passed over the network.
So what are some ways that people get around this issue?
For example, do they simply induce/simulate latency on the local network before launching to foreign networks, and then hope for the best? I'm hoping there are some more concrete solutions, but this is what I'm doing in the meantime...
When you say synchronized, I believe you are talking about network latency. Meaning, that a client on a local network may get its gaming information sooner than a client on the other side of the country. Correct?
If so, then I'm sure you can look for books or papers that cover this kind of topic, but I can give you at least one way to detect this latency and provide a way to manage it.
To detect latency, your server can use a type of trace route program to determine how long it takes for data to reach each client. A common Linux program example can be found here http://linux.about.com/library/cmd/blcmdl8_traceroute.htm. While the server is handling client data, it can also continuously collect the latency statistics and provide the data to the clients. For example, the server can update each client on its own network latency and what the longest latency is for the group of clients that are playing each other in a game.
The clients can then use the latency differences to determine when they should process the data they receive from the server. For example, a client is told by the server that its network latency is 50 milliseconds and the maximum latency for its group it 300 milliseconds. The client then knows to wait 250 milliseconds before processing game data from the server. That way, each client processes game data from the server at approximately the same time.
There are many other (and probably better) ways to handle this situation, but that should get you started in the right direction.

The most important basics of P2P

I've been reading around on the www but just can't get the most important basics of P2P.
The diagram is like this:
[peer1]<-->[dsl-router1]<-->[central server]<-->[dsl-router2]<-->[peer2]
I'm developing a chat software on the central server. Chat messages being transfered thru' the central server well by now, however, I need to make the p2p file sharing feature because the bandwidth (the cable bandwith, not the transfer limit) of the server supposed for transfering chat messages only.
The problem is that, my software on central server knows the IPs and ports of router1 and router2, but not the peer1 and peer2 as these peers are behind the routers and don't have IP addresses.
How to actually transfer some data from peer1 to peer2 and vice versa without having this data passing thru' central server?
(and the worst case is that there is a wireless router between peer and dsl-router)
There are two basic ways of doing this. The new way is to use IGDP (opening a port via uPnP). This is described quite well here:
http://www.codeproject.com/Articles/13285/Using-UPnP-for-Programmatic-Port-Forwardings-and-N
If neither of the two nodes have a router supporting uPnP then another alternative is TCP hole punching, which is not perfect but works quite well in practice. This is described here:
http://www.brynosaurus.com/pub/net/p2pnat/
During some situations, "routers" supplied by the ISP may run on bridge mode, which directly exposes the peer computer on the internet (the computer gets a public internet address). If at least one side has this configuration (or in a similar situation that the peer client is not behind another device), then things should be rather straight forward: simply assign the central server's job to whoever that have this privilege.
In the other case where both peers only have a local address (e.g. 192.168.0.2) assigned to their computers, it would then be rather difficult to get through the routers; clients behind routers are for the most part unreachable from the outside unless they originated the request. Then, one solution to the problem is port forwarding. By doing port forwarding, either through explicitly written rules or UPnP, some ports on the peer computer is exposed to the public internet, as in the first situation where instead of only some ports the entire computer is exposed.
If you are without either of these, then there is no simple way to avoid sending through the central server. Though you could, potentially, find other peers who have the capability to transfer for others.