Erlang concurrency/distribution - then vs. now - concurrency

As to the question of how many nodes can be in an erlang system on a practical (not theoretical) level, I've seen answers ranging from 100 in most cases, to one answer which stated "150-200 max."
I was surprised on seeing this, because wasn't erlang designed for massive concurrency and distribution in order to implement telecom networks, phone switches, etc? If so, wouldn't you assume (I know I did) that this would entail more than 100 nodes in a system (I always assumed in the hundreds, possibly thousands)?
I guess my question is: What was considered "massive concurrency/distribution" back when these old telecoms used erlang? How many machines would they typically have connected together, running erlang and doing concurrency?
Just curious, and thanks for any answers.

you got the answer, for a cluster of node, with current technology, a practical limit is from 100 to 200 nodes: because we are speaking of almost transparent distribution. The reason for this limitation are explained in the documentation and in few words are due to the mutual survey of each nodes, so the bandwidth and resources available for your application are decreasing faster and faster.
To have more nodes, you must program the cooperation between cluster and/or single nodes. The libraries offer some facilities to do that but of course it is not transparent, and not erlang specific.
It is also recommended for security reason to avoid huge cluster: today in an erlang cluster you can do what you want in any other node without restriction.

It depends. It depends on lots of things you didn't specify or define, and I suspect that if you specified enough that a "real" answer was possible you would be disappointed because it wouldn't be useful. That's why these sorts of questions are generally discouraged.
You don't say what date range you mean by "when these old telecoms used Erlang". They still use it (it's never had traction outside of Ericsson and there was never a time when Ericsson used it significantly more than the present). Here's a video of them talking about using Erlang on their SGSN-MME: http://vimeo.com/44718243
You don't say what you mean by "an Erlang system". Is that a single machine? Erlang did not have SMP support when it started (is that the time frame you're asking about?). Do you mean concurrent processes?
Is that a single cluster using net_kernel:connect_node/1? How are you defining a cluster? Erlang clusters, by default, are a complete mesh. That limits the maximum size based on the performance limits of the network and the machine's interfaces. But you can connect nodes in a chain and then there's no limit. But if you count that as a cluster, why not count it when you use your own TCP connections instead of just net_kernel's. There are lots of Ericsson routers in use on the Internet, so we could think of the Internet as one "system" where many of its component routers are using Erlang.
In the video I linked, you can see that in the early 2000s, Ericsson's SGSN product was a single box (containing multiple machines) that could serve maybe a few thousand mobile phones simultaneously. We might assume that each connected phone had one Erlang process managing it, plus a negligible number of system processes.

Related

AWS: Instances and Reliability

Short of creating ginormous instances, is there any way to either force instances to run on separate physical machines or detect how many physical machines are being used by multiple instances of the same image on Amazon Web Services (AWS)?
I'm thinking about reliability here. If I fool myself into thinking that I have three independent servers for fault tolerance purposes (think Paxos, Quicksilver, ZooKeeper, etc.) because I have three different instances running, but all three end up running on the same physical machine, I could be in for a very, very rude surprise.
I realize the issue may be forced by using separate regions, but it would be nice to know if there is an intra-region or even intra-availability-zone solution, as I'm not sure I've ever seen AWS to actually give me more than one availability-zone choice in the supposedly multi-choice pulldown menu when creating an instance.
OK, I appreciate the advice from the first two to answer my question, but I was trying to simplify the problem without writing a novel by positing 3 machines in one region. Let me try again - as I scale a hypothetical app stack up/outward, I'm going to both statically and dynamically ("elastically") add instances. Of course, any manner of failure/disaster can happen (including an entire data center burning to the ground due to, say, an unfortunate breakroom accident involving a microwave, a CD, and two idiots saying "oh yeah? well watch this!!!"), but by far the most likely is a hard machine failure of some sort, followed not too far behind by a dead port. Running multiple instances of the same type T on a single piece of virtualized hardware adds computation power, but not fault tolerance. Obviously, if I'm scaling up/out, I'm most likely going to be using "larger" instances. Obviously, if AWS' largest machine has a memory size M and a number of processors C, if I choose an instance with memory size m such that m > (M/2) or with a number of CPU size c such that c > (C/2), then I will guarantee my instances run on separate machines. However, I don't know what Mmax and Cmax are today; I certainly don't know what they will be a year from now, or two years from now, and so on, as Amazon buys Bigger Better Faster. I know this sounds like nitpicking and belaboring the point, but not knowing how instances are distributed or if there is a mechanism to control instance distribution means I can make genuine mistakes in assumptions either in calculating the effecting F+1 or 2F+1 using current distributed computing algorithms or evaluating new algorithms for use in new applications, sharding and locality decisions, minimum reserved vs. elastic instance counts for portions of the appstack that see less traffic, etc.
You always have at least two availability zones per region, and that should work for high availability scenarios. Intra-az would not go very far on reliability, as a whole az may go down (unlikely, but possible).
If you absolutely must force "intra-az separate hardware", dedicated instances in different accounts would achieve that, but would cost more and would not be much better.
Not only are there multiple availability zones (think separate data centers), within each region, you can also have servers split up into different regions (west coast, east coast, Europe etc).
As far as redundancy and reliability is concerned, your much better off spreading your work across AZ's and regions, then trying to figure out or ensure that instances within a single AZ are on the same piece of hardware.

c++ Distributed computing of an executable program

I was wondering if it is possible to run an executable program without adding to its source code, like running any game across several computers. When i was programming in c# i noticed a process method, which lets you summon or close any application or process, i was wondering if there was something similar with c++ which would let me transfer the processes of any executable file or game to other computers or servers minimizing my computer's processor consumption.
thanks.
Everything is possible, but this would require a huge amount of work and would almost for sure make your program painfully slower (I'm talking about a factor of millions or billions here). Essentially you would need to make sure every layer that is used in the program allows this. So you'd have to rewrite the OS to be able to do this, but also quite a few of the libraries it uses.
Why? Let's assume you want to distribute actual threads over different machines. It would be slightly more easy if it were actual processes, but I'd be surprised many applications work like this.
To begin with, you need to synchronize the memory, more specifically all non-thread-local storage, which often means 'all memory' because not all language have a thread-aware memory model. Of course, this can be optimized, for example buffer everything until you encounter an 'atomic' read or write, if of course your system has such a concept. Now can you imagine every thread blocking for synchronization a few seconds whenever a thread has to be locked/unlocked or an atomic variable has to be read/written?
Next to that there are the issues related to managing devices. Assume you need a network connection: which device will start this, how will the ip be chosen, ...? To seamlessly solve this you probably need a virtual device shared amongst all platforms. This has to happen for network devices, filesystems, printers, monitors, ... . And as you kindly mention games: this should happen for a GPU as well, just imagine how this would impact performance in only sending data from/to the GPU (hint: even 16xpci-e is often already a bottleneck).
In conclusion: this is not feasible, if you want a clustered application, you have to build it into the application from scratch.
I believe the closest thing you can do is MapReduce: it's a paradigm which hopefully will be a part of the official boost library soon. However, I don't think that you would want to apply it to a real-time application like a game.
A related question may provide more answers: https://stackoverflow.com/questions/2168558/is-there-anything-like-hadoop-in-c
But as KillianDS pointed out, there is no automagical way to do this, nor does it seem like is there a feasible way to do it. So what is the exact problem that you're trying to solve?
The current state of research is into practical means to distribute the work of a process across multiple CPU cores on a single computer. In that case, these processors still share RAM. This is essential: RAM latencies are measured in nanoseconds.
In distributed computing, remote memory access can take tens if not hundreds of microseconds. Distributed algorithms explicitly take this into account. No amount of magic can make this disappear: light itself is slow.
The Plan 9 OS from AT&T Bell Labs supports distributed computing in the most seamless and transparent manner. Plan 9 was designed to take the Unix ideas of breaking jobs into interoperating small tasks, performed by highly specialised utilities, and "everything is a file", as well as the client/server model, to a whole new level. It has the idea of a CPU server which performs computations for less powerful networked clients. Unfortunately the idea was too ambitious and way beyond its time and Plan 9 remained largerly a research project. It is still being developed as open source software though.
MOSIX is another distributed OS project that provides a single process space over multiple machines and supports transparent process migration. It allows processes to become migratable without any changes to their source code as all context saving and restoration are done by the OS kernel. There are several implementations of the MOSIX model - MOSIX2, openMosix (discontinued since 2008) and LinuxPMI (continuation of the openMosix project).
ScaleMP is yet another commercial Single System Image (SSI) implementation, mainly targeted towards data processing and Hight Performance Computing. It not only provides transparent migration between the nodes of a cluster but also provides emulated shared memory (known as Distributed Shared Memory). Basically it transforms a bunch of computers, connected via very fast network, into a single big NUMA machine with many CPUs and huge amount of memory.
None of these would allow you to launch a game on your PC and have it transparently migrated and executed somewhere on the network. Besides most games are GPU intensive and not so much CPU intensive - most games are still not even utilising the full computing power of multicore CPUs. We have a ScaleMP cluster here and it doesn't run Quake very well...

C/C++ Framework for distributed computing in a dynamic cluster

I am looking for a framework to be used in a C++ distributed number crunching application.
The setup looks as follows:
There is a master node which divides the problem domain into small independent tasks. The tasks are distibuted to worker nodes of different capability (e.g. CPU type/GPU-enabled).
Worker nodes are dynamically added to the compute grid, as they become available. It may also happen that a worker node dies, without saying good bye.
I am searching for a fast C/C++ framework to accomplish this setup.
To summarize, my main requirements are:
Worker/Task-scheduling paradigm
Dynamically add/remove nodes
Target network: 1G - 10G ethernet (corporate network, good performance over internet not required)
Optional: Encrypted and authenticated communication
You can certainly do what you want with MPI. MPI-2 added dynamic process management features, and I think most of the currently widely-used implementations offer these.
One of the advantages of using C++ + MPI is that the combination is quite widely used in scientific and technical computing, though my impression is that within this niche dynamic process management is not used very much. Since MPI is used on the very largest supercomputers tackling the bleeding-edge problems of computational science, one might hazard a guess that it would be fast enough for your purposes.
One of the disadvantages of using C++ + MPI is that MPI was not designed to tolerate failure of processes during execution. There is debate on SO about whether or not the dynamic process management features allow you to program your own fault tolerance. But no debate that it might be difficult.
You would get the first 3 of your requirements 'out-of-the-box'. As for encrypted and authenticated communication, you'd have to do most of that yourself, MPI just passes messages around. I'd guess that for most MPI users, running parallel applications on clusters or supercomputers with private interconnects (often themselves isolated from corporate or enterprise networks), encryption and authentication are matters of little concern.

What challenges promote the use of parallel/concurrent architectures?

I am quite excited by the possibility of using languages which have parallelism / concurrency built in, such as stackless python and erlang, and have a firm belief that we'll all have to move in that direction before too long - or will want to because it will be a good/easy way to get to scalability and performance.
However, I am so used to thinking about solutions in a linear/serial/OOP/functional way that I am struggling to cast any of my domain problems in a way that merits using concurrency. I suspect I just need to unlearn a lot, but I thought I would ask the following:
Have you implemented anything reasonably large in stackless or erlang or other?
Why was it a good choice? Was it a good choice? Would you do it again?
What characteristics of your problem meant that concurrent/parallel was right?
Did you re-cast an exising problem to take advantage of concurrency/parallelism? and
if so, how?
Anyone any experience they are willing to share?
in the past when desktop machines had a single CPU, parallelization only applied to "special" parallel hardware. But these days desktops have usually from 2 to 8 cores, so now the parallel hardware is the standard. That's a big difference and therefore it is not just about which problems suggest parallelism, but also how to apply parallelism to a wider set of problems than before.
In order to be take advantage of parallelism, you usually need to recast your problem in some ways. Parallelism changes the playground in many ways:
You get the data coherence and locking problems. So you need to try to organize your problem so that you have semi-independent data structures which can be handled by different threads, processes and computation nodes.
Parallelism can also introduce nondeterminism into your computation, if the relative order in which the parallel components do their jobs affects the results. You may need to protect against that, and define a parallel version of your algorithm which is robust against different scheduling orders.
When you transcend intra-motherboard parallelism and get into networked / cluster / grid computing, you also get the issues of network bandwidth, network going down, and the proper management of failing computational nodes. You may need to modify your problem so that it becomes easier to handle the situations where part of the computation gets lost when a network node goes down.
Before we had operating systems people building applications would sit down and discuss things like:
how will we store data on disks
what file system structure will we use
what hardware will our application work with
etc, etc
Operating systems emerged from collections of 'developer libraries'.
The beauty of an operating system is that your UNWRITTEN software has certain characteristics, it can:
talk to permanent storage
talk to the network
run in a command line
be used in batch
talk to a GUI
etc, etc
Once you have shifted to an operating system - you don't go back to the status quo ante...
Erlang/OTP (ie not Erlang) is an application system - it runs on two or more computers.
The beauty of an APPLICATION SYSTEM is that your UNWRITTEN software has certain characteristics, it can:
fail over between two machines
work in a cluster
etc, etc...
Guess what, once you have shifted to an Application System - you don't go back neither...
You don't have to use Erlang/OTP, Google have a good Application System in their app engine, so don't get hung up about the language syntax.
There may well be good business reasons to build on the Erlang/OTP stack not the Google App Engine - the biz dev guys in your firm will make that call for you.
The problems will stay almost the same inf future, but the underlying hardware for the realization is changing. To use this, the way of compunication between objects (components, processes, services, how ever you call it) will change. Messages will be sent asynchronously without waiting for a direct response. Instead after a job is done the process will call the sender back with the answer. It's like people working together.
I'm currently designing a lightweighted event-driven architecture based on Erlang/OTP. It's called Tideland EAS. I'm describing the ideas and principles here: http://code.google.com/p/tideland-eas/wiki/IdeasAndPrinciples. It's not ready, but maybe you'll understand what I mean.
mue
Erlang makes you think of the problem in parallel. You won't forget it one second. After a while you adapt. Not a big problem. Except the solution become parallel in every little corner. All other languages you have to tweak. To be concurrent. And that doesn't feel natural. Then you end up hating your solution. Not fun.
The biggest advantages Erlang have is that it got no global garbage collect. It will never take a break. That is kind of important, when you have 10000 page views a second.

BerkeleyDB Concurrency

What's the optimal level of concurrency that the C++ implementation of BerkeleyDB can reasonably support?
How many threads can I have hammering away at the DB before throughput starts to suffer because of resource contention?
I've read the manual and know how to set the number of locks, lockers, database page size, etc. but I'd just like some advice from someone who has real-world experience with BDB concurrency.
My application is pretty simple, I'll be doing gets and puts of records that are about 1KB each. No cursors, no deleting.
It depends on what kind of application you are building. Create a representative test scenario, and start hammering away. Then you will know the definitive answer.
Besides your use case, it also depends on CPU, memory, front-side bus, operating system, cache settings, etcetera.
Seriously, just test your own scenario.
If you need some numbers (that actually may mean nothing in your scenario):
Oracle Berkeley DB:
Performance Metrics and
Benchmarks
Performance Metrics
& Benchmarks:
Berkeley DB
I strongly agree with Daan's point: create a test program, and make sure the way in which it accesses data mimics as closely as possible the patterns you expect your application to have. This is extremely important with BDB because different access patterns yield very different throughput.
Other than that, these are general factors I found to be of major impact on throughput:
Access method (which in your case i guess is BTREE).
Level of persistency with which you configured DBD (for example, in my case the 'DB_TXN_WRITE_NOSYNC' environment flag improved write performance by an order of magnitude, but it compromises persistency)
Does the working set fit in cache?
Number of Reads Vs. Writes.
How spread out your access is (remember that BTREE has a page level locking - so accessing different pages with different threads is a big advantage).
Access pattern - meanig how likely are threads to lock one another, or even deadlock, and what is your deadlock resolution policy (this one may be a killer).
Hardware (disk & memory for cache).
This amounts to the following point:
Scaling a solution based on DBD so that it offers greater concurrency has two key ways of going about it; either minimize the number of locks in your design or add more hardware.
Doesn't this depend on the hardware as well as number of threads and stuff?
I would make a simple test and run it with increasing amounts of threads hammering and see what seems best.
What I did when working against a database of unknown performance was to measure turnaround time on my queries. I kept upping the thread count until turn-around time dropped, and dropping the thread count until turn-around time improved (well, it was processes in my environment, but whatever).
There were moving averages and all sorts of metrics involved, but the take-away lesson was: just adapt to how things are working at the moment. You never know when the DBAs will improve performance or hardware will be upgraded, or perhaps another process will come along to load down the system while you're running. So adapt.
Oh, and another thing: avoid process switches if you can - batch things up.
Oh, I should make this clear: this all happened at run time, not during development.
The way I understand things, Samba created tdb to allow "multiple concurrent writers" for any particular database file. So if your workload has multiple writers your performance may be bad (as in, the Samba project chose to write its own system, apparently because it wasn't happy with Berkeley DB's performance in this case).
On the other hand, if your workload has lots of readers, then the question is how well your operating system handles multiple readers.