c++ Distributed computing of an executable program - c++

I was wondering if it is possible to run an executable program without adding to its source code, like running any game across several computers. When i was programming in c# i noticed a process method, which lets you summon or close any application or process, i was wondering if there was something similar with c++ which would let me transfer the processes of any executable file or game to other computers or servers minimizing my computer's processor consumption.
thanks.

Everything is possible, but this would require a huge amount of work and would almost for sure make your program painfully slower (I'm talking about a factor of millions or billions here). Essentially you would need to make sure every layer that is used in the program allows this. So you'd have to rewrite the OS to be able to do this, but also quite a few of the libraries it uses.
Why? Let's assume you want to distribute actual threads over different machines. It would be slightly more easy if it were actual processes, but I'd be surprised many applications work like this.
To begin with, you need to synchronize the memory, more specifically all non-thread-local storage, which often means 'all memory' because not all language have a thread-aware memory model. Of course, this can be optimized, for example buffer everything until you encounter an 'atomic' read or write, if of course your system has such a concept. Now can you imagine every thread blocking for synchronization a few seconds whenever a thread has to be locked/unlocked or an atomic variable has to be read/written?
Next to that there are the issues related to managing devices. Assume you need a network connection: which device will start this, how will the ip be chosen, ...? To seamlessly solve this you probably need a virtual device shared amongst all platforms. This has to happen for network devices, filesystems, printers, monitors, ... . And as you kindly mention games: this should happen for a GPU as well, just imagine how this would impact performance in only sending data from/to the GPU (hint: even 16xpci-e is often already a bottleneck).
In conclusion: this is not feasible, if you want a clustered application, you have to build it into the application from scratch.

I believe the closest thing you can do is MapReduce: it's a paradigm which hopefully will be a part of the official boost library soon. However, I don't think that you would want to apply it to a real-time application like a game.
A related question may provide more answers: https://stackoverflow.com/questions/2168558/is-there-anything-like-hadoop-in-c
But as KillianDS pointed out, there is no automagical way to do this, nor does it seem like is there a feasible way to do it. So what is the exact problem that you're trying to solve?

The current state of research is into practical means to distribute the work of a process across multiple CPU cores on a single computer. In that case, these processors still share RAM. This is essential: RAM latencies are measured in nanoseconds.
In distributed computing, remote memory access can take tens if not hundreds of microseconds. Distributed algorithms explicitly take this into account. No amount of magic can make this disappear: light itself is slow.

The Plan 9 OS from AT&T Bell Labs supports distributed computing in the most seamless and transparent manner. Plan 9 was designed to take the Unix ideas of breaking jobs into interoperating small tasks, performed by highly specialised utilities, and "everything is a file", as well as the client/server model, to a whole new level. It has the idea of a CPU server which performs computations for less powerful networked clients. Unfortunately the idea was too ambitious and way beyond its time and Plan 9 remained largerly a research project. It is still being developed as open source software though.
MOSIX is another distributed OS project that provides a single process space over multiple machines and supports transparent process migration. It allows processes to become migratable without any changes to their source code as all context saving and restoration are done by the OS kernel. There are several implementations of the MOSIX model - MOSIX2, openMosix (discontinued since 2008) and LinuxPMI (continuation of the openMosix project).
ScaleMP is yet another commercial Single System Image (SSI) implementation, mainly targeted towards data processing and Hight Performance Computing. It not only provides transparent migration between the nodes of a cluster but also provides emulated shared memory (known as Distributed Shared Memory). Basically it transforms a bunch of computers, connected via very fast network, into a single big NUMA machine with many CPUs and huge amount of memory.
None of these would allow you to launch a game on your PC and have it transparently migrated and executed somewhere on the network. Besides most games are GPU intensive and not so much CPU intensive - most games are still not even utilising the full computing power of multicore CPUs. We have a ScaleMP cluster here and it doesn't run Quake very well...

Related

Using IPC as a multithread substitute?

I was wondering if you could theoretically utilise more than one core of of a processor to complete tasks quicker by using inter process communication instead of multithreading.
Say for instance a game engine. You have one executable processing ai and physics, then another handling the sound and rendering.
Maybe there could be shared memory that the physics and ai results get written to, that the renderer then could use to output the graphics.
What do you think? Ridiculous idea or feasible?
Thank you for your time.
Edit: the engine does not exist, it's just an example. Basically I'm asking if two or more programs can work together if any tasks can be parallelized.
Basically I'm asking if two or more programs can work together if any tasks can be parallelized.
Yes, the idea isn't flawed in principle.
That being said, creating a thread and creating a process vary on different OSes in terms of overhead; Windows processes are very expensive to create, while Linux ones are pretty much free.
One advantage of separating the processes is that when one of them crashes, the others aren't impacted and it can be restarted. You can also more easily push the calculations to another machine across the network. However, synchronizing threads will probably be easier and more performant in most cases.

C++, always running processes or invoked executable files?

I'm working on a project made of some separate processes (services). Some services are called every second, some other every minute and some services may not be called after days. (and there are some services that are called randomly and there is no exact information about their call times).
I have two approaches to develop the project. To make services always running processes using interprocess messaging, or to write separate C++ programs and run executable files when I need them.
I have two questions that I couldn't find a suitable answer to.
Is there any way I could calculate an approximated threshold that can help me answer to 'when to use which way'?
How much faster is always running processes? (I mean compared with process of initializing and running executable files in OS)
Edit 1: As mentioned in comments and Mats Petersson's answer, answer to my questions is heavily related to environment. Then I explain more about these conditions.
OS: CentOS 6.3
services are small (smaller that 1000 line codes normally) and use no additional resources (such as database)
I don't think anyone can answer your direct two questions, as it depends on many factors, such as "what OS", "what secondary storage", "how large an application is", "what your application does" (loading up the contents of a database with a million entries takes much longer than int x = 73; as the whole initialization outside main).
There is overhead with both approaches, and assuming there isn't enough memory to hold EVERYTHING in RAM at all times (and modern OS's will try to make use of the RAM as disk-cache or for other caching, rather than keep old crusty application code that doesn't run, so eventually your application code will be swapped out if it's not being run), you are going to have approximately the same disk I/O amount for both solutions.
For me, "having memory available" trumps other things, so executing a process when itäs needed is better than leaving it running in the expectation that in some time, it will need to be reused. The only exceptions are if the executable takes a long time to start (in other words, it's large and has a complex starting procedure) AND it's being run fairly frequently (at the very least several times per minute). Or you have high real-time requirements, so the extra delay of starting the process is significantly worse than "we're holding it in memory" penalty (but bear in mind that holding it in memory isn't REALLY holding it in memory, since the content will be swapped out to disk anyway, if it isn't being used).
Starting a process that was recently run is typically done from cache, so it's less of an issue. Also, if the application uses shared libraries (.so, .dll or .dynlib depending on OS) that are genuinely shared, then it will normally shorten the load time if that shared library is in memory already.
Both Linux and Windows (and I expect OS X) are optimised to load a program much faster the second time it executes in short succession - because it caches things, etc. So for the frequent calling of the executable, this will definitely work in your favour.
I would start by "execute every time", and if you find that this is causing a problem, redesign the programs to stay around.

Need help to choose real-time OS and Hardware

I heard and read that Windows/Linux OS machines are not real-time.
I have read this article. It listed WindowsCE is one of RTOS. That's kind of confusing to me since I always thought WindowsCE is for a mobile or embeded device.
I need a real-time application running 24/7 and processing signals various sensors from each quick moving object to db and monitor by running several machine learning algorithms.
What would be proper real-time hardware and OS for this kind of applications? Development environment would be MFC or Qt C++. I really need opinions from experienced developers. Thanks
QNX has served me well in the past. I should warn you that it was only for training purposes (real-time industrial process control), and that I have implemented real time control programs with this OS by I've never really deployed one.
The first rule with real-time systems is to specify your real-time constraints, such as:
the system must be able to process up to 600 signals per minute; or
the system must spend no more than 1/10 second per signal.
The difference is subtle, but these are different constraints.
Just keep in mind that there is absolutely no way to decide if any hardward/OS/library combination is good enough for you unless you specify these constraints
For that, you think QNX might be proper? What would be its advantages over Windows/Linux systems with high priority setting?
If you look at the QNX documentation for many POSIX systems calls, you will notice they specify extra constraints on performance, which are possibly required to guarantee your real-time constraints. The OS is specifically designed to match these constraints. You won't get this on a system that is not officially an RTOS. If you are going to write real-time software, I recommend that you buy a good book on the subject. There is considerable literature given that the subject is very sensitive.
Get yourself a good book on real-time system design to get a feel of what questions to ask, and then read the technical documentation of each product you will use to see if it can match your constraints. Example of things to look in software libraries like Qt is when they allocate memory. If this is not documented in each class interface, there is no way to guarantee meeting your constraints since there is hidden algorithmic complexity.
Development environment would be MFC or Qt C++.
I would think that Qt compiles on QNX, but I'm not sure if Qt provides the guarantees required to match your real-time constraints. Libraries that abstract away too much stuff are risk since it's difficult to determine if they satisfy your requirements. Hidden memory management is often problematic, but there are other questions you should ask about too.
It seems to me that people say Real-time systems == embedded systems. Am I wrong?
Real-time system definitely does not equal "embedded system", though many embedded systems have real-time constraints.
How real time do you need?
Remember real time is about responsiveness, not speed. In fact most RTOS will be slower on average than general OS.
Do you need to guarrantee a certain average number of transactions/second or do you need to always respond within n seconds of an event?
Do you have custom hardware or are you relying on inputs over ethernet, USB, etc?
Are drivers for the hardware available on the RTOS or will you have to write them yourself ?
Windows and linux are possible RTOS. Windows embedded allows you to turn off services to give much more reliable response rate and there are both realtime kernels and realtime add-ons to Linux which give pretty much the same real time performance as something like VxWorks.
It also depends on how many tasks you need to handle. A lot of the complexity of true RTOS (like VxWorks) is that they can control many tasks at the same time while allowing each a guaranteed latency and CPU share - important for a Mars rover but not for a single data collection PC

How are you taking advantage of Multicore?

As someone in the world of HPC who came from the world of enterprise web development, I'm always curious to see how developers back in the "real world" are taking advantage of parallel computing. This is much more relevant now that all chips are going multicore, and it'll be even more relevant when there are thousands of cores on a chip instead of just a few.
My questions are:
How does this affect your software roadmap?
I'm particularly interested in real stories about how multicore is affecting different software domains, so specify what kind of development you do in your answer (e.g. server side, client-side apps, scientific computing, etc).
What are you doing with your existing code to take advantage of multicore machines, and what challenges have you faced? Are you using OpenMP, Erlang, Haskell, CUDA, TBB, UPC or something else?
What do you plan to do as concurrency levels continue to increase, and how will you deal with hundreds or thousands of cores?
If your domain doesn't easily benefit from parallel computation, then explaining why is interesting, too.
Finally, I've framed this as a multicore question, but feel free to talk about other types of parallel computing. If you're porting part of your app to use MapReduce, or if MPI on large clusters is the paradigm for you, then definitely mention that, too.
Update: If you do answer #5, mention whether you think things will change if there get to be more cores (100, 1000, etc) than you can feed with available memory bandwidth (seeing as how bandwidth is getting smaller and smaller per core). Can you still use the remaining cores for your application?
My research work includes work on compilers and on spam filtering. I also do a lot of 'personal productivity' Unix stuff. Plus I write and use software to administer classes that I teach, which includes grading, testing student code, tracking grades, and myriad other trivia.
Multicore affects me not at all except as a research problem for compilers to support other applications. But those problems lie primarily in the run-time system, not the compiler.
At great trouble and expense, Dave Wortman showed around 1990 that you could parallelize a compiler to keep four processors busy. Nobody I know has ever repeated the experiment. Most compilers are fast enough to run single-threaded. And it's much easier to run your sequential compiler on several different source files in parallel than it is to make your compiler itself parallel. For spam filtering, learning is an inherently sequential process. And even an older machine can learn hundreds of messages a second, so even a large corpus can be learned in under a minute. Again, training is fast enough.
The only significant way I have of exploiting parallel machines is using parallel make. It is a great boon, and big builds are easy to parallelize. Make does almost all the work automatically. The only other thing I can remember is using parallelism to time long-running student code by farming it out to a bunch of lab machines, which I could do in good conscience because I was only clobbering a single core per machine, so using only 1/4 of CPU resources. Oh, and I wrote a Lua script that will use all 4 cores when ripping MP3 files with lame. That script was a lot of work to get right.
I will ignore tens, hundreds, and thousands of cores. The first time I was told "parallel machines are coming; you must get ready" was 1984. It was true then and is true today that parallel programming is a domain for highly skilled specialists. The only thing that has changed is that today manufacturers are forcing us to pay for parallel hardware whether we want it or not. But just because the hardware is paid for doesn't mean it's free to use. The programming models are awful, and making the thread/mutex model work, let alone perform well, is an expensive job even if the hardware is free. I expect most programmers to ignore parallelism and quietly get on about their business. When a skilled specialist comes along with a parallel make or a great computer game, I will quietly applaud and make use of their efforts. If I want performance for my own apps I will concentrate on reducing memory allocations and ignore parallelism.
Parallelism is really hard. Most domains are hard to parallelize. A widely reusable exception like parallel make is cause for much rejoicing.
Summary (which I heard from a keynote speaker who works for a leading CPU manufacturer): the industry backed into multicore because they couldn't keep making machines run faster and hotter and they didn't know what to do with the extra transistors. Now they're desperate to find a way to make multicore profitable because if they don't have profits, they can't build the next generation of fab lines. The gravy train is over, and we might actually have to start paying attention to software costs.
Many people who are serious about parallelism are ignoring these toy 4-core or even 32-core machines in favor of GPUs with 128 processors or more. My guess is that the real action is going to be there.
For web applications it's very, very easy: ignore it. Unless you've got some code that really begs to be done in parallel you can simply write old-style single-threaded code and be happy.
You usually have a lot more requests to handle at any given moment than you have cores. And since each one is handled in its own Thread (or even process, depending on your technology) this is already working in parallel.
The only place you need to be careful is when accessing some kind of global state that requires synchronization. Keep that to a minimum to avoid introducing artificial bottlenecks to an otherwise (almost) perfectly scalable world.
So for me multi-core basically boils down to these items:
My servers have less "CPUs" while each one sports more cores (not much of a difference to me)
The same number of CPUs can substain a larger amount of concurrent users
When the seems to be performance bottleneck that's not the result of the CPU being 100% loaded, then that's an indication that I'm doing some bad synchronization somewhere.
At the moment - doesn't affect it that much, to be honest. I'm more in 'preparation stage', learning about the technologies and language features that make this possible.
I don't have one particular domain, but I've encountered domains like math (where multi-core is essential), data sort/search (where divide & conquer on multi-core is helpful) and multi-computer requirements (e.g., a requirement that a back-up station's processing power is used for something).
This depends on what language I'm working. Obviously in C#, my hands are tied with a not-yet-ready implementation of Parallel Extensions that does seem to boost performance, until you start comparing same algorithms with OpenMP (perhaps not a fair comparison). So on .NET it's going to be an easy ride with some for → Parallel.For refactorings and the like.Where things get really interesting is with C++, because the performance you can squeeze out of things like OpenMP is staggering compared to .NET. In fact, OpenMP surprised me a lot, because I didn't expect it to work so efficiently. Well, I guess its developers have had a lot of time to polish it. I also like that it is available in Visual Studio out-of-the-box, unlike TBB for which you have to pay.As for MPI, I use PureMPI.net for little home projects (I have a LAN) to fool around with computations that one machine can't quite take. I've never used MPI commercially, but I do know that MKL has some MPI-optimized functions, which might be interesting to look at for anyone needing them.
I plan to do 'frivolous computing', i.e. use extra cores for precomputation of results that might or might not be needed - RAM permitting, of course. I also intend to delve into costly algorithms and approaches that most end users' machines right now cannot handle.
As for domains not benefitting from parallellization... well, one can always find something. One thing I am concerned about is decent support in .NET, though regrettably I have given up hope that speeds similar to C++ can be attained.
I work in medical imaging and image processing.
We're handling multiple cores in much the same way we handled single cores-- we have multiple threads already in the applications we write in order to have a responsive UI.
However, because we can now, we're taking strong looks at implementing most of our image processing operations in either CUDA or OpenMP. The Intel Compiler provides a lot of good sample code for OpenMP, and is just a much more mature product than CUDA, and provides a much larger installed base, so we're probably going to go with that.
What we tend to do for expensive (ie, more than a second) operations is to fork that operation off into another process, if we can. That way, the main UI remains responsive. If we can't, or it's just far too inconvenient or slow to move that much memory around, the operation is still in a thread, and then that operation can itself spawn multiple threads.
The key for us is to make sure that we don't hit concurrency bottlenecks. We develop in .NET, which means that UI updates have to be done from an Invoke call to the UI in order to have the main thread update the UI.
Maybe I'm lazy, but really, I don't want to have to spend too much time figuring a lot of this stuff out when it comes to parallelizing things like matrix inversions and the like. A lot of really smart people have spent a lot of time making that stuff fast like nitrous, and I just want to take what they've done and call it. Something like CUDA has an interesting interface for image processing (of course, that's what it's defined for), but it's still too immature for that kind of plug-and-play programming. If I or another developer get a lot of spare time, we might give it a try. So instead, we'll just go with OpenMP to make our processing faster (and that's definitely on the development roadmap for the next few months).
So far, nothing more than more efficient compilation with make:
gmake -j
the -j option allows tasks that don't depend on one another to run in parallel.
I'm developing ASP.NET web applications. There is little possibility to use multicore directly in my code, however IIS already scales well for multiple cores/CPU's by spawning multiple worker threads/processes when under load.
We're having a lot of success with task parallelism in .NET 4 using F#. Our customers are crying out for multicore support because they don't want their n-1 cores idle!
I'm in image processing. We're taking advantage of multicore where possible by processing images in slices doled out to different threads.
I said some of this in answer to a different question (hope this is OK!): there is a concept/methodology called Flow-Based Programming (FBP) that has been around for over 30 years, and is being used to handle most of the batch processing at a major Canadian bank. It has thread-based implementations in Java and C#, although earlier implementations were fiber-based (C++ and mainframe Assembler). Most approaches to the problem of taking advantage of multicore involve trying to take a conventional single-threaded program and figure out which parts can run in parallel. FBP takes a different approach: the application is designed from the start in terms of multiple "black-box" components running asynchronously (think of a manufacturing assembly line). Since the interface between components is data streams, FBP is essentially language-independent, and therefore supports mixed-language applications, and domain-specific languages. Applications written this way have been found to be much more maintainable than conventional, single-threaded applications, and often take less elapsed time, even on single-core machines.
My graduate work is in developing concepts for doing bare-metal multicore work & teaching same in embedded systems.
I'm also working a bit with F# to bring up my high-level multiprocess-able language facilities to speed.
We create the VivaMP code analyzer for error detecting in parallel OpenMP programs.
VivaMP is a lint-like static C/C++ code analyzer meant to indicate errors in parallel programs based on OpenMP technology. VivaMP static analyzer adds much to the abilities of the existing compilers, diagnoses any parallel code which has some errors or is an eventual source of such errors. The analyzer is integrated into VisualStudio2005/2008 development environment.
VivaMP – a tool for OpenMP
32 OpenMP Traps For C++ Developers
I believe that "Cycles are an engineers' best friend".
My company provides a commercial tool for analyzing
and transforming very
large software systems in many computer languages.
"Large" means 10-30 million lines of code.
The tool is the DMS Software Reengineering Toolkit
(DMS for short).
Analyses (and even transformations) on such huge systems
take a long time: our points-to analyzer for C
code takes 90 CPU hours on an x86-64 with 16 Gb RAM.
Engineers want answers faster than that.
Consequently, we implemented DMS in PARLANSE,
a parallel programming language of our own design,
intended to harness small-scale multicore shared
memory systems.
The key ideas behind parlanse are:
a) let the programmer expose parallelism,
b) let the compiler choose which part it can realize,
c) keep the context switching to an absolute minimum.
Static partial orders over computations are
an easy to help achieve all 3; easy to say,
relatively easy to measure costs,
easy for compiler to schedule computations.
(Writing parallel quicksort with this is trivial).
Unfortunately, we did this in 1996 :-(
The last few years have finally been a vindication;
I can now get 8 core machines at Fry's for under $1K
and 24 core machines for about the same price as a small
car (and likely to drop rapidly).
The good news is that DMS is now a fairly mature,
and there are a number of key internal mechanisms
in DMS which take advantage of this, notably
an entire class of analyzers call "attribute grammars",
which we write using a domain-specific language
which is NOT parlanse. DMS compiles these
atrribute grammars into PARLANSE and then they
are executed in parallel. Our C++ front
end uses attribute grammars, and is about 100K
sloc; it is compiled into 800K SLOC of parallel
parlanse code that actually works reliably.
Now (June 2009), we are pretty busy making DMS useful, and
don't always have enough time to harness the parallelism
well. Thus the 90 hour points-to analysis.
We are working on parallelizing that, and
have reasonable hope of 10-20x speedup.
We believe that in the long run, harnessing
SMP well will make workstations far more
friendly to engineers asking hard questions.
As well they should.
Our domain logic is based heavily on a workflow engine and each workflow instance runs off the ThreadPool.
That's good enough for us.
I can now separate my main operating system from my development / install whatever I like os using vitualisation setups with Virtual PC or VMWare.
Dual core means that one CPU runs my host OS, the other runs my development OS with a decent level of performance.
Learning a functional programming language might use multiple cores... costly.
I think it's not really hard to use extra cores. There are some trivialities as web apps that does not need to have any extra care as the web server does its work running the queries in parallel. The questions are for long running algorythms (long is what you call long). These need to be split over smaller domains that does not depend each other, or synchronize the dependencies. A lot of algs can do this, but sometimes horribly different implementations needed (costs again).
So, no silver bullet until you are using imperative programming languages, sorry. Either you need skilled programmers (costly) or you need to turn to an other programming language (costly). Or you may have luck simply (web).
I'm using and programming on a Mac. Grand Central Dispatch for the win. The Ars Technica review of Snow Leopard has a lot of interesting things to say about multicore programming and where people (or at least Apple) are going with it.
I've decided to take advantage of multiple cores in an implementation of the DEFLATE algorithm. MArc Adler did something similar in C code with PIGZ (parallel gzip). I've delivered the philosophical equivalent, but in a managed code library, in DotNetZip v1.9. This is not a port of PIGZ, but a similar idea, implemented independently.
The idea behind DEFLATE is to scan a block of data, look for repeated sequences, build a "dictionary" that maps a short "code" to each of those repeated sequences, then emit a byte stream where each instance of one of the repeated sequences is replaced by a "code" from the dictionary.
Because building the dictionary is CPU intensive, DEFLATE is a perfect candidate for parallelization. i've taken a Map+Reduce type approach, where I divide the incoming uncompressed bytestreeam into a set of smaller blocks (map), say 64k each, and then compress those independently. Then I concatenate the resulting blocks together (reduce). Each 64k block is compressed independently, on its own thread, without regard for the other blocks.
On a dual-core machine, this approach compresses in about 54% of the time of the traditional serial approach. On server-class machines, with more cores available, it can potentially deliver even better results; with no server machine, I haven't tested it personally, but people tell me it's fast.
There's runtime (cpu) overhead associated to the management of multiple threads, runtime memory overhead associated to the buffers for each thead, and data overhead associated to concatenating the blocks. So this approach pays off only for larger bytestreams. In my tests, above 512k, it can pay off. Below that, it is better to use a serial approach.
DotNetZip is delivered as a library. My goal was to make all of this transparent. So the library automatically uses the extra threads when the buffer is above 512kb. There's nothing the application has to do, in order to use threads. It just works, and when threads are used, it's magically faster. I think this is a reasonable approach to take for most libbraries being consumed by applications.
It would be nice for the computer to be smart about automatically and dynamically exploiting resources on parallizable algorithms, but the reality today is that apps designers have to explicitly code the parallelization in.
I work in C# with .Net Threads.
You can combine object-oriented encapsulation with Thread management.
I've read some posts from Peter talking about a new book from Packt Publishing and I've found the following article in Packt Publishing web page:
http://www.packtpub.com/article/simplifying-parallelism-complexity-c-sharp
I've read Concurrent Programming with Windows, Joe Duffy's book. Now, I am waiting for "C# 2008 and 2005 Threaded Programming", Hillar's book - http://www.amazon.com/2008-2005-Threaded-Programming-Beginners/dp/1847197108/ref=pd_rhf_p_t_2
I agree with Szundi "No silver bullet"!
You say "For web applications it's very, very easy: ignore it. Unless you've got some code that really begs to be done in parallel you can simply write old-style single-threaded code and be happy."
I am working with Web applications and I do need to take full advantage of parallelism.
I understand your point. However, we must prepare for the multicore revolution. Ignoring it is the same than ignoring the GUI revolution in the 90's.
We are not still developing for DOS? We must tackle multicore or we'll be dead in many years.
I think this trend will first persuade some developers, and then most of them will see that parallelization is a really complex task.
I expect some design pattern to come to take care of this complexity. Not low level ones but architectural patterns which will make hard to do something wrong.
For example I expect messaging patterns to gain popularity, because it's inherently asynchronous, but you don't think about deadlock or mutex or whatever.
How does this affect your software roadmap?
It doesn't. Our (as with almost all other) business related apps run perfectly well on a single core. So long as adding more cores doesn't significantly reduce the performance of single threaded apps, we're happy
...real stories...
Like everyone else, parallel builds are the main benefit we get. The Visual Studio 2008 C# compiler doesn't seem to use more than one core though, which really sucks
What are you doing with your existing code to take advantage of multicore machines
We may look into using the .NET parallel extensions if we ever have a long-running algorithm that can be parallelized, but the odds of this actually occurring are slim. The most likely answer is that some of the developers will play around with it for interest's sake, but not much else
how will you deal with hundreds or thousands of cores?
Head -> Sand.
If your domain doesn't easily benefit from parallel computation, then explaining why is interesting, too.
The client app mostly pushes data around, the server app mostly relies on SQL server to do the heavy lifting
I'm taking advantage of multicore using C, PThreads, and a home brew implementation of Communicating Sequential Processes on an OpenVPX platform with Linux using the PREEMPT_RT patch set's scheduler. It all adds up to nearly 100% CPU utilisation across multiple OS instances with no CPU time used for data exchange between processor cards in the OpenVPX chassis, and very low latency too. Also using sFPDP to join multiple OpenVPX chassis together into a single machine. Am not using Xeon's internal DMA so as to relieve memory pressure inside CPUs (DMA still uses memory bandwidth at the expense of the CPU cores). Instead we're leaving data in place and passing ownership of it around in a CSP way (so not unlike the philosophy of .NET's task parallel data flow library).
1) Software Roadmap - we have pressure to maximise the use real estate and available power. Making the very most of the latest hardware is essential
2) Software domain - effectively Scientific Computing
3) What we're doing with existing code? Constantly breaking it apart and redistributing parts of it across threads so that each core is maxed out doing the most it possibly can without breaking out real time requirement. New hardware means quite a lot of re-thinking (faster cores can do more in the given time, don't want them to be under utilised). Not as bad as it sounds - the core routines are very modular so easily assembled into thread-sized lumps. Although we planned on taking control of thread affinity away from Linux, we've not yet managed to extract significant extra performance by doing so. Linux is pretty good at getting data and code in more or less the same place.
4) In effect already there - total machine already adds up to thousands of cores
5) Parallel computing is essential - it's a MISD system.
If that sounds like a lot of work, it is. some jobs require going whole hog on making the absolute most of available hardware and eschewing almost everything that is high level. We're finding that the total machine performance is a function of CPU memory bandwidth, not CPU core speed, L1/L2/L3 cache size.

What challenges promote the use of parallel/concurrent architectures?

I am quite excited by the possibility of using languages which have parallelism / concurrency built in, such as stackless python and erlang, and have a firm belief that we'll all have to move in that direction before too long - or will want to because it will be a good/easy way to get to scalability and performance.
However, I am so used to thinking about solutions in a linear/serial/OOP/functional way that I am struggling to cast any of my domain problems in a way that merits using concurrency. I suspect I just need to unlearn a lot, but I thought I would ask the following:
Have you implemented anything reasonably large in stackless or erlang or other?
Why was it a good choice? Was it a good choice? Would you do it again?
What characteristics of your problem meant that concurrent/parallel was right?
Did you re-cast an exising problem to take advantage of concurrency/parallelism? and
if so, how?
Anyone any experience they are willing to share?
in the past when desktop machines had a single CPU, parallelization only applied to "special" parallel hardware. But these days desktops have usually from 2 to 8 cores, so now the parallel hardware is the standard. That's a big difference and therefore it is not just about which problems suggest parallelism, but also how to apply parallelism to a wider set of problems than before.
In order to be take advantage of parallelism, you usually need to recast your problem in some ways. Parallelism changes the playground in many ways:
You get the data coherence and locking problems. So you need to try to organize your problem so that you have semi-independent data structures which can be handled by different threads, processes and computation nodes.
Parallelism can also introduce nondeterminism into your computation, if the relative order in which the parallel components do their jobs affects the results. You may need to protect against that, and define a parallel version of your algorithm which is robust against different scheduling orders.
When you transcend intra-motherboard parallelism and get into networked / cluster / grid computing, you also get the issues of network bandwidth, network going down, and the proper management of failing computational nodes. You may need to modify your problem so that it becomes easier to handle the situations where part of the computation gets lost when a network node goes down.
Before we had operating systems people building applications would sit down and discuss things like:
how will we store data on disks
what file system structure will we use
what hardware will our application work with
etc, etc
Operating systems emerged from collections of 'developer libraries'.
The beauty of an operating system is that your UNWRITTEN software has certain characteristics, it can:
talk to permanent storage
talk to the network
run in a command line
be used in batch
talk to a GUI
etc, etc
Once you have shifted to an operating system - you don't go back to the status quo ante...
Erlang/OTP (ie not Erlang) is an application system - it runs on two or more computers.
The beauty of an APPLICATION SYSTEM is that your UNWRITTEN software has certain characteristics, it can:
fail over between two machines
work in a cluster
etc, etc...
Guess what, once you have shifted to an Application System - you don't go back neither...
You don't have to use Erlang/OTP, Google have a good Application System in their app engine, so don't get hung up about the language syntax.
There may well be good business reasons to build on the Erlang/OTP stack not the Google App Engine - the biz dev guys in your firm will make that call for you.
The problems will stay almost the same inf future, but the underlying hardware for the realization is changing. To use this, the way of compunication between objects (components, processes, services, how ever you call it) will change. Messages will be sent asynchronously without waiting for a direct response. Instead after a job is done the process will call the sender back with the answer. It's like people working together.
I'm currently designing a lightweighted event-driven architecture based on Erlang/OTP. It's called Tideland EAS. I'm describing the ideas and principles here: http://code.google.com/p/tideland-eas/wiki/IdeasAndPrinciples. It's not ready, but maybe you'll understand what I mean.
mue
Erlang makes you think of the problem in parallel. You won't forget it one second. After a while you adapt. Not a big problem. Except the solution become parallel in every little corner. All other languages you have to tweak. To be concurrent. And that doesn't feel natural. Then you end up hating your solution. Not fun.
The biggest advantages Erlang have is that it got no global garbage collect. It will never take a break. That is kind of important, when you have 10000 page views a second.