As the title says I was under the impression that using something like Virtualbox or VMware was actually parallel but I'm told they are different. When I searched to learn more about this I am lost in a solution. Can someone explain to me the difference?
Parallel is more related to parallel or concurrent computing. Typical resources involve multiple cores (of the same machine) or a cluster involving a number of nodes (each consisting of multiple processors and multiple cores) or even large-scale systems running into thousands of cores. This is often linked to computationally intensive applications such as weather forecasting or large-scale simulations.
Virtual setups, using VirtualBox or Parallels or VMWare, are meant to facilitate to have a parallel installations. For example, you can install Linux on a Mac machine or Windows on a Linux Machine.If a Windows installation is done on a Linux machine using a virtualisation software, the Windows is known as guest operating system and the linux is known as the host. On a large-scale, it is possible to have a fairly powerful machine to host hundreds of virtual machines.
Finally, to merge things in a nice way, if there is a machine with 64 cores with say 128GB RAM, the virtualisation software (VMWare for instance) will utilise the parallel nature of the machine to run the virtual machines. By the same token, you can simulate a cluster using a number of VM installations :)
Does this make sense?
Virtual is something that is not quite 'real'. It is emulated and works just like the real thing.
Parallel means two things running side by side. For example, sprinters are parallel to each other on the track.
Related
In my lab, we have several servers used for the simulation programs, but they worked independently. Now I want to combine them to become a cluster using MPICH to make them communicate. But there exists a problem, which is that these servers have different OSs. Some of them are Redhat, and some of them are Ubuntu. And on the homepage of MPICH, I saw that download sites of these two different operating systems are different, so will it be possible to set up a cluster with different operating system? And how to do it?
The reason why I don't want to reinstall these servers is that there are too many data on them and they are under used when I ask this question.
It is not feasible to get this working properly. You should be able to get the same version of an MPI implementation manually installed on different distributions. They might even talk to each other properly. But as soon you try to run actual applications, with dynamic libraries, you will get into trouble with different versions of shared libraries, glibc etc. You will be tempted to link everything statically or build different binaries for the different distributions. At the end of the day, you will just chase one issue you run into after another.
As a side node, combining some servers together with MPI does not make a High Performance Computing cluster. For instance an HPC system has sophisticated high performance interconnects and a high performance parallel file system.
Also note that your typical HPC application is going to run poorly on heterogeneous hardware (as in each node has different CPU / memory configurations).
From this https://software.intel.com/en-us/videos/purpose-of-the-mic-architecture I understand that applications with complex or numerous random memory access are not well suited for Intel Xeon-phi. This is because the architecture uses 61 cores and 8 memory controllers. In case of L1 and L2 cache misses, it takes up to 100s cycles to fetch the line into memory and get it ready for use by the CPU. Such applications are called latency-bound.
Then, the tutorial mentions that many-core architecture (Xeon-phi coprocessor only) are well suited for highly parallel homogeneous code. Two questions from there:
What is referred to as homogeneous code ?
What are real-world applications which can fully benefit from MIC architecture ?
I see the Intel MIC architecture as a "x86 based GPGPU" and if you are familiar with the concept of GPGPU you will find your self familiar with the Intel MIC.
An homogeneous clustering is a system infrastructure with multiple execution unit (i.e. CPUs) all with the same features. For example a multicore system that have four Intel Xeon processors is homogeneous.
An heterogeneous clustering is a system infrastructure with multiple execution unit with different features (i.e. like CPU and GPU). For example my Levono z510 with its Intel i7 Haswell (4 CPUs), its Nvidia GT740M (GPU) and its Intel HD Graphics 4600 (GPU) is an heterogeneous system.
An example of heterogeneous code could be a Video Game.
A video game has a control code, executed by one code of one CPU, that control what the other agents do, its send shaders to execute on the GPUs, physic computation to be performed on others cores or GPUs and so on.
In this example you need to write code that run on the CPU (so it is "CPU aware") and code that run on GPU (so it is "GPU aware"). This is actually done by using different tools, different programming languages and different programming models!
homogeneous code is code that don't need to be aware of n different programming models, one for each different kind of agent. It is just the same programming model, language and tool.
Take a look a this very simple sample code for the MPI library.
The code is all written in C, it is the same program that just take a different flow.
About the applications, Well that's really a broad question...
As said above I see the Intel MIC as a GPGPU based on x86 ISA (part of it at least).
An SDK particularly useful (and listed on the video you linked) to work with clustered systems is OpenCL, it can be used for fast processing of images and computer vision and basically for anything that need the same algorithm to be run billions of times with different inputs (like cryptography applications/brute forcing).
If you search for some OpenCL based project on the web you will get an idea.
To answer you second question it is better to ask ourselves "What could not take advantage of the MIC architecture?" and we will soon find that the more an algorithm is distant from the concept of Stream Processing and the related topics, including the one of Kernel, the less it is suitable for the MIC.
First a straight forward answer to your direct question - to get the most out of the coprocessor, your code should be able to use a large number of threads and should vectorize. How many threads? Well, you have 60 cores (+/- depending on which version you get) and 4 threads per core, with a sweet spot around 2 threads per core on many codes. Sometimes you can get good performance even if you don't use every single core. But vectorization is extremely important; the long (512 byte) vectors are a big source of speed on the coprocessor.
Now, on to programming. The Intel Xeon Phi coprocessor uses two different kinds of programming - offload and native.
In the offload model, you write a program, determine which parts of that code have enough parallelism to make use of the large number of cores on the coprocessor and mark those sections with offload directives. Then inside those offloaded sections, you write the code using some form of parallelism, like OpenMP. (Heterogeneous)
In native code, you do not use any offload directives but, instead, use a -mmic compiler directive. Then you run the code directly on the coprocessor. The code you write will use some form of parallelism, like OpenMP, to make use of the large number of cores the coprocessor has. (Homogeneous)
Another variation on these programming models is to use MPI, often in addition to OpenMP. You can use the offload programming model, in which case, the nodes in you MPI system will be the host nodes in your system. (Hybrid) Alternately, you can use the native programming model, in which case you treat the coprocessor as just another node in your system. (Heterogeneous if host and coprocessors are nodes; homogeneous if only coprocessors are use.)
You may have noticed that nothing I have said implies a separate programming style for the host and coprocessor. There are some optimizations you can make that will keep code written for the coprocessor from running on the processor as well but, in general, the code you write for the coprocessor can also be compiled for and run on the host by just changing the compiler options.
As far as real world apps, see https://software.intel.com/en-us/mic-developer/app-catalogs
My understanding is that Windows is non-deterministic and can be trouble when using it for data acquisition. Using a 32bit bus, and dual core, is it possible to use inline asm to work with interrupts in Visual Studio 2005 or at least set some kind of flags to be consistent in time with little jitter?
Going the direction of an RTOS(real time operating system): Windows CE with programming in kernel mode may get too expensive for us.
Real time solutions for Windows such as LabVIEW Real-time or RTX are expensive; a stand-alone RTOS would often be less expensive (or even free), but if you need Windows functionality as well, you are perhaps no further forward.
If cost is critical, you might run a free or low-cost RTOS in a virtual machine. This can work, though there is no cooperation over hardware access between the RTOS and Windows, and no direct communication mechanism (you could use TCP/IP over a virtual (or real) network I suppose.
Another alternative is to perform the real-time data acquisition on stand-alone hardware (a microcontroller development board or SBC for example) and communicate with Windows via USB or TCP/IP for example. It is possible that way to get timing jitter down to the microsecond level or better.
There are third-party realtime extensions to Windows. See, e. g. http://msdn.microsoft.com/en-us/library/ms838340(v=winembedded.5).aspx
Windows is not an RTOS, so there is no magic answer. However, there are some things you can do to make the system more "real time friendly".
Disable background processes that can steal system resources from you.
Use a multi-core processor to reduce the impact of context switching
If your program does any disk I/O, move that to its own spindle.
Look into process priority. Make sure your process is running as High or Realtime.
Pay attention to how your program manages memory. Avoid doing thigs that will lead to excessive disk paging.
Consider a real-time extension to Windows (already mentioned).
Consider moving to a real RTOS.
Consider dividing your system into two pieces: (1) real time component running on a microcontroller/DSP/FPGA, and (2) The user interface portion that runs on the Windows PC.
If I'm using Oracle's virtualbox, and I assign more than one virtual core to the virtual machine, how are the actual cores assigned? Does it use both real cores in the virtual machine, or does it use something that emulates cores?
Your question is almost like asking: How does an operating system determine which core to run a given process/thread on? Your computer is making that type of decision all the time - it has far more processes/threads running than you have cores available. This specific answer is similar in nature but also depends on how the guest machine is configured and what support your hardware has available to accelerate the virtualization process - so this answer is certainly not definitive and I won't really touch on how the host schedules code to be executed, but lets examine two relatively simple cases:
The first would be a fully virtualized machine - this would be a machine with no or minimal acceleration enabled. The hardware presented to the guest is fully virtualized even though many CPU instructions are simply passed through and executed directly on the CPU. In cases like this, your guest VM more-or-less behaves like any process running on the host: The CPU resources are scheduled by the operating system (to be clear, the host in this case) and the processes/threads can be run on whatever cores they are allowed to. The default is typically any core that is available, though some optimizations may be present to try and keep a process on the same core to allow the L1/L2 caches to be more effective and minimize context switches. Typically you would only have a single CPU allocated to the guest operating system in these cases, and that would roughly translate to a single process running on the host.
In a slightly more complex scenario, a virtual machine is configured with all available CPU virtualization acceleration options. In Intel speak these are referred to as VT-x for AMD it is AMD-V. These primarily support privileged instructions that would normally require some binary translation / trapping to keep the host and guest protected. As such, the host operating system loses a little bit of visibility. Include in that hardware accelerated MMU support (such that memory page tables can be accessed directly without being shadowed by the virtualization software) - and the visibility drops a little more. Ultimately though it still largely behaves as the first example: It is a process running on the host and is scheduled accordingly - only that you can think of a thread being allocated to run the instructions (or pass them through) for each virtual CPU.
It is worth noting that while you can (with the right hardware support) allocate more virtual cores to the guest than you have available, it isn't a good idea. Typically this will result in decreased performance as the guest potentially thrashes the CPU and can't properly schedule the resources that are being requested - even if the the CPU is not fully taxed. I bring this up as a scenario that shares certain similarities with a multi-threaded program that spawns far more threads (that are actually busy) than there are idle CPU cores available to run them. Your performance will typically be worse than if you had used fewer threads to get the work done.
In the extreme case, VirtualBox even supports hot-plugging CPU resources - though only a few operating systems properly support it: Windows 2008 Data Center edition and certain Linux kernels. The same rules generally apply where one guest CPU core is treated as a process/thread on a logical core for the host, however it is really up to the host and hardware itself to decide which logical core will be used for the virtual core.
With all that being said - your question of how VirtualBox actually assigns those resources... well, I haven't dug through the code so I certainly can't answer definitively but it has been my experience that it generally behaves as described. If you are really curious you could experiment with tagging the VirtualBox VBoxSvc.exe and associated processes in Task Manager and choosing the "Set Affinity" option and limiting their execution to a single CPU and see if those settings are honored. It probably depends on what level of HW assist you have available if those settings are honored by the host as the guest probably isn't really running as part of those.
I've written some MPI code which works flawlessly on large clusters. Each node in the cluster has the same cpu architecture and has access to a networked (i.e. 'common') file system (so that each node can excecute the actual binary). But consider this scenario:
I have a machine in my office with a dual core processor (intel).
I have a machine at home with a dual core processor (amd).
Both machines run linux, and both machines can successfully compile and run the MPI code locally (i.e. using 2 cores).
Now, is it possible to link the two machines together via MPI, so that I can utilise all 4 cores, bearing in mind the different architectures, and bearing in mind the fact that there are no shared (networked) filesystems?
If so, how?
Thanks,
Ben.
Its possible to do this. Most MPI implementations allow you to specify the location of the binary to be run on different machines. Alternatively, make sure that it is in your path on both machines. Since both machines have the same byte order, that shouldn't be a problem. You will have to make sure that any input data that the individual processes read is available in both locations.
There are lots of complications with doing this. You need to make sure that the firewalls between the systems will allow process startup and communication. Communication between the machines is going to be much slower, so if you code is communication heavy or latency intolerant, it probably will be quite slow. Most likely your execution time running on all 4 cores will be longer than just running with 2 on a single machine.
There is no geographical limitation on where the processes are located. And as KeithB said, there is no need to have common path or even the same binary on both the machines. Depending on what MPI implementation you are using, you dont even need the same endian-ness.
You can specify exactly the path to the binary on each machine and have two independent binaries as well. However, you should note the program will run slow if the communication infrastructure between the two nodes is not fast enough.