I've been reading around about how to check how many CPUs or cores my machine has (MacBook OS X, Sierra, v. 10.12, 2GHz Intel Core i7), but there are many different answers, for example here:
How to discover number of *logical* cores on Mac OS X?
what I would need though is to make sure that my C++ program runs just 1 one CPU (and, if possible, only on one core, i.e., without scheduling, i.e., my program should have a dedicated core to run on => I'm not sure how my Mac OS X architecture is actually organised).
I am not sure if this should be done at implementation, compilation or execution level. I've seen people talking about taskset for Ubuntu, but I'm not sure if that's the right tool for me (maybe it does not even exist for Mac OS X).
Note: if you feel this question should be asked in another Stack Exchange's website, just tell me, and I will move it to there. Actually I would like that my solution is cross-platform, so maybe this is not the best place to ask this question.
Processes are scheduled, the idea of a non-scheduled process is an oxymoron.
That said, restricting yourself to one CPU is pretty much the default in C++. main starts on one thread, and unless you create additional threads that's all you get.
You mention that you want a "dedicated" core. There's the idea of pinning a thread to a core, which sort of achieves that, but you can figure out what happens if two programs pin themselves to the same core. Another core might be fully unused while the two programs share that pinned core. This is more of a feature for supercomputers, where cores do not have uniform access to memory, and you should match CPU core and memory allocations.
Related
I'm trying to find out whether I could use an Intel Xeon Phi coprocessor to "parallelize" the following problem:
Say I have 2000 files that need to be processed by a single-threaded executable. For each file, the executable reads it, does its thing and outputs it to a correspoinding output file, then exits.
For instance:
FILES=/path/to/*
for f in $FILES
do
# take action on each file
./executable $f outFileCorrespondingTo_f
done
The tools are not coded for multi-threaded execution, or looping through the files, nor do we wish to change anything in their code for now. They're written in C with some external libraries.
My questions are:
Could this kind of "script-looping" be run on the Xeon Phi's native OS in such a way that it parallelizes the calls to the executable, so they run concurrently on all of its cores? Is it "general-purpose" enough for that?
The files themselves are rather small, so its 8GB memory would be more than enough for storing the data at runtime, but not for keeping all of the output on the device, so I would need to output on the host. So my second quetion is: is this kind of memory exchange possible "externally"?
i.e. not coded into the tool, but managed by the host OS and the device, for every execution of the executable.
If this is possible, could it provide a performance boost in any way, or would the memory and thread allocation bottlenecks be too intensive? Basically each execution takes a few seconds, depending on the length of the input file, but I'm pretty confident this is a few orders of magnitude longer than how much it would take to transfer the file.
Xeon phi co-processors run a very feature-full version of the Linux operating system, so most of what you are used to on a Linux box is likely to work on Xeon Phi as well.
Now, for your specific issue, I guess that GNU Parallel should just permit you to do what you want in a breath. Simply, you'll have to have your file system mounted on the card so that you can access the files directly, but this is just standard stuff for a Xeon Phi node. And be aware that this will generate some traffic on the PCIe link between the host and the co-processor for the file transfers.
Regarding performance, this is hard to tell: the lower single-threaded performance of Xeon Phi cores along with the transfer times are definitely suggesting a big hit in this domain, but the level of parallelism you can extract from the device might very well overcome this, depending on how compute intensive your workload is. Best answer is for you to give it a try...
This is an addition to the answer given by Gilles.
Yes, the Xeon Phi should be able to do what you want at a basic operational level.
Even so, I think it is the wrong platform for your purpose for a few reasons.
Each core on the Xeon Phi is a Pentium core. Though it is enhanced (4 threads/core, 512 bit vector engine, etc), it is still a Pentium. That means it runs scalar code as a Pentium. Your task sounds like a whole bunch of serial processes running in parallel. So each process will run as if it is running on a Pentium.
To achieve remarkable performance, you need code that parallelizes well (read that as OpenMP, light weight threads, and thread pooling) and also vectorizes (takes advantage of the 512-bit vector engine). Without both of those enhancements, you are running on a Pentium, abet a lot of Pentiums.
Moving data across the PCIe bus is slow. If you are transferring a lot of files, this can be even slower though you can reduce the contention a little by hiding latency (depending upon your application). If you are hitting the PCIe bus with 244 file read requests on start up, that's quite a lot of contention. Even in a steady state condition, it sounds like you'll be reading more than 20 files at any given time (and I suspect even more given that we are executing scalar code as a Pentium).
Now the KNL architecture might be more appropriate for your needs, but that isn't out yet.
If you still think the Xeon Phi might be appropriate for what you want to do, you can ask the Xeon Phi Intel forum experts. If your application is proprietary/sensitive, you can ask the Intel experts as a private message.
We are developing a very low-level app-system which runs before OS boot, in-fact a boot application.
the question is how we should utilize CPU cores/threads?
And how many thread we would run?
Is it possible at all?! is there any link/tutorial?
Since you're talking about threading before booting the OS, I'm going to assume that no kernel is available to you yet. That means no system calls, so no fork() or clone(). For the purpose of this answer, however, I'm also going to assume that you have already set up the A20-gate, a GDT, either protected (for IA-32) or long (for x86-64) mode, and so on. If you don't know what these are, we probably shouldn't be talking about threads before booting to begin with.
There are opcodes and tricks you can use to let your processor use other cores, thus implementing threading quite directly. You can find all these things in the Intel x86 (you are working on x86, are you? You obviously need a different set of manuals if you're on a different architecture) manuals here: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf
The reason there are no tutorials for something like this is, quite frankly, that it's not very useful. The entire point of setting things up before loading the kernel into memory is to make it easier to load the kernel into memory. Threading does not exactly contribute to this goal. It would be advisable to simply let the kernel deal with such low-level implementation requirements, so that you can then use the fork() and clone() system calls for all your threading needs.
EDIT: Good correction by Sinn: fork() create a new process, which of course isn't actually threading.
I've already posted this question here, but since it's maybe not that Qt-specific, I thought I might try my chance here as well. I hope it's not inappropriate to do that (just tell me if it is).
I’ve developed a small scientific program that performs some mathematical computations. I’ve tried to optimize it so that it’s as fast as possible. Now I’m almost done deploying it for Windows, Mac and Linux users. But I have not been able to test it on many different computers yet.
Here’s what troubles me: To deploy for Windows, I’ve used a laptop which has both Windows 7 and Ubuntu 12.04 installed on it (dual boot). I compared the speed of the app running on these two systems, and I was shocked to observe that it’s at least twice as slow on Windows! I wouldn’t have been surprised if there were a small difference, but how can one account for such a difference?
Here are a few precisions:
The computation that I make the program do are just some brutal and stupid mathematical calculations, basically, it computes products and cosines in a loop that is called a billion times. On the other hand, the computation is multi-threaded: I launch something like 6 QThreads.
The laptop has two cores #1.73Ghz. At first I thought that Windows was probably not using one of the cores, but then I looked at the processor activity, according to the small graphic, both cores are running 100%.
Then I thought the C++ compiler for Windows didn’t the use the optimization options (things like -O1 -O2) that the C++ compiler for Linux automatically did (in release build), but apparently it does.
I’m bothered that the app is so mush slower (2 to 4 times) on Windows, and it’s really weird. On the other hand I haven’t tried on other computers with Windows yet. Still, do you have any idea why the difference?
Additional info: some data…
Even though Windows seems to be using the two cores, I’m thinking this might have something to do with threads management, here’s why:
Sample Computation n°1 (this one launches 2 QThreads):
PC1-windows: 7.33s
PC1-linux: 3.72s
PC2-linux: 1.36s
Sample Computation n°2 (this one launches 3 QThreads):
PC1-windows: 6.84s
PC1-linux: 3.24s
PC2-linux: 1.06s
Sample Computation n°3 (this one launches 6 QThreads):
PC1-windows: 8.35s
PC1-linux: 2.62s
PC2-linux: 0.47s
where:
PC1-windows = my 2 cores laptop (#1.73Ghz) with Windows 7
PC1-linux = my 2 cores laptop (#1.73Ghz) with Ubuntu 12.04
PC2-linux = my 8 cores laptop (#2.20Ghz) with Ubuntu 12.04
(Of course, it's not shocking that PC2 is faster. What's incredible to me is the difference between PC1-windows and PC1-linux).
Note: I've also tried running the program on a recent PC (4 or 8 cores #~3Ghz, don't remember exactly) under Mac OS, speed was comparable to PC2-linux (or slightly faster).
EDIT: I'll answer here a few questions I was asked in the comments.
I just installed Qt SDK on Windows, so I guess I have the latest version of everything (including MinGW?). The compiler is MinGW. Qt version is 4.8.1.
I use no optimization flags because I noticed that they are automatically used when I build in release mode (with Qt Creator). It seems to me that if I write something like QMAKE_CXXFLAGS += -O1, this only has an effect in debug build.
Lifetime of threads etc: this is pretty simple. When the user clicks the "Compute" button, 2 to 6 threads are launched simultaneously (depending on what he is computing), they are terminated when the computation ends. Nothing too fancy. Every thread just does brutal computations (except one, actually, which makes a (not so) small"computation every 30ms, basically checking whether the error is small enough).
EDIT: latest developments and partial answers
Here are some new developments that provide answers about all this:
I wanted to determine whether the difference in speed really had something to do with threads or not. So I modified the program so that the computation only uses 1 thread, that way we are pretty much comparing the performance on "pure C++ code". It turned out that now Windows was only slightly slower than Linux (something like 15%). So I guess that a small (but not unsignificant) part of the difference is intrinsic to the system, but the largest part is due to threads management.
As someone (Luca Carlon, thanks for that) suggested in the comments, I tried building the application with the compiler for Microsoft Visual Studio (MSVC), instead of MinGW. And suprise, the computation (with all the threads and everything) was now "only" 20% to 50% slower than Linux! I think I'm going to go ahead and be content with that. I noticed that weirdly though, the "pure C++" computation (with only one thread) was now even slower (than with MinGW), which must account for the overall difference. So as far as I can tell, MinGW is slightly better than MSVC except that it handles threads like a moron.
So, I’m thinking either there’s something I can do to make MinGW (ideally I’d rather use it than MSVC) handle threads better, or it just can’t. I would be amazed, how could it not be well known and documented ? Although I guess I should be careful about drawing conclusions too quickly, I’ve only compared things on one computer (for the moment).
Another option it could be: on linux qt are just loaded, this could happens i.e. if you use KDE, while in Windows library must be loaded so this slow down computation time. To check how much library loading waste your application you could write a dummy test with pure c++ code.
I have noticed exactly the same behavior on my PC.
I am running Windows 7(64bits), Ubuntu (64bits) and OSX (Lion 64bits) and my program compares 2 XML files (more than 60Mb each). It uses Multithreading too (2 threads) :
-Windows : 40sec
-Linux : 14sec (!!!)
-OSX : 22sec.
I use a personal class for threads (and not Qt one) which uses "pthread" on linux/OSX and "threads" on windows.
I use Qt/mingw compiler as I need the XML class from Qt.
I have found no way (for now) to have the 3 OS having similar performances... but I hope I will !
I think that another reason may be the memory : my program uses about 500Mb of RAM. So I think that Unix is managing it best because, in mono-thread, Windows is exactly 1.89 times slower and I don't think that Linux could be more than 2 times slower !
I have heard of one case where Windows was extremely slow with writing files if you do it wrongly. (This has nothing to do with Qt.)
The problem in that case was that the developer used a SQLite database, wrote some 10000 datasets, and did a SQL COMMIT after each insert. This caused Windows to write the whole DB file to disk each time, while Linux would only update the buffered version of the filesystem inode in the RAM. The speed difference was even worse in that case: 1 second on Linux vs. 1 minute on Windows. (After he changed SQLite to commit only once at the end, it was also 1 second on Windows.)
So if you're writing the results of your computation to disk, you might want to check if you're calling fsync() or fflush() too often. If your writing code comes from a library, you can use strace for this (Linux-only, but should give you a basic idea).
You might experience performance differences by how mutexes run on Windows and Linux.
Pure mutex code on windows can have a 15ms wait every time there is a contention for resource when locking. Better performing synchronization mechanism on Windows is Critical Sections. It doesn't experience the locking penalty that regular mutexes experience in most cases.
I have found that on Linux, regular mutexes perform the same as Critical Sections on Windows.
It's probably the memory allocator, try using jemalloc or tcmalloc from Google. Glibc's ptmalloc3 is significantly better than the old crusty allocator in MSVC's crt. The comparable option from Microsoft is the Concurrency CRT but you cannot simply drop it in as a replacement.
I have a main processes that will create 4 threads. If i simply run all 4 threads would the kernel utilize all 4 cores or will the program be multithreaded on a single core? if not then how would synchronization be handled on a multicore. I have a 4core intel cpu and my program is in c++
Im running this on a linux in a Virtual machine.
You don't really know.
For one thing, the C++03 Standard doesn't know anything about threads, cores or any of that kind of stuff. So this is all platform-dependant anyway.
But even from a platform point-of-view, you often still don't really know. The operating system schedules threads and jobs. The OS might -- or might not -- give you the means to specify a "processor affinity" for a particular thread, but this typically takes some hoop-jumping-through to utilize.
One of the things you also should keep in mind is that if your goal is to keep each core 100% utilized, you'll often need more than n threads (where n is number of cores). Threads spend a lot of time sleeping, waiting on disk, and generally not doing anything on the core. The exact number of threads you'll need depends on your actual application and platform, but experimentation can help guide you towards fine tuning this.
I wrote a monolithic designed program which is quite rough on the processors needs. And as I have a dual-core I figured that one CPU should therefore be always at 100%. But both my CPUs are on 100% all the time. Now I am guessing that my compiler somehow turned my monolithic application in a threaded one. What are the limits of those optimization feature and when is it still needed to explicit make something threaded?
I am using the gcc on Ubuntu linux 64-Bit
It doesn't, at least not without using something like Cilk. You must be inadvertently using multiple threads (or processes) without realizing it. Perhaps you're using a third-party library that creates an extra thread or two in your process?
[EDIT]
As per the comments, use a program like top(1) to verify that is in fact your program's process that is using both CPUs at 100%. In your case, the XORG process is jumping to 100% because your program is producing a large amount of output.
Any calls to the OS, or other libraries (CRT for instance) may use other threads as well. I would hardly be surprised if the console ran in it's own thread, and if you're doing a lot of IO of any sort, that could cause the other CPU to max out.