We are developing a very low-level app-system which runs before OS boot, in-fact a boot application.
the question is how we should utilize CPU cores/threads?
And how many thread we would run?
Is it possible at all?! is there any link/tutorial?
Since you're talking about threading before booting the OS, I'm going to assume that no kernel is available to you yet. That means no system calls, so no fork() or clone(). For the purpose of this answer, however, I'm also going to assume that you have already set up the A20-gate, a GDT, either protected (for IA-32) or long (for x86-64) mode, and so on. If you don't know what these are, we probably shouldn't be talking about threads before booting to begin with.
There are opcodes and tricks you can use to let your processor use other cores, thus implementing threading quite directly. You can find all these things in the Intel x86 (you are working on x86, are you? You obviously need a different set of manuals if you're on a different architecture) manuals here: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf
The reason there are no tutorials for something like this is, quite frankly, that it's not very useful. The entire point of setting things up before loading the kernel into memory is to make it easier to load the kernel into memory. Threading does not exactly contribute to this goal. It would be advisable to simply let the kernel deal with such low-level implementation requirements, so that you can then use the fork() and clone() system calls for all your threading needs.
EDIT: Good correction by Sinn: fork() create a new process, which of course isn't actually threading.
Related
I've been reading around about how to check how many CPUs or cores my machine has (MacBook OS X, Sierra, v. 10.12, 2GHz Intel Core i7), but there are many different answers, for example here:
How to discover number of *logical* cores on Mac OS X?
what I would need though is to make sure that my C++ program runs just 1 one CPU (and, if possible, only on one core, i.e., without scheduling, i.e., my program should have a dedicated core to run on => I'm not sure how my Mac OS X architecture is actually organised).
I am not sure if this should be done at implementation, compilation or execution level. I've seen people talking about taskset for Ubuntu, but I'm not sure if that's the right tool for me (maybe it does not even exist for Mac OS X).
Note: if you feel this question should be asked in another Stack Exchange's website, just tell me, and I will move it to there. Actually I would like that my solution is cross-platform, so maybe this is not the best place to ask this question.
Processes are scheduled, the idea of a non-scheduled process is an oxymoron.
That said, restricting yourself to one CPU is pretty much the default in C++. main starts on one thread, and unless you create additional threads that's all you get.
You mention that you want a "dedicated" core. There's the idea of pinning a thread to a core, which sort of achieves that, but you can figure out what happens if two programs pin themselves to the same core. Another core might be fully unused while the two programs share that pinned core. This is more of a feature for supercomputers, where cores do not have uniform access to memory, and you should match CPU core and memory allocations.
I am a newbie in this area & am writing a C++/assembly code to benchmark (measure execution time) of a section of a code in clock cycles. I need to disable pre-emption and hard interrupts through my code. I know that linux kernel development permits use of preempt_disable(); &raw_local_irq_save(flags) functions to do the same.
My question is that I am not writing a kernel module, but a normal C/C++ program in user space. Can I use these system calls through my C++ code (i.e. from user space/ no kernel module?) Which header files should i include. if yes. Can someone please give me reading references or examples?
Thanks!!
You can't do this from userland application, especially disabling hardware interrupts, which provides the basis for many fundamental kernel functions like timekeeping.
What you can do instead is use sched_setscheduler(2) to set, say, SCHED_FIFO real-time priority, that is ask the kernel not to preempt your app until it voluntarily releases the CPU (usually a system call). Be careful though - you can easily lockup your system that way.
Usually that is impossible. The kernel will not let you block interrupts.
But assigning yourself a very high prio is usally good enough. Plus, make sure the benchmarked code runs long enough, e.g. by running it 10000 times in a loop. That way, some interrupts don't matter in the overall cycle counting. In my experience a code run time of 1 second is good enough (provided your system is not under heave stress) for home-brewn benchmarking.
I have a program written in C++ that runs a number of for loops per second without using anything that would make it wait for any reason. It consistently uses 2-10% of the CPU. Is there any way to force it to use more of the CPU and do a greater number of calculations without making the program more complex? Additionally, I compile with C::B on a Windows computer. Essentially, I'm asking whether there is a way to make my program faster by increasing usage of CPU, and if so, how.
That depends on why it's only using 10% of the CPU. If it's because you're using a multi-CPU machine and your program is using only one CPU, then no, you will have to introduce concurrency into your code to use that additional horsepower.
If it's being limited by something else (e.g. copying data to and from the disk), then you don't need to focus on CPU, you need to focus on whatever the bottleneck is. Most likely, the limiter will be reading from the disk, which you can improve by using better caching mechanisms.
Assuming your application has the power (PROCESS_SET_INFORMATION access right), you can use SetPriorityClass to bump up your priortiy (to the usual detriment of all other processes, of course).
You can go ABOVE_NORMAL_PRIORITY_CLASS (try this one first), HIGH_PRIORITY_CLASS (be very careful with this one) or REALTIME_PRIORITY_CLASS (I would strongly suggest that you probably shouldn't give this one a shot).
If you try the higher priorities and it's still clocking pretty low, then that's probably because you're not CPU-bound (such as if you're writing data to an output file). If that's the case, you'll probably have to find a way to make yourself CPU bound.
Just keep in mind that doing so may not be necessary (or even desirable). If you're running at a higher priority than other threads and you're still not sucking up a lot of CPU, it's probably because Windows has (most likely, rightfully) decided you don't need it.
It's really not the program's right or responsibility to demand additional resources from the system. That's the OS' job, as resource scheduler.
If it is necessary to use more CPU time than the OS sees fit, you should request that from the OS using the platform-dependent API. In this case, that seems to be something along the lines of SetPriorityClass or SetThreadPriority.
Creating a thread & giving higher priority to the thread might be one way.
If you use C++, consider using Intel Threading Building Block. You can find some examples here.
Some profilers give very nice indications of where bottlenecks in your code are. For example - the CodeAnalyst (for AMD chips only) has the instructions per cycle ratio. I'm sure intel profilers are similar.
As Billy O'Neal says though, if your runnning on an 8-core, being stuck on 10 percent of cpu is about right. If this is your problem then Windows msvc++ has a parallel mode (the parallel patterns library) for the standard algorithms. This can give parallelisation for free if have written your loops the c++ way (its still your responsibility to make sure your loops are thread safe). I've not used the msvc version but the gnu::__parallel_for_each etc work a treat.
For using all the cores of a quad core processor what do I need to change in my code is it about adding support of multi threading or is it which is taken care by OS itself. I am having FreeBSD and language I am using is C++. I want to give complete CPU cycles to my application at least 90%.
You need some form of parallelism. Multi-threading or multi-processing would be fine.
Usually, multiple threads are easier to handle (since they can access shared data) than multiple processes. However, usually, multiple threads are harder to handle (since they access shared data) than multiple processes.
And, yes, I wrote this deliberately.
If you have a SIMD scenario, Ninefingers' suggestion to look at OpenMP is also very good. (If you don't know what SIMD means, see Ninefingers' helpful comment below.)
For multi-threaded applications in C++ may I suggest Boost.Thread which should help you access the full potential of your quad-core machine.
As for changing your code, you might want to consider making things as immutable as possible. State transitions between threads are much more difficult to debug. There a plethora of things that could potentially happen in unexpected ways. See this SO thread.
Another option not mentioned here, threading aside, is the use of OpenMP available via the -fopenmp and the libgomp library, both of which I have installed on my FreeBSD 8 system.
These give you #pragma directives to parallelise certain loops, while statements etc i.e. the bits you can parallelise. It takes care of threading and cpu association for you. Note it is a general solution and therefore might not be the optimum way to parallelise, but it will allow you to parallelise certain routines.
Take a look at this: https://computing.llnl.gov/tutorials/openMP/
As for using threads/processes themselves, certain routines and ways of working lend themselves to it. Can you break tasks out into such a way? Does it make sense to fork() your process or create a thread? If so, do so, but if not, don't try to force your application to be multi-threaded just because. An example I usually give is the greatest common divisor algorithm - it relies on the step before all the time in the traditional implementation therefore is difficult to make parallel.
Also note it is well known that for certain algorithms, parallelisation is actually slower for small values of whatever you are doing in parallel, because although the jobs complete more quickly, the associated time cost of forking and joining (be that threads or processes) actually pushes the time above that of a serial implementation.
I think your only option is to run several threads. If your application is single-threaded, then it will only run on one of the cores (at a time), but if you have more threads, they can run simultaneously.
You need to add support to your application for parallelism through the use of Threading.
Once you have support for parallelism, it's up to the OS to assign your threads to CPU cores.
The first thing I think you should look at is whether your application and its algorithms are suited to be executed in parellel (or possibly as a set of serial tasks that can be processed independently). If this is not the case, it will be difficult to multithread it or break it up into parallel processes, and you may need to look into modifying the way it works.
Once you have established that you will be able to benefit from parallel processing you have the option to either use several processes or threads. The choice depends a lot on the nature of your application and how independent the parallel processes can be. It is easier to coordinate and share data between threads since they are in the same process, but also quite a bit more challenging to develop and debug.
Boost.Thread is a good library if you decide to go down the multi-threaded route.
I want to give complete CPU cycles to my application at least 90%.
Why? Your chip's not hot enough?
Seriously, it takes world experts dozens if not hundreds of hours to parallelize and load-balance an application so that it uses 90% of all four cores. Your CPU is already paid for and it costs the same whether you use it or not. (Actually, it costs slightly less to run, electrically speaking, if you don't use it.) How much is your time worth? How many hours are you willing to invest in order to make more effective use of a resource that may have cost you $300 and is probably sitting idle most of the time anyway?
It's possible to get speedups through parallelism, but it's expensive in human time. You need a good reason to justify it. (Learning how is a good enough reason.)
All the good books I know on parallel programming are for languages other than C++, and for good reason. If you want interesting stuff on parallelism check out Implicit Parallel Programmin in pH or Concurrent Programming in ML or the Fortress Project.
I just stumbled on Protothreads. They seems superior to native threads since context switch are explicit.
My question is. Makes this multi-threaded programming an easy task again?
(I think so. But have I missed something?)
They're not "superior" - they're just different and fit another purpose. Protothreads are simulated, and hence aren't real threads. They won't run on multiple cores, and they will all block on a single system call (socket recv() and such). Hence you shouldn't see it as a "silver bullet" that solves all multithreading problems. Such threads have existed for Java, Ruby and Python for quite some time now.
On the other hand, they're very lightweight so they do make some tasks quicker and simpler. They're suitable for small embedded systems because of low code and memory footprint. If you design your whole system (including an "OS", as is customary on small embedded devices) from the ground up, protothreads can provide a simple way to achieve concurrency.
Also read on green threads.