Higher core load Intel TBB - c++

I am using Intel TBB parallel_for to speed up a for loop doing some calculations:
tbb::parallel_for(tbb::blocked_range<int>(0,ListSize,1000),Calc);
Calc is an object of the class doCalc
class DoCalc
{
vector<string>FileList;
public:
void operator()(const tbb::blocked_range<int>& range) const{
for(int i=range.begin(); i!=range.end();++i){
//Do some calculations
}
}
DoCalc(vector<string> ilist):FileList(ilist){}
};
It takes approx. 60 seconds when I use the standard serial form of the for loop and approx. 20 seconds when I use the parallel_for from TBB to get the job done. When using standard for, the load of each core of my i5 CPU is at approx. 15% (according windows task manager) and very inhomogeneous and at approx. 50% and very homogeneous when using parallel_for.
I wonder if it's possible to get an even higher core load when using parallel_for. Are there any other parameters except grain_size? How can I boost the speed of parallel_for without changing the operations within the for loop (here as //Do some calculations in the code sample above).

The grainsize parameter is optional. When grainsizee is not specified, a partitioner should be supplied to the algorithm template. A partitioner is an object that guides the chunking of a range. The auto_partitioner provides an alternative that heuristically chooses the grain size so that you do not have to specify one. The heuristic attempt to limit overhead while still providing ample opportunities for load balancing.
Go to the tbb website for more information. www.threadingbuildingblocks.org

As #Eugene Roader already suggested, you might want to use the auto_partitioner (which is default from TBB version 2.2) to automatic chuncking of the range:
tbb::parallel_for(tbb::blocked_range<int>(0,ListSize),Calc,tbb:auto_partitioner());
I assume that your i5-CPU has 4 cores, so you get a speedup of 3 (60s => 20s) which is already "quite nice" as there might be certain overheads in the parallelization. One problem could be the maximum limit of memory bandwidth of you CPU which is saturated with 3 threads - or you might have a lot of allocation/deallocations within your could which are/must be synchronized between the threads with the standard memory manager. One trick to tackle this problem without much code changes in the inner loop might be using a thread local allocator, e.g. for FileList:
vector<string,tbb:scalable_allocator<string>> FileList;
Note that you should try the tbb::scalable_allocator for all other containers used in the loop too, in order bring your parallelization speedup closer to the number of cores, 4.

The answer to your question also depends on the ratio between memory accesses and computation in your algorithm. If you do very few operations on a lot of data, your problem is memory bound and that will limit the core load. If on the other hand you compute a lot with little data, your chances of improving are better.

Related

How to test the problem size scaling performance of code

I'm running a simple kernel which adds two streams of double-precision complex-values. I've parallelized it using OpenMP with custom scheduling: the slice_indices container contains different indices for different threads.
for (const auto& index : slice_indices)
{
auto* tens1_data_stream = tens1.get_slice_data(index);
const auto* tens2_data_stream = tens2.get_slice_data(index);
#pragma omp simd safelen(8)
for (auto d_index = std::size_t{}; d_index < tens1.get_slice_size(); ++d_index)
{
tens1_data_stream[d_index].real += tens2_data_stream[d_index].real;
tens1_data_stream[d_index].imag += tens2_data_stream[d_index].imag;
}
}
The target computer has a Intel(R) Xeon(R) Platinum 8168 CPU # 2.70GHz with 24 cores, L1 cache 32kB, L2 cache 1MB and L3 cache 33MB. The total memory bandwidth is 115GB/s.
The following is how my code scales with problem size S = N x N x N.
Can anybody tell me with the information I've provided if:
it's scaling well, and/or
how I could go about finding out if it's utilizing all the resources which are available to it?
Thanks in advance.
EDIT:
Now I've plotted the performance in GFLOP/s with 24 cores and 48 cores (two NUMA nodes, the same processor). It appears so:
And now the strong and weak scaling plots:
Note: I've measured the BW and it turns out to be 105GB/S.
Question: The meaning of the weird peak at 6 threads/problem size 90x90x90x16 B in the weak scaling plot is not obvious to me. Can anybody clear this?
Your graph has roughly the right shape: tiny arrays should fit in the L1 cache, and therefore get very high performance. Arrays of a megabyte or so fit in L2 and get lower performance, beyond that you should stream from memory and get low performance. So the relation between problem size and runtime should indeed get steeper with increasing size. However, the resulting graph (btw, ops/sec is more common than mere runtime) should have a stepwise structure as you hit successive cache boundaries. I'd say you don't have enough data points to demonstrate this.
Also, typically you would repeat each "experiment" several times to 1. even out statistical hiccups and 2. make sure that data is indeed in cache.
Since you tagged this "openmp" you should also explore taking a given array size, and varying the core count. You should then get a more or less linear increase in performance, until the processor does not have enough bandwidth to sustain all the cores.
A commenter brought up the concepts of strong/weak scaling. Strong scaling means: given a certain problem size, use more and more cores. That should give you increasing performance, but with diminishing returns as overhead starts to dominate. Weak scaling means: keep the problem size per process/thread/whatever constant, and increase the number of processing elements. That should give you almost linear increasing performance, until -- as I indicated -- you run out of bandwidth. What you seem to do is actually neither of these: you're doing "optimistic scaling": increase the problem size, with everything else constant. That should give you better and better performance, except for cache effects as I indicated.
So if you want to say "this code scales" you have to decide under what scenario. For what it's worth, your figure of 200Gb/sec is plausible. It depends on details of your architecture, but for a fairly recent Intel node that sounds reasonable.

First method call takes 10 times longer than consecutive calls with the same data

I am performing some execution time benchmarks for my implementation of quicksort. Out of 100 successive measurements on exactly the same input data it seems like the first call to quicksort takes roughly 10 times longer than all consecutive calls. Is this a consequence of the operating system getting ready to execute the program, or is there some other explanation? Moreover, is it reasonable to discard the first measurement when computing an average runtime?
The below bar chart illustrates execution time (miliseconds) versus method call number. Each time the method is called it processes the exact same data.
To produce this particular graph the main method makes a call to quicksort_timer::time_fpi_quicksort(5, 100) whose implementation can be seen below.
static void time_fpi_quicksort(int size, int runs)
{
std::vector<int> vector(size);
for (int i = 0; i < runs; i++)
{
vector = utilities::getRandomIntVectorWithConstantSeed(size);
Timer timer;
quicksort(vector, ver::FixedPivotInsertion);
}
}
The getRandomIntVectorWithConstantSeed is implemented as follows
std::vector<int> getRandomIntVectorWithConstantSeed(int size)
{
std::vector<int> vector(size);
srand(6475307);
for (int i = 0; i < size; i++)
vector[i] = rand();
return vector;
}
CPU and Compilation
CPU: Broadwell 2.7 GHz Intel Core i5 (5257U)
Compiler Version: Apple LLVM version 10.0.0 (clang-1000.11.45.5)
Compiler Options: -std=c++17 -O2 -march=native
Yes, it could be a page fault on the page holding the code for the sort function (and the timing code itself). The 10x could also include ramp-up to max turbo clock speed.
Caching is not plausible, though: you're writing the (tiny) array outside the timed region, unless the compiler somehow reordered the init with the constructor of your Timer. Memory allocation being much slower the first time would easily explain it, maybe having to make a system call to get a new page the first time, but later calls to new (to construct std::vector) just grabbing already-hot-in-cache memory from the free list.
Training the branch predictors could also be a big factor, but you'd expect it to take more than 1 run before the TAGE branch predictors in a modern Intel CPU, or the perceptron predictors in a modern AMD, "learned" the full pattern of all the branching. But maybe they get close after the first run.
Note that you produce the same random array every time, by using srand() on every call. To test if branch prediction is the explanation, remove the srand so you get different arrays every time, and see if the time stays much higher.
What CPU, compiler version / options, etc. are you using?
Probably is because of caching, as the memory needs to be fetched from DRAM and allocated in CPU's data cache the first time. That takes (much) more latency more than loads that hit in the CPU's cache.
Then as your instructions are in the pipeline they follow the same branch as it is the instructions from the same memory source as it doesn't need to be invalidated because is the same pointer.
Would be interesting if you implement 4 methods with more or less the same functionality and then swap between them to see what happen.

How to set number of threads in C++

I have written the following multi-threaded program for multi-threaded sorting using std::sort. In my program grainSize is a parameter. Since grainSize or the number of threads which can spawn is a system dependent feature. Therefore, I am not getting what should be the optimal value to which I should set the grainSize to? I work on Linux?
int compare(const char*,const char*)
{
//some complex user defined logic
}
void multThreadedSort(vector<unsigned>::iterator data, int len, int grainsize)
{
if(len < grainsize)
{
std::sort(data, data + len, compare);
}
else
{
auto future = std::async(multThreadedSort, data, len/2, grainsize);
multThreadedSort(data + len/2, len/2, grainsize); // No need to spawn another thread just to block the calling thread which would do nothing.
future.wait();
std::inplace_merge(data, data + len/2, data + len, compare);
}
}
int main(int argc, char** argv) {
vector<unsigned> items;
int grainSize=10;
multThreadedSort(items.begin(),items.size(),grainSize);
std::sort(items.begin(),items.end(),CompareSorter(compare));
return 0;
}
I need to perform multi-threaded sorting. So, that for sorting large vectors I can take advantage of multiple cores present in today's processor. If anyone is aware of an efficient algorithm then please do share.
I dont know why the value returned by multiThreadedSort() is not sorted, do you see some logical error in it, then please let me know about the same
This gives you the optimal number of threads (such as the number of cores):
unsigned int nThreads = std::thread::hardware_concurrency();
As you wrote it, your effective thread number is not equal to grainSize : it will depend on list size, and will potentially be much more than grainSize.
Just replace grainSize by :
unsigned int grainSize= std::max(items.size()/nThreads, 40);
The 40 is arbitrary but is there to avoid starting threads for sorting to few items which will be suboptimal (the time starting the thread will be larger than sorting the few items). It may be optimized by trial-and-error, and is potentially larger than 40.
You have at least a bug there:
multThreadedSort(data + len/2, len/2, grainsize);
If len is odd (for instance 9), you do not include the last item in the sort. Replace by:
multThreadedSort(data + len/2, len-(len/2), grainsize);
Unless you use a compiler with a totally broken implementation (broken is the wrong word, a better match would be... shitty), several invocations of std::futureshould already do the job for you, without having to worry.
Note that std::future is something that conceptually runs asynchronously, i.e. it may spawn another thread to execute concurrently. May, not must, mind you.
This means that it is perfectly "legitimate" for an implementation to simply spawn one thread per future, and it is also legitimate to never spawn any threads at all and simply execute the task inside wait().
In practice, sane implementations avoid spawning threads on demand and instead use a threadpool where the number of workers is set to something reasonable according to the system the code runs on.
Note that trying to optimize threading with std::thread::hardware_concurrency() does not really help you because the wording of that function is too loose to be useful. It is perfectly allowable for an implementation to return zero, or a more or less arbitrary "best guess", and there is no mechanism for you to detect whether the returned value is a genuine one or a bullshit value.
There also is no way of discriminating hyperthreaded cores, or any such thing as NUMA awareness, or anything the like. Thus, even if you assume that the number is correct, it is still not very meaningful at all.
On a more general note
The problem "What is the correct number of threads" is hard to solve, if there is a good universal answer at all (I believe there is not). A couple of things to consider:
Work groups of 10 are certainly way, way too small. Spawning a thread is an immensely expensive thing (yes, contrary to popular belief that's true for Linux, too) and switching or synchronizing threads is expensive as well. Try "ten thousands", not "tens".
Hyperthreaded cores only execute while the other core in the same group is stalled, most commonly on memory I/O (or, when spinning, by the explicit execution of an instruction such as e.g. REP-NOP on Intel). If you do not have a significant number of memory stalls, extra threads running on hyperthreaded cores will only add context switches, but will not run any faster. For something like sorting (which is all about accessing memory!), you're probably good to go as far as that one goes.
Memory bandwidth is usually saturated by one, sometimes 2 cores, rarely more (depends on the actual hardware). Throwing 8 or 12 threads at the problem will usually not increase memory bandwidth but will heighten pressure on shared cache levels (such as L3 if present, and often L2 as well) and the system page manager. For the particular case of sorting (very incoherent access, lots of stalls), the opposite may be the case. May, but needs not be.
Due to the above, for the general case "number of real cores" or "number of real cores + 1" is often a much better recommendation.
Accessing huge amounts of data with poor locality like with your approach will (single-threaded or multi-threaded) result in a lot of cache/TLB misses and possibly even page faults. That may not only undo any gains from thread parallelism, but it may indeed execute 4-5 orders of magnitude slower. Just think about what a page fault costs you. During a single page fault, you could have sorted a million elements.
Contrary to the above "real cores plus 1" general rule, for tasks that involve network or disk I/O which may block for a long time, even "twice the number of cores" may as well be the best match. So... there is really no single "correct" rule.
What's the conclusion of the somewhat self-contradicting points above? After you've implemented it, be sure to benchmark whether it really runs faster, because this is by no means guaranteed to be the case. And unluckily, there's no way of knowing with certitude what's best without having measured.
As another thing, consider that sorting is by no means trivial to parallelize. You are already using std::inplace_merge so you seem to be aware that it's not just "split subranges and sort them".
But think about it, what exactly does your approach really do? You are subdividing (recursively descending) up to some depth, then sorting the subranges concurrently, and merging -- which means overwriting. Then you are sorting (recursively ascending) larger ranges and merging them, until the whole range is sorted. Classic fork-join.
That means you touch some part of memory to sort it (in a pattern which is not cache-friendly), then touch it again to merge it. Then you touch it yet again to sort the larger range, and you touch it yet another time to merge that larger range. With any "luck", different threads will be accessing the memory locations at different times, so you'll have false sharing.
Also, if your understanding of "large data" is the same as mine, this means you are overwriting every memory location beween 20 and 30 times, possibly more often. That's a lot of traffic.
So much memory being read and written to repeatedly, over and over again, and the main bottleneck is memory bandwidth. See where I'm going? Fork-join looks like an ingenious thing, and in academics it probably is... but it isn't certain at all that this runs any faster on a real machine (it might quite possibly be many times slower).
Ideally, you cannot assume more than n*2 thread running in your system. n is number of CPU cores.
Modern OS uses concept of Hyperthreading. So, now on one CPU at a time can run 2 threads.
As mentioned in another answer, in C++11 you can get optimal number of threads using std::thread::hardware_concurrency();

Splitting up a program into 4 threads is slower than a single thread

I've been writing a raytracer the past week, and have come to a point where it's doing enough that multi-threading would make sense. I have tried using OpenMP to parallelize it, but running it with more threads is actually slower than running it with one.
Reading over other similar questions, especially about OpenMP, one suggestion was that gcc optimizes serial code better. However running the compiled code below with export OMP_NUM_THREADS=1 is twice as fast as with export OMP_NUM_THREADS=4. I.e. It's the same compiled code on both runs.
Running the program with time:
> export OMP_NUM_THREADS=1; time ./raytracer
real 0m34.344s
user 0m34.310s
sys 0m0.008s
> export OMP_NUM_THREADS=4; time ./raytracer
real 0m53.189s
user 0m20.677s
sys 0m0.096s
User time is a lot smaller than real, which is unusual when using multiple cores- user should be larger than real as several cores are running at the same time.
Code that I have parallelized using OpenMP
void Raytracer::render( Camera& cam ) {
// let the camera know to use this raytracer for probing the scene
cam.setSamplingFunc(getSamplingFunction());
int i, j;
#pragma omp parallel private(i, j)
{
// Construct a ray for each pixel.
#pragma omp for schedule(dynamic, 4)
for (i = 0; i < cam.height(); ++i) {
for (j = 0; j < cam.width(); ++j) {
cam.computePixel(i, j);
}
}
}
}
When reading this question I thought I had found my answer. It talks about the implementation of gclib rand() synchronizing calls to itself to preserve state for random number generation between threads. I am using rand() quite a lot for monte carlo sampling, so i thought that was the problem. I got rid of calls to rand, replacing them with a single value, but using multiple threads is still slower. EDIT: oops turns out I didn't test this correctly, it was the random values!
Now that those are out of the way, I will discuss an overview of what's being done on each call to computePixel, so hopefully a solution can be found.
In my raytracer I essentially have a scene tree, with all objects in it. This tree is traversed a lot during computePixel when objects are tested for intersection, however, no writes are done to this tree or any objects. computePixel essentially reads the scene a bunch of times, calling methods on the objects (all of which are const methods), and at the very end writes a single value to its own pixel array. This is the only part that I am aware of where more than one thread will try to write to to the same member variable. There is no synchronization anywhere since no two threads can write to the same cell in the pixel array.
Can anyone suggest places where there could be some kind of contention? Things to try?
Thank you in advance.
EDIT:
Sorry, was stupid not to provide more info on my system.
Compiler gcc 4.6 (with -O2 optimization)
Ubuntu Linux 11.10
OpenMP 3
Intel i3-2310M Quad core 2.1Ghz (on my laptop at the moment)
Code for compute pixel:
class Camera {
// constructors destructors
private:
// this is the array that is being written to, but not read from.
Colour* _sensor; // allocated using new at construction.
}
void Camera::computePixel(int i, int j) const {
Colour col;
// simple code to construct appropriate ray for the pixel
Ray3D ray(/* params */);
col += _sceneSamplingFunc(ray); // calls a const method that traverses scene.
_sensor[i*_scrWidth+j] += col;
}
From the suggestions, it might be the tree traversal that causes the slow-down. Some other aspects: there is quite a lot of recursion involved once the sampling function is called (recursive bouncing of rays)- could this cause these problems?
Thanks everyone for the suggestions, but after further profiling, and getting rid of other contributing factors, random-number generation did turn out to be the culprit.
As outlined in the question above, rand() needs to keep track of its state from one call to the next. If several threads are trying to modify this state, it would cause a race condition, so the default implementation in glibc is to lock on every call, to make the function thread-safe. This is terrible for performance.
Unfortunately the solutions to this problem that I've seen on stackoverflow are all local, i.e. deal with the problem in the scope where rand() is called. Instead I propose a "quick and dirty" solution that anyone can use in their program to implement independent random number generation for each thread, requiring no synchronization.
I have tested the code, and it works- there is no locking, and no noticeable slowdown as a result of calls to threadrand. Feel free to point out any blatant mistakes.
threadrand.h
#ifndef _THREAD_RAND_H_
#define _THREAD_RAND_H_
// max number of thread states to store
const int maxThreadNum = 100;
void init_threadrand();
// requires openmp, for thread number
int threadrand();
#endif // _THREAD_RAND_H_
threadrand.cpp
#include "threadrand.h"
#include <cstdlib>
#include <boost/scoped_ptr.hpp>
#include <omp.h>
// can be replaced with array of ordinary pointers, but need to
// explicitly delete previous pointer allocations, and do null checks.
//
// Importantly, the double indirection tries to avoid putting all the
// thread states on the same cache line, which would cause cache invalidations
// to occur on other cores every time rand_r would modify the state.
// (i.e. false sharing)
// A better implementation would be to store each state in a structure
// that is the size of a cache line
static boost::scoped_ptr<unsigned int> randThreadStates[maxThreadNum];
// reinitialize the array of thread state pointers, with random
// seed values.
void init_threadrand() {
for (int i = 0; i < maxThreadNum; ++i) {
randThreadStates[i].reset(new unsigned int(std::rand()));
}
}
// requires openmp, for thread number, to index into array of states.
int threadrand() {
int i = omp_get_thread_num();
return rand_r(randThreadStates[i].get());
}
Now you can initialize the random states for threads from main using init_threadrand(), and subsequently get a random number using threadrand() when using several threads in OpenMP.
The answer is, without knowing what machine you're running this on, and without really seeing the code of your computePixel function, that it depends.
There is quite a few factors that could affect the performance of your code, one thing that comes to mind is the cache alignment. Perhaps your data structures, and you did mention a tree, are not really ideal for caching, and the CPU ends up waiting for the data come from the RAM, since it cannot fit things into the cache. Wrong cache-line alignments could cause something like that. If the CPU has to wait for things to come from RAM, it is likely, that the thread will be context-switched out, and another will be run.
Your OS thread scheduler is non-deterministic, therefore, when a thread will run is not a predictable thing, so if it so happens that your threads are not running a lot, or are contending for CPU cores, this could also slow things down.
Thread affinity, also plays a role. A thread will be scheduled on a particular core, and normally it will be attempted to keep this thread on the same core. If more then one of your threads are running on a single core, they will have to share the same core. Another reason things could slow down. For performance reasons, once a particular thread has run on a core, it is normally kept there, unless there's a good reason to swap it to another core.
There's some other factors, which I don't remember off the top of my head, however, I suggest doing some reading on threading. It's a complicated and extensive subject. There's lots of material out there.
Is the data being written at the end, data that other threads need to be able to do computePixel ?
One strong possibility is false sharing. It looks like you are computing the pixels in sequence, thus each thread may be working on interleaved pixels. This is usually a very bad thing to do.
What could be happening is that each thread is trying to write the value of a pixel beside one written in another thread (they all write to the sensor array). If these two output values share the same CPU cache-line this forces the CPU to flush the cache between the processors. This results in an excessive amount of flushing between CPUs, which is a relatively slow operation.
To fix this you need to ensure that each thread truly works on an independent region. Right now it appears you divide on rows (I'm not positive since I don't know OMP). Whether this works depends on how big your rows are -- but still the end of each row will overlap with the beginning of the next (in terms of cache lines). You might want to try breaking the image into four blocks and have each thread work on a series of sequential rows (for like 1..10 11..20 21..30 31..40). This would greatly reduce the sharing.
Don't worry about reading constant data. So long as the data block is not being modified each thread can read this information efficiently. However, be leery of any mutable data you have in your constant data.
I just looked and the Intel i3-2310M doesn't actually have 4 cores, it has 2 cores and hyper-threading. Try running your code with just 2 threads and see it that helps. I find in general hyper-threading is totally useless when you have a lot of calculations, and on my laptop I turned it off and got much better compilation times of my projects.
In fact, just go into your BIOS and turn off HT -- it's not useful for development/computation machines.

Simple operation to waste time?

I'm looking for a simple operation / routine which can "waste" time if repeated continuously.
I'm researching how gprof profiles applications, so this "time waster" needs to waste time in the user space and should not require external libraries. IE, calling sleep(20) will "waste" 20 seconds of time, but gprof will not record this time because it occurred within another library.
Any recommendations for simple tasks which can be repeated to waste time?
Another variant on Tomalak's solution is to set up an alarm, and so in your busy-wait loop, you don't need to keep issuing a system call, but instead just check if the signal has been sent.
The simplest way to "waste" time without yielding CPU is a tight loop.
If you don't need to restrict the duration of your waste (say, you control it by simply terminating the process when done), then go C style*:
for (;;) {}
(Be aware, though, that the standard allows the implementation to assume that programs will eventually terminate, so technically speaking this loop — at least in C++0x — has Undefined Behaviour and could be optimised out!**
Otherwise, you could time it manually:
time_t s = time(0);
while (time(0) - s < 20) {}
Or, instead of repeatedly issuing the time syscall (which will lead to some time spent in the kernel), if on a GNU-compatible system you could make use of signal.h "alarms" to end the loop:
alarm(20);
while (true) {}
There's even a very similar example on the documentation page for "Handler Returns".
(Of course, these approaches will all send you to 100% CPU for the intervening time and make fluffy unicorns fall out of your ears.)
* {} rather than trailing ; used deliberately, for clarity. Ultimately, there's no excuse for writing a semicolon in a context like this; it's a terrible habit to get into, and becomes a maintenance pitfall when you use it in "real" code.
** See [n3290: 1.10/2] and [n3290: 1.10/24].
A simple loop would do.
If you're researching how gprof works, I assume you've read the paper, slowly and carefully.
I also assume you're familiar with these issues.
Here's a busy loop which runs at one cycle per iteration on modern hardware, at least as compiled by clang or gcc or probably any reasonable compiler with at least some optimization flag:
void busy_loop(uint64_t iters) {
volatile int sink;
do {
sink = 0;
} while (--iters > 0);
(void)sink;
}
The idea is just to store to the volatile sink every iteration. This prevents the loop from being optimized away, and makes each iteration have a predictable amount of work (at least one store). Modern hardware can do one store per cycle, and the loop overhead generally can complete in parallel in that same cycle, so usually achieves one cycle per iteration. So you can ballpark the wall-clock time in nanoseconds a given number of iters will take by dividing by your CPU speed in GHz. For example, a 3 GHz CPU will take about 2 seconds (2 billion nanos) to busy_loop when iters == 6,000,000,000.