Why is matrix multiplication with concurrency slower than without concurrency?

Why is matrix multiplication with concurrency slower than without concurrency? - concurrency

There are some parts of my program, which import a file that contains two matrices and multiply them.
But I am confused with why the duration time with concurrency is longer than that without concurrency?
Are there any bugs in my code?
// without concurrency
let mut result = vec![];
let time1 = Instant::now();
for i in 0..n {
let mut temp_vector = vec![];
for j in 0..n {
let mut temp_num = 0;
for multiple_count in 0..m {
temp_num = temp_num + arr1[i][multiple_count] * arr2[multiple_count][j];
}
temp_vector.push(temp_num);
}
result.push(temp_vector);
}
let time2 = Instant::now();
println!("normal solving result:\n");
for i in 0..n {
for j in 0..n {
print!("{:?} ", result[i][j]);
}
println!("");
}
let pass = time2.duration_since(time1);
println!("{:?}\n",pass);
println!("concurrency solving solution:\n");
// start the concurrency
let mut handles = vec![];
let arr1 = Arc::new(RwLock::new(arr1));
let arr2 = Arc::new(RwLock::new(arr2));
let count_time1 = Instant::now();
for i in 0..n {
for j in 0..n {
let arr1 = arr1.clone();
let arr2 = arr2.clone();
let handle = thread::spawn(move || {
let mut count = 0;
let arr1 = arr1.try_read().unwrap();
let arr2 = arr2.try_read().unwrap();
for k in 0..m {
count = count + arr1[i][k] * arr2[k][j];
}
count
});
handles.push(handle);
}
}
let count_time2 = Instant::now();
let pass_time = count_time2.duration_since(count_time1);

Without looking into too much detail: you are spawning n² threads -- one for each cell in the result matrix. Spawning threads is expensive (note that Rust doesn't use "green thread", but system threads by default).
Concurrency doesn't simply speed up everything; one has to be a bit smart about it. Usually you just want to utilize all CPU cores, hence you should only spawn roughly as many threads as there are cores. In your case, spawning the thread probably takes much more time than what the thread does, thus the slowdown.

There are several reasons, but the few that jump out:
You're cloning two Arcs each time, which involves an atomic addition, which is slow. While they're not in the threads, it is delaying the START of each thread.
You're using a Read-Write mutex (RWLock) each time. Even if it's just a read, this still almost certainly involves at least one atomic write, and an atomic read of some Mutex state, to some internal counter.
Spawning OS threads is not free or instantaneous. It has a startup cost.
Make sure you're using sufficiently sized data! We're usually talking at least n*n=10000 to get anything noticeable from parallelism. Usually several orders of magnitude more. This is part of why 3 is bad.
Rust doesn't use lightweight coroutines. These are full-on OS threads. You'd be better spawning as many threads as you have logical cores (logical means physical+hyperthreading, your OS should report them all as actual cores) and evenly distributing the cost across all cores.
You could probably get a pretty significant speed-up by ditching the RWLock (you don't need it since your data is read-only), since the Arc is only delaying the startup time, and the time it takes a thread to join (since it needs to drop the Arc). However, by far your biggest speedup is going to be only spawning 4-8 threads depending on your processor. I'll leave it to you how to best split it into chunks, but it's fairly straightforward.
Edit: In fact, you can probably get rid of the Arc too, since the threads immediately join, but depending on Rust thread lifetime weirdness, you may need the crossbeam::scoped functionality from the crossbeam crate to actually make it work.
As an aside, once you move to concurrent writing to the same data structures, I highly encourage you to look up info on the processor cache, specifically, false sharing. While Mutexes are likely to be the higher cost in Rust, if you can somehow eschew them (e.g. by splitting a slice with split_mut), you'll likely get bad flailing by constantly invalidating your cache around the boundaries.

Related

Best way divide a loop into threads?

I have a loop that repeats itself 8 times and I want to run each loop in a different thread so it will run quicker, I looked it up online but I can't decide for a way to do this. There are no shared resources inside the loop. Any ideas?
Sorry for bad English

The best way is to analyze your program for how it is to be used, and determine what the best cost vs performance trade off you can make is. Threads, even in languages like go have a non-trivial overhead; and in languages like java, it can be a significant overhead.
You need to have a grasp upon what the cost of dispatching an operation onto a thread is versus the time to perform the operation, and what execution models you can apply to do so. For example, if you try:
for (i = 0; i < NTHREAD; i++) {
t[i] = create_thread(PerformAction, ...);
}
for (i = 0; i < NTHREAD; i++) {
join_thread(t[i]);
}
You might think you have done wonderfully, however the NTHREAD-1’th thread doesn’t start until you have paid the overhead for creating the others. In contrast, if you created threads in a tree-like structure, and your OS doesn’t blow, you can get a significantly lower latency.
So, best practise: measure, write for the generic case and configure for the specific one.

Efficiency of Array with individual mutexes protecting them or one mutex protecting it

So I was a doing a bit of thinking the other day about concurrency and I was wondering whether it was faster to protect an array with individual mutexes for each element or whether it was faster for the entire array to use one mutex to protect all the data in it. Logically, I figured that a program would execute faster with invidual mutexes so that each thread would only need to "checkout" the element they need, which kinda sounds like it would allow for better concurrency. If only one was able to execute at a time waiting on the mutex, then surely there would be a lot of waiting going on. To test this theory, I created a set of tests like this. In both functions, all that was done is that a mutex is locked and a random value is written to a random location in the array, the only difference being that each element has it's own mutex in the first function, and then all elements share a mutex in the second. I left the number of runs a constant 25 to get a good average at the end of each test.
I ran it with:
NUM_ELEMENTS = 10;
NUM_THREADS = 5;
results
NUM_ELEMENTS = 100;
NUM_THREADS = 5;
results
NUM_ELEMENTS = 10;
NUM_THREADS = 10;
results
NUM_ELEMENTS = 100;
NUM_THREADS = 10;
results
NUM_ELEMENTS = 10;
NUM_THREADS = 15;
results
NUM_ELEMENTS = 100;
NUM_THREADS = 15;
results
For the set that use 10 elements in the array, here is a graph of the averages for the 2 different methods.
here
For the set that use 100 elements in the array, here is a graph of the averages for the 2 different methods.
here
For the record this was all done in MingW as i don't have a working linux box ( because reasons), with no other flags besides -c++11. As you can see, my original theory, was entirely incorrect. Apparently if an entire array of values shares one mutex for writing, it is quite a bit faster than each value having it's own lock. This seems entirely counter intuitive. So my question to you clever people out there, what is going on in the system or elsewhere that is causing this conundrum. Please correct my thinking!
EDIT: just noticed the graphs didn't import correctly, so ya'll have no idea what is going on. Fixed
EDIT: On the suggestion of #nanda, I implemented 2 more similar tests, except utilizing a thread pool of 4 threads, that process the same number of random assignments to the test vector as the other threads make. Here is the updated tests, and here is the output file.On a whimsey, i also decreased the number of threads the original 2 test used to 4 (the number of cores on my cpu), and as the output suggests, the two methods are now very very similar in average time elapsed. This allows for the conclusion that #nanda's reasoning is correct, that a large number of runnable threads, (or just more threads than you have cores), causes the system to have to queue up process threads which causes a large amount of delay. But also on a whim, i added a "control" group so to speak which was just an asynchronous loops that makes the same number of random accesses randomely to the array. It was considerably faster than the concurrent methods of doing so. Also, as you may notice, the threadpool method performaing the same amount of accesses as the original two methods, complete much quicker than the original 2 methods.
So here are my 2 new questions. Why in the world are the concurrent methods so incredibly slow as compared to the asynchronous method. Also, why is the threadpool method of concurrency quite a bit faster than my original method?

Parallel computing memory access bottleneck

The following algorithm is run iteratively in my program. running it, without the two lines indicated below, takes 1.5X as long as without. That is very surprising to me as it is. Worse, however, is that running with those two lines increases completion to 4.4X of running without them (6.6X not running whole algorithm). Additionaly, it causes my program to fail to scale beyond ~8 cores. In fact, when run on a single core, the two lines only increase time to 1.7x, which is still way too high considering what they do. I've ruled out that it has to do with an effect of the modified data elsewhere in my program.
So I'm wondering what could be causing this. Something to do with the cache maybe?
void NetClass::Age_Increment(vector <synapse> & synapses, int k)
{
int size = synapses.size();
int target = -1;
if(k > -1)
{
for(int q=0, x=0 ; q < size; q++)
{
if(synapses[q].active)
synapses[q].age++;
else
{
if(x==k)target=q;
x++;
}
}
/////////////////////////////////////Causing Bottleneck/////////////
synapses[target].active = true;
synapses[target].weight = .04 + (float (rand_r(seedp) % 17) / 100);
////////////////////////////////////////////////////////////////////
}
else
{
for(int q=0 ; q < size; q++)
if(synapses[q].active)
synapses[q].age++;
}
}
Update: Changing the two problem lines to:
bool x = true;
float y = .04 + (float (rand_r(seedp) % 17) / 100);
Removes the problem. Suggesting maybe that it's something to do with memory access?

Each thread modifies memory all the other reads read:
for(int q=0, x=0 ; q < size; q++)
if(synapses[q].active) ... // ALL threads read EVERY synapse.active
...
synapses[target].active = true; // EVERY thread writes at leas one synapse.active
These kind of reads and writes on the same address from different threads cause a great deal of cache invalidation, which will result in exactly the symptoms you describe. The solution is to avoid the write inside the loop, and the fact that moving the write into local variables is, again, proof that the problem is cache invalidation. Note that even if you wouldn't write the sane field being read (active), you would likely see the same symptoms due to false sharing, as I suspect that active, age and weight share a cache line.
For more details see CPU Caches and Why You Care
A final note is that the assignment to active and weight, not to mention the age++increment all seem extremely thread unsafe. Interlocked operations or lock/mutex protection for such updates would be mandatory.

Try re-introducing these two lines, but without rand_r, just to see if you get the same performance deterioration. If you don't, this is probably a sign that the rand_r is internally serialized (e.g. through a mutex), so you'd need to find a way to generate random numbers more concurrently.
The other potential area of concern is false sharing (if you have time, take a look at Herb Sutter's video and slides treating this subject, among others). Essentially, if your threads happen to modify different memory locations that are close enough to fall into the same cache line, the cache coherency hardware may effectively serialize the memory access and destroy the scalability. What makes this hard to diagnose is the fact that these memory locations may be logically independent and it may not be intuitively obvious they ended up close together at run-time. Try adding some padding to split such memory locations apart if you suspect false sharing.

If size is relatively small it doesn't surprise me at all that a call to a PRNG, an integer division, and a float division and addition would increase program execution that much. You're doing a fair amount of work so it seems logical that it would increase the runtime. Additionally since you told the compiler to do the math as float rather than double that could increase time even further on some systems (where native floating point is double). Have you considered a fixed point representation with ints?
I can't say why it would scale worse with more cores, unless you exceed the number of cores your program has been given by the OS (or if your system's rand_r is implemented using locking or thread-specific data to maintain additional state).
Also note that you never check if target is valid before using it as an array index, if it ever makes it out of the for loop still set to -1 all bets are off for your program.

how do i properly design a worker thread? (avoid for example Sleep(1))

i am still a beginner at multi-threading, so bear with me please:
i am currently writing an application that does some FVM calculation on a grid. it's a time-explicit model, so at every timestep i need to calculate new values for the whole grid. my idea was to distribute this calculation to 4 worker-threads, which then deal with the cells of the grid (first thread calculating 0, 4, 8... second thread 1, 5, 9... and so forth).
i create those 4 threads at program start.
they look something like this:
void __fastcall TCalculationThread::Execute()
{
bool alive = true;
THREAD_SIGNAL ts;
while (alive)
{
Sleep(1);
if (TryEnterCriticalSection(&TMS))
{
ts = thread_signal;
LeaveCriticalSection(&TMS);
alive = !ts.kill;
if (ts.go && !ts.done.at(this->index))
{
double delta_t = ts.dt;
for (unsigned int i=this->index; i < cells.size(); i+= this->steps)
{
calculate_one_cell();
}
EnterCriticalSection(&TMS);
thread_signal.done.at(this->index)=true;
LeaveCriticalSection(&TMS);
}
}
}
they use a global struct, to communicate with the main thread (main thread sets ts.go to true when the workers need to start.
now i am sure this is not the way to do it! not only does it feel wrong, it also doesn't perform very well...
i read for example here that a semaphore or an event would work better. the answer to this guy's question talks about a lockless queue.
i am not very familiar with these concepts would like some pointers how to continue.
could you line out any of the ways to do this better?
thank you for your time. (and sorry for the formatting)
i am using borland c++ builder and its thread-object (TThread).

The definitely more effective algorithm would be to calculate yields for 0,1,2,3 on one thread, 4,5,6,7 on another, etc. Interleaving memory accesses like that is very bad, even if the variables are completely independent- you'll get false sharing problems. This is the equivalent of the CPU locking every write.

Calling Sleep(1) in a calculation thread can't be a good solution to any problem. You want your threads to be doing useful work rather than blocking for no good reason.
I think your basic problem can be expressed as a serial algorithm of this basic form:
for (int i=0; i<N; i++)
cells[i]->Calculate();
You are in the happy position that calls to Calculate() are independent of each other—what you have here is a parallel for. This means that you can implement this without a mutex.
There are a variety of ways to achieve this. OpenMP would be one; a threadpool class another. If you are going to roll your own thread based solution then use InterlockedIncrement() on a shared variable to iterate through the array.
You may hit some false sharing problems, as #DeadMG suggests, but quite possibly not. If you do have false sharing then yet another approach is to stride across larger sub-arrays. Essentially the increment (i.e. stride) passed to InterlockedIncrement() would be greater than one.
The bottom line is that the way to make the code faster is to remove both the the critical section (and hence the contention on it) and the Sleep(1).

Boost.Thread no speedup?

I have a small program that implements a monte carlo simulation of BlackJack using various card counting strategies. My main function basically does this:
int bankroll = 50000;
int hands = 100;
int tests = 10000;
Simulation::strategy = hi_lo;
for(int i = 0; i < simulations; ++i)
runSimulation(bankroll, hands, tests, strategy);
The entire program run in a single thread on my machine takes about 10 seconds.
I wanted to take advantage of the 3 cores my processor has so I decided to rewrite the program to simply execute the various strategies in separate threads like this:
int bankroll = 50000;
int hands = 100;
int tests = 10000;
Simulation::strategy = hi_lo;
boost::thread threads[simulations];
for(int i = 0; i < simulations; ++i)
threads[i] = boost::thread(boost::bind(runSimulation, bankroll, hands, tests, strategy));
for(int i = 0; i < simulations; ++i)
threads[i].join();
However, when I ran this program, even though I got the same results it took around 24 seconds to complete. Did I miss something here?

If the value of simulations is high, then you end up creating a lot of threads, and the overhead of doing so can end up destroying any possible performance gains.
EDIT: One approach to this might be to just start three threads and let them each run 1/3 of the desired simulations. Alternatively, using a thread pool of some kind could also help.

This is a good candidate for a work queue with thread pool. I have used Intel Threading Blocks (TBB) for such requirements. Use handcrafted thread pools for quick hacks too. On Windows, the OS provides you with a nice thread pool backed work queue
"QueueUserWorkItem()"

Read these articles from Herb Sutter. You are probably victim of "false sharing".
http://drdobbs.com/go-parallel/article/showArticle.jhtml?articleID=214100002
http://drdobbs.com/go-parallel/article/showArticle.jhtml?articleID=217500206

I agree with dlev . If your function runSimulation is not changing anything which will be required for the next call to "runSimulation" to work properly then you can do something like:
. Divide "simulations" by 3.
. Now you will be having 3 counters "0 to simulation/3" "(simulation/3 + 1) to 2simulation/3" and "(2*simulation)/3 + 1 to simulation".
All these 3 counters can be used in three different threads simultaneously.
**NOTE ::** Your requirement might not be suitable for this type of checkup at all in case you have to do shared data lockup and all

I'm late to this party, but wanted to note two things for others who come across this post:
1) Definitely see the second Herb Sutter link that David points out (http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206). It solved the problem that brought me to this question, outlining a struct data object wrapper that ensures separate parallel threads aren't competing for resources headquartered on the same memory cache-line (hardware controls will prevent multiple threads from accessing the same memory cache-line simultaneously).
2) Re the original question, dlev points out a large part of the problem, but since it's a simulation I bet there's a deeper issue slowing things down. While none of your program's high-level variables are shared you probably have one critical system variable that's shared: the system-level "last random number" that's stored under-the-hood and used to create the next random number. You might even be initializing dedicated generator objects for each simulation, but if they're making calls to a function like rand() then they, and by extension their threads, are making repeated calls to the same shared system resource and subsequently blocking one another.
Solutions to issue #2 would depend on the structure of the simulation program itself. For instance if calls to a random generator are fragmented then I'd probably batch into one upfront call which retrieves and stores what the simulation will need. And this has me wondering now about more sophisticated approaches that'd deal with the underlying random generation shared-resource issue...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js