CPU speed and threads in C++ - c++

I have the following C++ program:
void testSpeed(int start, int end)
{
int temp = 0;
for(int i = start; i < end; i++)
{
temp++;
}
}
int main() {
using namespace boost;
timer aTimer;
// start two new threads that calls the "testSpeed" function
boost::thread my_thread1(&testSpeed, 0, 500000000);
boost::thread my_thread2(&testSpeed, 500000000, 1000000000);
// wait for both threads to finish
my_thread1.join();
my_thread2.join();
double elapsedSec = aTimer.elapsed();
double IOPS = 1/elapsedSec;
}
So the idea is to test the CPU speed in terms of integer operations per second (IOPS).
There are 1 billion iterations (operations), so on 1Ghz CPU we should get around billion integer operations per second, I believe.
My assumption is that more threads = more integer operations per second. But the more threads I try the less operations per second I see (I have more cores than threads).
What may causing such a behavior? Is it the threads overhead? Maybe I should try a much longer experiment to see if the threads actually help?
Thank you!
UPDATE:
So I changed the loop to run 18 billions times, and declared temp as volatile. Also added another testSpeed method with different name so now a single threaded executes both methods one after another, while two threads get each one method; so there shouldn't be any sync' issues, etc. And... still no change in behavior! single threaded is faster according to timer. Ahhh! I found the sucker, apparently timer is bluffing. The two threads take half the time to finish but timer tells me the single threaded run was two seconds faster. I'm now trying to understand why... Thanks everyone!

I am almost certain that compiler optimizes away your loops. Since you do not subtract the overhead of creating/synchronizing threads, you actually measure only that. So more threads you have, more overhead you create and more time it takes.
Overall, you can refer to the documentation of your CPU and find out about its frequency and how much ticks any given instruction takes. Testing it yourself using an approach like this is nearly impossible and is, well, useless. This is because of an overhead like context switches, transferring the execution from one CPU/core to another, scheduler swap-outs, branch mis-prediction. In real life you will also encounter cache misses and a lot of memory bus latency since there is no programs that fit into ~ 15 registers. So you better test a real program using some good profiler. For example, latest CPUs can give out CPU stall information, cache misses, branch mispredictions and a lot more. You can use a good profiler to actually decide when and how to parallel your program as well.

As the number of threads increases beyond a certain point, it leads to an increase in the number of cache misses (cache is being shared among the threads), but at the same time memory access latency is being masked by a large number of threads(while a thread is waiting for data to be fetched from the memory, other threads are running). Hence there is a trade off. Here is an interesting paper on this subject.
According to this paper, on a multi-core machine when the number of threads is very low (of the order of number of cores), the performance will increase on increasing the number of threads, because now the cores are being fully utilized.
After that, a further increase in the number of threads leads to the effect of cache misses dominating, thus leading to a degradation in the performance.
If the number of threads become very large, such that the amount of cache storage per thread almost become almost zero, all memory accesses are made from the main memory. But at the same time increased number of threads is also very effectively masking the increased memory access latency. This time the second effect dominates leading to an increase in the performance.
Thus the valley in the middle is the region with the worst performance.

Related

What's the "real world" performance improvement for multithreading I can expect?

I'm programming a recursive tree search with multiple branches and works fine. To speed up I'm implementing a simple multithreading: I distribute the search into main branches and scatter them among the threads. Each thread doesn't have to interact with the others, and when a solve is found I add it to a common std::vector using a mutex this way:
if (CubeTest.IsSolved())
{ // Solve algorithm found
std::lock_guard<std::mutex> guard(SearchMutex); // Thread safe code
Solves.push_back(Alg); // Add the solve
}
I don't allocate variables in dynamic store (heap) with new and delete, since the memory needs are small.
The maximum number of threads I use is the quantity I get from: std::thread::hardware_concurrency()
I did some tests, always the same search but changing the amount or threads used, and I found things that I don't expected.
I know that if you double the amount of threads (if the processor has enougth capacity) you can't expect to double the performance, because of context switching and things like that.
For example, I have an old Intel Xeon X5650 with 6 cores / 12 threads. If I execute my code, until the sixth thread things are as expected, but if I use an additional thread the performace is worst. Using more threads increase the performace very little, to the point that the use of all avaliable threads (12) barely compensates for the use of only 6:
Threads vs processing time chart for Xeon X5650:
(I repeat the test several times and I show the average times of all the tests).
I repeat the tests in other computer with an Intel i7-4600U (2 cores / 4 threads) and I found this:
Threads vs processing time chart for i7-4600U:
I understand that with less cores the performance gain using more threads is worst.
I think also that when you start to use the second thread in the same core the performance is penalized in some way. Am I right? How can I improve the performance in this situation?
So my question is if this performance gains for multithreading is what I can expect in the real world, or on the other hand, this numbers are telling me that I'm doing things wrong and I should learn more about mutithreading programming.
What's the “real world” performance improvement for multithreading I can expect?
It depends on many factors. In general, the most optimistic improvement that one can hope for is reduction of runtime by factor of number of cores1. In most cases this is unachievable because of the need for threads to synchronise with one another.
In worst case, not only is there no improvement due to lack of parallelism, but also the overhead of synchronisation as well as cache contention can make the runtime much worse than the single threaded program.
Peak memory use often increases linearly by number of threads because each thread needs to operate on data of their own.
Total CPU time usage, and therefore energy use also increases due to extra time spent on synchronisation. This is relevant to systems that operate on battery power as well as those that have poor heat management (both apply to phones and laptops).
Binary size would be marginally larger due to extra code that deals with threads.
1 Whether you get all of the performance out of "logical" cores i.e. "hyper threading" or "clustered multi threading" also depends on many factors. Often, one executes the same function in all threads, in which case they tend to use the same parts of the CPU, in which case sharing the core with multiple threads doesn't necessarily yield benefit.
A CPU which uses hyperthreading claims to be able to execute two threads simultaneously on one core. But actually it doesn't. It just pretends to be able to do that. Internally it performs preemptive multitasking: Execute a bit of thread A, then switch to thread B, execute a bit of B, back to A and so on.
So what's the point of hyperthreading at all?
The thread switches inside the CPU are faster than thread switches managed by the thread scheduler of the operating system. So the performance gains are mostly through avoiding overhead of thread switches. But it does not allow the CPU core to perform more operations than it did before.
Conclusion: The performance gain you can expect from concurrency depend on the number of physical cores of the CPU, not logical cores.
Also keep in mind that thread synchronization methods like mutexes can become pretty expensive. So the less locking you can get away with the better. When you have multiple threads filling the same result set, then it can sometimes be better to let each thread build their own result set and then merge those sets later when all threads are finished.

why does having more than one thread(parallel processing) in some specific cases degrade performance?

i noticed that having more than a thread running for some code is much much slower than having one thread, and i have been really pulling my hair to know why,can anyone help?
code explanation :
i have ,sometimes, a very large array that i need to process parts of in a parallel way for optimization,each "part" of a row gets looped on and processed on in a specific thread, now i've noticed that if i only have one "part",i.e the whole array and a single worker thread that runs through it is noticeably faster than if i divide the array and process it as separate sub arrays with different threads.
bool m_generate_row_worker(ull t_row_start,ull t_row_end)
{
for(;t_row_start<t_row_end;t_row_start++)
{
m_current_row[t_row_start]=m_singularity_checker(m_previous_row[t_row_start],m_shared_random_row[t_row_start]);
}
return true;
}
...
//code
...
for(unsigned short thread_indx=0;thread_indx<noThreads-1;thread_indx++)
{
m_threads_array[thread_indx]=std::thread(
m_generate_row_worker,this,
thread_indx*(m_parts_per_thread),(thread_indx+1)*(m_parts_per_thread));
}
m_threads_array[noThreads-1]=std::thread(m_generate_row_worker,this,
(noThreads-1)*(m_parts_per_thread),std::max((noThreads)*(m_parts_per_thread),m_blocks_per_row));
//join
for(unsigned short thread_indx=0;thread_indx<noThreads;thread_indx++)
{
m_threads_array[thread_indx].join();
}
//EDIT
inline ull m_singularity_checker(ull t_to_be_ckecked_with,ull
t_to_be_ckecked)
{
return (t_to_be_ckecked & (t_to_be_ckecked_with<<1)
& (t_to_be_ckecked_with>>1) ) | (t_to_be_ckecked_with &
t_to_be_ckecked);
}
why does having more than one thread(parallel processing) in some specific cases degrade performance?
Because thread creation has overhead. If the task to be performed has only small computational cost, then the cost of creating multiple threads is more than the time saved by parallelism. This is especially the case when creating significantly more threads than there are CPU cores.
Because many algorithms do not easily divide into independent sub-tasks. Dependencies on other threads requires synchronisation, which has overhead that can in some cases be more than the time saved by parallelism.
Because in poorly designed programs, synchronization can cause all tasks to be processed sequentially even if they are in separate threads.
Because (depending on CPU architecture) sometimes otherwise correctly implemented, and seemingly independent tasks have effectual dependency because they operate on the same area of memory. More specifically, when a threads writes into a piece of memory, all threads operating on the same cache line must synchronise (the CPU does this for you automatically) to remain consistent. The cost of cache misses is often much higher than the time saved by parallelism. This problem is called "false sharing".
Because sometimes introduction of multi threading makes the program more complex, which makes it more difficult for the compiler / optimiser to make use of instruction level parallelism.
...
In conclusion: Threads are not a silver bullet that automatically multiplies the performance of your program.
Regarding your program, we cannot count out any of the above potential issues given the excerpt that you have shown.
Some tips on avoiding or finding above issues:
Don't create more threads than you have cores, discounting the number of threads that are expected to be blocking (waiting for input, disk, etc).
Only use multi-threading with problems that are computationally expensive, (or to do work while a thread is blocking, but this may be more efficiently solved using asynchronous I/O and coroutines).
Don't do (or do as little as possible) I/O from more than one thread into a single device (disk, NIC, virtual terminal, ...) unless it is specially designed to handle it.
Minimise the number of dependencies between threads. Consider all access to global things that may cause synchronisation, and avoid them. For example, avoid memory allocation. Keep in mind that things like operations on standard containers do memory allocation.
Keep the memory touched by distinct threads far from each other (not adjacent small elements of array). If processing an array, divide it in consecutive blocks, rather than striping one element every (number of threads)th element. In some extreme cases, extra copying into thread specific data structures, and then joining in the end may be efficient.
If you've done all you can, and multi threading measures slower, consider whether perhaps it is not a good solution for your problem.
Using threads do not always mean that you will get more work done. For example using 2 threads does not mean you will get a task done in half the time. There is an overhead to setting up the threads and depending on how many cores and OS etc... how much context switching is occurring between threads (saving the thread stack/regs and loading the next one - it all adds up). At some point adding more threads will start to slow your program down since there will be more time spent switching between threads/setting threads up/down then there is work being done. So you may be a victim of this.
If you have 100 very small items (like 1 instruction) of work to do, then 100 threads will be guaranteed to be slower since you now have ("many instructions" + 1) x 100 of work to do. Where the "many instructions" are the work of setting up the threads and clearing them up at the end - and switching between them.
So, you may want to start to profile this for yourself.. How much work is done processing each row and how many threads in total are you setting up?
One very crude, but quick/simple way to start to measure is to just take the time elapsed to processes one row in isolation (e.g. use std::chrono functions to measure the time at the start of processing one row and then take the time at the end to see total time spent. Then maybe do the same test over the entire table to get an idea how total time.
If you find that a individual row is taking very little time then you may not be getting so much benefit from the threads... You may be better of splitting the table into chunks of work that are equal to the number of cores your CPU has, then start changing the number of threads (+/-) to find the sweet spot. Just making threads based on number of rows is a poor choice - you really want to design it to max out each core (for example).
So if you had 4 cores, maybe start by splitting the work into 4 threads to start with. Then test it with 8 if its better try 16, if its worse try 12....etc...
Also you might get different results on different PCs...

Determining Optimum Thread Count

So, as part of a school assignment, we are being asked to determine what our optimum thread count is for our personal computers by constructing a toy program.
To start, we are to create a task that takes between 20 and 30 seconds to run. I chose to do a coin toss simulation, where the total number of heads and tails are accumulated and then displayed. On my machine, 300,000,000 tosses on one thread ended up at 25 seconds. After that, I went to 2 threads, then 4, then 8, 16, 32, and, just for fun, 100.
Here are the results:
* Thread Tosses per thread time(seconds)
* ------------------------------------------
* 1 300,000,000 25
* 2 150,000,000 13
* 4 75,000,000 13
* 8 37,500,000 13
* 16 18,750,000 14
* 32 9,375,000 14
* 100 3,000,000 14
And here is the code I'm using:
void toss()
{
int heads = 0, tails = 0;
default_random_engine gen;
uniform_int_distribution<int> dist(0,1);
int max =3000000; //tosses per thread
for(int x = 0; x < max; ++x){(dist(gen))?++heads:++tails;}
cout<<heads<<" "<<tails<<endl;
}
int main()
{
vector<thread>thr;
time_t st, fin;
st = time(0);
for(int i = 0;i < 100;++i){thr.push_back(thread(toss));} //thread count
for(auto& thread: thr){thread.join();}
fin = time(0);
cout<<fin-st<<" seconds\n";
return 0;
}
Now for the main question:
Past a certain point, I would've expected there to be a considerable decline in computing speed as more threads were added, but the results don't seem to show that.
Is there something fundamentally wrong with my code that would yield these sorts of results, or is this behavior considered normal? I'm very new to multi-threading, so I have a feeling it's the former....
Thanks!
EDIT: I am running this on a macbook with a 2.16 GHz Core 2 Duo (T7400) processor
Your results seem very normal to me. While thread creation has a cost, its not that much (especially compared to the per-second granularity of your tests). An extra 100 thread creations, destructions, and possible context-switches isn't going to change your timing by more than a few milliseconds I bet.
Running on my Intel i7-4790 # 3.60 GHz I get these numbers:
threads - seconds
-----------------
1 - 6.021
2 - 3.205
4 - 1.825
8 - 1.062
16 - 1.128
32 - 1.138
100 - 1.213
1000 - 2.312
10000 - 23.319
It takes many, many more threads to get to the point at which the extra threads make a noticeable difference. Only when I get to 1,000 threads do I see that the thread-management has made a significant difference and at 10,000 it dwarfs the loop (the loop is only doing 30,000 tosses at that point).
As towards your assignment, it should be fairly straightforward to see that the optimal number of threads for your system should be the same as the available threads that can be executed at once. There's not any processing power left to execute another thread until one is either done or yielded, which doesn't help you finish faster. And, any less threads and you aren't using all available resources. My CPU has 8 threads and the chart reflects that.
Edit 2 - To further elaborate on the "lack of performance penalty" part due to popular demand:
...I would've expected there to be a considerable decline in computing
speed as more threads were added, but the results don't seem to show
that.
I made this giant chart in order to better illustrate the scaling.
To explain the results:
The blue bar illustrates the total time to do all the tosses. Although that time decreases all the way up to 256 threads, the gains from doubling the thread count gets smaller and smaller. The CPU I ran this test had 4 physical and 8 logical cores. Scaling is pretty good all the way to 4 cores and decent to 8 cores, and then it plummets. Pipeline saturation allows to get minor gains all the way to 256 but it is simply not worth it.
The red bar illustrates the time per toss. It is nearly identical for 1 and 2 threads, as the CPU pipeline hasn't reached full saturation yet. It gets a minor hit at 4 threads, it still runs fine but now the pipeline is saturated, and at 8 threads it really shows that logical threads are not the same thing as physical, that gets progressively worse pushing above 8 threads.
The green bar illustrates the overhead, or how much lower actual performance is relative to the expected double boost from doubling the threads. Pushing above the available logical cores causes the overhead to skyrocket. Note that this is mostly thread synchronization, the actual thread scheduling overhead is probably constant after a given point, there is a minimal window of activity time a thread must receive, which explains why the thread switching doesn't come to the point of overwhelming the work throughput. In fact there is no severe performance drop all the way to 4k threads, which is expected as modern systems have to be able and often run over thousand threads in parallel. And again, most of that drop is due to thread synchronization, not thread switching.
The black outline bar illustrates the time difference relative to the lowest time. At 8 threads we only lose ~14% of absolute performance from not having the pipeline oversaturated, which is a good thing because it is most cases not really worth stressing the entire system over so little. It also shows that 1 thread is only ~6 times slower than the maximum the CPU can pull off. Which gives a figure of how good logical cores are compared to physical cores, 100% extra logical cores give a 50% boost in performance, in this use case a logical thread is ~50% as good as a physical thread, which also correlates to the ~47% boost we see going from 4 to 8. Note that this is a very simply workload though, in more demanding cases it is close to 20-25% for this particular CPU, and in some edge cases there is actually a performance hit.
Edit 1 - I foolishly forgot to isolate the computational workload from the thread synchronization workload.
Running the test with little to no work reveals that for high thread counts the thread management part takes the bulk of the time. Thus the thread switching penalty is indeed very small and possibly after a certain point a constant.
And it would make a lot of sense if you put yourself in the shoes of a thread scheduler maker. The scheduler can easily be protected from being choked by an unreasonably high switching to working ratio, so there is likely a minimal window of time the scheduler will give to a thread before switching to another, while the rest are put on hold. This ensures that the the switching to working ratio will never exceed the limits of what is reasonable. It would be much better to stall other threads than go crazy with thread switching, as the CPU will mostly be switching and doing very little actual work.
The optimal thread count is the available amount of logical CPU cores. This achieves optimal pipeline saturation.
If you use more you will suffer performance degradation due to the cost of thread context switching. The more threads, the more penalty.
If you use less, you will not be utilizing the full hardware potential.
There is also the problem of workload graining, which is very important when you utilize synchronization such as a mutex. If your concurrency is too finely grained you can experience performance drops even when going from 1 to 2 threads on a 8 thread machine. You'd want to reduce synchronization as much as possible, doing as much work as possible in between synchronizations, otherwise you can experience huge performance drops.
Note the difference between physical and logical CPU core. Processors with hyper-threading can have more than one logical core per physical core. "Secondary" logical cores do not have the same computational power as the "primary" as they are merely used to utilize vacancies in the processor pipeline usage.
For example, if you have a 4 core 8 thread CPU, in the case of a perfectly scaling workload you will see 4 times increase of performance going from 1 to 4 threads, but a lot less going from 4 to 8 threads, as evident from vu1p3n0x's answer.
You can look here for ways to determine the number of available CPU cores.

Why is 6-7 threads faster than 20?

In school we were introduced to C++11 threads. The teacher gave us a simple assessment to complete which was to make a basic web crawler using 20 threads. To me threading is pretty new, although I do understand the basics.
I would like to mention that I am not looking for someone to complete my assessment as it is already done. I only want to understand the reason why using 6 threads is always faster than using 20.
Please see code sample below.
main.cpp:
do
{
for (size_t i = 0; i < THREAD_COUNT; i++)
{
threads[i] = std::thread(SweepUrlList);
}
for (size_t i = 0; i < THREAD_COUNT; i++)
{
threads[i].join();
}
std::cout << std::endl;
WriteToConsole();
listUrl = listNewUrl;
listNewUrl.clear();
} while (listUrl.size() != 0);
Basically this assigns to each worker thread the job to complete which is the method SweepUrlList that can be found below and then join all thread.
while (1)
{
mutextGetNextUrl.lock();
std::set<std::string>::iterator it = listUrl.begin();
if (it == listUrl.end())
{
mutextGetNextUrl.unlock();
break;
}
std::string url(*it);
listUrl.erase(*it);
mutextGetNextUrl.unlock();
ExtractEmail(url, listEmail);
std::cout << ".";
}
So each worker thread loop until ListUrl is empty. ExtractEmail is a method that downloads the webpage (using curl) and parse it to extract emails from mailto links.
The only blocking call in ExtractEmail can be found below:
if(email.length() != 0)
{
mutextInsertNewEmail.lock();
ListEmail.insert(email);
mutextInsertNewEmail.unlock();
}
All answers are welcome and if possible links to any documentation you found to answer this question.
This is a fairly universal problem with threading, and at its core:
What you are demonstrating is thread Scheduling. The operating system is going to work with the various threads, and schedule work where there is currently not work.
Assuming you have 4 cores and hyper threading you have 8 processors that can carry the load, but also that of other applications (Operating System, C++ debugger, and your application to start).
In theory, you would probably be OK on performance up until about 8 intensive threads. After you reach the most threads your processor can effectively use, then threads begin to compete against each other for resources. This can be seen (especially with intensive applications and tight loops) by poor performance.
Finally, this is a simplified answer but I suspect what you are seeing.
The simple answer is choke points. Something that you are doing is causing a choke point. When this occurs there is a slow down. It could be in the number of active connections you are making to something, or merely the extra overhead of the number and memory size of the threads (see the below answer about cores being one of these chokes).
You will need to set up a series of monitors to investigate where your choke point is, and what needs to change in order to achieve scale. Many systems across every industry face this problem every day. Opening up the throttle at one end does not equal the same increase in the output at the other end. In cases it can decrease the output at the other end.
Take for example individuals leaving a hall. The goal is to get 100 people out of the building as quickly as possible. If single file produces a rate of 1 person every 1 second therefore 100 seconds to clear the building. We many be able to half that time by sending them out 2 abreast, so 50 seconds to clear the building. What if we then sent them out as 8 abreast. The door is only 2m wide, so with 8 abreast being equivalent to 4m, only 50% of the first row would make it through. The other 4 would then cause a blockage for the next row and so on. Depending on the rate, this could cause temporary blockages and increase the time 10 fold.
Threads are an operating system construct. Basically, each thread's state (which is basically all the CPU's registers and virtual memory mapping [which is a part of the process construct]) is saved by the operating system. Once the OS gives that specific thread "execution time" it restores this state and let it run. Once this time is finished, it has to save this state. The process of saving a specific thread's state and restoring another is called Context Switching, and it takes a significant amount of time (usually between a couple of hundreds to thousand of CPU cycles).
There are also additional penalties to context switching. Some of the processor's cache (like the virtual memory translation cache, called the TLB) has to be flushed, pipelining instruction to be discarded and more. Generally, you want to minimize context switching as much as possible.
If your CPU has 4 cores, than 4 threads can run simultaneously. If you try to run 20 threads on a 4 core system, then the OS has to manage time between those threads so it will seem like they run in parallel. E.g, threads 1-4 will run for 50 milliseconds, then 5-9 will run for 50 milliseconds, etc.
Therefore, if all of your threads are running CPU intensive operations, it is generally most efficient to make your program use the same amount of threads as cores (sometimes called 'processors' in windows). If you have more threads than cores, than context switching must happen, and it is overhead that can be minimized.
In general, more threads is not better. More threading provides value in two ways higher parallelism and less blocking. More threading hurts by higher memory, higher context switching and higher resource contention.
The value of more threads for higher parallelism is generally maximized between 1-2x the number of actual cores that you have available. If your threads are already CPU bound the maximum value is generally 1x number of cores.
The value of less blocking is much harder to quantify and depends on the type of work you are performing. If you are IO bound and your threads are primarily waiting for IO to be ready then a larger number of threads could be beneficial.
However if you have shared state between threads, or you are doing some form of message passing between threads then you will run into synchronization and contention issues. As the number of threads increases, the more these types of overhead as well as context switches dominates the time spent doing your task.
Amdahl's law is a useful measure to determine if higher parallelism will actually improve the total runtime of your job.
You also must be careful that your increased parallelism doesn't exceed some other resource like total memory or disk or network throughput. Once you have saturated the current bottleneck, you will not see improved performance by increasing the number of threads.
Before doing any performance tuning, it is important to understand what the dominant resource bottleneck is. There are lots of tools for doing system-wide resource monitoring. On Linux, one very useful tool is dstat. On Windows, you can use the Task Manager to monitor many of these resources.

Multidimensional Array Initialization: Any benefit from Threading?

say I have the following code:
char[5][5] array;
for(int i =0; i < 5; ++i)
{
for(int j = 0; j < 5; ++i)
{
array[i][j] = //random char;
}
}
Would there be a benefit for initializing each row in this array in a separate thread?
Imagine instead of a 5 by 5 array, we have a 10 by 10?
n x n?
Also, this is done once, during application startup.
You're joking, right?
If not: The answer is certainly no!!!
You'd incur a lot of overhead for putting together enough synchronization to dispatch the work via a message queue, plus knowing all the threads had finished their rows and the arrays were ready. That would far outstrip the time it takes one CPU core to fill 25 bytes with a known value. So for almost any simple initialization like this you do not want to use threads.
Also bear in mind that threads provide concurrency but not speedup on a single core machine. If you have an operation which has to be completed synchronously--like an array initialization--then you'll only get value by adding a # of threads up to the # of CPU cores available. In theory.
So if you're on a multi-core system and if what you were putting in each cell took a long time to calculate... then sure, it may be worth exploiting some kind of parallelism. So I like genpfault's suggestion: write it multithreaded for a multi-core system and time it as an educational exercise just to get a feel for when the crossover of benefit happens...
Unless you're doing a significant amount of computation, no, there will not be any benefit. It's possible you might even see worse performance due to caching effects.
This type of initialization is memory-bound, not CPU bound. The time it takes to initialize the array depends on the speed of your memory; your CPU will just waste cycles spinning waiting for the memory operations to commit. Adding more threads will still have them all waiting for memory, and if they're all fighting over the same cache lines, the performance will be worse because now the caches of the separate CPUs have to synchronize with each other to avoid cache incoherency.
On modern hardware? Probably none, since you're not doing any significant computation. You'll most likely be limited by your memory bandwidth.
Pretty easy to test though. Whip up some OpenMP and give it a whirl!
Doubtful, but for some point of n x n, maybe... but I'd imagine it's a really high n and you'd have probably already be multi-threading on processing this data. Remember that these threads will be writing back to the same area which may also lead to cache contention.
If you want to know for sure, try it and profile.
Also, this is done once, during application startup.
For this kind of thing, the cost of allocating the threads is probably greater than what you save by using them. Especially if you only need to do it once.
I did something similar, but in my case, the 2d array represented pixels on the screen. I was doing pretty expensive stuff, colour lerping, Perlin noise calculation... When launching it all in a single thread, I got around 40 fps, but when I added slave threads responsible for calculating rows of pixels, I managed to double that result. So yes, there might be situations where multithreading helps in speeding up whatever you do in the array, providing that what you do is expensive enough to justify using multiple threads.
You can download a live demo where you adjust the number of threads to watch the fps counter change: http://umbrarumregnum.110mb.com/download/mnd (the multithreading test is the "Noise Demo 3").