Goroutine execution time with different input data - concurrency

I am experimenting with goroutine for parallelizing some computation. However, the execution time of goroutine confuse me. My experiment setup is simple.
runtime.GOMAXPROCS(3)
datalen := 1000000000
data21 := make([]float64, datalen)
data22 := make([]float64, datalen)
data23 := make([]float64, datalen)
t := time.Now()
res := make(chan interface{}, dlen)
go func() {
for i := 0; i < datalen; i++ {
data22[i] = math.Sqrt(13)
}
res <- true
}()
go func() {
for i := 0; i < datalen; i++ {
data22[i] = math.Sqrt(13)
}
res <- true
}()
go func() {
for i := 0; i < datalen; i++ {
data22[i] = math.Sqrt(13)
}
res <- true
}()
for i:=0; i<3; i++ {
<-res
}
fmt.Printf("The parallel for loop took %v to run.\n", time.Since(t))
Notice that I loaded the same data in 3 goroutines, the execution time for this program is
The parallel for loop took 7.436060182s to run.
However, if I let each goroutine handle different data as follows:
runtime.GOMAXPROCS(3)
datalen := 1000000000
data21 := make([]float64, datalen)
data22 := make([]float64, datalen)
data23 := make([]float64, datalen)
t := time.Now()
res := make(chan interface{}, dlen)
go func() {
for i := 0; i < datalen; i++ {
data21[i] = math.Sqrt(13)
}
res <- true
}()
go func() {
for i := 0; i < datalen; i++ {
data22[i] = math.Sqrt(13)
}
res <- true
}()
go func() {
for i := 0; i < datalen; i++ {
data23[i] = math.Sqrt(13)
}
res <- true
}()
for i:=0; i<3; i++ {
<-res
}
fmt.Printf("The parallel for loop took %v to run.\n", time.Since(t))
The execution time for this is almost 3 times more than previous and is almost equal/worse then sequential execution without goroutine
The parallel for loop took 20.744438468s to run.
I guess maybe I use the goroutine in a wrong way. So what should be the correct way to use multiple goroutines to handle different pieces of data;

Since your example program is not performing any substantial calculation, the bottleneck is going to be the speed at which data can be written to memory. With the settings in the example, we're talking about 22 GB of writes which is not insignificant.
Given the time difference in the run time of the two examples, one likely possibility is that it isn't actually writing as much to the RAM. Given that memory writes are cached by the CPU, the execution probably looks something like this:
the first goroutine writes out data to a cache line representing the start of the data22 array.
the second goroutine writes out data to a cache line representing the same location. The CPU running the first goroutine notices that the write invalidates its own cached write, so throws away its changes.
the third goroutine writes out data to a cache line representing the same location. The CPU running the second goroutine notices that the write invalidates its own cached write, so throws away its changes.
the cache line in the third CPU is evicted and the changes are written out to RAM.
This process continues as the goroutines progress through the data22 array. Since RAM is the bottleneck and we end up writing one third as much data in this scenario, it isn't that surprising that it runs approximately 3 times as fast as the second case.

You are using enormous amounts of memory. 1000000000 * 8 = 8GB in the first example and 3 * 1000000000 * 8 = 24GB in the second example. In the second example you are probably using lots of swap space. Disk I/O is very, very slow, even on an SSD.
Change datalen := 1000000000 to datalen := 100000000, a 10-fold decrease. What are your run times now? Average at least three runs of each example. How much memory does your computer have? Are you using an SSD?

Related

Why would adding a delay improve data throughput in this multithreaded environment?

In my application, I have two threads, a producer (thread 1) and a consumer (thread 2). Each thread has an input and output interface (effectively a pointer to a list) that is connected to a third thread which serves as a router.
When the producer writes, it calls memcpy to copy data into a buffer and pushes the buffer into a list. Meanwhile, the router thread is round-robin searching through all the threads that are connected to it and monitoring their interfaces to see if any thread has data to send out. When it sees that thread 1's list is non-empty, it checks to determine which thread the data is intended for. The data is spliced into the destination thread's (in this case thread 2) input list, at which point thread 2 will malloc some memory, memcpy the data into it and return the pointer to this new region.
For my test, I'm measuring throughput to see how long it takes to send 100k messages of varying sizes. Thread 1 sends data of some size, thread 2 reads it and sends back a small reply message, which thread 1 reads. This would be one complete exchange. In the first test, in thread 1, I'm sending all 100k messages, and then reading 100k replies. In the second test, in thread 1, I'm alternating sending a message and waiting for the reply and repeating 100k times. In both tests, thread 2 is in a loop reading the message and sending a reply. I would expect test 1 to have higher throughput because the threads should spend less time waiting around. However, it has markedly worse throughput than test 2. I've measured how long individual function calls (to read/write) take in the two test cases and they invariably take longer in test 1 (based on the means and medians and no delay) though the numbers are of the same order of magnitude.
When I add a loop doing nothing into thread 1's sending loop in test 1, I see dramatically improved throughput for this case as opposed to not having the delay. My only guess is that adding a delay slows down the producer so the consumer can absorb the data which prevents its input list from growing very large. I'm wondering if there may be other explanations and if so, how I can test for them.
Edit
Unfortunately, my own code is just the test I described above which calls a library that actually performs the reads/writes, creates that third thread etc. It's difficult to make a minimal example out of it because the library is complex and not mine. I provide some pseudocode to illustrate the setup in more detail.
int NUM_ITERATIONS = 100000;
int msg_reply = 2; // size of the reply message in words
int msg_size = 512; // indicates 512 64 bit words
void generate(int iterations, int size, interface* out){
std::vector<long long> vec(size);
for(int i = 0; i < size; i++)
vec[i] = (long long) i;
for(int i = 0; i < iterations; i++)
out->lib_write((char*) vec.data(), size);
}
void receive(int iterations, int size, interface* in){
for(int i = 0; i < iterations; i++)
char* data = in->lib_read(size)
void producer(interface* in, interface* out){
// test 1
start = std::chrono::high_resolution_clock::now();
// write data of size msg_size, NUM_ITERATIONS times to out
generate(NUM_ITERATIONS, msg_size, out);
// read data of size msg_reply, NUM_ITERATIONS times from in
receive(NUM_ITERATIONS, msg_reply, in);
end = std::chrono::high_resolution_clock::now();
// using NUM_ITERATIONS, msg_size and time, compute and print throughput to stdout
print_throughput(end-start, "throughput_0", msg_size);
// test 2
start = std::chrono::high_resolution_clock::now();
for(int j = 0; j < NUM_ITERATIONS; j++){
generate(1, msg_size, out);
receive(1, msg_reply, in);
}
end = std::chrono::high_resolution_clock::now();
print_throughput(end-start, "throughput_1", msg_size);
}
void consumer(interface* in, interface* out){
for(int i = 0; i < 2; i++}{
for(int j = 0; j < NUM_ITERATIONS; j++){
receive(1, msg_size, in);
generate(1, msg_reply, out);
}
}
}
The calls to lib_write() and lib_read() become fairly complex. To elaborate on the description above, the data gets memcpy'd into a buffer and then moved into a list. The interface has a condition variable member and the write calls its notify_one() method. The third thread is looping through all the interface pointers it has and checking to see if their lists are non-empty. If so, the data is spliced from one output list to the destination's input list using the splice() method in std::list. Meanwhile, the consumer calls the lib_read() which waits on the condition variable while the interface is empty, and then memcpy's the data into a new region and returns it.
// note: these will not compile as is. Undefined variables are class members
char * interface::lib_read(size_t * _size){
char * ret;
{
std::unique_lock<std::mutex> lock(mutex);
// packets is an std::list containing the incoming data
while (packets.empty()) {
cv.wait(lock);
}
curr_read_it = packets.begin();
}
size_t buff_size = curr_read_it->size;
ret = (char *)malloc(buff_size);
memcpy((char *)ret, (char *)curr_read_it->data, buff_size);
{
std::unique_lock<std::mutex> lock(mutex);
packets.erase(curr_read_it);
curr_read_it = packets.end();
}
return ret;
}
void interface::lib_write(char * data, int size){
// indicates the destination thread id
long long header = 1;
// buffer is a just an array that's max packet sized
memcpy((char *)buffer.data, &header, sizeof(long long));
memcpy((char *)buffer.data + sizeof(long long), (char *)data, size * sizeof(long long));
std::lock_guard<std::mutex> guard(mutex);
packets.push_back(std::move(buffer));
cv.notify_one();
}
// this is on thread 3
void route(){
do{
// this is a vector containing all the "out" interfaces
for(int i = 0; i < out_ptrs.size(); i++){
interface <long long> * _out = out_ptrs[i];
if(!_out->empty()){
// this just returns the header id (also locks the mutex)
long long dest= _out->get_dest();
// looks up the correct interface based on the id and splices
// a packet into from _out to the appropriate one. Locks mutex
in_ptrs[dest_map[dest]]->splice(_out);
}
}
}while(!done());
I was looking for general advice on what factors may influence multithreading performance and what to test for in order to better understand what was going on.
I talked to some other people and the advice I got that was helpful was to determine if the OS scheduling was the issue (which is what I suspected but was unsure how to test). Essentially, I used taskset and sched_affinity() to force the application to run on one core or on a subset of cores and looked at how they compared to each other and to the unrestricted case.
Based on the restrictions, I got dramatically different results and could see some trends so I'm pretty confident in saying that it's an OS scheduling issue. Different ones can yield better performance under different workloads.

parallel for is slower than sequential for

My program shall perform a parallel distinct rotation of words and texts.
If you do not know what this means: Rotations of "BANANA" are
BANANA
ANANAB
NANABA
ANABAN
NABANA
ABANAN
(simply put the first letter to the end.)
vector<string> rotate_sequentiell( string* word )
{
vector<string> all_rotations;
for ( unsigned int i = 0; i < word->size(); i++ )
{
string rotated = word->substr( i ) + word->substr( 0,i );
all_rotations.push_back( rotated );
}
if ( verbose ) { printVec(&all_rotations, "Rotations"); }
return all_rotations;
}
We should be able to make this parallel. Instead of moving just one letter to the end, I want to move two letters at once to the end, so for example, we take BANANA
Take te "BA" to the end and get NANA BA, which is the third entry in the list above.
I implemented it like this
vector<string> rotate_parallel( string* word )
{
vector<string> all_rotations( word->size() );
#pragma omp parallel for
for ( unsigned int i = 0; i < word->size(); i++ )
{
string rotated = word->substr( i ) + word->substr( 0,i );
all_rotations[i] = rotated;
}
if ( verbose ) { printVec(&all_rotations, "Rotations"); }
return all_rotations;
}
I pre-calculated the number of possible rotations and used the #pragma omp parallel for, so it should do what I think it does.
To test these functions, I have a 40KB large text-file which is meant to be "rotated". I wanna have all the distinct rotations of a giant text.
What happens now is, that the sequential procedure tooks like 4.3 seconds and the parallel tooks like 6.5 seconds.
Why is that so? What am I doing wrong?
This is how I measure time:
clock_t start, finish;
start = clock();
bwt_encode_parallel( &glob_word, &seperator );
finish = clock();
cout << "Time (seconds): "
<< ((double)(finish - start))/CLOCKS_PER_SEC;
I compile my code with
g++ -O3 -g -Wall -lboost_regex -fopenmp -fmessage-length=0
The parallel version has 2 sources of additional work compared to the sequential version:
(1) overhead of starting the threads, and
(2) coordination and locking between the threads.
Impact of (1) Should diminish when the data set grows larger, and probably can't be worth 2 seconds anyway, but this would set the limit of how small jobs it makes sense to parallelize.
(2) is in your case probably mostly caused by omp assigning tasks to the threads, and the different threads doing memory allocation for the 2 intermediate substrings and the final string "rotated" - the memory allocation routine probably has to get a global lock before it can reserve a piece of the heap for you.
Preallocating the final storage in a single thread and guiding OMP to run the parallel loop in large (2048) blocks of iterations per thread tilts the result to to favor the parallel execution. I get about 700ms for the single threaded and 330ms for the multithreaded version with the code below:
enum {SZ = 40960};
std::string word;
word.resize(SZ);
for (int i = 0; i < SZ; i++) {
word[i] = (i & 127) + 1; // put stuff into the word
}
std::vector<std::string> all_rotations(SZ);
clock_t start, finish;
start = clock();
for (int i = 0; i < (int)word.size(); i++) {
all_rotations[i].reserve(SZ);
}
#pragma omp parallel for schedule (static, 2048)
for (int i = 0; i < (int)word.size(); i++) {
std::string rotated = word.substr(i) + word.substr(0, i);
all_rotations[i] = rotated;
}
finish = clock();
printf("Time (seconds): %0.3lf\n", ((double)(finish - start))/CLOCKS_PER_SEC);
Last, when you need the results of the burrows wheeler transform, you don't necessarily want N copies of a string that contains N characters. It would save space and processing to treat the string as a ring buffer and read each rotation from a different offset in the buffer.

For loop performance and multithreaded performance questions

I was kind of bored so I wanted to try using std::thread and eventually measure performance of single and multithreaded console application. This is a two part question. So I started with a single threaded sum of a massive vector of ints (800000 of ints).
int sum = 0;
auto start = chrono::high_resolution_clock::now();
for (int i = 0; i < 800000; ++i)
sum += ints[i];
auto end = chrono::high_resolution_clock::now();
auto diff = end - start;
Then I added range based and iterator based for loop and measured the same way with chrono::high_resolution_clock.
for (auto& val : ints)
sum += val;
for (auto it = ints.begin(); it != ints.end(); ++it)
sum += *it;
At this point console output looked like:
index loop: 30.0017ms
range loop: 221.013ms
iterator loop: 442.025ms
This was a debug version, so I changed to release and the difference was ~1ms in favor of index based for. No big deal, but just out of curiosity: should there be a difference this big in debug mode between these three for loops? Or even a difference in 1ms in release mode?
I moved on to the thread creation, and tried to do a parallel sum of the array with this lambda (captured everything by reference so I could use vector of ints and a mutex previously declared) using index based for.
auto func = [&](int start, int total, int index)
{
int partial_sum = 0;
auto s = chrono::high_resolution_clock::now();
for (int i = start; i < start + total; ++i)
partial_sum += ints[i];
auto e = chrono::high_resolution_clock::now();
auto d = e - s;
m.lock();
cout << "thread " + to_string(index) + ": " << chrono::duration<double, milli>(d).count() << "ms" << endl;
sum += partial_sum;
m.unlock();
};
for (int i = 0; i < 8; ++i)
threads.push_back(thread(func, i * 100000, 100000, i));
Basically every thread was summing 1/8 of the total array, and the final console output was:
thread 0: 6.0004ms
thread 3: 6.0004ms
thread 2: 6.0004ms
thread 5: 7.0004ms
thread 4: 7.0004ms
thread 1: 7.0004ms
thread 6: 7.0004ms
thread 7: 7.0004ms
8 threads total: 53.0032ms
So I guess the second part of this question is what's going on here? Solution with 2 threads ended with ~30ms as well. Cache ping pong? Something else? If I'm doing something wrong, what would be the correct way to do it? Also if It's relevant, I was trying this on an i7 with 8 threads, so yes I know I didn't count the main thread, but tried it with 7 separate threads and pretty much got the same result.
EDIT: Sorry forgot the mention this was on Windows 7 with Visual Studio 2013 and Visual Studio's v120 compiler or whatever it's called.
EDIT2: Here's the whole main function:
http://pastebin.com/HyZUYxSY
With optimisation not turned on, all the method calls that are performed behind the scenes are likely real method calls. Inline functions are likely not inlined but really called. For template code, you really need to turn on optimisation to avoid that all the code is taken literally. For example, it's likely that your iterator code will call iter.end () 800,000 times, and operator!= for the comparison 800,000 times, which calls operator== and so on and so on.
For the multithreaded code, processors are complicated. Operating systems are complicated. Your code isn't alone on the computer. Your computer can change its clock speed, change into turbo mode, change into heat protection mode. And rounding the times to milliseconds isn't really helpful. Could be one thread to 6.49 milliseconds and another too 6.51 and it got rounded differently.
should there be a difference this big in debug mode between these three for loops?
Yes. If allowed, a decent compiler can produce identical output for each of the 3 different loops, but if optimizations are not enabled, the iterator version has more function calls and function calls have certain overhead.
Or even a difference in 1ms in release mode?
Your test code:
start = ...
for (auto& val : ints)
sum += val;
end = ...
diff = end - start;
sum = 0;
Doesn't use the result of the loop at all so when optimized, the compiler should simply choose to throw away the code resulting in something like:
start = ...
// do nothing...
end = ...
diff = end - start;
For all your loops.
The difference of 1ms may be produced by high granularity of the "high_resolution_clock" in the used implementation of the standard library and by differences in process scheduling during the execution. I measured the index based for being 0.04 ms slower, but that result is meaningless.
Aside from how std::thread is implemented on Windows I would to point your attention to your available execution units and context switching.
An i7 does not have 8 real execution units. It's a quad-core processor with hyper-threading. And HT does not magically double the available number of threads, no matter how it's advertised. It's a really clever system which tries to fit in instructions from an extra pipeline whenever possible. But in the end all instructions go through only four execution units.
So running 8 (or 7) threads is still more than your CPU can really handle simultaneously. That means your CPU has to switch a lot between 8 hot threads clamouring for calculation time. Top that off with several hundred more threads from the OS, admittedly most of which are asleep, that need time and you're left with a high degree of uncertainty in your measurements.
With a single threaded for-loop the OS can dedicate a single core to that task and spread the half-sleeping threads across the other three. This is why you're seeing such a difference between 1 thread and 8 threads.
As for your debugging questions: you should check if Visual Studio has Iterator checking enabled in debugging. When it's enabled every time an iterator is used it is bounds-checked and such. See: https://msdn.microsoft.com/en-us/library/aa985965.aspx
Lastly: have a look at the -openmp switch. If you enable that and apply the OpenMP #pragmas to your for-loops you can do away with all the manual thread creation. I toyed around with similar threading tests (because it's cool. :) ) and OpenMPs performance is pretty damn good.
For the first question, regarding the difference in performance between the range, iterator and index implementations, others have pointed out that in a non-optimized build, much which would normally be inlined may not be.
However there is an additional wrinkle: by default, in Debug builds, Visual Studio will use checked iterators. Access through a checked iterator is checked for safety (does the iterator refer to a valid element?), and consequently operations which use them, including the range-based iteration, are heavily penalized.
For the second part, I have to say that those durations seem abnormally long. When I run the code locally, compiled with g++ -O3 on a core i7-4770 (Linux), I get sub-millisecond timings for each method, less in fact than the jitter between runs. Altering the code to iterate each test 1000 times gives more stable results, with the per test times being 0.33 ms for the index and range loops with no extra tweaking, and about 0.15 ms for the parallel test.
The parallel threads are doing in total the same number of operations, and what's more, using all four cores limits the CPU's ability to dynamically increase its clock speed. So how can it take less total time?
I'd wager that the gains result from better utilization of the per-core L2 caches, four in total. Indeed, using four threads instead of eight threads reduces the total parallel time to 0.11 ms, consistent with better L2 cache use.
Browsing the Intel processor documentation, all the Core i7 processors, including the mobile ones, have at least 4 MB of L3 cache, which will happily accommodate 800 thousand 4-byte ints. So I'm surprised both by the raw times being 100 times larger than I'm seeing, and the 8-thread time totals being so much greater, which as you surmise, is a strong hint that they are thrashing the cache. I'm presuming this is demonstrating just how suboptimal the Debug build code is. Could you post results from an optimised build?
Not knowing how those std::thread classes are implemented, one possible explanation for the 53ms could be:
The threads are started right away when they get instantiated. (I see no thread.start() or threads.StartAll() or alike). So, during the time the first thread instance gets active, the main thread might (or might not) be preempted. There is no guarantee that the threads are getting spawned on individual cores, after all (thread affinity).
If you have a closer look at POSIX APIs, there is the notion of "application context" and "system context", which basically implies, that there might be an OS policy in place which would not use all cores for 1 application.
On Windows (this is where you were testing), maybe the threads are not being spawned directly but via a thread pool, maybe with some extra std::thread functionality, which could produce overhead/delay. (Such as completion ports etc.).
Unfortunately my machine is pretty fast so I had to increase the amount of data processed to yield significant times. But on the upside, this reminded me to point out, that typically, it starts to pay off to go parallel when the computation times are way beyond the time of a time slice (rule of thumb).
Here my "native" Windows implementation, which - for a large enough array finally makes the threads win over a single threaded computation.
#include <stdafx.h>
#include <nativethreadTest.h>
#include <vector>
#include <cstdint>
#include <Windows.h>
#include <chrono>
#include <iostream>
#include <thread>
struct Range
{
Range( const int32_t *p, size_t l)
: data(p)
, length(l)
, result(0)
{}
const int32_t *data;
size_t length;
int32_t result;
};
static int32_t Sum(const int32_t * data, size_t length)
{
int32_t sum = 0;
const int32_t *end = data + length;
for (; data != end; data++)
{
sum += *data;
}
return sum;
}
static int32_t TestSingleThreaded(const Range& range)
{
return Sum(range.data, range.length);
}
DWORD
WINAPI
CalcThread
(_In_ LPVOID lpParameter
)
{
Range * myRange = reinterpret_cast<Range*>(lpParameter);
myRange->result = Sum(myRange->data, myRange->length);
return 0;
}
static int32_t TestWithNCores(const Range& range, size_t ncores)
{
int32_t result = 0;
std::vector<Range> ranges;
size_t nextStart = 0;
size_t chunkLength = range.length / ncores;
size_t remainder = range.length - chunkLength * ncores;
while (nextStart < range.length)
{
ranges.push_back(Range(&range.data[nextStart], chunkLength));
nextStart += chunkLength;
}
Range remainderRange(&range.data[range.length - remainder], remainder);
std::vector<HANDLE> threadHandles;
threadHandles.reserve(ncores);
for (size_t i = 0; i < ncores; ++i)
{
threadHandles.push_back(::CreateThread(NULL, 0, CalcThread, &ranges[i], 0, NULL));
}
int32_t remainderResult = Sum(remainderRange.data, remainderRange.length);
DWORD waitResult = ::WaitForMultipleObjects((DWORD)threadHandles.size(), &threadHandles[0], TRUE, INFINITE);
if (WAIT_OBJECT_0 == waitResult)
{
for (auto& r : ranges)
{
result += r.result;
}
result += remainderResult;
}
else
{
throw std::runtime_error("Something went horribly - HORRIBLY wrong!");
}
for (auto& h : threadHandles)
{
::CloseHandle(h);
}
return result;
}
static int32_t TestWithSTLThreads(const Range& range, size_t ncores)
{
int32_t result = 0;
std::vector<Range> ranges;
size_t nextStart = 0;
size_t chunkLength = range.length / ncores;
size_t remainder = range.length - chunkLength * ncores;
while (nextStart < range.length)
{
ranges.push_back(Range(&range.data[nextStart], chunkLength));
nextStart += chunkLength;
}
Range remainderRange(&range.data[range.length - remainder], remainder);
std::vector<std::thread> threads;
for (size_t i = 0; i < ncores; ++i)
{
threads.push_back(std::thread([](Range* range){ range->result = Sum(range->data, range->length); }, &ranges[i]));
}
int32_t remainderResult = Sum(remainderRange.data, remainderRange.length);
for (auto& t : threads)
{
t.join();
}
for (auto& r : ranges)
{
result += r.result;
}
result += remainderResult;
return result;
}
void TestNativeThreads()
{
const size_t DATA_SIZE = 800000000ULL;
typedef std::vector<int32_t> DataVector;
DataVector data;
data.reserve(DATA_SIZE);
for (size_t i = 0; i < DATA_SIZE; ++i)
{
data.push_back(static_cast<int32_t>(i));
}
Range r = { data.data(), data.size() };
std::chrono::system_clock::time_point singleThreadedStart = std::chrono::high_resolution_clock::now();
int32_t result = TestSingleThreaded(r);
std::chrono::system_clock::time_point singleThreadedEnd = std::chrono::high_resolution_clock::now();
std::cout
<< "Single threaded sum: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(singleThreadedEnd - singleThreadedStart).count()
<< "ms." << " Result = " << result << std::endl;
std::chrono::system_clock::time_point multiThreadedStart = std::chrono::high_resolution_clock::now();
result = TestWithNCores(r, 8);
std::chrono::system_clock::time_point multiThreadedEnd = std::chrono::high_resolution_clock::now();
std::cout
<< "Multi threaded sum: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(multiThreadedEnd - multiThreadedStart).count()
<< "ms." << " Result = " << result << std::endl;
std::chrono::system_clock::time_point stdThreadedStart = std::chrono::high_resolution_clock::now();
result = TestWithSTLThreads(r, 8);
std::chrono::system_clock::time_point stdThreadedEnd = std::chrono::high_resolution_clock::now();
std::cout
<< "std::thread sum: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(stdThreadedEnd - stdThreadedStart).count()
<< "ms." << " Result = " << result << std::endl;
}
Here the output on my machine of this code:
Single threaded sum: 382ms. Result = -532120576
Multi threaded sum: 234ms. Result = -532120576
std::thread sum: 245ms. Result = -532120576
Press any key to continue . . ..
Last not least, I feel urged to mention that the way this code is written it is rather a memory IO performance benchmark than a core CPU computation benchmark.
Better computation benchmarks would use small amounts of data which is local, fits into CPU caches etc.
Maybe it would be interesting to experiment with the splitting of the data into ranges. What if each thread were "jumping" over the data from the start to an end with a gap of ncores? Thread 1: 0 8 16... Thread 2: 1 9 17 ... etc.? Maybe then the "locality" of the memory could gain extra speed.

Assembly Line in Golang using concurrency

New to Go. I'm attempting to code an "assembly line" where multiple functions act like workers and pass some data structure to each other down the line, each doing something to the data structure.
type orderStruct struct {
orderNum,capacity int
orderCode uint64
box [9]int
}
func position0(in chan orderStruct){
order := <-in
if((order.orderCode<<63)>>63 == 1){
order.box[order.capacity] = 1
order.capacity += 1
}
fmt.Println(" filling box {", order.orderNum, order.orderCode, order.box, order.capacity, "} at position 0")
}
func startOrder(in chan orderStruct){
order := <-in
fmt.Printf("\nStart an empty box for customer order number %d , request number %d\n", order.orderNum, order.orderCode)
fmt.Println(" starting box {", order.orderNum, order.orderCode, order.box, order.capacity, "}")
d := make(chan orderStruct,1)
go position0(d)
d <- order
}
func main() {
var orders [10]orderStruct
numOrders := len(os.Args)-1
var x int
for i := 0; i < numOrders; i++{
x, _ = strconv.Atoi(os.Args[i+1])
orders[i].orderCode = uint64(x)
orders[i].orderNum = i+1
orders[i].capacity = 0
for j := 0; j < 9; j++{
orders[i].box[j] = 0
}
c := make(chan orderStruct)
go startOrder(c)
c <- orders[i]
}
}
So basically the issue I'm having is that the print statements in startOrder() execute fine, but when I try to pass the struct to position0(), nothing is printed. Am I misunderstanding how channels work?
Pipelines are a great place to start when learning to program concurrently in Go. Nick Craig-Wood's answer provides a working solution to this specific challenge.
There is a whole range of other ways to use concurrency in Go. Broadly, there are three categories divided according to what is being treated as concurrent:
Functional decomposition - Creating pipelines of several functions is a good way to get started - and is your question's topic. It's quite easy to think about and quite productive. However, if it progresses to truly parallel hardware, it's quite hard to balance the load well. Everything goes at the speed of the slowest pipeline stage.
Geometric decomposition - Dividing the data up into separate regions that can be processed independently (or without too much communication). Grid-based systems are popularly used in certain domains of scientific high-performance computing, such as weather-forecasting.
Farming - Identifying how the work to be done can be chopped into (a large number of) tasks and these tasks can be allocated to 'workers' one by one until all are completed. Often, the number of tasks far exceeds the number of workers. This category includes all the so-called 'embarrassingly parallel' problems (embarrassing because if you fail to get your high-performance system to give linear speed-up, you look a bit daft).
I could add a fourth category of hybrids of several of the above.
There is quite a lot of literature about this, including much from the days of Occam programming in the '80s and '90s. Go and Occam both use CSP message passing so the issues are similar. I would single out the helpful book Practical Parallel Processing: An introduction to problem solving in parallel (Chalmers and Tidmus 1996).
I've attempted to re-write what you've written to work properly. You can run it on the playground
The main differences are
only two go routines are started - these act as the two workers on the production line - one taking orders and the other filling boxes
use of sync.WaitGroup to find out when they end
use of for x := range channel
use of close(c) to signal end of channel
you could start multiple copies of each worker and the code would still work fine (repeat the wg.Add(1); go startOrders(c, wg) code)
Here is the code
package main
import (
"fmt"
"sync"
)
type orderStruct struct {
orderNum, capacity int
orderCode uint64
box [9]int
}
func position0s(in chan orderStruct, wg *sync.WaitGroup) {
defer wg.Done()
for order := range in {
if (order.orderCode<<63)>>63 == 1 {
order.box[order.capacity] = 1
order.capacity += 1
}
fmt.Println(" filling box {", order.orderNum, order.orderCode, order.box, order.capacity, "} at position 0")
}
}
func startOrders(in chan orderStruct, wg *sync.WaitGroup) {
defer wg.Done()
d := make(chan orderStruct)
wg.Add(1)
go position0s(d, wg)
for order := range in {
fmt.Printf("\nStart an empty box for customer order number %d , request number %d\n", order.orderNum, order.orderCode)
fmt.Println(" starting box {", order.orderNum, order.orderCode, order.box, order.capacity, "}")
d <- order
}
close(d)
}
func main() {
var orders [10]orderStruct
numOrders := 4
var x int = 10
wg := new(sync.WaitGroup)
c := make(chan orderStruct)
wg.Add(1)
go startOrders(c, wg)
for i := 0; i < numOrders; i++ {
orders[i].orderCode = uint64(x)
orders[i].orderNum = i + 1
orders[i].capacity = 0
for j := 0; j < 9; j++ {
orders[i].box[j] = 0
}
c <- orders[i]
}
close(c)
wg.Wait()
}

Code runs 6 times slower with 2 threads than with 1

Original Problem:
So I have written some code to experiment with threads and do some testing.
The code should create some numbers and then find the mean of those numbers.
I think it is just easier to show you what I have so far. I was expecting with two threads that the code would run about 2 times as fast. Measuring it with a stopwatch I think it runs about 6 times slower! EDIT: Now using the computer and clock() function to tell the time.
void findmean(std::vector<double>*, std::size_t, std::size_t, double*);
int main(int argn, char** argv)
{
// Program entry point
std::cout << "Generating data..." << std::endl;
// Create a vector containing many variables
std::vector<double> data;
for(uint32_t i = 1; i <= 1024 * 1024 * 128; i ++) data.push_back(i);
// Calculate mean using 1 core
double mean = 0;
std::cout << "Calculating mean, 1 Thread..." << std::endl;
findmean(&data, 0, data.size(), &mean);
mean /= (double)data.size();
// Print result
std::cout << " Mean=" << mean << std::endl;
// Repeat, using two threads
std::vector<std::thread> thread;
std::vector<double> result;
result.push_back(0.0);
result.push_back(0.0);
std::cout << "Calculating mean, 2 Threads..." << std::endl;
// Run threads
uint32_t halfsize = data.size() / 2;
uint32_t A = 0;
uint32_t B, C, D;
// Split the data into two blocks
if(data.size() % 2 == 0)
{
B = C = D = halfsize;
}
else if(data.size() % 2 == 1)
{
B = C = halfsize;
D = hsz + 1;
}
// Run with two threads
thread.push_back(std::thread(findmean, &data, A, B, &(result[0])));
thread.push_back(std::thread(findmean, &data, C, D , &(result[1])));
// Join threads
thread[0].join();
thread[1].join();
// Calculate result
mean = result[0] + result[1];
mean /= (double)data.size();
// Print result
std::cout << " Mean=" << mean << std::endl;
// Return
return EXIT_SUCCESS;
}
void findmean(std::vector<double>* datavec, std::size_t start, std::size_t length, double* result)
{
for(uint32_t i = 0; i < length; i ++) {
*result += (*datavec).at(start + i);
}
}
I don't think this code is exactly wonderful, if you could suggest ways of improving it then I would be grateful for that also.
Register Variable:
Several people have suggested making a local variable for the function 'findmean'. This is what I have done:
void findmean(std::vector<double>* datavec, std::size_t start, std::size_t length, double* result)
{
register double holding = *result;
for(uint32_t i = 0; i < length; i ++) {
holding += (*datavec).at(start + i);
}
*result = holding;
}
I can now report: The code runs with almost the same execution time as with a single thread. That is a big improvement of 6x, but surely there must be a way to make it nearly twice as fast?
Register Variable and O2 Optimization:
I have set the optimization to 'O2' - I will create a table with the results.
Results so far:
Original Code with no optimization or register variable:
1 thread: 4.98 seconds, 2 threads: 29.59 seconds
Code with added register variable:
1 Thread: 4.76 seconds, 2 Threads: 4.76 seconds
With reg variable and -O2 optimization:
1 Thread: 0.43 seconds, 2 Threads: 0.6 seconds 2 Threads is now slower?
With Dameon's suggestion, which was to put a large block of memory in between the two result variables:
1 Thread: 0.42 seconds, 2 Threads: 0.64 seconds
With TAS 's suggestion of using iterators to access contents of the vector:
1 Thread: 0.38 seconds, 2 Threads: 0.56 seconds
Same as above on Core i7 920 (single channel memory 4GB):
1 Thread: 0.31 seconds, 2 Threads: 0.56 seconds
Same as above on Core i7 920 (dual channel memory 2x2GB):
1 Thread: 0.31 seconds, 2 Threads: 0.35 seconds
Why are 2 threads 6x slower than 1 thread?
You are getting hit by a bad case of false sharing.
After getting rid of the false-sharing, why is 2 threads not faster than 1 thread?
You are bottlenecked by your memory bandwidth.
False Sharing:
The problem here is that each thread is accessing the result variable at adjacent memory locations. It's likely that they fall on the same cacheline so each time a thread accesses it, it will bounce the cacheline between the cores.
Each thread is running this loop:
for(uint32_t i = 0; i < length; i ++) {
*result += (*datavec).at(start + i);
}
And you can see that the result variable is being accessed very often (each iteration). So each iteration, the threads are fighting for the same cacheline that's holding both values of result.
Normally, the compiler should put *result into a register thereby removing the constant access to that memory location. But since you never turned on optimizations, it's very likely the compiler is indeed still accessing the memory location and thus incurring false-sharing penalties at every iteration of the loop.
Memory Bandwidth:
Once you have eliminated the false sharing and got rid of the 6x slowdown, the reason why you're not getting improvement is because you've maxed out your memory bandwidth.
Sure your processor may be 4 cores, but they all share the same memory bandwidth. Your particular task of summing up an array does very little (computational) work for each memory access. A single thread is already enough to max out your memory bandwidth. Therefore going to more threads is not likely to get you much improvement.
In short, no you won't be able to make summing an array significantly faster by throwing more threads at it.
As stated in other answers, you are seeing false sharing on the result variable, but there is also one other location where this is happening. The std::vector<T>::at() function (as well as the std::vector<T>::operator[]()) access the length of the vector on each element access. To avoid this you should switch to using iterators. Also, using std::accumulate() will allow you to take advantage of optimizations in the standard library implementation you are using.
Here are the relevant parts of the code:
thread.push_back(std::thread(findmean, std::begin(data)+A, std::begin(data)+B, &(result[0])));
thread.push_back(std::thread(findmean, std::begin(data)+B, std::end(data), &(result[1])));
and
void findmean(std::vector<double>::const_iterator start, std::vector<double>::const_iterator end, double* result)
{
*result = std::accumulate(start, end, 0.0);
}
This consistently gives me better performance for two threads on my 32-bit netbook.
More threads doesn't mean faster! There is an overhead in creating and context-switching threads, even the hardware in which this code run is influencing the results. For such a trivial work like this it's better probably a single thread.
This is probably because the cost of launching and waiting for two threads is a lot more than computing the result in a single loop. Your data size is 128MB, which is not alot for modern processors to process in a single loop.