Related
This question already has answers here:
OpenMP time and clock() give two different results
(3 answers)
Closed 3 years ago.
I have a matrix, call it small_matrix made up of about 100000 rows and 128 columns stored as a single array (I'm using it for CUDA calculations so the saved space is needed). I have a larger matrix, call it large_matrix, with 10x the number of rows and the same row length as small_matrix and I want to populate its rows with the rows from as small_matrix. However, the population process isn't 1:1. There is a map array that maps every row in large_matrix to a row in small_matrix. A single row in small_matrix can be mapped to by multiple rows in large_matrix. We can assume that the map array is randomly generated. Also, there is a small chance (assume it to be 1%) that a row in large_matrix will have random values instead of actual values.
I'm trying to optimize this process by means of parallelism using OMP on C++ but I just cannot seem to be able to. Anything I've tried so far has only lead to increasing the runtime with more threads instead of decreasing it. Here is the problem's code, I'm trying to optimize expand_matrix:
#include <stdio.h>
#include <omp.h>
#include <random>
#include <stdlib.h>
#include <cstddef>
#include <ctime>
#include <cstring>
using namespace std;
inline void* aligned_malloc(size_t size, size_t align){
void *result;
#ifdef _MSC_VER
result = _aligned_malloc(size, align);
#else
if(posix_memalign(&result, align, size)) result = 0;
#endif
return result;
}
inline void aligned_free(void *ptr) {
#ifdef _MSC_VER
_aligned_free(ptr);
#else
free(ptr);
#endif
}
void expand_matrix(int num_rows_in_large_matrix, int row_length, long long* map, float*small_matrix, float* large_matrix, const int num_threads);
int main(){
int row_length = 128;
long long small_matrix_rows = 100000;
long long large_matrix_rows = 1000000;
long long *map = new long long [large_matrix_rows];
float *small_matrix = (float*)aligned_malloc(small_matrix_rows*128*sizeof(float), 128);
float *large_matrix = (float*)aligned_malloc(large_matrix_rows*128*sizeof(float), 128);
minstd_rand gen(std::random_device{}()); //NOTE: Valgrind will give an error saying: vex amd64->IR: unhandled instruction bytes: 0xF 0xC7 0xF0 0x89 0x6 0xF 0x42 0xC1 :: look: https://bugs.launchpad.net/ubuntu/+source/valgrind/+bug/
uniform_real_distribution<double> values_dist(0, 1);
uniform_int_distribution<long long> map_dist(0,small_matrix_rows);
for (long long i = 0; i<small_matrix_rows*row_length;i++){
small_matrix[i] = values_dist(gen)-0.5;
}
for (long long i=0; i<large_matrix_rows;i++){
if (values_dist(gen)<0.99)
map[i] = map_dist(gen);
}
clock_t start, end;
int num_threads =4;
printf("Populated matrix and generated map\n");
start = clock();
expand_matrix(large_matrix_rows, row_length, map, small_matrix, large_matrix, num_threads);
end = clock();
printf("Time to expand using %d threads = %f\n", num_threads, double(end-start)/CLOCKS_PER_SEC);
return 0;
}
void expand_matrix(int num_rows_in_large_matrix, int row_length, long long* map, float*small_matrix, float* large_matrix, const int num_threads){
#pragma omp parallel num_threads(num_threads)
{
#pragma omp for schedule(guided, 4)
for(unsigned int i = 0; i < num_rows_in_large_matrix; i++ ){
long long sml = map[i];
if(sml == -1){
for (int j = 0; j < row_length; j++)
large_matrix[i * row_length + j] = 0.5;
}
else{
memcpy(large_matrix+i*row_length, small_matrix+sml*row_length, row_length*sizeof(float));
}
}
}
}
Here are some runtimes:
Time to expand using 1 threads = 0.402949
Time to expand using 2 threads = 0.530361
Time to expand using 4 threads = 0.608085
Time to expand using 8 threads = 0.667806
Time to expand using 16 threads = 0.999886
I've made sure the matrices were aligned with memory, I've tried using non-temporal instructions for copying, I am stumped. I don't know where to look anymore. Any help is much appreciated.
Some hardware information:
CPU: Intel(R) Xeon(R) CPU E5-2620 v4 # 2.10GHz
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
Using Ubuntu 16.04 and gcc version 5.5.0 20171010 (Ubuntu 5.5.0-12ubuntu1~16.04).
Thanks to #Gilles and #Zulan for pointing out the error. I will post it as an answer so others can see the issue. I was using the wrong time measurement method; my method does not work with multithreaded applications. In other words, I misused the clock() function. Here is #Giller's answer:
clock() measures CPU time which increases with the number of CPU you add. omp_get_wtime() measures the wallclock time which is the one you want to see decreasing
the function I'm using to measure the time of the function execution is clock(). This function counts the number of processor ticks taken by all the processors involved in running the code. When I run my code in parallel using multiple processors, the clock ticks returned by clock() are the total of all the processors and so the number is only increasing as I increase the number of processors. When I switched the time measurement to omp_get_wtime() the timing returned was correct and I got the following results:
1 thread = 0.423516
4 threads = 0.152680
8 threads = 0.090841
16 threads = 0.064748
So, instead of measuring the runtime like this:
clock_t start, end;
start = clock();
expand_matrix(large_matrix_rows, row_length, map, small_matrix, large_matrix, num_threads);
end = clock();
printf("Total time %f\n", double(end-start)/CLOCKS_PER_SEC);
I do it like this:
double start, end;
start = omp_get_wtime();
expand_matrix(large_matrix_rows, row_length, map, small_matrix, large_matrix, num_threads);
end = omp_get_wtime();
printf("Total time %f\n", end-start);
So I'm trying to call a function every n seconds. The below is a simple representation of what I'm trying to achieve. I wanted to know if the below method is the only way to achieve this. I would love if the "if" condition can be avoided.
#include <stdio.h>
#include <time.h>
void print_hello(int i) {
printf("hello\n");
printf("%d\n", i);
}
int main () {
time_t start_t, end_t;
double diff_t;
time(&start_t);
int i = 0;
while(1) {
time(&end_t);
// printf("here in main");
i = i + 1;
diff_t = difftime(end_t, start_t);
if(diff_t==5) {
// printf("Execution time = %f\n", diff_t);
print_hello(i);
time(&start_t);
}
}
return(0);
}
The usage of time in OPs program can be reduced to something like
// get tStart;
// set tEnd = tStart + x;
do {
// get t;
} while (t < tEnd);
This is what is called busy-wait.
It might be used to write code with most precise timing as well as in other special cases. The draw-back is that the waiting consumes ful CPU load. (You might be even able to hear this – by raising ventilation noise.)
In general, however, spinning is considered an anti-pattern and should be avoided, as processor time that could be used to execute a different task is instead wasted on useless activity.
Another option is to delegate the wake-up to the system, which reduces the load of process/thread to minimum while waiting:
#include <chrono>
#include <iostream>
#include <thread>
void print_hello(int i)
{
std::cout << "hello\n"
<< i << '\n';
}
int main ()
{
using namespace std::chrono_literals; // to support e.g. 5s for 5 sceconds
auto tStart = std::chrono::system_clock::now();
for (int i = 1; i <= 3; ++i) {
auto tEnd = tStart + 2s;
std::this_thread::sleep_until(tEnd);
print_hello(i);
tStart = tEnd;
}
}
Output:
hello
1
hello
2
hello
3
Live Demo on coliru
(I had to reduce number of iterations and the waiting times to prevent the TLE in online compiler.)
std::this_thread::sleep_until
Blocks the execution of the current thread until specified sleep_time has been reached.
The clock tied to sleep_time is used, which means that adjustments of the clock are taken into account. Thus, the duration of the block might, but might not, be less or more than sleep_time - Clock::now() at the time of the call, depending on the direction of the adjustment. The function also may block for longer than until after sleep_time has been reached due to scheduling or resource contention delays.
The last sentence mentions the draw-back of this solution: The OS may decide to wake-up the thread/process later than requested. That may happen e.g. is OS is under high load. In the “normal” case, the latency shouldn't be more than a few milli-seconds. So, the latency might be tolerable.
Please, note how tEnd and tStart are updated in loop. The current wake-up time is not considered to prevent accumulation of latencies.
I was kind of bored so I wanted to try using std::thread and eventually measure performance of single and multithreaded console application. This is a two part question. So I started with a single threaded sum of a massive vector of ints (800000 of ints).
int sum = 0;
auto start = chrono::high_resolution_clock::now();
for (int i = 0; i < 800000; ++i)
sum += ints[i];
auto end = chrono::high_resolution_clock::now();
auto diff = end - start;
Then I added range based and iterator based for loop and measured the same way with chrono::high_resolution_clock.
for (auto& val : ints)
sum += val;
for (auto it = ints.begin(); it != ints.end(); ++it)
sum += *it;
At this point console output looked like:
index loop: 30.0017ms
range loop: 221.013ms
iterator loop: 442.025ms
This was a debug version, so I changed to release and the difference was ~1ms in favor of index based for. No big deal, but just out of curiosity: should there be a difference this big in debug mode between these three for loops? Or even a difference in 1ms in release mode?
I moved on to the thread creation, and tried to do a parallel sum of the array with this lambda (captured everything by reference so I could use vector of ints and a mutex previously declared) using index based for.
auto func = [&](int start, int total, int index)
{
int partial_sum = 0;
auto s = chrono::high_resolution_clock::now();
for (int i = start; i < start + total; ++i)
partial_sum += ints[i];
auto e = chrono::high_resolution_clock::now();
auto d = e - s;
m.lock();
cout << "thread " + to_string(index) + ": " << chrono::duration<double, milli>(d).count() << "ms" << endl;
sum += partial_sum;
m.unlock();
};
for (int i = 0; i < 8; ++i)
threads.push_back(thread(func, i * 100000, 100000, i));
Basically every thread was summing 1/8 of the total array, and the final console output was:
thread 0: 6.0004ms
thread 3: 6.0004ms
thread 2: 6.0004ms
thread 5: 7.0004ms
thread 4: 7.0004ms
thread 1: 7.0004ms
thread 6: 7.0004ms
thread 7: 7.0004ms
8 threads total: 53.0032ms
So I guess the second part of this question is what's going on here? Solution with 2 threads ended with ~30ms as well. Cache ping pong? Something else? If I'm doing something wrong, what would be the correct way to do it? Also if It's relevant, I was trying this on an i7 with 8 threads, so yes I know I didn't count the main thread, but tried it with 7 separate threads and pretty much got the same result.
EDIT: Sorry forgot the mention this was on Windows 7 with Visual Studio 2013 and Visual Studio's v120 compiler or whatever it's called.
EDIT2: Here's the whole main function:
http://pastebin.com/HyZUYxSY
With optimisation not turned on, all the method calls that are performed behind the scenes are likely real method calls. Inline functions are likely not inlined but really called. For template code, you really need to turn on optimisation to avoid that all the code is taken literally. For example, it's likely that your iterator code will call iter.end () 800,000 times, and operator!= for the comparison 800,000 times, which calls operator== and so on and so on.
For the multithreaded code, processors are complicated. Operating systems are complicated. Your code isn't alone on the computer. Your computer can change its clock speed, change into turbo mode, change into heat protection mode. And rounding the times to milliseconds isn't really helpful. Could be one thread to 6.49 milliseconds and another too 6.51 and it got rounded differently.
should there be a difference this big in debug mode between these three for loops?
Yes. If allowed, a decent compiler can produce identical output for each of the 3 different loops, but if optimizations are not enabled, the iterator version has more function calls and function calls have certain overhead.
Or even a difference in 1ms in release mode?
Your test code:
start = ...
for (auto& val : ints)
sum += val;
end = ...
diff = end - start;
sum = 0;
Doesn't use the result of the loop at all so when optimized, the compiler should simply choose to throw away the code resulting in something like:
start = ...
// do nothing...
end = ...
diff = end - start;
For all your loops.
The difference of 1ms may be produced by high granularity of the "high_resolution_clock" in the used implementation of the standard library and by differences in process scheduling during the execution. I measured the index based for being 0.04 ms slower, but that result is meaningless.
Aside from how std::thread is implemented on Windows I would to point your attention to your available execution units and context switching.
An i7 does not have 8 real execution units. It's a quad-core processor with hyper-threading. And HT does not magically double the available number of threads, no matter how it's advertised. It's a really clever system which tries to fit in instructions from an extra pipeline whenever possible. But in the end all instructions go through only four execution units.
So running 8 (or 7) threads is still more than your CPU can really handle simultaneously. That means your CPU has to switch a lot between 8 hot threads clamouring for calculation time. Top that off with several hundred more threads from the OS, admittedly most of which are asleep, that need time and you're left with a high degree of uncertainty in your measurements.
With a single threaded for-loop the OS can dedicate a single core to that task and spread the half-sleeping threads across the other three. This is why you're seeing such a difference between 1 thread and 8 threads.
As for your debugging questions: you should check if Visual Studio has Iterator checking enabled in debugging. When it's enabled every time an iterator is used it is bounds-checked and such. See: https://msdn.microsoft.com/en-us/library/aa985965.aspx
Lastly: have a look at the -openmp switch. If you enable that and apply the OpenMP #pragmas to your for-loops you can do away with all the manual thread creation. I toyed around with similar threading tests (because it's cool. :) ) and OpenMPs performance is pretty damn good.
For the first question, regarding the difference in performance between the range, iterator and index implementations, others have pointed out that in a non-optimized build, much which would normally be inlined may not be.
However there is an additional wrinkle: by default, in Debug builds, Visual Studio will use checked iterators. Access through a checked iterator is checked for safety (does the iterator refer to a valid element?), and consequently operations which use them, including the range-based iteration, are heavily penalized.
For the second part, I have to say that those durations seem abnormally long. When I run the code locally, compiled with g++ -O3 on a core i7-4770 (Linux), I get sub-millisecond timings for each method, less in fact than the jitter between runs. Altering the code to iterate each test 1000 times gives more stable results, with the per test times being 0.33 ms for the index and range loops with no extra tweaking, and about 0.15 ms for the parallel test.
The parallel threads are doing in total the same number of operations, and what's more, using all four cores limits the CPU's ability to dynamically increase its clock speed. So how can it take less total time?
I'd wager that the gains result from better utilization of the per-core L2 caches, four in total. Indeed, using four threads instead of eight threads reduces the total parallel time to 0.11 ms, consistent with better L2 cache use.
Browsing the Intel processor documentation, all the Core i7 processors, including the mobile ones, have at least 4 MB of L3 cache, which will happily accommodate 800 thousand 4-byte ints. So I'm surprised both by the raw times being 100 times larger than I'm seeing, and the 8-thread time totals being so much greater, which as you surmise, is a strong hint that they are thrashing the cache. I'm presuming this is demonstrating just how suboptimal the Debug build code is. Could you post results from an optimised build?
Not knowing how those std::thread classes are implemented, one possible explanation for the 53ms could be:
The threads are started right away when they get instantiated. (I see no thread.start() or threads.StartAll() or alike). So, during the time the first thread instance gets active, the main thread might (or might not) be preempted. There is no guarantee that the threads are getting spawned on individual cores, after all (thread affinity).
If you have a closer look at POSIX APIs, there is the notion of "application context" and "system context", which basically implies, that there might be an OS policy in place which would not use all cores for 1 application.
On Windows (this is where you were testing), maybe the threads are not being spawned directly but via a thread pool, maybe with some extra std::thread functionality, which could produce overhead/delay. (Such as completion ports etc.).
Unfortunately my machine is pretty fast so I had to increase the amount of data processed to yield significant times. But on the upside, this reminded me to point out, that typically, it starts to pay off to go parallel when the computation times are way beyond the time of a time slice (rule of thumb).
Here my "native" Windows implementation, which - for a large enough array finally makes the threads win over a single threaded computation.
#include <stdafx.h>
#include <nativethreadTest.h>
#include <vector>
#include <cstdint>
#include <Windows.h>
#include <chrono>
#include <iostream>
#include <thread>
struct Range
{
Range( const int32_t *p, size_t l)
: data(p)
, length(l)
, result(0)
{}
const int32_t *data;
size_t length;
int32_t result;
};
static int32_t Sum(const int32_t * data, size_t length)
{
int32_t sum = 0;
const int32_t *end = data + length;
for (; data != end; data++)
{
sum += *data;
}
return sum;
}
static int32_t TestSingleThreaded(const Range& range)
{
return Sum(range.data, range.length);
}
DWORD
WINAPI
CalcThread
(_In_ LPVOID lpParameter
)
{
Range * myRange = reinterpret_cast<Range*>(lpParameter);
myRange->result = Sum(myRange->data, myRange->length);
return 0;
}
static int32_t TestWithNCores(const Range& range, size_t ncores)
{
int32_t result = 0;
std::vector<Range> ranges;
size_t nextStart = 0;
size_t chunkLength = range.length / ncores;
size_t remainder = range.length - chunkLength * ncores;
while (nextStart < range.length)
{
ranges.push_back(Range(&range.data[nextStart], chunkLength));
nextStart += chunkLength;
}
Range remainderRange(&range.data[range.length - remainder], remainder);
std::vector<HANDLE> threadHandles;
threadHandles.reserve(ncores);
for (size_t i = 0; i < ncores; ++i)
{
threadHandles.push_back(::CreateThread(NULL, 0, CalcThread, &ranges[i], 0, NULL));
}
int32_t remainderResult = Sum(remainderRange.data, remainderRange.length);
DWORD waitResult = ::WaitForMultipleObjects((DWORD)threadHandles.size(), &threadHandles[0], TRUE, INFINITE);
if (WAIT_OBJECT_0 == waitResult)
{
for (auto& r : ranges)
{
result += r.result;
}
result += remainderResult;
}
else
{
throw std::runtime_error("Something went horribly - HORRIBLY wrong!");
}
for (auto& h : threadHandles)
{
::CloseHandle(h);
}
return result;
}
static int32_t TestWithSTLThreads(const Range& range, size_t ncores)
{
int32_t result = 0;
std::vector<Range> ranges;
size_t nextStart = 0;
size_t chunkLength = range.length / ncores;
size_t remainder = range.length - chunkLength * ncores;
while (nextStart < range.length)
{
ranges.push_back(Range(&range.data[nextStart], chunkLength));
nextStart += chunkLength;
}
Range remainderRange(&range.data[range.length - remainder], remainder);
std::vector<std::thread> threads;
for (size_t i = 0; i < ncores; ++i)
{
threads.push_back(std::thread([](Range* range){ range->result = Sum(range->data, range->length); }, &ranges[i]));
}
int32_t remainderResult = Sum(remainderRange.data, remainderRange.length);
for (auto& t : threads)
{
t.join();
}
for (auto& r : ranges)
{
result += r.result;
}
result += remainderResult;
return result;
}
void TestNativeThreads()
{
const size_t DATA_SIZE = 800000000ULL;
typedef std::vector<int32_t> DataVector;
DataVector data;
data.reserve(DATA_SIZE);
for (size_t i = 0; i < DATA_SIZE; ++i)
{
data.push_back(static_cast<int32_t>(i));
}
Range r = { data.data(), data.size() };
std::chrono::system_clock::time_point singleThreadedStart = std::chrono::high_resolution_clock::now();
int32_t result = TestSingleThreaded(r);
std::chrono::system_clock::time_point singleThreadedEnd = std::chrono::high_resolution_clock::now();
std::cout
<< "Single threaded sum: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(singleThreadedEnd - singleThreadedStart).count()
<< "ms." << " Result = " << result << std::endl;
std::chrono::system_clock::time_point multiThreadedStart = std::chrono::high_resolution_clock::now();
result = TestWithNCores(r, 8);
std::chrono::system_clock::time_point multiThreadedEnd = std::chrono::high_resolution_clock::now();
std::cout
<< "Multi threaded sum: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(multiThreadedEnd - multiThreadedStart).count()
<< "ms." << " Result = " << result << std::endl;
std::chrono::system_clock::time_point stdThreadedStart = std::chrono::high_resolution_clock::now();
result = TestWithSTLThreads(r, 8);
std::chrono::system_clock::time_point stdThreadedEnd = std::chrono::high_resolution_clock::now();
std::cout
<< "std::thread sum: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(stdThreadedEnd - stdThreadedStart).count()
<< "ms." << " Result = " << result << std::endl;
}
Here the output on my machine of this code:
Single threaded sum: 382ms. Result = -532120576
Multi threaded sum: 234ms. Result = -532120576
std::thread sum: 245ms. Result = -532120576
Press any key to continue . . ..
Last not least, I feel urged to mention that the way this code is written it is rather a memory IO performance benchmark than a core CPU computation benchmark.
Better computation benchmarks would use small amounts of data which is local, fits into CPU caches etc.
Maybe it would be interesting to experiment with the splitting of the data into ranges. What if each thread were "jumping" over the data from the start to an end with a gap of ncores? Thread 1: 0 8 16... Thread 2: 1 9 17 ... etc.? Maybe then the "locality" of the memory could gain extra speed.
I wanted to see which would access faster, a struct or a tuple, so I wrote a small program. However, when it finishes running the time recorded is 0.000000 for both. I'm pretty sure the program isn't finishing that fast(running on an online compiler since I am away from home)
#include <iostream>
#include <time.h>
#include <tuple>
#include <cstdlib>
using namespace std;
struct Item
{
int x;
int y;
};
typedef tuple<int, int> dTuple;
int main() {
printf("Let's see which is faster...\n");
//Timers
time_t startTime;
time_t currentTimeTuple;
time_t currentTimeItem;
//Delta times
double deltaTimeTuple;
double deltaTimeItem;
//Collections
dTuple tupleArray[100000];
struct Item itemArray[100000];
//Seed random number
srand(time(NULL));
printf("Generating tuple array...\n");
//Initialize an array of tuples with random ints
for(int i = 0; i < 100000; ++i)
{
tupleArray[i] = dTuple(rand() % 1000,rand() % 1000);
}
printf("Generating Item array...\n");
//Initialize an array of Items
for(int i = 0; i < 100000; ++i)
{
itemArray[i].x = rand() % 1000;
itemArray[i].y = rand() % 1000;
}
//Begin timer for tuple array
time(&startTime);
//Iterate through the array of tuples and print out each value, timing how long it takes
for(int i = 0; i < 100000; ++i)
{
printf("%d: %d", get<0>(tupleArray[i]), get<1>(tupleArray[i]));
}
//Get the time it took to go through the tuple array
time(¤tTimeTuple);
deltaTimeTuple = difftime(startTime, currentTimeTuple);
//Start the timer for the array of Items
time(&startTime);
//Iterate through the array of Items and print out each value, timing how long it takes
for(int i = 0; i < 100000; ++i)
{
printf("%d: %d", itemArray[i].x, itemArray[i].y);
}
//Get the time it took to go through the item array
time(¤tTimeItem);
deltaTimeItem = difftime(startTime, currentTimeItem);
printf("\n\n");
printf("It took %f seconds to go through the tuple array\nIt took %f seconds to go through the struct Item array\n", deltaTimeTuple, deltaTimeItem);
return 0;
}
According to www.cplusplus.com/reference/ctime/time/, difftime should return the difference between two time_t.
time() generally returns the number of seconds since 00:00 hours, Jan 1, 1970 UTC (i.e., the current unix timestamp). So depending on your library implementation, anything below 1 second might appear as 0.
You should prefer use of <chrono> for benchmarking purpose:
chrono::high_resolution_clock::time_point t;
t = high_resolution_clock::now();
// do something to benchmark
chrono::high_resolution_clock::time_point t2 = chrono::high_resolution_clock::now();
cout <<"Exec in ms: "<< chrono::duration_cast<milliseconds>(t2 - t).count() <<endl;
You have nevertheless to consider the clock resolution. For example with windows it's 15 ms, so if you are close around 15 ms, or even below, you should increase the number of iterations.
I recommend checking the assembly language listing before you start performance timings.
The first thing to check is that the assembly language is significantly different between accessing a tuple versus accessing your structure. You can get a rough estimate of the timing difference by counting the number of different instructions.
Secondly, I recommend looking at the definition of Tuple. Something tells me that it is a structure declared similarly to yours. Thus I predict your timings should be near equal.
Thirdly, you should also compare std::pair since you only have 2 items in your tuple and a std::pair has two elements. Again, it should be the same as your structure.
Lastly, be prepared for "noise" in your data. The noise may come from the data cache misses, other programs using the data cache, delegation of code between cores, and any I/O. The closer you can get to running your program on a single core, free of interruptions, the better quality your data will be.
time_t is a count of seconds, and 100,000 trivial operations on a fast, modern computer could indeed finish in less than a second.
I have a piece of code that I use to test various containers (e.g. deque and a circular buffer) when passing data from a producer (thread 1) to a consumer (thread 2). A data is represented by a struct with a pair of timestamps. First timestamp is taken before push in the producer, and the second one is taken when data is popped by the consumer.
The container is protected with a pthread spinlock.
The machine runs redhat 5.5 with 2.6.18 kernel (old!), it is a 4-core system with hyperthreading disabled. gcc 4.7 with -std=c++11 flag was used in all tests.
Producer acquires the lock, timestamps the data and pushes it into the queue, unlocks and sleeps in a busy loop for 2 microseconds (the only reliable way I found to sleep for precisely 2 micros on that system).
Consumer locks, pops the data, timestamps it and generates some statistics (running mean delay and standard deviation). The stats is printed every 5 seconds (M is the mean, M2 is the std dev) and reset. I used gettimeofday() to obtain the timestamps, which means that the mean delay number can be thought of as the percentage of delays that exceed 1 microsecond.
Most of the time the output looks like this:
CNT=2500000 M=0.00935 M2=0.910238
CNT=2500000 M=0.0204112 M2=1.57601
CNT=2500000 M=0.0045016 M2=0.372065
but sometimes (probably 1 trial out of 20) like this:
CNT=2500000 M=0.523413 M2=4.83898
CNT=2500000 M=0.558525 M2=4.98872
CNT=2500000 M=0.581157 M2=5.05889
(note the mean number is much worse than in the first case, and it never recovers as the program runs).
I would appreciate thoughts on why this could happen. Thanks.
#include <iostream>
#include <string.h>
#include <stdexcept>
#include <sys/time.h>
#include <deque>
#include <thread>
#include <cstdint>
#include <cmath>
#include <unistd.h>
#include <xmmintrin.h> // _mm_pause()
int64_t timestamp() {
struct timeval tv;
gettimeofday(&tv, 0);
return 1000000L * tv.tv_sec + tv.tv_usec;
}
//running mean and a second moment
struct StatsM2 {
StatsM2() {}
double m = 0;
double m2 = 0;
long count = 0;
inline void update(long x, long c) {
count = c;
double delta = x - m;
m += delta / count;
m2 += delta * (x - m);
}
inline void reset() {
m = m2 = 0;
count = 0;
}
inline double getM2() { // running second moment
return (count > 1) ? m2 / (count - 1) : 0.;
}
inline double getDeviation() {
return std::sqrt(getM2() );
}
inline double getM() { // running mean
return m;
}
};
// pause for usec microseconds using busy loop
int64_t busyloop_microsec_sleep(unsigned long usec) {
int64_t t, tend;
tend = t = timestamp();
tend += usec;
while (t < tend) {
t = timestamp();
}
return t;
}
struct Data {
Data() : time_produced(timestamp() ) {}
int64_t time_produced;
int64_t time_consumed;
};
int64_t sleep_interval = 2;
StatsM2 statsm2;
std::deque<Data> queue;
bool producer_running = true;
bool consumer_running = true;
pthread_spinlock_t spin;
void producer() {
producer_running = true;
while(producer_running) {
pthread_spin_lock(&spin);
queue.push_back(Data() );
pthread_spin_unlock(&spin);
busyloop_microsec_sleep(sleep_interval);
}
}
void consumer() {
int64_t count = 0;
int64_t print_at = 1000000/sleep_interval * 5;
Data data;
consumer_running = true;
while (consumer_running) {
pthread_spin_lock(&spin);
if (queue.empty() ) {
pthread_spin_unlock(&spin);
// _mm_pause();
continue;
}
data = queue.front();
queue.pop_front();
pthread_spin_unlock(&spin);
++count;
data.time_consumed = timestamp();
statsm2.update(data.time_consumed - data.time_produced, count);
if (count >= print_at) {
std::cerr << "CNT=" << count << " M=" << statsm2.getM() << " M2=" << statsm2.getDeviation() << "\n";
statsm2.reset();
count = 0;
}
}
}
int main(void) {
if (pthread_spin_init(&spin, PTHREAD_PROCESS_PRIVATE) < 0)
exit(2);
std::thread consumer_thread(consumer);
std::thread producer_thread(producer);
sleep(40);
consumer_running = false;
producer_running = false;
consumer_thread.join();
producer_thread.join();
return 0;
}
EDIT:
I believe that 5 below is the only thing that can explain 1/2 second latency. When on the same core, each would run for a long time and only then switch to the other.
The rest of the things on the list are too small to cause a 1/2 second delay.
You can use pthread_setaffinity_np to pin your threads to specific cores. You can try different combinations and see how performance changes.
EDIT #2:
More things you should take care of: (who said testing was simple...)
1. Make sure the consumer is already running when the producer starts producing. Not too important in your case as the producer is not really producing in a tight loop.
2. This is very important: you divide by count every time, which is not the right thing to do for your stats. This means that the first measurement in every stats window weight a lot more than the last. To measure the median you have to collect all the values. Measuring the average and min/max, without collecting all numbers, should give you a good enough picture of the latency.
It's not surprising, really.
1. The time is taken in Data(), but then the container spends time calling malloc.
2. Are you running 64 bit or 32? In 32 bit gettimeofday is a system call while in 64 bit it's a VDSO that doesn't get into the kernel... you may want to time gettimeofday itself and record the variance. Or enroll your own using rdtsc.
The best would be to use cycles instead of micros because micros are really too big for this scenario... only the rounding to micros gets you very much skewed when dealing with such a small scale of things
3. Are you guaranteed to not get preempted between producer and consumer? I guess that not. But this should not happen very frequently on a box dedicated to testing...
4. Is it 4 cores on a single socket or 2? if it's a 2 socket box, you want to have the 2 threads on the same socket, or you pay (at least) double for data transfer.
5. Make sure the threads are not running on the same core.
6. If the Data you transfer and the additional data (container node) are sharing cache lines (kind of likely) with other Data+node, the producer would be delayed by the consumer when it writes to the consumed timestamp. This is called false sharing. You can eliminate this by padding/aligning to 64 bytes and using an intrusive container.
gettimeofday is not a good way to profile computation overhead. It is the wall clock and your computer is multiprocessing. Even you think you are not running anything else, the OS scheduler always has some other activities to keep the system running. To profile your process overhead, you have to at least raise the priority of the process you are profiling. Also use high resolution timer or cpu ticks to do the timing measure.