pthread works fine only if the number of threads is small - c++

I am trying to encrypt a BMP image, pixel by pixel, and I use pthread to encrypt them in parallel.
My code is kinda like this:
struct arg_struct {
int arg1;
int arg2;
...
};
void* print_message(void* arguments) {
struct arg_struct* args = (struct arg_struct*)arguments;
//Encryption...
// exit from current thread
pthread_exit(NULL);
}
void multiThreadExample() {
cout << "Start..." << endl;
const int NUMBER_OF_THREADS = 50000; // number of pixels
pthread_t thread[NUMBER_OF_THREADS];
arg_struct arg[NUMBER_OF_THREADS];
for (int i=0;i<NUMBER_OF_THREADS;i++) {
arg[i].arg1 = i;
arg[i].arg2 = ... // give values to arguments in arg_struct
pthread_create(&thread[i], NULL, print_message, static_cast<void*>(&arg[i]));
}
for(int i = 0; i < NUMBER_OF_THREADS; i++) {
pthread_join(thread[i], NULL);
}
cout << "Complete..." << endl;
//streaming results to file using filebuf
}
int main() {
multiThreadExample();
return 0;
}
It works fine if the image is as small as 3*3.
If the image gets bigger, such as 240*164, the program freezes AFTER printing out "Complete..."
After several minutes, it says Segmentation fault (core dumped).
I'm not sure what makes the program freeze after the most heavy part (encryption) has already completed. Is it because such many threads have already occupied all my memory space? It occupied more than 10G at maximum during running.
Actually I have tried to do this WITHOUT multithreading, and the program still freezes.

50000 threads is insane. You are most likely running out of memory to allocate stacks for the threads. In general, to get the most out of your CPU through parallelism you only need as many threads as there are CPU cores - that is the limit on the amount of actual concurrency you can get in the hardware. Any more threads, and you are just using up resources for the overhead of thread creation and context switches.
Instead, create a pool of threads equal to the number of cores you have, and break your image up into chunks, and schedule those onto the threads you created.

Related

Why would adding a delay improve data throughput in this multithreaded environment?

In my application, I have two threads, a producer (thread 1) and a consumer (thread 2). Each thread has an input and output interface (effectively a pointer to a list) that is connected to a third thread which serves as a router.
When the producer writes, it calls memcpy to copy data into a buffer and pushes the buffer into a list. Meanwhile, the router thread is round-robin searching through all the threads that are connected to it and monitoring their interfaces to see if any thread has data to send out. When it sees that thread 1's list is non-empty, it checks to determine which thread the data is intended for. The data is spliced into the destination thread's (in this case thread 2) input list, at which point thread 2 will malloc some memory, memcpy the data into it and return the pointer to this new region.
For my test, I'm measuring throughput to see how long it takes to send 100k messages of varying sizes. Thread 1 sends data of some size, thread 2 reads it and sends back a small reply message, which thread 1 reads. This would be one complete exchange. In the first test, in thread 1, I'm sending all 100k messages, and then reading 100k replies. In the second test, in thread 1, I'm alternating sending a message and waiting for the reply and repeating 100k times. In both tests, thread 2 is in a loop reading the message and sending a reply. I would expect test 1 to have higher throughput because the threads should spend less time waiting around. However, it has markedly worse throughput than test 2. I've measured how long individual function calls (to read/write) take in the two test cases and they invariably take longer in test 1 (based on the means and medians and no delay) though the numbers are of the same order of magnitude.
When I add a loop doing nothing into thread 1's sending loop in test 1, I see dramatically improved throughput for this case as opposed to not having the delay. My only guess is that adding a delay slows down the producer so the consumer can absorb the data which prevents its input list from growing very large. I'm wondering if there may be other explanations and if so, how I can test for them.
Edit
Unfortunately, my own code is just the test I described above which calls a library that actually performs the reads/writes, creates that third thread etc. It's difficult to make a minimal example out of it because the library is complex and not mine. I provide some pseudocode to illustrate the setup in more detail.
int NUM_ITERATIONS = 100000;
int msg_reply = 2; // size of the reply message in words
int msg_size = 512; // indicates 512 64 bit words
void generate(int iterations, int size, interface* out){
std::vector<long long> vec(size);
for(int i = 0; i < size; i++)
vec[i] = (long long) i;
for(int i = 0; i < iterations; i++)
out->lib_write((char*) vec.data(), size);
}
void receive(int iterations, int size, interface* in){
for(int i = 0; i < iterations; i++)
char* data = in->lib_read(size)
void producer(interface* in, interface* out){
// test 1
start = std::chrono::high_resolution_clock::now();
// write data of size msg_size, NUM_ITERATIONS times to out
generate(NUM_ITERATIONS, msg_size, out);
// read data of size msg_reply, NUM_ITERATIONS times from in
receive(NUM_ITERATIONS, msg_reply, in);
end = std::chrono::high_resolution_clock::now();
// using NUM_ITERATIONS, msg_size and time, compute and print throughput to stdout
print_throughput(end-start, "throughput_0", msg_size);
// test 2
start = std::chrono::high_resolution_clock::now();
for(int j = 0; j < NUM_ITERATIONS; j++){
generate(1, msg_size, out);
receive(1, msg_reply, in);
}
end = std::chrono::high_resolution_clock::now();
print_throughput(end-start, "throughput_1", msg_size);
}
void consumer(interface* in, interface* out){
for(int i = 0; i < 2; i++}{
for(int j = 0; j < NUM_ITERATIONS; j++){
receive(1, msg_size, in);
generate(1, msg_reply, out);
}
}
}
The calls to lib_write() and lib_read() become fairly complex. To elaborate on the description above, the data gets memcpy'd into a buffer and then moved into a list. The interface has a condition variable member and the write calls its notify_one() method. The third thread is looping through all the interface pointers it has and checking to see if their lists are non-empty. If so, the data is spliced from one output list to the destination's input list using the splice() method in std::list. Meanwhile, the consumer calls the lib_read() which waits on the condition variable while the interface is empty, and then memcpy's the data into a new region and returns it.
// note: these will not compile as is. Undefined variables are class members
char * interface::lib_read(size_t * _size){
char * ret;
{
std::unique_lock<std::mutex> lock(mutex);
// packets is an std::list containing the incoming data
while (packets.empty()) {
cv.wait(lock);
}
curr_read_it = packets.begin();
}
size_t buff_size = curr_read_it->size;
ret = (char *)malloc(buff_size);
memcpy((char *)ret, (char *)curr_read_it->data, buff_size);
{
std::unique_lock<std::mutex> lock(mutex);
packets.erase(curr_read_it);
curr_read_it = packets.end();
}
return ret;
}
void interface::lib_write(char * data, int size){
// indicates the destination thread id
long long header = 1;
// buffer is a just an array that's max packet sized
memcpy((char *)buffer.data, &header, sizeof(long long));
memcpy((char *)buffer.data + sizeof(long long), (char *)data, size * sizeof(long long));
std::lock_guard<std::mutex> guard(mutex);
packets.push_back(std::move(buffer));
cv.notify_one();
}
// this is on thread 3
void route(){
do{
// this is a vector containing all the "out" interfaces
for(int i = 0; i < out_ptrs.size(); i++){
interface <long long> * _out = out_ptrs[i];
if(!_out->empty()){
// this just returns the header id (also locks the mutex)
long long dest= _out->get_dest();
// looks up the correct interface based on the id and splices
// a packet into from _out to the appropriate one. Locks mutex
in_ptrs[dest_map[dest]]->splice(_out);
}
}
}while(!done());
I was looking for general advice on what factors may influence multithreading performance and what to test for in order to better understand what was going on.
I talked to some other people and the advice I got that was helpful was to determine if the OS scheduling was the issue (which is what I suspected but was unsure how to test). Essentially, I used taskset and sched_affinity() to force the application to run on one core or on a subset of cores and looked at how they compared to each other and to the unrestricted case.
Based on the restrictions, I got dramatically different results and could see some trends so I'm pretty confident in saying that it's an OS scheduling issue. Different ones can yield better performance under different workloads.

recursive threading with C++ gives a Resource temporarily unavailable

So I'm trying to create a program that implements a function that generates a random number (n) and based on n, creates n threads. The main thread is responsible to print the minimum and maximum of the leafs. The depth of hierarchy with the Main thread is 3.
I have written the code below:
#include <iostream>
#include <thread>
#include <time.h>
#include <string>
#include <sstream>
using namespace std;
// a structure to keep the needed information of each thread
struct ThreadInfo
{
long randomN;
int level;
bool run;
int maxOfVals;
double minOfVals;
};
// The start address (function) of the threads
void ChildWork(void* a) {
ThreadInfo* info = (ThreadInfo*)a;
// Generate random value n
srand(time(NULL));
double n=rand()%6+1;
// initialize the thread info with n value
info->randomN=n;
info->maxOfVals=n;
info->minOfVals=n;
// the depth of recursion should not be more than 3
if(info->level > 3)
{
info->run = false;
}
// Create n threads and run them
ThreadInfo* childInfo = new ThreadInfo[(int)n];
for(int i = 0; i < n; i++)
{
childInfo[i].level = info->level + 1;
childInfo[i].run = true;
std::thread tt(ChildWork, &childInfo[i]) ;
tt.detach();
}
// checks if any child threads are working
bool anyRun = true;
while(anyRun)
{
anyRun = false;
for(int i = 0; i < n; i++)
{
anyRun = anyRun || childInfo[i].run;
}
}
// once all child threads are done, we find their max and min value
double maximum=1, minimum=6;
for( int i=0;i<n;i++)
{
// cout<<childInfo[i].maxOfVals<<endl;
if(childInfo[i].maxOfVals>=maximum)
maximum=childInfo[i].maxOfVals;
if(childInfo[i].minOfVals< minimum)
minimum=childInfo[i].minOfVals;
}
info->maxOfVals=maximum;
info->minOfVals=minimum;
// we set the info->run value to false, so that the parrent thread of this thread will know that it is done
info->run = false;
}
int main()
{
ThreadInfo info;
srand(time(NULL));
double n=rand()%6+1;
cout<<"n is: "<<n<<endl;
// initializing thread info
info.randomN=n;
info.maxOfVals=n;
info.minOfVals=n;
info.level = 1;
info.run = true;
std::thread t(ChildWork, &info) ;
t.join();
while(info.run);
info.maxOfVals= max<unsigned long>(info.randomN,info.maxOfVals);
info.minOfVals= min<unsigned long>(info.randomN,info.minOfVals);
cout << "Max is: " << info.maxOfVals <<" and Min is: "<<info.minOfVals;
}
The code compiles with no error, but when I execute it, it gives me this :
libc++abi.dylib: terminating with uncaught exception of type
std::__1::system_error: thread constructor failed: Resource
temporarily unavailable Abort trap: 6
You spawn too many threads. It looks a bit like a fork() bomb. Threads are a very heavy-weight system resource. Use them sparingly.
Within the function void Childwork I see two mistakes:
As someone already pointed out in the comments, you check the info level of a thread and then you go and create some more threads regardless of the previous check.
Within the for loop that spawns your new threads, you increment the info level right before you spawn the actual thread. However you increment a freshly created instance of ThreadInfo here ThreadInfo* childInfo = new ThreadInfo[(int)n]. All instances within childInfo hold a level of 0. Basically the level of each thread you spawn is 1.
In general avoid using threads to achieve concurrency for I/O bound operations (*). Just use threads to achieve concurrency for independent CPU bound operations. As a rule of thumb you never need more threads than you have CPU cores in your system (**). Having more does not improve concurrency and does not improve performance.
(*) You should always use direct function calls and an event based system to run pseudo concurrent I/O operations. You do not need any threading to do so. For example a TCP server does not need any threads to serve thousands of clients.
(**) This is the ideal case. In practice your software is composed of multiple parts, developed by independent developers and maintained in different modes, so it is ok to have some threads which could be theoretically avoided.
Multithreading is still rocket science in 2019. Especially in C++. Do not do it unless you know exactly what you are doing. Here is a good series of blog posts that handle threads.

c++ concurrency issue with scaling thread runtime

I have a program that performs the same function on a large array. I break the array into equal chunks and pass them to threads. Currently the threads perform the function and return what they are supposed to, BUT the more threads I add the longer each thread takes to run. Which totally negates the purpose of concurrency. I have tried with std::thread and std::async both with the same result. In the images below the amount of data processed by all child threads and the main thread are the same size (main has 6 more points), but what main runs in ~ 12 seconds the child threads take ~12 x the number of threads as if they were running asynchronously. But they all start at the same time, and if I output from each thread they are running concurrently. Does this have something to do with how they are being joined? I have tried everything I can think of, any help/advice is much appreciated! In the sample code main doesn't run the function until after child threads finish, if I put the join after the main runs it still doesn't run until the child threads finish. Below you can see the runtimes when run with 3 and 5 threads. These times are on a downscaled dataset for testing.
void foo(char* arg1, long arg2, std::promise<std::vector<std::vector<std::vector<std::vector<std::vector<long>>>>>> & ftrV) {
std::vector<std::vector<std::vector<std::vector<std::vector<long>>>>> Grid;
// does stuff....
// fills in "Grid"
ftrV.set_value(Grid);
}
int main(){
int thnmb = 3; // # of threads
std::vector<long> buffers; // fill in buffers
std::vector<char*> pointers; //fill in pointers
std::vector<std::promise<std::vector<std::vector<std::vector<std::vector<std::vector<long>>>>>>> PV(thnmb); // vector of promise grids
std::vector<std::future<std::vector<std::vector<std::vector<std::vector<std::vector<long>>>>>>> FV(thnmb); // vector of futures grids
std::vector<std::thread> th(thnmb); // vector of threads
std::vector<std::vector<std::vector<std::vector<std::vector<std::vector<long>>>>>> vt1(thnmb); // vector to store thread grids
for (int i = 0; i < thnmb; i++) {
th[i] = std::thread(&foo, pointers[i], buffers[i], std::ref(PV[i]));
}
for (int i = 0; i < thnmb; i++) {
FV[i] = PV[i].get_future();
}
for (int i = 0; i < thnmb; i++) {
vt1[i] = FV[i].get();
}
for (int i = 0; i < thnmb; i++) {
th[i].join();
}
// main performs same function as foo here
// combine data
// do other stuff..
return(0);
}
It's hard to give a definitive answer without knowing what foo does, but you're probably running into memory access issues. Each access to your 5 dimension array will require 5 memory lookups, and it only takes 2 or 3 threads with memory access to saturate what a typical system can deliver.
main should perform it's foo work after creating the threads but before getting the value of the promises.
And foo should probably end with ftrV.set_value(std::move(Grid)) so that a copy of that array won't have to be made.

C++ Low-Latency Threaded Asynchronous Buffered Stream (intended for logging) – Boost

Question:
3 while loops below contain code that has been commented out. I search for ("TAG1", "TAG2", and "TAG3") for easy identification. I simply want the while loops to wait on the condition tested to become true before proceeding while minimizing CPU resources as much as possible. I first tried using Boost condition variables, but there's a race condition. Putting the thread to sleep for 'x' microseconds is inefficient because there is no way to precisely time the wakeup. Finally, boost::this_thread::yield() does not seem to do anything. Probably because I only have 2 active threads on a dual-core system. Specifically, how can I make the three tagged areas below run more efficiently while introducing as little unnecessary blocking as possible.
BACKGROUND
Objective:
I have an application that logs a lot of data. After profiling, I found that much time is consumed on the logging operations (logging text or binary to a file on the local hard disk). My objective is to reduce the latency on logData calls by replacing non-threaded direct write calls with calls to a threaded buffered stream logger.
Options Explored:
Upgrade 2005-era slow hard disk to SSD...possible. Cost is not prohibitive...but involves a lot of work... more than 200 computers would have to be upgraded...
Boost ASIO...I don't need all the proactor / networking overhead, looking for something simpler and more light-weight.
Design:
Producer and consumer thread pattern, the application writes data into a buffer and a background thread then writes it to disk sometime later. So the ultimate goal is to have the writeMessage function called by the application layer return as fast as possible while data is correctly / completely logged to the log file in a FIFO order sometime later.
Only one application thread, only one writer thread.
Based on ring buffer. The reason for this decision is to use as few locks as possible and ideally...and please correct me if I'm wrong...I don't think I need any.
Buffer is a statically-allocated character array, but could move it to the heap if needed / desired for performance reasons.
Buffer has a start pointer that points to the next character that should be written to the file. Buffer has an end pointer that points to the array index after the last character to be written to the file. The end pointer NEVER passes the start pointer. If a message comes in that is larger than the buffer, then the writer waits until the buffer is emptied and writes the new message to the file directly without putting the over-sized message in the buffer (once the buffer is emptied, the worker thread won't be writing anything so no contention).
The writer (worker thread) only updates the ring buffer's start pointer.
The main (application thread) only updates the ring buffer's end pointer, and again, it only inserts new data into the buffer when there is available space...otherwise it either waits for space in the buffer to become available or writes directly as described above.
The worker thread continuously checks to see if there is data to be written (indicated by the case when the buffer start pointer != buffer end pointer). If there is no data to be written, the worker thread should ideally go to sleep and wake up once the application thread has inserted something into the buffer (and changed the buffer's end pointer such that it no longer points to the same index as the start pointer). What I have below involves while loops continuously checking that condition. It is a very bad / inefficient way of waiting on the buffer.
Results:
On my 2009-era dual-core laptop with SSD, I see that the total write time of the threaded / buffered benchmark vs. direct write is about 1 : 6 (0.609 sec vs. 0.095 sec), but highly variable. Often the buffered write benchmark is actually slower than direct write. I believe that the variability is due to the poor implementation of waiting for space to free up in the buffer, waiting for the buffer to empty, and having the worker-thread wait for work to become available. I have measured that some of the while loops consume over 10000 cycles and I suspect that those cycles are actually competing for hardware resources that the other thread (worker or application) requires to finish the computation being waited on.
Output seems to check out. With TEST mode enabled and a small buffer size of 10 as a stress test, I diffed hundreds of MBs of output and found it to equal the input.
Compiles with current version of Boost (1.55)
Header
#ifndef BufferedLogStream_h
#define BufferedLogStream_h
#include <stdio.h>
#include <iostream>
#include <iostream>
#include <cstdlib>
#include "boost\chrono\chrono.hpp"
#include "boost\thread\thread.hpp"
#include "boost\thread\locks.hpp"
#include "boost\thread\mutex.hpp"
#include "boost\thread\condition_variable.hpp"
#include <time.h>
using namespace std;
#define BENCHMARK_STR_SIZE 128
#define NUM_BENCHMARK_WRITES 524288
#define TEST 0
#define BENCHMARK 1
#define WORKER_LOOP_WAIT_MICROSEC 20
#define MAIN_LOOP_WAIT_MICROSEC 10
#if(TEST)
#define BUFFER_SIZE 10
#else
#define BUFFER_SIZE 33554432 //4 MB
#endif
class BufferedLogStream {
public:
BufferedLogStream();
void openFile(char* filename);
void flush();
void close();
inline void writeMessage(const char* message, unsigned int length);
void writeMessage(string message);
bool operator() () { return start != end; }
private:
void threadedWriter();
inline bool hasSomethingToWrite();
inline unsigned int getFreeSpaceInBuffer();
void appendStringToBuffer(const char* message, unsigned int length);
FILE* fp;
char* start;
char* end;
char* endofringbuffer;
char ringbuffer[BUFFER_SIZE];
bool workerthreadkeepalive;
boost::mutex mtx;
boost::condition_variable waitforempty;
boost::mutex workmtx;
boost::condition_variable waitforwork;
#if(TEST)
struct testbuffer {
int length;
char message[BUFFER_SIZE * 2];
};
public:
void test();
private:
void getNextRandomTest(testbuffer &tb);
FILE* datatowrite;
#endif
#if(BENCHMARK)
public:
void runBenchmark();
private:
void initBenchmarkString();
void runDirectWriteBaseline();
void runBufferedWriteBenchmark();
char benchmarkstr[BENCHMARK_STR_SIZE];
#endif
};
#if(TEST)
int main() {
BufferedLogStream* bl = new BufferedLogStream();
bl->openFile("replicated.txt");
bl->test();
bl->close();
cout << "Done" << endl;
cin.get();
return 0;
}
#endif
#if(BENCHMARK)
int main() {
BufferedLogStream* bl = new BufferedLogStream();
bl->runBenchmark();
cout << "Done" << endl;
cin.get();
return 0;
}
#endif //for benchmark
#endif
Implementation
#include "BufferedLogStream.h"
BufferedLogStream::BufferedLogStream() {
fp = NULL;
start = ringbuffer;
end = ringbuffer;
endofringbuffer = ringbuffer + BUFFER_SIZE;
workerthreadkeepalive = true;
}
void BufferedLogStream::openFile(char* filename) {
if(fp) close();
workerthreadkeepalive = true;
boost::thread t2(&BufferedLogStream::threadedWriter, this);
fp = fopen(filename, "w+b");
}
void BufferedLogStream::flush() {
fflush(fp);
}
void BufferedLogStream::close() {
workerthreadkeepalive = false;
if(!fp) return;
while(hasSomethingToWrite()) {
boost::unique_lock<boost::mutex> u(mtx);
waitforempty.wait_for(u, boost::chrono::microseconds(MAIN_LOOP_WAIT_MICROSEC));
}
flush();
fclose(fp);
fp = NULL;
}
void BufferedLogStream::threadedWriter() {
while(true) {
if(start != end) {
char* currentend = end;
if(start < currentend) {
fwrite(start, 1, currentend - start, fp);
}
else if(start > currentend) {
if(start != endofringbuffer) fwrite(start, 1, endofringbuffer - start, fp);
fwrite(ringbuffer, 1, currentend - ringbuffer, fp);
}
start = currentend;
waitforempty.notify_one();
}
else { //start == end...no work to do
if(!workerthreadkeepalive) return;
boost::unique_lock<boost::mutex> u(workmtx);
waitforwork.wait_for(u, boost::chrono::microseconds(WORKER_LOOP_WAIT_MICROSEC));
}
}
}
bool BufferedLogStream::hasSomethingToWrite() {
return start != end;
}
void BufferedLogStream::writeMessage(string message) {
writeMessage(message.c_str(), message.length());
}
unsigned int BufferedLogStream::getFreeSpaceInBuffer() {
if(end > start) return (start - ringbuffer) + (endofringbuffer - end) - 1;
if(end == start) return BUFFER_SIZE-1;
return start - end - 1; //case where start > end
}
void BufferedLogStream::appendStringToBuffer(const char* message, unsigned int length) {
if(end + length <= endofringbuffer) { //most common case for appropriately-sized buffer
memcpy(end, message, length);
end += length;
}
else {
int lengthtoendofbuffer = endofringbuffer - end;
if(lengthtoendofbuffer > 0) memcpy(end, message, lengthtoendofbuffer);
int remainderlength = length - lengthtoendofbuffer;
memcpy(ringbuffer, message + lengthtoendofbuffer, remainderlength);
end = ringbuffer + remainderlength;
}
}
void BufferedLogStream::writeMessage(const char* message, unsigned int length) {
if(length > BUFFER_SIZE - 1) { //if string is too large for buffer, wait for buffer to empty and bypass buffer, write directly to file
while(hasSomethingToWrite()); {
boost::unique_lock<boost::mutex> u(mtx);
waitforempty.wait_for(u, boost::chrono::microseconds(MAIN_LOOP_WAIT_MICROSEC));
}
fwrite(message, 1, length, fp);
}
else {
//wait until there is enough free space to insert new string
while(getFreeSpaceInBuffer() < length) {
boost::unique_lock<boost::mutex> u(mtx);
waitforempty.wait_for(u, boost::chrono::microseconds(MAIN_LOOP_WAIT_MICROSEC));
}
appendStringToBuffer(message, length);
}
waitforwork.notify_one();
}
#if(TEST)
void BufferedLogStream::getNextRandomTest(testbuffer &tb) {
tb.length = 1 + (rand() % (int)(BUFFER_SIZE * 1.05));
for(int i = 0; i < tb.length; i++) {
tb.message[i] = rand() % 26 + 65;
}
tb.message[tb.length] = '\n';
tb.length++;
tb.message[tb.length] = '\0';
}
void BufferedLogStream::test() {
cout << "Buffer size is: " << BUFFER_SIZE << endl;
testbuffer tb;
datatowrite = fopen("orig.txt", "w+b");
for(unsigned int i = 0; i < 7000000; i++) {
if(i % 1000000 == 0) cout << i << endl;
getNextRandomTest(tb);
writeMessage(tb.message, tb.length);
fwrite(tb.message, 1, tb.length, datatowrite);
}
fflush(datatowrite);
fclose(datatowrite);
}
#endif
#if(BENCHMARK)
void BufferedLogStream::initBenchmarkString() {
for(unsigned int i = 0; i < BENCHMARK_STR_SIZE - 1; i++) {
benchmarkstr[i] = rand() % 26 + 65;
}
benchmarkstr[BENCHMARK_STR_SIZE - 1] = '\n';
}
void BufferedLogStream::runDirectWriteBaseline() {
clock_t starttime = clock();
fp = fopen("BenchMarkBaseline.txt", "w+b");
for(unsigned int i = 0; i < NUM_BENCHMARK_WRITES; i++) {
fwrite(benchmarkstr, 1, BENCHMARK_STR_SIZE, fp);
}
fflush(fp);
fclose(fp);
clock_t elapsedtime = clock() - starttime;
cout << "Direct write baseline took " << ((double) elapsedtime) / CLOCKS_PER_SEC << " seconds." << endl;
}
void BufferedLogStream::runBufferedWriteBenchmark() {
clock_t starttime = clock();
openFile("BufferedBenchmark.txt");
cout << "Opend file" << endl;
for(unsigned int i = 0; i < NUM_BENCHMARK_WRITES; i++) {
writeMessage(benchmarkstr, BENCHMARK_STR_SIZE);
}
cout << "Wrote" << endl;
close();
cout << "Close" << endl;
clock_t elapsedtime = clock() - starttime;
cout << "Buffered write took " << ((double) elapsedtime) / CLOCKS_PER_SEC << " seconds." << endl;
}
void BufferedLogStream::runBenchmark() {
cout << "Buffer size is: " << BUFFER_SIZE << endl;
initBenchmarkString();
runDirectWriteBaseline();
runBufferedWriteBenchmark();
}
#endif
Update: November 25, 2013
I updated the code below use boost::condition_variables, specifically the wait_for() method as recommended by Evgeny Panasyuk. This avoids unnecessarily checking the same condition over and over again. I am currently seeing the buffered version run in about 1/6th the time as the unbuffered / direct-write version. This is not the ideal case because both cases are limited by the hard disk (in my case a 2010 era SSD). I plan to use the code below in an environment where the hard disk will not be the bottleneck and most if not all the time, the buffer should have space available to accommodate the writeMessage requests. That brings me to my next question. How big should I make the buffer? I don't mind allocating 32 MBs or 64 MB to ensure that it never fills up. The code will be running on systems that can spare that. Intuitively, I feel that it's a bad idea to statically allocate a 32 MB character array. Is it? Anyhow, I expect that when I run the code below for my intended application, the latency of logData() calls will be greatly reduced which will yield a significant reduction in overall processing time.
If anyone sees any way to make the code below better (faster, more robust, leaner, etc), please let me know. I appreciate the feedback. Lazin, how would your approach be faster or more efficient than what I have posted below? I kinda like the idea of just having one buffer and making it large enough so that it practically never fills up. Then I don't have to worry about reading from different buffers. Evgeny Panasyuk, I like the approach of using existing code whenever possible, especially if it's an existing boost library. However, I also don't see how the spcs_queue is more efficient than what I have below. I'd rather deal with one large buffer than many smaller ones and have to worry about splitting splitting my input stream on the input and splicing it back together on the output. Your approach would allow me to offload the formatting from the main thread onto the worker thread. That is a cleaver approach. But I'm not sure yet whether it will save a lot of time and to realize the full benefit, I would have to modify code that I do not own.
//End Update
General solution.
I think you must look at the Naggle algorithm. For one producer and one consumer this would look like this:
At the beginning buffer is empty, worker thread is idle and waiting for the events.
Producer writes data to the buffer and notifies worker thread.
Worker thread woke up and start the write operation.
Producer tries to write another message, but buffer is used by worker, so producer allocates another buffer and writes message to it.
Producer tries to write another message, I/O still in progress so producer writes message to previously allocated buffer.
Worker thread done writing buffer to file and sees that there is another buffer with data so it grabs it and starts to write.
The very first buffer is used by producer to write all consecutive messages, until second write operation in progress.
This schema will help achieve low latency requirement, single message will be written to disc instantaneously, but large amount of events will be written by large batches for greather throughput.
If your log messages have levels - you can improve this schema a little bit. All error messages have high priority(level) and must be saved on disc immediately (because they are rare but very valuable) but debug and trace messages have low priority and can be buffered to save bandwidth (because they are very frequent but not as valuable as error and info messages). So when you write error message, you must wait until worker thread is done writing your message (and all messages that are in the same buffer) and then continue, but debug and trace messages can be just written to buffer.
Threading.
Spawning worker thread for each application thread is to costly. You must use single writer thread for each log file. Write buffers must be shared between threads. Each buffer must have two pointers - commit_pointer and prepare_pointer. All buffer space between beginning of the buffer and commit_pointer are available for worker thread. Buffer space between commit_pointer and prepare_pointer are currently updated by application threads. Invariant: commit_pointer <= prepare_pointer.
Write operations can be performed in two steps.
Prepare write. This operation reserves space in a buffer.
Producer calculates len(message) and atomically updates prepare_pointer;
Old prepare_pointer value and len is saved by consumer;
Commit write.
Producer writes message at the beginning of the reserved buffer space (old prepare_pointer value).
Producer busy-waits until commit_pointer is equal to old prepare_pointer value that its save in local variable.
Producer commit write operation by doing commit_pointer = commit_pointer + len atomically.
To prevent false sharing, len(message) can be rounded to cache line size and all extra space can be filled with spaces.
// pseudocode
void write(const char* message) {
int len = strlen(message); // TODO: round to cache line size
const char* old_prepare_ptr;
// Prepare step
while(1)
{
old_prepare_ptr = prepare_ptr;
if (
CAS(&prepare_ptr,
old_prepare_ptr,
prepare_ptr + len) == old_prepare_ptr
)
break;
// retry if another thread perform prepare op.
}
// Write message
memcpy((void*)old_prepare_ptr, (void*)message, len);
// Commit step
while(1)
{
const char* old_commit_ptr = commit_ptr;
if (
CAS(&commit_ptr,
old_commit_ptr,
old_commit_ptr + len) == old_commit_ptr
)
break;
// retry if another thread commits
}
notify_worker_thread();
}
concurrent_queue<T, Size>
The question that I have is how to make the worker thread go to work as soon as there is work to do and sleep when there is no work.
There is boost::lockfree::spsc_queue - wait-free single-producer single-consumer queue. It can be configured to have compile-time capacity (the size of the internal ringbuffer).
From what I understand, you want something similar to following configuration:
template<typename T, size_t N>
class concurrent_queue
{
// T can be wrapped into struct with padding in order to avoid false sharing
mutable boost::lockfree::spsc_queue<T, boost::lockfree::capacity<N>> q;
mutable mutex m;
mutable condition_variable c;
void wait() const
{
unique_lock<mutex> u(m);
c.wait_for(u, chrono::microseconds(1)); // Or whatever period you need.
// Timeout is required, because modification happens not under mutex
// and notification can be lost.
// Another option is just to use sleep/yield, without notifications.
}
void notify() const
{
c.notify_one();
}
public:
void push(const T &t)
{
while(!q.push(t))
wait();
notify();
}
void pop(T &result)
{
while(!q.pop(result))
wait();
notify();
}
};
When there are elements in queue - pop does not block. And when there is enough space in internal buffer - push does not block.
concurrent<T>
I want to reduce both formatting and write times as much as possible so I plan to reduce both.
Check out Herb Sutter talk at C++ and Beyond 2012: C++ Concurrency. At page 14 he shows example of concurrent<T>. Basically it is wrapper around object of type T which starts separate thread for performing all operations on that object. Usage is:
concurrent<ostream*> x(&cout); // starts thread internally
// ...
// x acts as function object.
// It's function call operator accepts action
// which is performed on wrapped object in separate thread.
int i = 42;
x([i](ostream *out){ *out << "i=" << i; }); // passing lambda as action
You can use similar pattern in order to offload all formatting work to consumer thread.
Small Object Optimization
Otherwise, new buffers are allocated and I want to avoid memory allocation after the buffer stream is constructed.
Above concurrent_queue<T, Size> example uses fixed-size buffer which is fully contained within queue, and does not imply additional allocations.
However, Herb's concurrent<T> example uses std::function to pass action into worker thread. That may incur costly allocation.
std::function implementations may use Small Object Optimization (and most implementations do) - small function objects are in-place copy-constructed in internal buffer, but there is no guarantee, and for function objects bigger than threshold - heap allocation would happen.
There are several options to avoid this allocation:
Implement std::function analog with internal buffer large enough to hold target function objects (for example, you can try to modify boost::function or this version).
Use your own function object which would represent all type of log messages. Basically it would contain just values required to format message. As potentially there are different types of messages, consider to use boost::variant (which is literary union coupled with type tag) to represent them.
Putting it all together, here is proof-of-concept (using second option):
LIVE DEMO
#include <boost/lockfree/spsc_queue.hpp>
#include <boost/optional.hpp>
#include <boost/variant.hpp>
#include <condition_variable>
#include <iostream>
#include <cstddef>
#include <thread>
#include <chrono>
#include <mutex>
using namespace std;
/*********************************************/
template<typename T, size_t N>
class concurrent_queue
{
mutable boost::lockfree::spsc_queue<T, boost::lockfree::capacity<N>> q;
mutable mutex m;
mutable condition_variable c;
void wait() const
{
unique_lock<mutex> u(m);
c.wait_for(u, chrono::microseconds(1));
}
void notify() const
{
c.notify_one();
}
public:
void push(const T &t)
{
while(!q.push(t))
wait();
notify();
}
void pop(T &result)
{
while(!q.pop(result))
wait();
notify();
}
};
/*********************************************/
template<typename T, typename F>
class concurrent
{
typedef boost::optional<F> Job;
mutable concurrent_queue<Job, 16> q; // use custom size
mutable T x;
thread worker;
public:
concurrent(T x)
: x{x}, worker{[this]
{
Job j;
while(true)
{
q.pop(j);
if(!j) break;
(*j)(this->x); // you may need to handle exceptions in some way
}
}}
{}
void operator()(const F &f)
{
q.push(Job{f});
}
~concurrent()
{
q.push(Job{});
worker.join();
}
};
/*********************************************/
struct LogEntry
{
struct Formatter
{
typedef void result_type;
ostream *out;
void operator()(double x) const
{
*out << "floating point: " << x << endl;
}
void operator()(int x) const
{
*out << "integer: " << x << endl;
}
};
boost::variant<int, double> data;
void operator()(ostream *out)
{
boost::apply_visitor(Formatter{out}, data);
}
};
/*********************************************/
int main()
{
concurrent<ostream*, LogEntry> log{&cout};
for(int i=0; i!=1024; ++i)
{
log({i});
log({i/10.});
}
}

c++ pthread multithreading for 2 x Intel Xeon X5570, quad-core CPUs on Amazan EC2 HPC ubuntu instance

I wrote a program that employs multithreading for parallel computing. I have verified that on my system (OS X) it maxes out both cores simultaneously. I just ported it to Ubuntu with no modifications needed, because I coded it with that platform in mind. In particular, I am running the Canonical HVM Oneiric image on an an Amazon EC2, cluster compute 4x large instance. Those machines feature 2 Intel Xeon X5570, quad-core CPUs.
Unfortunately, my program does not accomplish multithreading on the EC2 machine. Running more than 1 thread actually slows the computing marginally for each additional thread. Running top while my program is running shows that when more than 1 thread is initialized, the system% of CPU consumption is roughly proportional to the number of threads. With only 1 thread, %sy is ~0.1. In either case user% never goes above ~9%.
The following are the threading-relevant sections of my code
const int NUM_THREADS = N; //where changing N is how I set the # of threads
void Threading::Setup_Threading()
{
sem_unlink("producer_gate");
sem_unlink("consumer_gate");
producer_gate = sem_open("producer_gate", O_CREAT, 0700, 0);
consumer_gate = sem_open("consumer_gate", O_CREAT, 0700, 0);
completed = 0;
queued = 0;
pthread_attr_init (&attr);
pthread_attr_setdetachstate (&attr, PTHREAD_CREATE_DETACHED);
}
void Threading::Init_Threads(vector <NetClass> * p_Pop)
{
thread_list.assign(NUM_THREADS, pthread_t());
for(int q=0; q<NUM_THREADS; q++)
pthread_create(&thread_list[q], &attr, Consumer, (void*) p_Pop );
}
void* Consumer(void* argument)
{
std::vector <NetClass>* p_v_Pop = (std::vector <NetClass>*) argument ;
while(1)
{
sem_wait(consumer_gate);
pthread_mutex_lock (&access_queued);
int index = queued;
queued--;
pthread_mutex_unlock (&access_queued);
Run_Gen( (*p_v_Pop)[index-1] );
completed--;
if(!completed)
sem_post(producer_gate);
}
}
main()
{
...
t1 = time(NULL);
threads.Init_Threads(p_Pop_m);
for(int w = 0; w < MONTC_NUM_TRIALS ; w++)
{
queued = MONTC_POP;
completed = MONTC_POP;
for(int q = MONTC_POP-1 ; q > -1; q--)
sem_post(consumer_gate);
sem_wait(producer_gate);
}
threads.Close_Threads();
t2 = time(NULL);
cout << difftime(t2, t1);
...
}
Ok, just guess. There is simple way to transform your parallel code to consecutive. For example:
thread_func:
while (1) {
pthread_mutex_lock(m1);
//do something
pthread_mutex_unlock(m1);
...
pthread_mutex_lock(mN);
pthread_mutex_unlock(mN);
If you run such code in several thread, you will not see any speedup, because of mutex usage. Code will work as consecutive, not as parallel. Only one thread will work at any moment.
The bad thing, that you can not used any mutex in your program explicity, but still have such situation. For example, call of "malloc" may cause usage of mutex some where in "C" runtime, call of "write" may cause usage of mutex somewhere in Linux kernel. Even call of gettimeofday may cause mutex lock/unlock (and they cause, if tell about Linux/glibc).
You may have only one mutex, but spend under it a lot of time, and this may cause such behaviour.
And because of mutex may be used somewhere in kernel and C/C++ runtime, you can see different behaviour with different OSes.