Atomic Variables Accessed Multiple Times From One Function

Atomic Variables Accessed Multiple Times From One Function - c++

I have the following code:
The header:
class Counter
{
public:
Conuter(const std::string& fileName);
boost::uint16_t getCounter();
private:
tbb::atomic<boost::uint32_t> counter;
std::string counterFileName;
};
The cpp:
Counter::Counter(const std::string& fileName) : counter(), counterFileName(fileName)
{
std::string line;
std::ifstream counterFile (fileName.c_str());
if (counterFile.is_open())
{
getline (counterFile, line);
counterFile.close();
}
boost::uint32_t temp = std::stoul (line,nullptr,0);
counter = temp;
}
boost::uint32_t Counter::getCounter()
{
if (counter > 1000)
{
counter = 0;
}
assert( counter < 1000);
const boost::uint32_t ret = counter++;
if ((counter % 10) == 0)
{
// write the counter back to the file.
std::ofstream file (counterFileName.c_str());
if (file.is_open())
{
myfile << counter;
myfile.close();
}
}
return ret;
}
And elsewhere lets say we have two threads:
boost::thread t1(&Counter::getCounter, counter);
boost::thread t2(&Counter::getCounter, counter);
My question is around the atomic variable. Since the getCounter function can access the counter value up to 3 times per call can the atomic variable change from one call to the next. For example, if the call to if (counter > 1000) fails to pass is there any sort of guarantee that the assert will also pass? Maybe more concretely does the atomic variable block until the end of the function call? Or just as long as the value is being read/written? My second question is, how does the operating system handle the atomic? As in if the atomic doesn't cause the function to block until it is finished, what happens if one thread is updating the variable and one thread is attempting to write it out? Sorry for the abundance of questions this is my first attempt at a lock free data structure.

First of all, even in a single-threaded app, sequence of
if (counter > 1000) ...
assert(counter < 1000)
will fail when counter is 1000.
Second, yes, atomic variable can change between reads easily. The whole point of it is that the single read is atomic, and if another thread is updating the variable exactly when it is read, you are guaranteed to have a proper memory-ordered read (you also have some guarantees regarding arithmetics on it - your increment is guaranteed to increment). But it says nothing about next read!
If you need to lock the variable, you need to use locking mechansims, such as mutexes.

Related

One more releasing and acquiring locks make performance worse unexpectedly

My program has 8 writing threads and one persistence thread. The following code is the core of the persistence thread
std::string longLine;
myMutex.lock();
while (!myQueue.empty()) {
std::string& head = myQueue.front();
const int hSize = head.size();
if(hSize < blockMaxSize)
break;
longLine += head;
myQueue.pop_front();
}
myMutex.unlock();
flushToFile(longLine);
The performance is acceptable (millions of writings finished in hundreds of milliseconds). I still hope to improve the code by avoiding string copying so that I change the code as followed:
myMutex.lock();
while (!myQueue.empty()) {
const int hsize = myQueu.front().size();
if(hsize < blockMaxSize)
break;
std::string head{std::move(myQueue.front())};
myQueue.pop_front();
myMutex.unlock();
flushToFile(head);
myMutex.lock();
}
myMutex.unlock();
It is surprising that the performance drops sharply to millions of writings finished in quite a few seconds. Debugging shows most of time was spent on waiting for the lock after flushing the file.
But I don't understand why. Any one could help?
Not understand more time spent on wait for the lock

Possibly faster. Do all your string concatenations inside the flush function. That way your string concatenation won't block the writer threads trying to append to the queue. This is possibly a micro-optimization.
While we're at it. Let's establish that myQueue is a vector and not a queue or list class. This will be faster since the only operations on the collection are an append or total erase.
std::string longLine;
std::vector<std::string> tempQueue;
myMutex.lock();
if (myQueue.size() >= blockMaxSize) {
tempQueue = std::move(myQueue);
myQueue = {}; // not sure if this is needed
}
myMutex.unlock();
flushToFileWithQueue(tempQueue);
Where flushToFileWithQueue is this:
void flushToFileWithQueue(std::vector<std::string>& queue) {
string longLine;
for (size_t i = 0; i < queue.size(); i++) {
longline += queue[i];
}
queue.resize(0); // faster than calling .pop() N times
flushToFile(longLine);
}
You didn't show what wakes up the persistence thread. If it's polling instead of using a proper condition variable, let me know and I'll show you how to use that.
Also make use of the .reserve() method on these instances of the vector collection such that all the queue has all the memory it needs to grow. Again, possibly a micro-optimization.

Multi-thread crawler doesn't speed up with threading (on local files)

I have a task - to write a multithreaded webcrawler (actually I have a local set.html files that I need to parse and move to another directory). The main condition for this task is to make it possible to enter an arbitrary number of threads and determine at what number the program will stop adding in performance.
#include <iostream>
#include <fstream>
#include <thread>
#include <mutex>
#include <queue>
#include <ctime>
#include <set>
#include <chrono>
#include <atomic>
using namespace std;
class WebCrawler{
private:
const string start_path = "/";
const string end_path = "/";
int thread_counts;
string home_page;
queue<string> to_visit;
set<string> visited;
vector<thread> threads;
mutex mt1;
int count;
public:
WebCrawler(int thread_counts_, string root_)
:thread_counts(thread_counts_), home_page(root_) {
to_visit.push(root_);
visited.insert(root_);
count = 0;
}
void crawler(){
for(int i = 0; i<thread_counts; i++)
threads.push_back(thread(&WebCrawler::start_crawl, this));
for(auto &th: threads)
th.join();
cout<<"Count: "<<count<<endl;
}
void parse_html(string page_){
cout<<"Thread: "<<this_thread::get_id()<<" page: "<<page_<< endl;
ifstream page;
page.open(start_path+page_, ios::in);
string tmp;
getline(page, tmp);
page.close();
for(int i = 0; i<tmp.size(); i++){
if( tmp[i] == '<'){
string tmp_num ="";
while(tmp[i]!= '>'){
if(isdigit(tmp[i]))
tmp_num+=tmp[i];
i++;
}
tmp_num+= ".html";
if((visited.find(tmp_num) == visited.end())){
mt1.lock();
to_visit.push(tmp_num);
visited.insert(tmp_num);
mt1.unlock();
}
}
}
}
void move(string page_){
mt1.lock();
count++;
ofstream page;
page.open(end_path+page_, ios::out);
page.close();
mt1.unlock();
}
void start_crawl(){
cout<<"Thread started: "<<this_thread::get_id()<< endl;
string page;
while(!to_visit.empty()){
mt1.lock();
page = to_visit.front();
to_visit.pop();
mt1.unlock();
parse_html(page);
move(page);
}
}
};
int main(int argc, char const *argv\[])
{
int start_time = clock();
WebCrawler crawler(7, "0.html");
crawler.crawler();
int end_time = clock();
cout<<"Time: "<<(float)(end_time -start_time)/CLOCKS_PER_SEC<<endl;
cout<<thread::hardware_concurrency()<<endl;
return 0;
}
1 thread = Time: 0.709504
2 thread = Time: 0.668037
4 thread = Time: 0.762967
7 thread = Time: 0.781821
I've been trying to figure out for a week why my program is running slower even on two threads. I probably don't fully understand how mutex works, or perhaps the speed is lost during the joining of threads. Do you have any ideas how to fix it?

There are many ways to protect things in multithreading, implicit or explicit.
In addition to the totally untested code, there are also some implicit assumptions, for example of that int is large enough for your task, that must be considered.
Lets make a short analysis of what is needing protection.
Variables that are accessed from multiple threads
things that are const can be excluded
unless you const cast them
part of them are mutable
global objects like files or cout
could be overwritten
written from multiple threads
streams have their own internal locks
so you can write to a stream from multiple threads to cout
but you don't want it for the files in this case.
if multiple threads want to open the same file, you will get an error.
std::endl forces an synchronization, so change it to "\n" like a commenter noted.
So this boils down to:
queue<string> to_visit;
set<string> visited; // should be renamed visiting
int count;
<streams and files>
count is easy
std::atomic<int> count;
The files are implicit protected by your visited/visiting check, so they are good too. So the mutex in move can be removed.
The remaining needs an mutex each as they could be independently updated.
mutex mutTovisit, // formerly known as mut1.
mutVisiting.
Now we have the problem that we could deadlock with two mutexes, if we try to lock in different order in two places. You need to read up on all the lock stuff if you add more locks, scoped_lock and lock are good places to start.
Changing the code to
{
scoped_lock visitLock(mutVisiting); // unlocks at end of } even if stuff throws
if((visited.find(tmp_num) == visited.end())){
scoped_lock toLock(mutTo);
to_visit.push(tmp_num);
visited.insert(tmp_num);
}
}
And in this code there are multiple errors, that are hidden by the not thread safe access to to_visit and the randomness of the thread starts.
while(!to_visit.empty()){ // 2. now the next thread starts and sees its empty and stops
// 3. or worse it starts then hang at lock
mt1.lock();
page = to_visit.front(); // 4. and does things that are not good here with an empty to_visit
to_visit.pop(); // 1. is now empty after reading root
mt1.unlock();
parse_html(page);
move(page);
}
To solve this you need an (atomic?) counter, found(Pages) of current known unvisited pages so we know if are done. Then to start threads when there is new work that needs to be done we can use std::condition_variable(_any)
The general idea of the plan is to have the threads wait until work is available, then each time a new page is discovered notify_one to start work.
To Startup, set the found to 1 and notify_one once the threads have started, when a thread is done with the work decrease found. To stop when found is zero, the thread that decrease it to zero notify_all so they all can stop.
What you will find is that if the data is on a single slow disk, it is unlikely you will see much effect from more than 2 threads, if all files are currently cached in ram, you might see more effect.

I think there's a bottle neck on your move function. Each thread takes the same amount of time to go through that function. You could start with that

Which types of memory_order should be used for non-blocking behaviour with an atomic_flag?

I'd like, instead of having my threads wait, doing nothing, for other threads to finish using data, to do something else in the meantime (like checking for input, or re-rendering the previous frame in the queue, and then returning to check to see if the other thread is done with its task).
I think this code that I've written does that, and it "seems" to work in the tests I've performed, but I don't really understand how std::memory_order_acquire and std::memory_order_clear work exactly, so I'd like some expert advice on if I'm using those correctly to achieve the behaviour I want.
Also, I've never seen multithreading done this way before, which makes me a bit worried. Are there good reasons not to have a thread do other tasks instead of waiting?
/*test program
intended to test if atomic flags can be used to perform other tasks while shared
data is in use, instead of blocking
each thread enters the flag protected part of the loop 20 times before quitting
if the flag indicates that the if block is already in use, the thread is intended to
execute the code in the else block (only up to 5 times to avoid cluttering the output)
debug note: this doesn't work with std::cout because all the threads are using it at once
and it's not thread safe so it all gets garbled. at least it didn't crash
real world usage
one thread renders and draws to the screen, while the other checks for input and
provides frameData for the renderer to use. neither thread should ever block*/
#include <fstream>
#include <atomic>
#include <thread>
#include <string>
struct ThreadData {
int numTimesToWriteToDebugIfBlockFile;
int numTimesToWriteToDebugElseBlockFile;
};
class SharedData {
public:
SharedData() {
threadData = new ThreadData[10];
for (int a = 0; a < 10; ++a) {
threadData[a] = { 20, 5 };
}
flag.clear();
}
~SharedData() {
delete[] threadData;
}
void runThread(int threadID) {
while (this->threadData[threadID].numTimesToWriteToDebugIfBlockFile > 0) {
if (this->flag.test_and_set(std::memory_order_acquire)) {
std::string fileName = "debugIfBlockOutputThread#";
fileName += std::to_string(threadID);
fileName += ".txt";
std::ofstream writeFile(fileName.c_str(), std::ios::app);
writeFile << threadID << ", running, output #" << this->threadData[threadID].numTimesToWriteToDebugIfBlockFile << std::endl;
writeFile.close();
writeFile.clear();
this->threadData[threadID].numTimesToWriteToDebugIfBlockFile -= 1;
this->flag.clear(std::memory_order_release);
}
else {
if (this->threadData[threadID].numTimesToWriteToDebugElseBlockFile > 0) {
std::string fileName = "debugElseBlockOutputThread#";
fileName += std::to_string(threadID);
fileName += ".txt";
std::ofstream writeFile(fileName.c_str(), std::ios::app);
writeFile << threadID << ", standing by, output #" << this->threadData[threadID].numTimesToWriteToDebugElseBlockFile << std::endl;
writeFile.close();
writeFile.clear();
this->threadData[threadID].numTimesToWriteToDebugElseBlockFile -= 1;
}
}
}
}
private:
ThreadData* threadData;
std::atomic_flag flag;
};
void runThread(int threadID, SharedData* sharedData) {
sharedData->runThread(threadID);
}
int main() {
SharedData sharedData;
std::thread thread[10];
for (int a = 0; a < 10; ++a) {
thread[a] = std::thread(runThread, a, &sharedData);
}
thread[0].join();
thread[1].join();
thread[2].join();
thread[3].join();
thread[4].join();
thread[5].join();
thread[6].join();
thread[7].join();
thread[8].join();
thread[9].join();
return 0;
}```

The memory ordering you're using here is correct.
The acquire memory order when you test and set your flag (to take your hand-written lock) has the effect, informally speaking, of preventing any memory accesses of the following code from becoming visible before the flag is tested. That's what you want, because you want to ensure that those accesses are effectively not done if the flag was already set. Likewise, the release order on the clear at the end prevents any of the preceding accesses from becoming visible after the clear, which is also what you need so that they only happen while the lock is held.
However, it's probably simpler to just use a std::mutex. If you don't want to wait to take the lock, but instead do something else if you can't, that's what try_lock is for.
class SharedData {
// ...
private:
std::mutex my_lock;
}
// ...
if (my_lock.try_lock()) {
// lock was taken, proceed with critical section
my_lock.unlock();
} else {
// lock not taken, do non-critical work
}
This may have a bit more overhead, but avoids the need to think about atomicity and memory ordering. It also gives you the option to easily do a blocking wait if that later becomes useful. If you've designed your program around an atomic_flag and later find a situation where you must wait to take the lock, you may find yourself stuck with either spinning while continually retrying the lock (which is wasteful of CPU cycles), or something like std::this_thread::yield(), which may wait for longer than necessary after the lock is available.
It's true this pattern is somewhat unusual. If there is always non-critical work to be done that doesn't need the lock, commonly you'd design your program to have a separate thread that just does the non-critical work continuously, and then the "critical" thread can just block as it waits for the lock.

Multithreaded OpenSSL

I am using libcurl with OpenSSL in a multithreaded-program. Based on the kind response from Daniel Stenberg and the examples
opensslthreadlock.c
threaded-ssl.c
I tried the following code:
static std::mutex* aMutex;
void locking_function(int mode, int n, const char* file, int line)
{
if(mode & CRYPTO_LOCK){
std::cout << "Mutex locking\n";
aMutex[n].lock();
}
else{
std::cout << "Mutex unlocking\n";
aMutex[n].unlock();
}
}
unsigned long id_function()
{
return (unsigned long)std::hash<std::thread::id>() (std::this_thread::get_id());
}
int thread_setup()
{
aMutex = new std::mutex[CRYPTO_num_locks()];
if(!aMutex)
return 0;
else{
CRYPTO_set_id_callback(id_function);
CRYPTO_set_locking_callback(locking_function);
}
return 1;
}
int thread_cleanup()
{
if(!aMutex)
return 0;
CRYPTO_set_id_callback(NULL);
CRYPTO_set_locking_callback(NULL);
delete[] aMutex;
aMutex = NULL;
return 1;
}
So I provide the locking function, the thread-identifier function and let the locking_function use a global array of mutexes.
The sentences "Mutex locking" and "Mutex unlocking" (which are printed in the locking_function) gets printed over and over really fast. When I destroy the CURL handle it stops. Is this normal/correct?
I call
thread_setup
when my application starts and
thread_cleanup
when the application ends.
Thanks for any help.

The sentences "Mutex locking" and "Mutex unlocking" (which are printed in the locking_function) gets printed over and over really fast.
Mutex lock/unlock are used to ensure only one thread accesses the critical section at one time (that's why it's called mutual exclusion). In a multi-threaded program, where you use an array of mutexes and use the same methods for locking/unlocking them, these two I expect would get called many times. Plus the execution of the critical section is usually fast(a few lines of code).
So to give you a straight answer, yes, what you are seeing is normal

Please note that on some platforms this code could have an error:
(unsigned long)std::hash<std::thread::id>() (std::this_thread::get_id());
std::hash() returns size_t and you cast it to unsigned long. On Win-64 platform, size_t is 64 bit and unsigned long is 32 bit.

C++ Low-Latency Threaded Asynchronous Buffered Stream (intended for logging) – Boost

Question:
3 while loops below contain code that has been commented out. I search for ("TAG1", "TAG2", and "TAG3") for easy identification. I simply want the while loops to wait on the condition tested to become true before proceeding while minimizing CPU resources as much as possible. I first tried using Boost condition variables, but there's a race condition. Putting the thread to sleep for 'x' microseconds is inefficient because there is no way to precisely time the wakeup. Finally, boost::this_thread::yield() does not seem to do anything. Probably because I only have 2 active threads on a dual-core system. Specifically, how can I make the three tagged areas below run more efficiently while introducing as little unnecessary blocking as possible.
BACKGROUND
Objective:
I have an application that logs a lot of data. After profiling, I found that much time is consumed on the logging operations (logging text or binary to a file on the local hard disk). My objective is to reduce the latency on logData calls by replacing non-threaded direct write calls with calls to a threaded buffered stream logger.
Options Explored:
Upgrade 2005-era slow hard disk to SSD...possible. Cost is not prohibitive...but involves a lot of work... more than 200 computers would have to be upgraded...
Boost ASIO...I don't need all the proactor / networking overhead, looking for something simpler and more light-weight.
Design:
Producer and consumer thread pattern, the application writes data into a buffer and a background thread then writes it to disk sometime later. So the ultimate goal is to have the writeMessage function called by the application layer return as fast as possible while data is correctly / completely logged to the log file in a FIFO order sometime later.
Only one application thread, only one writer thread.
Based on ring buffer. The reason for this decision is to use as few locks as possible and ideally...and please correct me if I'm wrong...I don't think I need any.
Buffer is a statically-allocated character array, but could move it to the heap if needed / desired for performance reasons.
Buffer has a start pointer that points to the next character that should be written to the file. Buffer has an end pointer that points to the array index after the last character to be written to the file. The end pointer NEVER passes the start pointer. If a message comes in that is larger than the buffer, then the writer waits until the buffer is emptied and writes the new message to the file directly without putting the over-sized message in the buffer (once the buffer is emptied, the worker thread won't be writing anything so no contention).
The writer (worker thread) only updates the ring buffer's start pointer.
The main (application thread) only updates the ring buffer's end pointer, and again, it only inserts new data into the buffer when there is available space...otherwise it either waits for space in the buffer to become available or writes directly as described above.
The worker thread continuously checks to see if there is data to be written (indicated by the case when the buffer start pointer != buffer end pointer). If there is no data to be written, the worker thread should ideally go to sleep and wake up once the application thread has inserted something into the buffer (and changed the buffer's end pointer such that it no longer points to the same index as the start pointer). What I have below involves while loops continuously checking that condition. It is a very bad / inefficient way of waiting on the buffer.
Results:
On my 2009-era dual-core laptop with SSD, I see that the total write time of the threaded / buffered benchmark vs. direct write is about 1 : 6 (0.609 sec vs. 0.095 sec), but highly variable. Often the buffered write benchmark is actually slower than direct write. I believe that the variability is due to the poor implementation of waiting for space to free up in the buffer, waiting for the buffer to empty, and having the worker-thread wait for work to become available. I have measured that some of the while loops consume over 10000 cycles and I suspect that those cycles are actually competing for hardware resources that the other thread (worker or application) requires to finish the computation being waited on.
Output seems to check out. With TEST mode enabled and a small buffer size of 10 as a stress test, I diffed hundreds of MBs of output and found it to equal the input.
Compiles with current version of Boost (1.55)
Header
#ifndef BufferedLogStream_h
#define BufferedLogStream_h
#include <stdio.h>
#include <iostream>
#include <iostream>
#include <cstdlib>
#include "boost\chrono\chrono.hpp"
#include "boost\thread\thread.hpp"
#include "boost\thread\locks.hpp"
#include "boost\thread\mutex.hpp"
#include "boost\thread\condition_variable.hpp"
#include <time.h>
using namespace std;
#define BENCHMARK_STR_SIZE 128
#define NUM_BENCHMARK_WRITES 524288
#define TEST 0
#define BENCHMARK 1
#define WORKER_LOOP_WAIT_MICROSEC 20
#define MAIN_LOOP_WAIT_MICROSEC 10
#if(TEST)
#define BUFFER_SIZE 10
#else
#define BUFFER_SIZE 33554432 //4 MB
#endif
class BufferedLogStream {
public:
BufferedLogStream();
void openFile(char* filename);
void flush();
void close();
inline void writeMessage(const char* message, unsigned int length);
void writeMessage(string message);
bool operator() () { return start != end; }
private:
void threadedWriter();
inline bool hasSomethingToWrite();
inline unsigned int getFreeSpaceInBuffer();
void appendStringToBuffer(const char* message, unsigned int length);
FILE* fp;
char* start;
char* end;
char* endofringbuffer;
char ringbuffer[BUFFER_SIZE];
bool workerthreadkeepalive;
boost::mutex mtx;
boost::condition_variable waitforempty;
boost::mutex workmtx;
boost::condition_variable waitforwork;
#if(TEST)
struct testbuffer {
int length;
char message[BUFFER_SIZE * 2];
};
public:
void test();
private:
void getNextRandomTest(testbuffer &tb);
FILE* datatowrite;
#endif
#if(BENCHMARK)
public:
void runBenchmark();
private:
void initBenchmarkString();
void runDirectWriteBaseline();
void runBufferedWriteBenchmark();
char benchmarkstr[BENCHMARK_STR_SIZE];
#endif
};
#if(TEST)
int main() {
BufferedLogStream* bl = new BufferedLogStream();
bl->openFile("replicated.txt");
bl->test();
bl->close();
cout << "Done" << endl;
cin.get();
return 0;
}
#endif
#if(BENCHMARK)
int main() {
BufferedLogStream* bl = new BufferedLogStream();
bl->runBenchmark();
cout << "Done" << endl;
cin.get();
return 0;
}
#endif //for benchmark
#endif
Implementation
#include "BufferedLogStream.h"
BufferedLogStream::BufferedLogStream() {
fp = NULL;
start = ringbuffer;
end = ringbuffer;
endofringbuffer = ringbuffer + BUFFER_SIZE;
workerthreadkeepalive = true;
}
void BufferedLogStream::openFile(char* filename) {
if(fp) close();
workerthreadkeepalive = true;
boost::thread t2(&BufferedLogStream::threadedWriter, this);
fp = fopen(filename, "w+b");
}
void BufferedLogStream::flush() {
fflush(fp);
}
void BufferedLogStream::close() {
workerthreadkeepalive = false;
if(!fp) return;
while(hasSomethingToWrite()) {
boost::unique_lock<boost::mutex> u(mtx);
waitforempty.wait_for(u, boost::chrono::microseconds(MAIN_LOOP_WAIT_MICROSEC));
}
flush();
fclose(fp);
fp = NULL;
}
void BufferedLogStream::threadedWriter() {
while(true) {
if(start != end) {
char* currentend = end;
if(start < currentend) {
fwrite(start, 1, currentend - start, fp);
}
else if(start > currentend) {
if(start != endofringbuffer) fwrite(start, 1, endofringbuffer - start, fp);
fwrite(ringbuffer, 1, currentend - ringbuffer, fp);
}
start = currentend;
waitforempty.notify_one();
}
else { //start == end...no work to do
if(!workerthreadkeepalive) return;
boost::unique_lock<boost::mutex> u(workmtx);
waitforwork.wait_for(u, boost::chrono::microseconds(WORKER_LOOP_WAIT_MICROSEC));
}
}
}
bool BufferedLogStream::hasSomethingToWrite() {
return start != end;
}
void BufferedLogStream::writeMessage(string message) {
writeMessage(message.c_str(), message.length());
}
unsigned int BufferedLogStream::getFreeSpaceInBuffer() {
if(end > start) return (start - ringbuffer) + (endofringbuffer - end) - 1;
if(end == start) return BUFFER_SIZE-1;
return start - end - 1; //case where start > end
}
void BufferedLogStream::appendStringToBuffer(const char* message, unsigned int length) {
if(end + length <= endofringbuffer) { //most common case for appropriately-sized buffer
memcpy(end, message, length);
end += length;
}
else {
int lengthtoendofbuffer = endofringbuffer - end;
if(lengthtoendofbuffer > 0) memcpy(end, message, lengthtoendofbuffer);
int remainderlength = length - lengthtoendofbuffer;
memcpy(ringbuffer, message + lengthtoendofbuffer, remainderlength);
end = ringbuffer + remainderlength;
}
}
void BufferedLogStream::writeMessage(const char* message, unsigned int length) {
if(length > BUFFER_SIZE - 1) { //if string is too large for buffer, wait for buffer to empty and bypass buffer, write directly to file
while(hasSomethingToWrite()); {
boost::unique_lock<boost::mutex> u(mtx);
waitforempty.wait_for(u, boost::chrono::microseconds(MAIN_LOOP_WAIT_MICROSEC));
}
fwrite(message, 1, length, fp);
}
else {
//wait until there is enough free space to insert new string
while(getFreeSpaceInBuffer() < length) {
boost::unique_lock<boost::mutex> u(mtx);
waitforempty.wait_for(u, boost::chrono::microseconds(MAIN_LOOP_WAIT_MICROSEC));
}
appendStringToBuffer(message, length);
}
waitforwork.notify_one();
}
#if(TEST)
void BufferedLogStream::getNextRandomTest(testbuffer &tb) {
tb.length = 1 + (rand() % (int)(BUFFER_SIZE * 1.05));
for(int i = 0; i < tb.length; i++) {
tb.message[i] = rand() % 26 + 65;
}
tb.message[tb.length] = '\n';
tb.length++;
tb.message[tb.length] = '\0';
}
void BufferedLogStream::test() {
cout << "Buffer size is: " << BUFFER_SIZE << endl;
testbuffer tb;
datatowrite = fopen("orig.txt", "w+b");
for(unsigned int i = 0; i < 7000000; i++) {
if(i % 1000000 == 0) cout << i << endl;
getNextRandomTest(tb);
writeMessage(tb.message, tb.length);
fwrite(tb.message, 1, tb.length, datatowrite);
}
fflush(datatowrite);
fclose(datatowrite);
}
#endif
#if(BENCHMARK)
void BufferedLogStream::initBenchmarkString() {
for(unsigned int i = 0; i < BENCHMARK_STR_SIZE - 1; i++) {
benchmarkstr[i] = rand() % 26 + 65;
}
benchmarkstr[BENCHMARK_STR_SIZE - 1] = '\n';
}
void BufferedLogStream::runDirectWriteBaseline() {
clock_t starttime = clock();
fp = fopen("BenchMarkBaseline.txt", "w+b");
for(unsigned int i = 0; i < NUM_BENCHMARK_WRITES; i++) {
fwrite(benchmarkstr, 1, BENCHMARK_STR_SIZE, fp);
}
fflush(fp);
fclose(fp);
clock_t elapsedtime = clock() - starttime;
cout << "Direct write baseline took " << ((double) elapsedtime) / CLOCKS_PER_SEC << " seconds." << endl;
}
void BufferedLogStream::runBufferedWriteBenchmark() {
clock_t starttime = clock();
openFile("BufferedBenchmark.txt");
cout << "Opend file" << endl;
for(unsigned int i = 0; i < NUM_BENCHMARK_WRITES; i++) {
writeMessage(benchmarkstr, BENCHMARK_STR_SIZE);
}
cout << "Wrote" << endl;
close();
cout << "Close" << endl;
clock_t elapsedtime = clock() - starttime;
cout << "Buffered write took " << ((double) elapsedtime) / CLOCKS_PER_SEC << " seconds." << endl;
}
void BufferedLogStream::runBenchmark() {
cout << "Buffer size is: " << BUFFER_SIZE << endl;
initBenchmarkString();
runDirectWriteBaseline();
runBufferedWriteBenchmark();
}
#endif
Update: November 25, 2013
I updated the code below use boost::condition_variables, specifically the wait_for() method as recommended by Evgeny Panasyuk. This avoids unnecessarily checking the same condition over and over again. I am currently seeing the buffered version run in about 1/6th the time as the unbuffered / direct-write version. This is not the ideal case because both cases are limited by the hard disk (in my case a 2010 era SSD). I plan to use the code below in an environment where the hard disk will not be the bottleneck and most if not all the time, the buffer should have space available to accommodate the writeMessage requests. That brings me to my next question. How big should I make the buffer? I don't mind allocating 32 MBs or 64 MB to ensure that it never fills up. The code will be running on systems that can spare that. Intuitively, I feel that it's a bad idea to statically allocate a 32 MB character array. Is it? Anyhow, I expect that when I run the code below for my intended application, the latency of logData() calls will be greatly reduced which will yield a significant reduction in overall processing time.
If anyone sees any way to make the code below better (faster, more robust, leaner, etc), please let me know. I appreciate the feedback. Lazin, how would your approach be faster or more efficient than what I have posted below? I kinda like the idea of just having one buffer and making it large enough so that it practically never fills up. Then I don't have to worry about reading from different buffers. Evgeny Panasyuk, I like the approach of using existing code whenever possible, especially if it's an existing boost library. However, I also don't see how the spcs_queue is more efficient than what I have below. I'd rather deal with one large buffer than many smaller ones and have to worry about splitting splitting my input stream on the input and splicing it back together on the output. Your approach would allow me to offload the formatting from the main thread onto the worker thread. That is a cleaver approach. But I'm not sure yet whether it will save a lot of time and to realize the full benefit, I would have to modify code that I do not own.
//End Update

General solution.
I think you must look at the Naggle algorithm. For one producer and one consumer this would look like this:
At the beginning buffer is empty, worker thread is idle and waiting for the events.
Producer writes data to the buffer and notifies worker thread.
Worker thread woke up and start the write operation.
Producer tries to write another message, but buffer is used by worker, so producer allocates another buffer and writes message to it.
Producer tries to write another message, I/O still in progress so producer writes message to previously allocated buffer.
Worker thread done writing buffer to file and sees that there is another buffer with data so it grabs it and starts to write.
The very first buffer is used by producer to write all consecutive messages, until second write operation in progress.
This schema will help achieve low latency requirement, single message will be written to disc instantaneously, but large amount of events will be written by large batches for greather throughput.
If your log messages have levels - you can improve this schema a little bit. All error messages have high priority(level) and must be saved on disc immediately (because they are rare but very valuable) but debug and trace messages have low priority and can be buffered to save bandwidth (because they are very frequent but not as valuable as error and info messages). So when you write error message, you must wait until worker thread is done writing your message (and all messages that are in the same buffer) and then continue, but debug and trace messages can be just written to buffer.
Threading.
Spawning worker thread for each application thread is to costly. You must use single writer thread for each log file. Write buffers must be shared between threads. Each buffer must have two pointers - commit_pointer and prepare_pointer. All buffer space between beginning of the buffer and commit_pointer are available for worker thread. Buffer space between commit_pointer and prepare_pointer are currently updated by application threads. Invariant: commit_pointer <= prepare_pointer.
Write operations can be performed in two steps.
Prepare write. This operation reserves space in a buffer.
Producer calculates len(message) and atomically updates prepare_pointer;
Old prepare_pointer value and len is saved by consumer;
Commit write.
Producer writes message at the beginning of the reserved buffer space (old prepare_pointer value).
Producer busy-waits until commit_pointer is equal to old prepare_pointer value that its save in local variable.
Producer commit write operation by doing commit_pointer = commit_pointer + len atomically.
To prevent false sharing, len(message) can be rounded to cache line size and all extra space can be filled with spaces.
// pseudocode
void write(const char* message) {
int len = strlen(message); // TODO: round to cache line size
const char* old_prepare_ptr;
// Prepare step
while(1)
{
old_prepare_ptr = prepare_ptr;
if (
CAS(&prepare_ptr,
old_prepare_ptr,
prepare_ptr + len) == old_prepare_ptr
)
break;
// retry if another thread perform prepare op.
}
// Write message
memcpy((void*)old_prepare_ptr, (void*)message, len);
// Commit step
while(1)
{
const char* old_commit_ptr = commit_ptr;
if (
CAS(&commit_ptr,
old_commit_ptr,
old_commit_ptr + len) == old_commit_ptr
)
break;
// retry if another thread commits
}
notify_worker_thread();
}

concurrent_queue<T, Size>
The question that I have is how to make the worker thread go to work as soon as there is work to do and sleep when there is no work.
There is boost::lockfree::spsc_queue - wait-free single-producer single-consumer queue. It can be configured to have compile-time capacity (the size of the internal ringbuffer).
From what I understand, you want something similar to following configuration:
template<typename T, size_t N>
class concurrent_queue
{
// T can be wrapped into struct with padding in order to avoid false sharing
mutable boost::lockfree::spsc_queue<T, boost::lockfree::capacity<N>> q;
mutable mutex m;
mutable condition_variable c;
void wait() const
{
unique_lock<mutex> u(m);
c.wait_for(u, chrono::microseconds(1)); // Or whatever period you need.
// Timeout is required, because modification happens not under mutex
// and notification can be lost.
// Another option is just to use sleep/yield, without notifications.
}
void notify() const
{
c.notify_one();
}
public:
void push(const T &t)
{
while(!q.push(t))
wait();
notify();
}
void pop(T &result)
{
while(!q.pop(result))
wait();
notify();
}
};
When there are elements in queue - pop does not block. And when there is enough space in internal buffer - push does not block.
concurrent<T>
I want to reduce both formatting and write times as much as possible so I plan to reduce both.
Check out Herb Sutter talk at C++ and Beyond 2012: C++ Concurrency. At page 14 he shows example of concurrent<T>. Basically it is wrapper around object of type T which starts separate thread for performing all operations on that object. Usage is:
concurrent<ostream*> x(&cout); // starts thread internally
// ...
// x acts as function object.
// It's function call operator accepts action
// which is performed on wrapped object in separate thread.
int i = 42;
x([i](ostream *out){ *out << "i=" << i; }); // passing lambda as action
You can use similar pattern in order to offload all formatting work to consumer thread.
Small Object Optimization
Otherwise, new buffers are allocated and I want to avoid memory allocation after the buffer stream is constructed.
Above concurrent_queue<T, Size> example uses fixed-size buffer which is fully contained within queue, and does not imply additional allocations.
However, Herb's concurrent<T> example uses std::function to pass action into worker thread. That may incur costly allocation.
std::function implementations may use Small Object Optimization (and most implementations do) - small function objects are in-place copy-constructed in internal buffer, but there is no guarantee, and for function objects bigger than threshold - heap allocation would happen.
There are several options to avoid this allocation:
Implement std::function analog with internal buffer large enough to hold target function objects (for example, you can try to modify boost::function or this version).
Use your own function object which would represent all type of log messages. Basically it would contain just values required to format message. As potentially there are different types of messages, consider to use boost::variant (which is literary union coupled with type tag) to represent them.
Putting it all together, here is proof-of-concept (using second option):
LIVE DEMO
#include <boost/lockfree/spsc_queue.hpp>
#include <boost/optional.hpp>
#include <boost/variant.hpp>
#include <condition_variable>
#include <iostream>
#include <cstddef>
#include <thread>
#include <chrono>
#include <mutex>
using namespace std;
/*********************************************/
template<typename T, size_t N>
class concurrent_queue
{
mutable boost::lockfree::spsc_queue<T, boost::lockfree::capacity<N>> q;
mutable mutex m;
mutable condition_variable c;
void wait() const
{
unique_lock<mutex> u(m);
c.wait_for(u, chrono::microseconds(1));
}
void notify() const
{
c.notify_one();
}
public:
void push(const T &t)
{
while(!q.push(t))
wait();
notify();
}
void pop(T &result)
{
while(!q.pop(result))
wait();
notify();
}
};
/*********************************************/
template<typename T, typename F>
class concurrent
{
typedef boost::optional<F> Job;
mutable concurrent_queue<Job, 16> q; // use custom size
mutable T x;
thread worker;
public:
concurrent(T x)
: x{x}, worker{[this]
{
Job j;
while(true)
{
q.pop(j);
if(!j) break;
(*j)(this->x); // you may need to handle exceptions in some way
}
}}
{}
void operator()(const F &f)
{
q.push(Job{f});
}
~concurrent()
{
q.push(Job{});
worker.join();
}
};
/*********************************************/
struct LogEntry
{
struct Formatter
{
typedef void result_type;
ostream *out;
void operator()(double x) const
{
*out << "floating point: " << x << endl;
}
void operator()(int x) const
{
*out << "integer: " << x << endl;
}
};
boost::variant<int, double> data;
void operator()(ostream *out)
{
boost::apply_visitor(Formatter{out}, data);
}
};
/*********************************************/
int main()
{
concurrent<ostream*, LogEntry> log{&cout};
for(int i=0; i!=1024; ++i)
{
log({i});
log({i/10.});
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Atomic Variables Accessed Multiple Times From One Function - c++

Related

One more releasing and acquiring locks make performance worse unexpectedly

Multi-thread crawler doesn't speed up with threading (on local files)

Which types of memory_order should be used for non-blocking behaviour with an atomic_flag?

Multithreaded OpenSSL

C++ Low-Latency Threaded Asynchronous Buffered Stream (intended for logging) – Boost

Categories

Resources