Multithreading for image processing at GPU using CUDA

Multithreading for image processing at GPU using CUDA - c++

Problem Statement:
I have to continuously process 8 megapixel images captured from a camera . There have to be several image processing algorithms on it like color interpolation, color transformation etc. These operations will take a long time at CPU. So, I decided to do these operations at GPU using CUDA kernel. I have already written a working CUDA kernel for color transformation. But still I need some more boost in the performance.
There are basically two computational times:
Copying the source image from CPU to GPU and vice-versa
Processing of the source image at GPU
when the image is getting copied from CPU to GPU....nothing else happens. And similarly, when the processing of image at GPU working...nothing else happens.
MY IDEA: I want to do multi-threading so that I can save some time. I want to capture the next image while the processing of previous image is going on at GPU. And, when the GPU finishes the processing of previous image then, the next image is already there for it to get transferred from CPU to GPU.
What I need: I am completely new to the world of Multi-threading. I am watching some tutorials and some other stuff to know more about it. So, I am looking up for some suggestions about the proper steps and proper logic.

I'm not sure you really need threads for this. CUDA has the ability to allow for asynchronous concurrent execution between host and device (without the necessity to use multiple CPU threads.) What you're asking for is a pretty standard "pipelined" algorithm. It would look something like this:
$ cat t832.cu
#include <stdio.h>
#define IMGSZ 8000000
// for this example, NUM_FRAMES must be less than 255
#define NUM_FRAMES 128
#define nTPB 256
#define nBLK 64
unsigned char cur_frame = 0;
unsigned char validated_frame = 0;
bool validate_image(unsigned char *img) {
validated_frame++;
for (int i = 0; i < IMGSZ; i++) if (img[i] != validated_frame) {printf("image validation failed at %d, was: %d, should be: %d\n",i, img[i], validated_frame); return false;}
return true;
}
void CUDART_CB my_callback(cudaStream_t stream, cudaError_t status, void* data) {
validate_image((unsigned char *)data);
}
bool capture_image(unsigned char *img){
for (int i = 0; i < IMGSZ; i++) img[i] = cur_frame;
if (++cur_frame == NUM_FRAMES) {cur_frame--; return true;}
return false;
}
__global__ void img_proc_kernel(unsigned char *img){
int idx = threadIdx.x + blockDim.x*blockIdx.x;
while(idx < IMGSZ){
img[idx]++;
idx += gridDim.x*blockDim.x;}
}
int main(){
// setup
bool done = false;
unsigned char *h_imgA, *h_imgB, *d_imgA, *d_imgB;
size_t dsize = IMGSZ*sizeof(unsigned char);
cudaHostAlloc(&h_imgA, dsize, cudaHostAllocDefault);
cudaHostAlloc(&h_imgB, dsize, cudaHostAllocDefault);
cudaMalloc(&d_imgA, dsize);
cudaMalloc(&d_imgB, dsize);
cudaStream_t st1, st2;
cudaStreamCreate(&st1); cudaStreamCreate(&st2);
unsigned char *cur = h_imgA;
unsigned char *d_cur = d_imgA;
unsigned char *nxt = h_imgB;
unsigned char *d_nxt = d_imgB;
cudaStream_t *curst = &st1;
cudaStream_t *nxtst = &st2;
done = capture_image(cur); // grabs a frame and puts it in cur
// enter main loop
while (!done){
cudaMemcpyAsync(d_cur, cur, dsize, cudaMemcpyHostToDevice, *curst); // send frame to device
img_proc_kernel<<<nBLK, nTPB, 0, *curst>>>(d_cur); // process frame
cudaMemcpyAsync(cur, d_cur, dsize, cudaMemcpyDeviceToHost, *curst);
// insert a cuda stream callback here to copy the cur frame to output
cudaStreamAddCallback(*curst, &my_callback, (void *)cur, 0);
cudaStreamSynchronize(*nxtst); // prevent overrun
done = capture_image(nxt); // capture nxt image while GPU is processing cur
unsigned char *tmp = cur;
cur = nxt;
nxt = tmp; // ping - pong
tmp = d_cur;
d_cur = d_nxt;
d_nxt = tmp;
cudaStream_t *st_tmp = curst;
curst = nxtst;
nxtst = st_tmp;
}
}
$ nvcc -o t832 t832.cu
$ cuda-memcheck ./t832
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$
There are many cuda sample codes which may be helpful also, such as simpleStreams, asyncAPI, and simpleCallbacks

Since your question is very wide, I can only think of the following advice:
1) Use CUDA streams
When using more than one CUDA stream, the memory transfer between CPU->GPU, the GPU processing and the memory transfer between GPU->CPU can overlap. This way the image processing of the next image can already begin while the result is transferred back.
You can also decompose each frame. Use n streams per frame and launch the image processing kernels n times with an offset.
2) Apply the producer-consumer scheme
The producer thread captures the frames from the camera and stores them in a thread-safe container. The consumer thread(s) fetch(es) a frame from this source container, upload(s) it to the GPU using its/their own CUDA stream(s), launches the kernel and copies the result back to the host.
Each consumer thread would synchronize with its stream(s) before trying to get a new image from the source container.
A simple implementation could look like this:
#include <vector>
#include <thread>
#include <memory>
struct ThreadSafeContainer{ /*...*/ };
struct Producer
{
Producer(std::shared_ptr<ThreadSafeContainer> c) : container(c)
{
}
void run()
{
while(true)
{
// grab image from camera
// store image in container
}
}
std::shared_ptr<ThreadSafeContainer> container;
};
struct Consumer
{
Consumer(std::shared_ptr<ThreadSafeContainer> c) : container(c)
{
cudaStreamCreate(&stream);
}
~Consumer()
{
cudaStreamDestroy(stream);
}
void run()
{
while(true)
{
// read next image from container
// upload to GPU
cudaMemcpyAsync(...,...,...,stream);
// run kernel
kernel<<<..., ..., ..., stream>>>(...);
// copy results back
cudaMemcpyAsync(...,...,...,stream);
// wait for results
cudaStreamSynchronize(stream);
// do something with the results
}
}
std::shared_ptr<ThreadSafeContainer> container;
cudaStream_t stream; // or multiple streams per consumer
};
int main()
{
// create an instance of ThreadSafeContainer which whill be shared between Producer and Consumer instances
auto container = std::make_shared<ThreadSafeContainer>();
// create one instance of Producer, pass the shared container as an argument to the constructor
auto p = std::make_shared<Producer>(container);
// create a separate thread which executes Producer::run
std::thread producer_thread(&Producer::run, p);
const int consumer_count = 2;
std::vector<std::thread> consumer_threads;
std::vector<std::shared_ptr<Consumer>> consumers;
// create as many consumers as specified
for (int i=0; i<consumer_count;++i)
{
// create one instance of Consumer, pass the shared container as an argument to the constructor
auto c = std::make_shared<Consumer>(container);
// create a separate thread which executes Consumer::run
consumer_threads.push_back(std::thread(&Consumer::run, c));
}
// wait for the threads to finish, otherwise the program will just exit here and the threads will be killed
// in this example, the program will never exit since the infinite loop in the run() methods never end
producer_thread.join();
for (auto& t : consumer_threads)
{
t.join();
}
return 0;
}

Related

Thread with expensive operations slows down UI thread - Windows 10, C++

The Problem: I have two threads in a Windows 10 application I'm working on, a UI thread (called the render thread in the code) and a worker thread in the background (called the simulate thread in the code). Ever couple of seconds or so, the background thread has to perform a very expensive operation that involves allocating a large amount of memory. For some reason, when this operation happens, the UI thread lags for a split second and becomes unresponsive (this is seen in the application as a camera not moving for a second while the camera movement input is being given).
Maybe I'm misunderstanding something about how threads work on Windows, but I wasn't aware that this was something that should happen. I was under the impression that you use a separate UI thread for this very reason: to keep it responsive while other threads do more time intensive operations.
Things I've tried: I've removed all communication between the two threads, so there are no mutexes or anything of that sort (unless there's something implicit that Windows does that I'm not aware of). I have also tried setting the UI thread to be a higher priority than the background thread. Neither of these helped.
Some things I've noted: While the UI thread lags for a moment, other applications running on my machine are just as responsive as ever. The heavy operation seems to only affect this one process. Also, if I decrease the amount of memory being allocated, it alleviates the issue (however, for the application to work as I want it to, it needs to be able to do this allocation).
The question: My question is two-fold. First, I'd like to understand why this is happening, as it seems to go against my understanding of how multi-threading should work. Second, do you have any recommendations or ideas on how to fix this and get it so the UI doesn't lag.
Abbreviated code: Note the comment about epochs in timeline.h
main.cpp
#include "Renderer/Headers/Renderer.h"
#include "Shared/Headers/Timeline.h"
#include "Simulator/Simulator.h"
#include <iostream>
#include <Windows.h>
unsigned int __stdcall renderThread(void* timelinePtr);
unsigned int __stdcall simulateThread(void* timelinePtr);
int main() {
Timeline timeline;
HANDLE renderHandle = (HANDLE)_beginthreadex(0, 0, &renderThread, &timeline, 0, 0);
if (renderHandle == 0) {
std::cerr << "There was an error creating the render thread" << std::endl;
return -1;
}
SetThreadPriority(renderHandle, THREAD_PRIORITY_HIGHEST);
HANDLE simulateHandle = (HANDLE)_beginthreadex(0, 0, &simulateThread, &timeline, 0, 0);
if (simulateHandle == 0) {
std::cerr << "There was an error creating the simulate thread" << std::endl;
return -1;
}
SetThreadPriority(simulateHandle, THREAD_PRIORITY_IDLE);
WaitForSingleObject(renderHandle, INFINITE);
WaitForSingleObject(simulateHandle, INFINITE);
return 0;
}
unsigned int __stdcall renderThread(void* timelinePtr) {
Timeline& timeline = *((Timeline*)timelinePtr);
Renderer renderer = Renderer(timeline);
renderer.run();
return 0;
}
unsigned int __stdcall simulateThread(void* timelinePtr) {
Timeline& timeline = *((Timeline*)timelinePtr);
Simulator simulator(timeline);
simulator.run();
return 0;
}
simulator.cpp
// abbreviated
void Simulator::run() {
while (true) {
// abbreviated
timeline->push(latestState);
}
}
// abbreviated
timeline.h
#ifndef TIMELINE_H
#define TIMELINE_H
#include "WorldState.h"
#include <mutex>
#include <vector>
class Timeline {
public:
Timeline();
bool tryGetStateAtFrame(int frame, WorldState*& worldState);
void push(WorldState* worldState);
private:
// The concept of an Epoch was introduced to help reduce mutex conflicts, but right now since the threads are disconnected, there should be no mutex locks at all on the UI thread. However, every 1024 pushes onto the timeline, a new Epoch must be created. The amount of slowdown largely depends on how much memory the WorldState class takes. If I make WorldState small, there isn't a noticable hiccup, but when it is large, it becomes noticeable.
class Epoch {
public:
static const int MAX_SIZE = 1024;
void push(WorldState* worldstate);
int getSize();
WorldState* getAt(int index);
private:
int size = 0;
WorldState states[MAX_SIZE];
};
Epoch* pushEpoch;
std::mutex lock;
std::vector<Epoch*> epochs;
};
#endif // !TIMELINE_H
timeline.cpp
#include "../Headers/Timeline.h"
#include <iostream>
Timeline::Timeline() {
pushEpoch = new Epoch();
}
bool Timeline::tryGetStateAtFrame(int frame, WorldState*& worldState) {
if (!lock.try_lock()) {
return false;
}
if (frame >= epochs.size() * Epoch::MAX_SIZE) {
lock.unlock();
return false;
}
worldState = epochs.at(frame / Epoch::MAX_SIZE)->getAt(frame % Epoch::MAX_SIZE);
lock.unlock();
return true;
}
void Timeline::push(WorldState* worldState) {
pushEpoch->push(worldState);
if (pushEpoch->getSize() == Epoch::MAX_SIZE) {
lock.lock();
epochs.push_back(pushEpoch);
lock.unlock();
pushEpoch = new Epoch();
}
}
void Timeline::Epoch::push(WorldState* worldState) {
if (this->size == this->MAX_SIZE) {
throw std::out_of_range("Pushed too many items to Epoch without clearing");
}
this->states[this->size] = *worldState;
this->size++;
}
int Timeline::Epoch::getSize() {
return this->size;
}
WorldState* Timeline::Epoch::getAt(int index) {
if (index >= this->size) {
throw std::out_of_range("Tried accessing nonexistent element of epoch");
}
return &(this->states[index]);
}
Renderer.cpp: loops to call Presenter::update() and some OpenGL rendering tasks.
Presenter.cpp
// abbreviated
void Presenter::update() {
camera->update();
// timeline->tryGetStateAtFrame(Time::getFrames(), worldState); // Normally this would cause a potential mutex conflict, but for now I have it commented out. This is the only place that anything on the UI thread accesses timeline.
}
// abbreviated
Any help/suggestions?

I ended up figuring this out!
So as it turns out, the new operator in C++ is threadsafe, which means that once it starts, it has to finish before any other threads can do anything. Why was that a problem in my case? Well, when an Epoch was being initialized, it had to initialize an array of 1024 WorldStates, each of which has 10,000 CellStates that need to be initialized, and each of those had an array of 16 items that needed to be initalized, so we ended up with over 100,000,000 objects needing to be initialized before the new operator could return. That was taking long enough that it caused the UI to hiccup while it was waiting.
The solution was to create a factory function that would build the pieces of the Epoch piecemeal, one constructor at a time and then combine them together and return a pointer to the new epoch.
timeline.h
#ifndef TIMELINE_H
#define TIMELINE_H
#include "WorldState.h"
#include <mutex>
#include <vector>
class Timeline {
public:
Timeline();
bool tryGetStateAtFrame(int frame, WorldState*& worldState);
void push(WorldState* worldState);
private:
class Epoch {
public:
static const int MAX_SIZE = 1024;
static Epoch* createNew();
void push(WorldState* worldstate);
int getSize();
WorldState* getAt(int index);
private:
Epoch();
int size = 0;
WorldState* states[MAX_SIZE];
};
Epoch* pushEpoch;
std::mutex lock;
std::vector<Epoch*> epochs;
};
#endif // !TIMELINE_H
timeline.cpp
Timeline::Epoch* Timeline::Epoch::createNew() {
Epoch* epoch = new Epoch();
for (unsigned int i = 0; i < MAX_SIZE; i++) {
epoch->states[i] = new WorldState();
}
return epoch;
}

How to solve the problem that multithreaded drawing is not smooth?

I wrote a data acquisition program with Qt. I collect data using the child threads of the dual cache region written by QSemphore.
void QThreadShow::run() {
m_stop=false; // when start thread,m_stop=false
int n=fullBufs.available();
if (n>0)
fullBufs.acquire(n);
while (!m_stop) {
fullBufs.acquire(); // wait fo full buffer
QVector<double> dataPackage(BufferSize);
double seq=bufNo;
if (curBuf==1)
for (int i=0;i<BufferSize;i++){
dataPackage[i]=buffer2[i]; // copy data from full buffer
}
else
for (int i=0;i<BufferSize;i++){
dataPackage[i]=buffer1[i];
}
for (int k=0;k<BufferSize;k++) {
vectorQpointFbufferData[k]=QPointF(x,dataPackage[k]);
}
emptyBufs.release(); // release a buffer
QVariant variantBufferData;
variantBufferData.setValue(vectorQpointFbufferData);
emit newValue(variantBufferData,seq); // send data to main thread
}
quit();
}
When a cache of sub-threads has collected 500 data, the data is input into a QVector and sent to the main thread and is directly assigned to a lineseries in qchartview every 20ms for drawing. I use QtChart to chart the data.
void MainWindow::onthreadB_newValue(QVariant bufferData, double bufNo) {
// Analysis of QVariant data
CH1.hardSoftDataPointPackage = bufferData.value<QVector<QPointF>>();
if (ui->CH1_Source->currentIndex()==0) {
for (int p = 0;p<CH1.hardSoftDataPointPackage.size();p++) {
series_CH3->append(CH1.hardSoftDataPointPackage[p]);
}
}
}
There is a timer in the main thread.The interval is 20ms and there is a double time (time = time +1), which controls the X-axis.
void MainWindow::drawAxis(double time) {
// dynamic draw x axis
if (time<100) {
axisX->setRange(0, TimeBase/(1000/FrameRate) * 10);
// FrameRate=50
} else {
axisX->setRange(time-TimeBase/(1000/FrameRate) * 10, time);
}
}
But when I run my program, there is a problem that every time the subthread sends data to the main thread, the main thread gets stuck for a few seconds and the plot also gets stuck for a few seconds. I added a curve in the main thread getting data from the main thread, and found that both two curves will be stuck at the same time. I don't know how to solve this problem.
Besides, I want the main thread to draw the data from the child thread evenly within 20ms, instead of drawing all the points at once.

Your main thread stucks because you copy (add to series) a lot of data at one time. Instead this you can collect all your data inside your thread instance without emitting a signal. And from main thread just take little pieces of collected data every 20 ms.
Something like this:
while(!m_stop)
{
...
//QVariant variantBufferData;
//variantBufferData.setValue(vectorQpointFbufferData);
//emit newValue(variantBufferData,seq);//send data to main thread
//instead this just store in internal buffer
m_mutex.lock();
m_internalBuffer.append(vectorQpointFbufferData);
m_mutex.unlock();
}
Read method
QVector<QPointF> QThreadShow::takeDataPiece()
{
int n = 4;
QVector<QPointF> piece;
piece.reserve(n);
m_mutex.lock();
for (int i = 0; i < n; i++)
{
QPointF point = m_internalBuffer.takeFirst();
piece.append(point);
}
m_mutex.unlock();
return piece;
}
And in Main thread read in timeout slot
void MainWindow::OnDrawTimer()
{
QVector<QPointF> piece = m_childThread.takeDataPiece();
//add to series
...
//drawAxis
...
}

OpenCL vs CUDA: Pinned memory

I have been porting my RabbitCT CUDA implementation to OpenCL and I'm running into issues with pinned memory.
For CUDA a host buffer is created that buffers the input images to be processed in pinned memory. This allows the host to catch the next batch of input images while the GPU processes the current batch. A simplified mockup of my CUDA implementation is as follows:
// globals
float** hostProjBuffer = new float*[BUFFER_SIZE];
float* devProjection[STREAMS_MAX];
cudaStream_t stream[STREAMS_MAX];
void initialize()
{
// initiate streams
for( uint s = 0; s < STREAMS_MAX; s++ ){
cudaStreamCreateWithFlags (&stream[s], cudaStreamNonBlocking);
cudaMalloc( (void**)&devProjection[s], imgSize);
}
// initiate buffers
for( uint b = 0; b < BUFFER_SIZE; b++ ){
cudaMallocHost((void **)&hostProjBuffer[b], imgSize);
}
}
// main function called for all input images
void backproject(imgdata* r)
{
uint projNr = r->imgnr % BUFFER_SIZE;
uint streamNr = r->imgnr % STREAMS_MAX;
// When buffer is filled, wait until work in current stream has finished
if(projNr == 0) {
cudaStreamSynchronize(stream[streamNr]);
}
// copy received image data to buffer (maps double precision to float)
std::copy(r->I_n, r->I_n+(imgSizeX * imgSizeY), hostProjBuffer[projNr]);
// copy image and matrix to device
cudaMemcpyAsync( devProjection[streamNr], hostProjBuffer[projNr], imgSize, cudaMemcpyHostToDevice, stream[streamNr] );
// call kernel
backproject<<<numBlocks, threadsPerBlock, 0 , stream[streamNr]>>>(devProjection[streamNr]);
}
So, for CUDA, I create a pinned host pointer for each buffer item and copy the data to the device before executing kernel of each stream.
For OpenCL I initially did something similar when following the Nvidia OpenCL Best Practices Guide. Here they recommend creating two buffers, one for copying the kernel data to and one for the pinned memory. However, this leads to the implementation using double the device memory as both the kernel and pinned memory buffers are allocated on the device.
To get around this memory issue, I created an implementation where only a mapping is made to the device as it is needed. This can be seen in the following implementation:
// globals
float** hostProjBuffer = new float* [BUFFER_SIZE];
cl_mem devProjection[STREAMS_MAX], devMatrix[STREAMS_MAX];
cl_command_queue queue[STREAMS_MAX];
// initiate streams
void initialize()
{
for( uint s = 0; s < STREAMS_MAX; s++ ){
queue[s] = clCreateCommandQueueWithProperties(context, device, NULL, &status);
devProjection[s] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, imgSize, NULL, &status);
}
}
// main function called for all input images
void backproject(imgdata* r)
{
const uint projNr = r->imgnr % BUFFER_SIZE;
const uint streamNr = r->imgnr % STREAMS_MAX;
// when buffer is filled, wait until work in current stream has finished
if(projNr == 0) {
status = clFinish(queue[streamNr]);
}
// map host memory region to device buffer
hostProjBuffer[projNr] = (float*) clEnqueueMapBuffer(queue[streamNr], devProjection[streamNr], CL_FALSE, CL_MAP_WRITE_INVALIDATE_REGION, 0, imgSize, 0, NULL, NULL, &status);
// copy received image data to hostbuffers
std::copy(imgPtr, imgPtr + (imgSizeX * imgSizeY), hostProjBuffer[projNr]);
// unmap the allocated pinned host memory
clEnqueueUnmapMemObject(queue[streamNr], devProjection[streamNr], hostProjBuffer[projNr], 0, NULL, NULL);
// set stream specific arguments
clSetKernelArg(kernel, 0, sizeof(devProjection[streamNr]), (void *) &devProjection[streamNr]);
// launch kernel
clEnqueueNDRangeKernel(queue[streamNr], kernel, 3, NULL, global_work_size, local_work_size, 0, NULL, NULL);
clFlush(queue[streamNr]);
clFinish(queue[streamNr]); //should be removed!
}
This implementation does use a similar amount of device memory as the CUDA implementation. However, I have been unable to get this last code example working without a clFinish after each loop, which significantly hampers the performance of the application. This indicates data is lost as the host moves ahead of the kernel. I tried increasing my buffer size to the number of input images, but this did not work either. So somehow during execution, the hostBuffer data gets lost.
So, with the goal to write OpenCL code similar to CUDA, I have three questions:
What is the recommended implementation for OpenCL pinned memory?
Is my OpenCL implementation similar to how CUDA handles pinned memory?
What causes the wrong data to be used in the OpenCL example?
Thanks in advance!
Kind regards,
Remy
PS: Question initially asked at the Nvidia developer forums

Figuring out a race condition

I am building a screen recorder, I am using ffmpeg to make the video out from frames I get from Google Chrome. I get green screen in the output video. I think there is a race condition in the threads since I am not allowed to use main thread to do the processing. here how the code look like
This function works each time I get a new frame, I suspect the functions avpicture_fill & vpx_codec_get_cx_data are being rewritten before write_ivf_frame_header & WriteFile are done.
I am thinking of creating a queue where this function push the object pp::VideoFrame then another thread with mutex will dequeue and do the processing below.
What is the best solution for this problem? and what is the optimal way of debugging it
void EncoderInstance::OnGetFrame(int32_t result, pp::VideoFrame frame) {
if (result != PP_OK)
return;
const uint8_t* data = static_cast<const uint8_t*>(frame.GetDataBuffer());
pp::Size size;
frame.GetSize(&size);
uint32_t buffersize = frame.GetDataBufferSize();
if (is_recording_) {
vpx_codec_iter_t iter = NULL;
const vpx_codec_cx_pkt_t *pkt;
// copy the pixels into our "raw input" container.
int bytes_filled = avpicture_fill(&pic_raw, data, AV_PIX_FMT_YUV420P, out_width, out_height);
if(!bytes_filled) {
Logger::Log("Cannot fill the raw input buffer");
return;
}
if(vpx_codec_encode(&codec, &raw, frame_cnt, 1, flags, VPX_DL_REALTIME))
die_codec(&codec, "Failed to encode frame");
while( (pkt = vpx_codec_get_cx_data(&codec, &iter)) ) {
switch(pkt->kind) {
case VPX_CODEC_CX_FRAME_PKT:
glb_app_thread.message_loop().PostWork(callback_factory_.NewCallback(&EncoderInstance::write_ivf_frame_header, pkt));
glb_app_thread.message_loop().PostWork(callback_factory_.NewCallback(&EncoderInstance::WriteFile, pkt));
break;
default:break;
}
}
frame_cnt++;
}
video_track_.RecycleFrame(frame);
if (need_config_) {
ConfigureTrack();
need_config_ = false;
} else {
video_track_.GetFrame(
callback_factory_.NewCallbackWithOutput(
&EncoderInstance::OnGetFrame));
}
}

Does the use of an anonymous pipe introduce a memory barrier for interthread communication?

For example, say I allocate a struct with new and write the pointer into the write end of an anonymous pipe.
If I read the pointer from the corresponding read end, am I guaranteed to see the 'correct' contents on the struct?
Also of of interest is whether the results of socketpair() on unix & self connecting over tcp loopback on windows have the same guarantees.
The context is a server design which centralizes event dispatch with select/epoll

For example, say I allocate a struct with new and write the pointer into the write end of an anonymous pipe.
If I read the pointer from the corresponding read end, am I guaranteed to see the 'correct' contents on the struct?
No. There is no guarantee that the writing CPU will have flushed the write out of its cache and made it visible to the other CPU that might do the read.
Also of of interest is whether the results of socketpair() on unix & self connecting over tcp loopback on windows have the same guarantees.
No.

In practice, calling write(), which is a system call, will end up locking one or more data structures in the kernel, which should take care of the reordering issue. For example, POSIX requires subsequent reads to see data written before their call, which implies a lock (or some kind of acquire/release) by itself.
As for whether that's part of the formal spec of the calls, probably it's not.

A pointer is just a memory address, so provided you are on the same process the pointer will be valid on the receiving thread and will point to the same struct. If you are on different processes, at best you will get immediately a memory error, at worse you will read (or write) to a random memory which is essentially Undefined Behaviour.
Will you read the correct content? Neither better nor worse than if your pointer was in a static variable shared by both threads: you still have to do some synchronization if you want consistency.
Will the kind of transfer address matter between static memory (shared by threads), anonymous pipes, socket pairs, tcp loopback, etc.? No: all those channels transfers bytes, so if you pass a memory address, you will get your memory address. What is left you then is synchronization, because here you are just sharing a memory address.
If you do not use any other synchronization, anything can happen (did I already spoke of Undefined Behaviour?):
reading thread can access memory before it has been written by writing one giving stale data
if you forgot to declare the struct members as volatile, reading thread can keep using cached values, here again getting stale data
reading thread can read partially written data meaning incoherent data

Interesting question with, so far, only one correct answer from Cornstalks.
Within the same (multi-threaded) process there are no guarantees since pointer and data follow different paths to reach their destination.
Implicit acquire/release guarantees do not apply since the struct data cannot piggyback on the pointer through the cache and formally you are dealing with a data race.
However, looking at how the pointer and the struct data itself reach the second thread (through the pipe and memory cache respectively), there is a real chance that this mechanism is not going to cause any harm.
Sending the pointer to a peer thread takes 3 system calls (write() in the sending thread, select() and read() in the receiving thread) which is (relatively) expensive and by the time the pointer value is available
in the receiving thread, the struct data probably has arrived long before.
Note that this is just an observation, the mechanism is still incorrect.

I believe, your case might be reduced to this 2 threads model:
int data = 0;
std::atomic<int*> atomicPtr{nullptr};
//...
void thread1()
{
data = 42;
atomicPtr.store(&integer, std::memory_order_release);
}
void thread2()
{
int* ptr = nullptr;
while(!ptr)
ptr = atomicPtr.load(std::memory_order_consume);
assert(*ptr == 42);
}
Since you have 2 processes you can't use one atomic variable across them but since you listed windows you can omit atomicPtr.load(std::memory_order_consume) from the consuming part because, AFAIK, all the architectures Windows is running on guarantee this load to be correct without any barrier on the loading side. In fact, I think there are not much architectures out there where that instruction would not be a NO-OP(I heard only about DEC Alpha)

I agree with Serge Ballesta's answer. Within the same process, it's feasible to send and receive object address via anonymous pipe.
Since the write system call is guaranteed to be atomic when message size is below PIPE_BUF (normally 4096 bytes), so multi-producer threads will not mess up each other's object address (8 bytes for 64 bit applications).
Talk is cheap, here is the demo code for Linux (defensive code and error handlers are omitted for simplicity). Just copy & paste to pipe_ipc_demo.cc then compile & run the test.
#include <unistd.h>
#include <string.h>
#include <pthread.h>
#include <string>
#include <list>
template<class T> class MPSCQ { // pipe based Multi Producer Single Consumer Queue
public:
MPSCQ();
~MPSCQ();
int producerPush(const T* t);
T* consumerPoll(double timeout = 1.0);
private:
void _consumeFd();
int _selectFdConsumer(double timeout);
T* _popFront();
private:
int _fdProducer;
int _fdConsumer;
char* _consumerBuf;
std::string* _partial;
std::list<T*>* _list;
static const int _PTR_SIZE;
static const int _CONSUMER_BUF_SIZE;
};
template<class T> const int MPSCQ<T>::_PTR_SIZE = sizeof(void*);
template<class T> const int MPSCQ<T>::_CONSUMER_BUF_SIZE = 1024;
template<class T> MPSCQ<T>::MPSCQ() :
_fdProducer(-1),
_fdConsumer(-1) {
_consumerBuf = new char[_CONSUMER_BUF_SIZE];
_partial = new std::string; // for holding partial pointer address
_list = new std::list<T*>; // unconsumed T* cache
int fd_[2];
int r = pipe(fd_);
_fdConsumer = fd_[0];
_fdProducer = fd_[1];
}
template<class T> MPSCQ<T>::~MPSCQ() { /* omitted */ }
template<class T> int MPSCQ<T>::producerPush(const T* t) {
return t == NULL ? 0 : write(_fdProducer, &t, _PTR_SIZE);
}
template<class T> T* MPSCQ<T>::consumerPoll(double timeout) {
T* t = _popFront();
if (t != NULL) {
return t;
}
if (_selectFdConsumer(timeout) <= 0) { // timeout or error
return NULL;
}
_consumeFd();
return _popFront();
}
template<class T> void MPSCQ<T>::_consumeFd() {
memcpy(_consumerBuf, _partial->data(), _partial->length());
ssize_t r = read(_fdConsumer, _consumerBuf, _CONSUMER_BUF_SIZE - _partial->length());
if (r <= 0) { // EOF or error, error handler omitted
return;
}
const char* p = _consumerBuf;
int remaining_len_ = _partial->length() + r;
T* t;
while (remaining_len_ >= _PTR_SIZE) {
memcpy(&t, p, _PTR_SIZE);
_list->push_back(t);
remaining_len_ -= _PTR_SIZE;
p += _PTR_SIZE;
}
*_partial = std::string(p, remaining_len_);
}
template<class T> int MPSCQ<T>::_selectFdConsumer(double timeout) {
int r;
int nfds_ = _fdConsumer + 1;
fd_set readfds_;
struct timeval timeout_;
int64_t usec_ = timeout * 1000000.0;
while (true) {
timeout_.tv_sec = usec_ / 1000000;
timeout_.tv_usec = usec_ % 1000000;
FD_ZERO(&readfds_);
FD_SET(_fdConsumer, &readfds_);
r = select(nfds_, &readfds_, NULL, NULL, &timeout_);
if (r < 0 && errno == EINTR) {
continue;
}
return r;
}
}
template<class T> T* MPSCQ<T>::_popFront() {
if (!_list->empty()) {
T* t = _list->front();
_list->pop_front();
return t;
} else {
return NULL;
}
}
// = = = = = test code below = = = = =
#define _LOOP_CNT 5000000
#define _ONE_MILLION 1000000
#define _PRODUCER_THREAD_NUM 2
struct TestMsg { // all public
int _threadId;
int _msgId;
int64_t _val;
TestMsg(int thread_id, int msg_id, int64_t val) :
_threadId(thread_id),
_msgId(msg_id),
_val(val) { };
};
static MPSCQ<TestMsg> _QUEUE;
static int64_t _SUM = 0;
void* functor_producer(void* arg) {
int my_thr_id_ = pthread_self();
TestMsg* msg_;
for (int i = 0; i <= _LOOP_CNT; ++ i) {
if (i == _LOOP_CNT) {
msg_ = new TestMsg(my_thr_id_, i, -1);
} else {
msg_ = new TestMsg(my_thr_id_, i, i + 1);
}
_QUEUE.producerPush(msg_);
}
return NULL;
}
void* functor_consumer(void* arg) {
int msg_cnt_ = 0;
int stop_cnt_ = 0;
TestMsg* msg_;
while (true) {
if ((msg_ = _QUEUE.consumerPoll()) == NULL) {
continue;
}
int64_t val_ = msg_->_val;
delete msg_;
if (val_ <= 0) {
if ((++ stop_cnt_) >= _PRODUCER_THREAD_NUM) {
printf("All done, _SUM=%ld\n", _SUM);
break;
}
} else {
_SUM += val_;
if ((++ msg_cnt_) % _ONE_MILLION == 0) {
printf("msg_cnt_=%d, _SUM=%ld\n", msg_cnt_, _SUM);
}
}
}
return NULL;
}
int main(int argc, char* const* argv) {
pthread_t consumer_;
pthread_create(&consumer_, NULL, functor_consumer, NULL);
pthread_t producers_[_PRODUCER_THREAD_NUM];
for (int i = 0; i < _PRODUCER_THREAD_NUM; ++ i) {
pthread_create(&producers_[i], NULL, functor_producer, NULL);
}
for (int i = 0; i < _PRODUCER_THREAD_NUM; ++ i) {
pthread_join(producers_[i], NULL);
}
pthread_join(consumer_, NULL);
return 0;
}
And here is test result ( 2 * sum(1..5000000) == (1 + 5000000) * 5000000 == 25000005000000 ):
$ g++ -o pipe_ipc_demo pipe_ipc_demo.cc -lpthread
$ ./pipe_ipc_demo ## output may vary except for the final _SUM
msg_cnt_=1000000, _SUM=251244261289
msg_cnt_=2000000, _SUM=1000708879236
msg_cnt_=3000000, _SUM=2250159002500
msg_cnt_=4000000, _SUM=4000785160225
msg_cnt_=5000000, _SUM=6251640644676
msg_cnt_=6000000, _SUM=9003167062500
msg_cnt_=7000000, _SUM=12252615629881
msg_cnt_=8000000, _SUM=16002380952516
msg_cnt_=9000000, _SUM=20252025092401
msg_cnt_=10000000, _SUM=25000005000000
All done, _SUM=25000005000000
The technique showed here is used in our production applications. One typical usage is the consumer thread acts as a log writer, and worker threads can write log messages almost asynchronously. Yes, almost means sometimes writer threads may be blocked in write() when pipe is full, and this is a reliable congestion control feature provided by OS.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Multithreading for image processing at GPU using CUDA - c++

Related

Thread with expensive operations slows down UI thread - Windows 10, C++

How to solve the problem that multithreaded drawing is not smooth?

OpenCL vs CUDA: Pinned memory

Figuring out a race condition

Does the use of an anonymous pipe introduce a memory barrier for interthread communication?

Categories

Resources