Does the use of an anonymous pipe introduce a memory barrier for interthread communication? - c++

For example, say I allocate a struct with new and write the pointer into the write end of an anonymous pipe.
If I read the pointer from the corresponding read end, am I guaranteed to see the 'correct' contents on the struct?
Also of of interest is whether the results of socketpair() on unix & self connecting over tcp loopback on windows have the same guarantees.
The context is a server design which centralizes event dispatch with select/epoll

For example, say I allocate a struct with new and write the pointer into the write end of an anonymous pipe.
If I read the pointer from the corresponding read end, am I guaranteed to see the 'correct' contents on the struct?
No. There is no guarantee that the writing CPU will have flushed the write out of its cache and made it visible to the other CPU that might do the read.
Also of of interest is whether the results of socketpair() on unix & self connecting over tcp loopback on windows have the same guarantees.
No.

In practice, calling write(), which is a system call, will end up locking one or more data structures in the kernel, which should take care of the reordering issue. For example, POSIX requires subsequent reads to see data written before their call, which implies a lock (or some kind of acquire/release) by itself.
As for whether that's part of the formal spec of the calls, probably it's not.

A pointer is just a memory address, so provided you are on the same process the pointer will be valid on the receiving thread and will point to the same struct. If you are on different processes, at best you will get immediately a memory error, at worse you will read (or write) to a random memory which is essentially Undefined Behaviour.
Will you read the correct content? Neither better nor worse than if your pointer was in a static variable shared by both threads: you still have to do some synchronization if you want consistency.
Will the kind of transfer address matter between static memory (shared by threads), anonymous pipes, socket pairs, tcp loopback, etc.? No: all those channels transfers bytes, so if you pass a memory address, you will get your memory address. What is left you then is synchronization, because here you are just sharing a memory address.
If you do not use any other synchronization, anything can happen (did I already spoke of Undefined Behaviour?):
reading thread can access memory before it has been written by writing one giving stale data
if you forgot to declare the struct members as volatile, reading thread can keep using cached values, here again getting stale data
reading thread can read partially written data meaning incoherent data

Interesting question with, so far, only one correct answer from Cornstalks.
Within the same (multi-threaded) process there are no guarantees since pointer and data follow different paths to reach their destination.
Implicit acquire/release guarantees do not apply since the struct data cannot piggyback on the pointer through the cache and formally you are dealing with a data race.
However, looking at how the pointer and the struct data itself reach the second thread (through the pipe and memory cache respectively), there is a real chance that this mechanism is not going to cause any harm.
Sending the pointer to a peer thread takes 3 system calls (write() in the sending thread, select() and read() in the receiving thread) which is (relatively) expensive and by the time the pointer value is available
in the receiving thread, the struct data probably has arrived long before.
Note that this is just an observation, the mechanism is still incorrect.

I believe, your case might be reduced to this 2 threads model:
int data = 0;
std::atomic<int*> atomicPtr{nullptr};
//...
void thread1()
{
data = 42;
atomicPtr.store(&integer, std::memory_order_release);
}
void thread2()
{
int* ptr = nullptr;
while(!ptr)
ptr = atomicPtr.load(std::memory_order_consume);
assert(*ptr == 42);
}
Since you have 2 processes you can't use one atomic variable across them but since you listed windows you can omit atomicPtr.load(std::memory_order_consume) from the consuming part because, AFAIK, all the architectures Windows is running on guarantee this load to be correct without any barrier on the loading side. In fact, I think there are not much architectures out there where that instruction would not be a NO-OP(I heard only about DEC Alpha)

I agree with Serge Ballesta's answer. Within the same process, it's feasible to send and receive object address via anonymous pipe.
Since the write system call is guaranteed to be atomic when message size is below PIPE_BUF (normally 4096 bytes), so multi-producer threads will not mess up each other's object address (8 bytes for 64 bit applications).
Talk is cheap, here is the demo code for Linux (defensive code and error handlers are omitted for simplicity). Just copy & paste to pipe_ipc_demo.cc then compile & run the test.
#include <unistd.h>
#include <string.h>
#include <pthread.h>
#include <string>
#include <list>
template<class T> class MPSCQ { // pipe based Multi Producer Single Consumer Queue
public:
MPSCQ();
~MPSCQ();
int producerPush(const T* t);
T* consumerPoll(double timeout = 1.0);
private:
void _consumeFd();
int _selectFdConsumer(double timeout);
T* _popFront();
private:
int _fdProducer;
int _fdConsumer;
char* _consumerBuf;
std::string* _partial;
std::list<T*>* _list;
static const int _PTR_SIZE;
static const int _CONSUMER_BUF_SIZE;
};
template<class T> const int MPSCQ<T>::_PTR_SIZE = sizeof(void*);
template<class T> const int MPSCQ<T>::_CONSUMER_BUF_SIZE = 1024;
template<class T> MPSCQ<T>::MPSCQ() :
_fdProducer(-1),
_fdConsumer(-1) {
_consumerBuf = new char[_CONSUMER_BUF_SIZE];
_partial = new std::string; // for holding partial pointer address
_list = new std::list<T*>; // unconsumed T* cache
int fd_[2];
int r = pipe(fd_);
_fdConsumer = fd_[0];
_fdProducer = fd_[1];
}
template<class T> MPSCQ<T>::~MPSCQ() { /* omitted */ }
template<class T> int MPSCQ<T>::producerPush(const T* t) {
return t == NULL ? 0 : write(_fdProducer, &t, _PTR_SIZE);
}
template<class T> T* MPSCQ<T>::consumerPoll(double timeout) {
T* t = _popFront();
if (t != NULL) {
return t;
}
if (_selectFdConsumer(timeout) <= 0) { // timeout or error
return NULL;
}
_consumeFd();
return _popFront();
}
template<class T> void MPSCQ<T>::_consumeFd() {
memcpy(_consumerBuf, _partial->data(), _partial->length());
ssize_t r = read(_fdConsumer, _consumerBuf, _CONSUMER_BUF_SIZE - _partial->length());
if (r <= 0) { // EOF or error, error handler omitted
return;
}
const char* p = _consumerBuf;
int remaining_len_ = _partial->length() + r;
T* t;
while (remaining_len_ >= _PTR_SIZE) {
memcpy(&t, p, _PTR_SIZE);
_list->push_back(t);
remaining_len_ -= _PTR_SIZE;
p += _PTR_SIZE;
}
*_partial = std::string(p, remaining_len_);
}
template<class T> int MPSCQ<T>::_selectFdConsumer(double timeout) {
int r;
int nfds_ = _fdConsumer + 1;
fd_set readfds_;
struct timeval timeout_;
int64_t usec_ = timeout * 1000000.0;
while (true) {
timeout_.tv_sec = usec_ / 1000000;
timeout_.tv_usec = usec_ % 1000000;
FD_ZERO(&readfds_);
FD_SET(_fdConsumer, &readfds_);
r = select(nfds_, &readfds_, NULL, NULL, &timeout_);
if (r < 0 && errno == EINTR) {
continue;
}
return r;
}
}
template<class T> T* MPSCQ<T>::_popFront() {
if (!_list->empty()) {
T* t = _list->front();
_list->pop_front();
return t;
} else {
return NULL;
}
}
// = = = = = test code below = = = = =
#define _LOOP_CNT 5000000
#define _ONE_MILLION 1000000
#define _PRODUCER_THREAD_NUM 2
struct TestMsg { // all public
int _threadId;
int _msgId;
int64_t _val;
TestMsg(int thread_id, int msg_id, int64_t val) :
_threadId(thread_id),
_msgId(msg_id),
_val(val) { };
};
static MPSCQ<TestMsg> _QUEUE;
static int64_t _SUM = 0;
void* functor_producer(void* arg) {
int my_thr_id_ = pthread_self();
TestMsg* msg_;
for (int i = 0; i <= _LOOP_CNT; ++ i) {
if (i == _LOOP_CNT) {
msg_ = new TestMsg(my_thr_id_, i, -1);
} else {
msg_ = new TestMsg(my_thr_id_, i, i + 1);
}
_QUEUE.producerPush(msg_);
}
return NULL;
}
void* functor_consumer(void* arg) {
int msg_cnt_ = 0;
int stop_cnt_ = 0;
TestMsg* msg_;
while (true) {
if ((msg_ = _QUEUE.consumerPoll()) == NULL) {
continue;
}
int64_t val_ = msg_->_val;
delete msg_;
if (val_ <= 0) {
if ((++ stop_cnt_) >= _PRODUCER_THREAD_NUM) {
printf("All done, _SUM=%ld\n", _SUM);
break;
}
} else {
_SUM += val_;
if ((++ msg_cnt_) % _ONE_MILLION == 0) {
printf("msg_cnt_=%d, _SUM=%ld\n", msg_cnt_, _SUM);
}
}
}
return NULL;
}
int main(int argc, char* const* argv) {
pthread_t consumer_;
pthread_create(&consumer_, NULL, functor_consumer, NULL);
pthread_t producers_[_PRODUCER_THREAD_NUM];
for (int i = 0; i < _PRODUCER_THREAD_NUM; ++ i) {
pthread_create(&producers_[i], NULL, functor_producer, NULL);
}
for (int i = 0; i < _PRODUCER_THREAD_NUM; ++ i) {
pthread_join(producers_[i], NULL);
}
pthread_join(consumer_, NULL);
return 0;
}
And here is test result ( 2 * sum(1..5000000) == (1 + 5000000) * 5000000 == 25000005000000 ):
$ g++ -o pipe_ipc_demo pipe_ipc_demo.cc -lpthread
$ ./pipe_ipc_demo ## output may vary except for the final _SUM
msg_cnt_=1000000, _SUM=251244261289
msg_cnt_=2000000, _SUM=1000708879236
msg_cnt_=3000000, _SUM=2250159002500
msg_cnt_=4000000, _SUM=4000785160225
msg_cnt_=5000000, _SUM=6251640644676
msg_cnt_=6000000, _SUM=9003167062500
msg_cnt_=7000000, _SUM=12252615629881
msg_cnt_=8000000, _SUM=16002380952516
msg_cnt_=9000000, _SUM=20252025092401
msg_cnt_=10000000, _SUM=25000005000000
All done, _SUM=25000005000000
The technique showed here is used in our production applications. One typical usage is the consumer thread acts as a log writer, and worker threads can write log messages almost asynchronously. Yes, almost means sometimes writer threads may be blocked in write() when pipe is full, and this is a reliable congestion control feature provided by OS.

Related

Thread with expensive operations slows down UI thread - Windows 10, C++

The Problem: I have two threads in a Windows 10 application I'm working on, a UI thread (called the render thread in the code) and a worker thread in the background (called the simulate thread in the code). Ever couple of seconds or so, the background thread has to perform a very expensive operation that involves allocating a large amount of memory. For some reason, when this operation happens, the UI thread lags for a split second and becomes unresponsive (this is seen in the application as a camera not moving for a second while the camera movement input is being given).
Maybe I'm misunderstanding something about how threads work on Windows, but I wasn't aware that this was something that should happen. I was under the impression that you use a separate UI thread for this very reason: to keep it responsive while other threads do more time intensive operations.
Things I've tried: I've removed all communication between the two threads, so there are no mutexes or anything of that sort (unless there's something implicit that Windows does that I'm not aware of). I have also tried setting the UI thread to be a higher priority than the background thread. Neither of these helped.
Some things I've noted: While the UI thread lags for a moment, other applications running on my machine are just as responsive as ever. The heavy operation seems to only affect this one process. Also, if I decrease the amount of memory being allocated, it alleviates the issue (however, for the application to work as I want it to, it needs to be able to do this allocation).
The question: My question is two-fold. First, I'd like to understand why this is happening, as it seems to go against my understanding of how multi-threading should work. Second, do you have any recommendations or ideas on how to fix this and get it so the UI doesn't lag.
Abbreviated code: Note the comment about epochs in timeline.h
main.cpp
#include "Renderer/Headers/Renderer.h"
#include "Shared/Headers/Timeline.h"
#include "Simulator/Simulator.h"
#include <iostream>
#include <Windows.h>
unsigned int __stdcall renderThread(void* timelinePtr);
unsigned int __stdcall simulateThread(void* timelinePtr);
int main() {
Timeline timeline;
HANDLE renderHandle = (HANDLE)_beginthreadex(0, 0, &renderThread, &timeline, 0, 0);
if (renderHandle == 0) {
std::cerr << "There was an error creating the render thread" << std::endl;
return -1;
}
SetThreadPriority(renderHandle, THREAD_PRIORITY_HIGHEST);
HANDLE simulateHandle = (HANDLE)_beginthreadex(0, 0, &simulateThread, &timeline, 0, 0);
if (simulateHandle == 0) {
std::cerr << "There was an error creating the simulate thread" << std::endl;
return -1;
}
SetThreadPriority(simulateHandle, THREAD_PRIORITY_IDLE);
WaitForSingleObject(renderHandle, INFINITE);
WaitForSingleObject(simulateHandle, INFINITE);
return 0;
}
unsigned int __stdcall renderThread(void* timelinePtr) {
Timeline& timeline = *((Timeline*)timelinePtr);
Renderer renderer = Renderer(timeline);
renderer.run();
return 0;
}
unsigned int __stdcall simulateThread(void* timelinePtr) {
Timeline& timeline = *((Timeline*)timelinePtr);
Simulator simulator(timeline);
simulator.run();
return 0;
}
simulator.cpp
// abbreviated
void Simulator::run() {
while (true) {
// abbreviated
timeline->push(latestState);
}
}
// abbreviated
timeline.h
#ifndef TIMELINE_H
#define TIMELINE_H
#include "WorldState.h"
#include <mutex>
#include <vector>
class Timeline {
public:
Timeline();
bool tryGetStateAtFrame(int frame, WorldState*& worldState);
void push(WorldState* worldState);
private:
// The concept of an Epoch was introduced to help reduce mutex conflicts, but right now since the threads are disconnected, there should be no mutex locks at all on the UI thread. However, every 1024 pushes onto the timeline, a new Epoch must be created. The amount of slowdown largely depends on how much memory the WorldState class takes. If I make WorldState small, there isn't a noticable hiccup, but when it is large, it becomes noticeable.
class Epoch {
public:
static const int MAX_SIZE = 1024;
void push(WorldState* worldstate);
int getSize();
WorldState* getAt(int index);
private:
int size = 0;
WorldState states[MAX_SIZE];
};
Epoch* pushEpoch;
std::mutex lock;
std::vector<Epoch*> epochs;
};
#endif // !TIMELINE_H
timeline.cpp
#include "../Headers/Timeline.h"
#include <iostream>
Timeline::Timeline() {
pushEpoch = new Epoch();
}
bool Timeline::tryGetStateAtFrame(int frame, WorldState*& worldState) {
if (!lock.try_lock()) {
return false;
}
if (frame >= epochs.size() * Epoch::MAX_SIZE) {
lock.unlock();
return false;
}
worldState = epochs.at(frame / Epoch::MAX_SIZE)->getAt(frame % Epoch::MAX_SIZE);
lock.unlock();
return true;
}
void Timeline::push(WorldState* worldState) {
pushEpoch->push(worldState);
if (pushEpoch->getSize() == Epoch::MAX_SIZE) {
lock.lock();
epochs.push_back(pushEpoch);
lock.unlock();
pushEpoch = new Epoch();
}
}
void Timeline::Epoch::push(WorldState* worldState) {
if (this->size == this->MAX_SIZE) {
throw std::out_of_range("Pushed too many items to Epoch without clearing");
}
this->states[this->size] = *worldState;
this->size++;
}
int Timeline::Epoch::getSize() {
return this->size;
}
WorldState* Timeline::Epoch::getAt(int index) {
if (index >= this->size) {
throw std::out_of_range("Tried accessing nonexistent element of epoch");
}
return &(this->states[index]);
}
Renderer.cpp: loops to call Presenter::update() and some OpenGL rendering tasks.
Presenter.cpp
// abbreviated
void Presenter::update() {
camera->update();
// timeline->tryGetStateAtFrame(Time::getFrames(), worldState); // Normally this would cause a potential mutex conflict, but for now I have it commented out. This is the only place that anything on the UI thread accesses timeline.
}
// abbreviated
Any help/suggestions?
I ended up figuring this out!
So as it turns out, the new operator in C++ is threadsafe, which means that once it starts, it has to finish before any other threads can do anything. Why was that a problem in my case? Well, when an Epoch was being initialized, it had to initialize an array of 1024 WorldStates, each of which has 10,000 CellStates that need to be initialized, and each of those had an array of 16 items that needed to be initalized, so we ended up with over 100,000,000 objects needing to be initialized before the new operator could return. That was taking long enough that it caused the UI to hiccup while it was waiting.
The solution was to create a factory function that would build the pieces of the Epoch piecemeal, one constructor at a time and then combine them together and return a pointer to the new epoch.
timeline.h
#ifndef TIMELINE_H
#define TIMELINE_H
#include "WorldState.h"
#include <mutex>
#include <vector>
class Timeline {
public:
Timeline();
bool tryGetStateAtFrame(int frame, WorldState*& worldState);
void push(WorldState* worldState);
private:
class Epoch {
public:
static const int MAX_SIZE = 1024;
static Epoch* createNew();
void push(WorldState* worldstate);
int getSize();
WorldState* getAt(int index);
private:
Epoch();
int size = 0;
WorldState* states[MAX_SIZE];
};
Epoch* pushEpoch;
std::mutex lock;
std::vector<Epoch*> epochs;
};
#endif // !TIMELINE_H
timeline.cpp
Timeline::Epoch* Timeline::Epoch::createNew() {
Epoch* epoch = new Epoch();
for (unsigned int i = 0; i < MAX_SIZE; i++) {
epoch->states[i] = new WorldState();
}
return epoch;
}

How to get a reliable memory usage information for a 64-bit process from a 32-bit process?

My goal is to get memory usage information for an arbitrary process. I do the following from my 32-bit process:
HANDLE hProc = ::OpenProcess(PROCESS_QUERY_LIMITED_INFORMATION | PROCESS_VM_READ, 0, pid);
if(hProc)
{
PROCESS_MEMORY_COUNTERS_EX pmx = {0};
if(::GetProcessMemoryInfo(hProc, (PROCESS_MEMORY_COUNTERS*)&pmx, sizeof(pmx)))
{
wprintf(L"Working set: %.02f MB\n", pmx.WorkingSetSize / (1024.0 * 1024.0));
wprintf(L"Private bytes: %.02f MB\n", pmx.PrivateUsage / (1024.0 * 1024.0));
}
::CloseHandle(hProc);
}
The issue is that if the pid process is a 64-bit process, it may have allocated more than 4GB of memory, which will overflow both pmx.WorkingSetSize and pmx.PrivateUsage, which are both 32-bit variables in a 32-bit process. So in that case instead of failing, GetProcessMemoryInfo suceeds with both metrics returned as UINT_MAX -- which is wrong!
So I was wondering, if there was a reliable API to retrieve memory usage from an arbitrary process in a 32-bit application?
There's a reliable API called the "Performance Data Helpers".
Windows' stock perfmon utility is the classical example of a Windows Performance Counter application. Also Process Explorer is using it for collecting process statistics.
Its advantage is that you don't even need the SeDebugPrivilege to gain PROCESS_VM_READ access to other processes.
Note though that access is limited to users being part of the Performance Monitoring Users group.
The idea behind PDH is:
A Query object
One or multiple counters
Create samples upon request or periodically
Fetch the data you have asked for
It's a bit more work to get you started, but still easy in the end. What I do is setting up a permanent PDH query, such that I can reuse it throughout the lifetime of my application.
There's one drawback: By default, the Operating System creates numbered entries for processes having the same name. Those numbered entries even change while processes terminate or new ones get created. So you have to account for this and cross-check the process ID (PID), actually while having a handle open to the process(es) you want to obtain memory usage for.
Below you find a simple PDH wrapper alternative to GetProcessMemoryInfo().
Of course there's plenty of room to tweak the following code or adjust it to your needs. I have also seen people who created more generic C++ wrappers already.
Declaration
#include <tuple>
#include <array>
#include <vector>
#include <stdint.h>
#include <Pdh.h>
#pragma comment(lib, "Pdh.lib")
class process_memory_info
{
private:
using pd_t = std::tuple<DWORD, ULONGLONG, ULONGLONG>; // performance data type
static constexpr size_t pidIdx = 0;
static constexpr size_t wsIdx = 1;
static constexpr size_t pbIdx = 2;
struct less_pd
{
bool operator ()(const pd_t& left, const pd_t& right) const
{
return std::get<pidIdx>(left) < std::get<pidIdx>(right);
}
};
public:
~process_memory_info();
bool setup_query();
bool take_sample();
std::pair<uintmax_t, uintmax_t> get_memory_info(DWORD pid) const;
private:
PDH_HQUERY pdhQuery_ = nullptr;
std::array<PDH_HCOUNTER, std::tuple_size_v<pd_t>> pdhCounters_ = {};
std::vector<pd_t> perfData_;
};
Implementation
#include <memory>
#include <execution>
#include <algorithm>
#include <stdlib.h>
using std::unique_ptr;
using std::pair;
using std::array;
using std::make_unique;
using std::get;
process_memory_info::~process_memory_info()
{
PdhCloseQuery(pdhQuery_);
}
bool process_memory_info::setup_query()
{
if (pdhQuery_)
return true;
if (PdhOpenQuery(nullptr, 0, &pdhQuery_))
return false;
size_t i = 0;
for (auto& counterPath : array<PDH_COUNTER_PATH_ELEMENTS, std::tuple_size_v<pd_t>>{ {
{ nullptr, L"Process", L"*", nullptr, 0, L"ID Process" },
{ nullptr, L"Process", L"*", nullptr, 0, L"Working Set" },
{ nullptr, L"Process", L"*", nullptr, 0, L"Private Bytes" }
}})
{
wchar_t pathStr[PDH_MAX_COUNTER_PATH] = {};
DWORD size;
PdhMakeCounterPath(&counterPath, pathStr, &(size = _countof(pathStr)), 0);
PdhAddEnglishCounter(pdhQuery_, pathStr, 0, &pdhCounters_[i++]);
}
return true;
}
bool process_memory_info::take_sample()
{
if (PdhCollectQueryData(pdhQuery_))
return false;
DWORD nItems = 0;
DWORD size;
PdhGetFormattedCounterArray(pdhCounters_[0], PDH_FMT_LONG, &(size = 0), &nItems, nullptr);
auto valuesBuf = make_unique<BYTE[]>(size);
PdhGetFormattedCounterArray(pdhCounters_[0], PDH_FMT_LONG, &size, &nItems, PPDH_FMT_COUNTERVALUE_ITEM(valuesBuf.get()));
unique_ptr<PDH_FMT_COUNTERVALUE_ITEM[]> pidValues{ PPDH_FMT_COUNTERVALUE_ITEM(valuesBuf.release()) };
valuesBuf = make_unique<BYTE[]>(size);
PdhGetFormattedCounterArray(pdhCounters_[1], PDH_FMT_LARGE, &size, &nItems, PPDH_FMT_COUNTERVALUE_ITEM(valuesBuf.get()));
unique_ptr<PDH_FMT_COUNTERVALUE_ITEM[]> wsValues{ PPDH_FMT_COUNTERVALUE_ITEM(valuesBuf.release()) };
valuesBuf = make_unique<BYTE[]>(size);
PdhGetFormattedCounterArray(pdhCounters_[2], PDH_FMT_LARGE, &size, &nItems, PPDH_FMT_COUNTERVALUE_ITEM(valuesBuf.get()));
unique_ptr<PDH_FMT_COUNTERVALUE_ITEM[]> pbValues{ PPDH_FMT_COUNTERVALUE_ITEM(valuesBuf.release()) };
perfData_.clear();
perfData_.reserve(nItems);
for (size_t i = 0, n = nItems; i < n; ++i)
{
perfData_.emplace_back(pidValues[i].FmtValue.longValue, wsValues[i].FmtValue.largeValue, pbValues[i].FmtValue.largeValue);
}
std::sort(std::execution::par_unseq, perfData_.begin(), perfData_.end(), less_pd{});
return true;
}
pair<uintmax_t, uintmax_t> process_memory_info::get_memory_info(DWORD pid) const
{
auto it = std::lower_bound(perfData_.cbegin(), perfData_.cend(), pd_t{ pid, 0, 0 }, less_pd{});
if (it != perfData_.cend() && get<pidIdx>(*it) == pid)
return { get<wsIdx>(*it), get<pbIdx>(*it) };
else
return {};
}
int main()
{
process_memory_info pmi;
pmi.setup_query();
DWORD pid = 4;
pmi.take_sample();
auto[workingSet, privateBytes] = pmi.get_memory_info(pid);
return 0;
}
Why don't you compile this application as 64-bit and then you should be able to collect the memory usage for both 32-bit and 64-bit processes.
The WMI Win32_Process provider has quite a few 64-bit memory numbers. Not sure if everything you are after is there or not.

Reading all of a stuct or nothing from a pipe/socket in linux?

I've got a subprocess that I've popened that outputs fixed-sized structs containing some status information. My plan is to have a separate thread that reads from the stdout of that process to pull in the data as it comes.
I've got to check a flag periodically to make sure the program is still running so I can shut down cleanly, so I have to set the pipe to non-blocking and just have to run a loop piecing together the status message.
Is there a canonical way I can tell Linux "either read this entire amount or nothing before a timeout", that way I'll be able to check my flag, but I don't have to handle the boilerplate of reading the structure piece meal?
Alternatively, is there a way to push data back into a pipe? I could try to read the whole thing, and if it times out before it's all ready, push what I have back in and try again in a bit.
I've also written my popen (so I can grab stdin and stdout, so I'm totally OK using a socket rather than a pipe if that helps).
Here's what I ended up doing for anyone that's curious. I just wrote a class that wraps up the file descriptor and message size and gives me the "all-or-none" behavior I want.
struct aonreader {
aonreader(int fd, ssize_t size) {
fd_ = fd;
size_ = size_;
nread_ = 0;
nremain_ = size_;
}
ssize_t read(void *dst) {
ssize_t ngot = read(fd, (char*)dst + nread_, nremain_);
if (ngot < 0) {
if (errno != EAGAIN && errno != EWOULDBLOCK) {
return -1; // error
}
} else {
nread_ += ngot;
nremain_ -= ngot;
// if we read a whole struct
if (nremain_ == 0) {
nread_ = 0;
nremain_ = size_;
return size_;
}
}
return 0;
private:
int fd_;
ssize_t size_;
ssize_t nread_;
ssize_t nremain_;
};
Which can then be used something like this:
thing_you_want thing;
aonreader buffer(fd, sizeof(thing_you_want));
while (running) {
size_t ngot = buffer.read(&thing);
if (ngot == sizeof(thing_you_want)) {
<handle thing>
} else if (ngot < 0) {
<error, handle errno>
}
<otherwise loop and check running flag>
}

thread_specific_ptr multithread confusion

// code snippet 1
static boost::thread_specific_ptr<StreamX> StreamThreadSpecificPtr;
void thread_proc() {
StreamX * stream = NULL;
stream = StreamThreadSpecificPtr.get();
if (NULL == stream) {
stream = new StreamX();
StreamThreadSpecificPtr.reset(stream);
}
printf("%p\n", stream);
}
int run() {
boost::thread_group threads;
for(int i = 0; i < 5; i ++) {
threads.create_thread(&thread_proc);
}
threads.join_all();
}
// the result is
0x50d560 -- SAME POINTER
0x50d540
0x50bfc0
0x50bef0
0x50d560 -- SAME POINTER
// code snippet 2
static boost::thread_specific_ptr<StreamX> StreamThreadSpecificPtr(NULL); // DIFF from code snippet 1
void thread_proc() {
StreamX * stream = NULL;
stream = StreamThreadSpecificPtr.get();
if (NULL == stream) {
stream = new StreamX();
StreamThreadSpecificPtr.reset(stream);
}
printf("%p\n", stream);
}
int run() {
boost::thread_group threads;
for(int i = 0; i < 5; i ++) {
threads.create_thread(&thread_proc);
}
threads.join_all();
}
// the result is
0x50d510
0x50d4f0
0x50bf70
0x50ca70
0x50be50
In code snippet 1, two pointer are same. it is not expected.
In code snippet 2, with initializing StreamThreadSpecificPtr to NULL, everything seams good.
Could you please help to figure out the answer for this confusion? Thanks a lot.
The joy is that your threads are actually terminating asynchronously, destructing the StreamX instances.
Using a detector:
struct StreamX
{
StreamX() { puts(__FUNCTION__); }
~StreamX() { puts(__FUNCTION__); }
};
I get the following output:
StreamX
0x7f258c0008c0
~StreamX
StreamX
0x7f25740008c0
~StreamX
StreamX
0x7f25840008c0
~StreamX
StreamX
0x7f25780008c0
StreamX
~StreamX
0x7f257c0008c0
~StreamX
real 0m0.002s
user 0m0.000s
sys 0m0.004s
It makes sense for subsequent allocations to reuse the same heap addresses, since there isn't much fragmentation involved. In other words, you can't just compare pointers to see whether they alias the same object in a concurrent application.
The difference with the second example is only spurious. There are many factors that can - and will - influence the result. E.g. adding a tiny delay at the end of each thread will remove all opportunity for threads to terminate before other instances have been instantiated.
See it Live On Coliru

memory leak with sockets and map

I have a socket server, everytime a new connection is made, a XClient class is instantiated and I am inserting it into a map. I am watching the memory usage through task manager. everytime a new connection is made, lets assume, the memory usage of my program increases by 800kb for example. Inside that class, there is a connected variable, which will tell me wheter this client is active or not. I created a thread to run endlessly and iterate through all the elements of my map and I'm checking if the connected variable is true or false. if it is false, I am (at least I think I am...) releasing the memory used by the previously instantiated XClient class. BUT, the memory usage is being decreased only half of the 800kb (for example, no precise values). So, when a client connects: +800kb. when client disconnects: -400kb. I think I have a memory leak? If I have 100 clients connected, that 400kb that is not being released would turn into 4000kb of non-used(?) memory, and that would be a problem.
So, here is my code.
The thread to iterate through all elements:
DWORD Update(XSockets *sockets)
{
while(true)
{
for(sockets->it = sockets->clients.begin(); sockets->it != sockets->clients.end(); sockets->it++)
{
int key = (*sockets->it).first;
if(sockets->clients[key]->connected == false) // remove the client, releasing memory
{
delete sockets->clients[key];
}
}
Sleep(100);
}
return true;
}
The code that is adding new XClients instances to my map:
bool XSockets::AcceptConnections()
{
struct sockaddr_in from;
while(true)
{
try
{
int fromLen = sizeof(from);
SOCKET client = accept(this->loginSocket,(struct sockaddr*)&from,&fromLen);
if(client != INVALID_SOCKET)
{
srand(time(NULL));
int clientKey = rand();
XClient* clientClass = new XClient(inet_ntoa(from.sin_addr),clientKey,client);
this->clients.insert(make_pair(clientKey,clientClass));
}
Sleep(100);
}
catch(...)
{
printf("error accepting incoming connection!\r\n");
break;
}
}
closesocket(this->loginSocket);
WSACleanup();
return true;
}
And the declarations:
map<int,XClient*> clients;
map<int,XClient*>::iterator it;
You've got several problems, but the chief one is that you appear to be sharing a map between threads without any synchronization at all. That can lead to all kinds of trouble.
Are you using c++11 or Boost? To avoid memory leak nightmares like this, you could create a map of shared pointers. This way, you can let the structure clean itself up.
This is how I would do it:
#include <memory>
#include <map>
#include <algorithm>
#include <functional>
#include <mutex>
typedef std::shared_ptr<XClient> XClientPtr;
std::map<int, XClientPtr> client;
std::mutex the_lock;
bool XSockets::AcceptConnections()
{
/* snip */
auto clientClass = std::make_shared<XClient>(/*... params ...*/);
the_lock.lock();
clients[clientKey] = clientClass;
the_lock.unlock();
/* snip */
}
bool client_is_connected(const std::pair<int, XClientPtr> &p) {
return p.second->connected;
}
DWORD Update(XSockets *sockets) {
while(true) { /* You should probably have some kind of
exit condition here. Like a global "running" bool
so that the thread will eventually stop. */
the_lock.lock();
auto it = sockets->clients.begin(), end = sockets->clients.end();
for(; it != end; ) {
if (!it->second->connected)
//Clients will be destructed here if their refcount goes to 0
sockets->clients.erase(it++);
else
++it;
}
the_lock.unlock();
Sleep(100);
}
return 1;
}
Note: Above code is untested. I haven't even tried to compile it.
See What happens to an STL iterator after erasing it in VS, UNIX/Linux?. In your case, you are not deleting everything, so you will want to not use a for loop.
sockets->it = sockets->clients.begin();
while (sockets->it != sockets->clients.end())
{
int key = (*sockets->it).first;
if(sockets->clients[key]->connected == false) // remove the client, releasing memory
{
delete sockets->clients[key];
sockets->clients.erase(sockets->it++);
}
else
{
sockets->it++;
}
}