Streaming images with ZMQ, message_t allocation takes too much time - c++

I've been trying to find out how to stream images with zeromq (i'm using the cppzmq wrapper, but raw API answers are fine). Naively, I set up
zmq::context_t ctx(4);
zmq::socket_t pub_image_socket(ctx, zmq::socket_type::pub);
pub_image_socket.bind("tcp://127.0.0.1:8001");
...
while(true){
//render to image...
zmq::message_t image_message(image_size.x*image_size.y*element_size);
copy_image_to(image, image_message);
pub_image_socket.send(image_message, zmq::send_flags::none);
}
I thought that maybe some other part of the zmq chain was taking a bunch of time, so I did the following (in a debug build):
//zmq::context_t ctx(4);
//zmq::socket_t pub_image_socket(ctx, zmq::socket_type::pub);
//pub_image_socket.bind("tcp://127.0.0.1:8001");
...
while(true){
//render to image...
zmq::message_t image_message(image_size.x*image_size.y*element_size);
//commented out so only message creation left
//copy_image_to(image, image_message);
//pub_image_socket.send(image_message, zmq::send_flags::none);
}
there was no movement in the runtime (and the visual studio profiler could not actually tell me what the slow-down was).
So then I decided to do this:
zmq::context_t ctx(4);
zmq::socket_t pub_image_socket(ctx, zmq::socket_type::pub);
pub_image_socket.bind("tcp://127.0.0.1:8001");
...
while(true){
//render to image...
//zmq::message_t image_message(image_size.x*image_size.y*element_size);
//commented out so only message creation left
//copy_image_to(image, image_message);
//pub_image_socket.send(image_message, zmq::send_flags::none);
}
Commenting out the image_message more than doubled my runtime... I'm not sure what I can do to stop this though, theoretically I could stop this by re-using the allocated memory in a message_t but zmq prohibits this. Allocating megabytes per frame just takes too much time.
I'm starting to think data streaming is impossible for ZeroMQ due to this limitation, and I should just use asio and tcp/udp sockets instead. Is there a way to avoid this massive reallocation cost in ZMQ?

Use zmq_msg_init_data
http://api.zeromq.org/master:zmq-msg-init-data
You can provide the memory pointer/size of your already allocated memory and zeromq will take ownership (skipping the extra allocation). Once its been processed and is no longer needed it will call the associated free function where your own code can clean up.
I have used this approach in the past with a memory pool/ circular buffer and it worked well.

Related

I need help figuring out tcp sockets (clsocket)

I am having trouble figuring out sockets i am just asking the server for data at a position (glm::i64vec4) and expecting a response but the position gets way off when i get the response and the data for that position reflects that (aka my voxel game make a kinda cool looking but useless mess)
It's probably just me not understanding sockets whatsoever or maybe something weird with this library
one thought i had is it was maybe something to do with mismatching blocking and non blocking on the server and client
but when i switched the server to blocking (and put each client in a seperate thread from each other and the accepting process) it did nothing
if i'm doing something really stupid please tell me i know next to nothing about sockets
here is some code that probably looks horrible
Server Code
std::deque <CActiveSocket*> clients;
CPassiveSocket socket;
socket.Initialize();
socket.SetNonblocking();//I'm doing this so i don't need multiple threads for clients
socket.Listen("0.0.0.0",port);
while (1){
{
CActiveSocket* c;
if ((c = socket.Accept()) != NULL){
clients.emplace_back(c);
}
}
for (CActiveSocket*& c : clients){
c->Receive(sizeof(glm::i64vec4));
if (c->GetBytesReceived() == sizeof(glm::i64vec4)){
chkpkt chk;
chk.pos = *(glm::i64vec4*)c->GetData();
LOOP3D(chksize+2){
chk.data(i,j,k).val = chk.pos.y*chksize+j;
chk.data(i,j,k).id=0;
}
while (c->Send((uint8*)&chk,sizeof(chkpkt)) != sizeof(chkpkt)){}
}
}
}
Client Code
//v is a glm::i64vec4
//fsock is set to Blocking
if(fsock.Send((uint8*)&v,sizeof(glm::i64vec4)))
if (fsock.Receive(sizeof(chkpkt))){
tthread::lock_guard<tthread::fast_mutex> lock(wld->filemut);
wld->ichks[v]=(*(chkpkt*)fsock.GetData()).data;//i tried using the position i get back from the server to set this (instead of v) but that made it to where nothing loaded
//i checked it and the chunks position never lines up with what i sent
}
Without your complete application codes it's unrealistic to offer any suggestions of any particular lines of code correction.
But it seems like you are using this library. It doesn't matter if not, because most of time when doing network programming, socket's weird behavior make some problems somewhat universal. Thus there are a few suggestions for the portion of socket application in your project:
It suffices to have BLOCKING sockets.
Most of time socket's read have somewhat weird behavior, that is, it might not receive the requested size of bytes at a time. Due to this, you need to repeatedly call read until the receiving buffer is read thoroughly. For a complete and robust solution you can refer to Stevens's readn routine ([Ref.1], page122).
If you are using exactly the library mentioned above, you can see that your fsock.Receive eventually calls recv. And recv is just an variant of read[Ref.2], thus the solutions for both of them are just identical. And this pattern might help:
while(fsock.Receive(sizeof(chkpkt))>0)
{
// ...
}
Ref.1: https://mathcs.clarku.edu/~jbreecher/cs280/UNIX%20Network%20Programming(Volume1,3rd).pdf
Ref.2: https://man7.org/linux/man-pages/man2/recv.2.html#DESCRIPTION

AT command response parser

I am working on my own implementation to read AT commands from a Modem using a microcontroller and c/c++
but!! always a BUT!! after I have two "threads" on my program, the first one were I am comparing the possible reply from the Moden using strcmp which I believe is terrible slow
comparing function
if (strcmp(reply, m_buffer) == 0)
{
memset(buffer, 0, buffer_size);
buffer_size = 0;
memset(m_buffer, 0, m_buffer_size);
m_buffer_size = 0;
return 0;
}
else
return 1;
this one works fine for me with AT commands like AT or AT+CPIN? where the last response from the Modem is "OK" and nothing in the middle, but it is not working with commands like AT+CREG?, wheres it responses:
+REG: n,n
OK
and I am specting for "+REG: n,n" but I believe strncpy is very slow and my buffer data is replaced for "OK"
2nd "thread" where it enables a UART RX interruption and replaces my buffer data every time it receives new data
Interruption handle:
m_buffer_size = buffer_size;
strncpy(m_buffer, buffer, buffer_size + m_buffer_size);
Do you know any out there faster than strcmp? or something to improve the AT command responses reading?
This has the scent of an XY Problem
If you have seen the buffer contents being over written, you might want to look into a thread safe queue to deliver messages from the RX thread to the parsing thread. That way even if a second message arrives while you're processing the first, you won't run into "buffer overwrite" problems.
Move the data out of the receive buffer and place it in another buffer. Two buffers is rarely enough, so create a pool of buffers. In the past I have used linked lists of pre-allocated buffers to keep fragmentation down, but depending on the memory management and caching smarts in your microcontroller, and the language you elect to use, something along the lines of std::deque may be a better choice.
So
Make a list of free buffers.
When a the UART handling thread loop looks something like,
Get a buffer from the free list
Read into the buffer until full or timeout
Pass buffer to parser.
Parser puts buffer in its own receive list
Parsing sends a signal to wake up its thread.
Repeat until terminated. If the free list is emptied, your program is probably still too slow to keep up. Perhaps adding more buffers will allow the program to get through a busy period, but if the data flow is relatively constant and the free list empties out... Well, you have a problem.
Parser loop also repeats until terminated looks like:
If receive list not empty,
Get buffer from receive list
Process buffer
Return buffer to free list
Otherwise
Sleep
Remember to protect the lists from concurrent access by the different threads. C11 and C++11 have a number of useful tools to assist you here.

waveOutWrite buffers are never returned to application

I have a problem with Microsoft's WaveOut API:
edit1: Added Link to sample project:
edit2: removed link, its not representative of the issue
After playing some audio, when I want to terminate a given playback stream, I call the function:
waveOutClose(hWaveOut_);
However, even after waveOutClose() is called, sometimes the library will still access memory previously passed to it by waveOutWrite(), causing an invalid memory access.
I then tried to ensure all the buffers are marked as done before freeing the buffer:
PcmPlayback::~PcmPlayback()
{
if(hWaveOut_ == nullptr)
return;
waveOutReset(hWaveOut_); // infinite-loops, never returns
for(auto it = buffers_.begin(); it != buffers_.end(); ++it)
waveOutUnprepareHeader(hWaveOut_, &it->wavehdr_, sizeof(WAVEHDR));
while( buffers_.empty() == false ) // infinite loops
removeCompletedBuffers();
waveOutClose(hWaveOut_);
//Unhandled exception at 0x75629E80 (msvcrt.dll) in app.exe:
// 0xC0000005: Access violation reading location 0xFEEEFEEE.
}
void PcmPlayback::removeCompletedBuffers()
{
for(auto it = buffers_.begin(); it != buffers_.end();)
{
if( it->wavehdr_.dwFlags & WHDR_DONE )
{
waveOutUnprepareHeader(hWaveOut_, &it->wavehdr_, sizeof(WAVEHDR));
it = buffers_.erase(it);
}
else
++it;
}
}
However, this situation never happens - the buffer never becomes empty. There will be 4-5 blocks remaining with wavehdr_.dwFlags == 18 (I believe this means the blocks are still marked as in playback)
How can I resolve this issue?
# Martin Schlott ("Can you provide the loop where you write the buffer to waveOutWrite?")
Its not quite a loop, instead I have a function that is called whenever I receive an audio packet over the network:
void PcmPlayback::addData(const std::vector<short> &rhs)
{
removeCompletedBuffers();
if(rhs.empty())
return;
// add new data
buffers_.push_back(Buffer());
Buffer & buffer = buffers_.back();
buffer.data_ = rhs;
ZeroMemory(&buffers_.back().wavehdr_, sizeof(WAVEHDR));
buffer.wavehdr_.dwBufferLength = buffer.data_.size() * sizeof(short);
buffer.wavehdr_.lpData = (char *)(buffer.data_.data());
waveOutPrepareHeader(hWaveOut_, &buffer.wavehdr_, sizeof(WAVEHDR)); // prepare block for playback
waveOutWrite(hWaveOut_, &buffer.wavehdr_, sizeof(WAVEHDR));
}
The described behavior can happen if you do not call
waveOutUnprepareHeader
to every buffer you used before you use
waveOutClose
The flagfield _dwFlags seems to indicate that the buffers are still enqueued (WHDR_INQUEUE | WHDR_PREPARED) try:
waveOutReset
before unprepare buffers.
After analyses your code, I found two problems/bugs which are not related to waveOut (funny, you use C++11 but the oldest media interface). You use a vector as buffer. During some calling operations, the vector is copied! One bug I found is:
typedef std::function<void(std::vector<short>)> CALLBACK_FN;
instead of:
typedef std::function<void(std::vector<short>&)> CALLBACK_FN;
which forces a copy of the vector.
Try to avoid using vectors if you expect to use it mostly as rawbuffer. Better use std::unique_pointer as buffer pointer.
Your callback in the recorder is not monitored by a mutex, nor does it check if a destructor was already called. The destructing happens during the callback (mostly) which leads to an exception.
For your test program, go back and use raw pointer and static callbacks before blaming waveOut. Your code is not bad, but the first bug already shows, that a small bug will lead to unpredictical errors. As you also organize your buffers in a std::array, I would search for bugs there. I guess, you make a unintentional copy of your whole buffer array, unpreparing the wrong buffers.
I did not have the time to dig deeper, but I guess those are the problems.
I managed to find my problem in the end, it was caused by multiple bugs and a deadlock. I will document what happened here so people can learn from this in the future
I was clued in to what was happening when I fixed the bugs in the sample:
call waveInStop() before waveInClose() in ~Recorder.cpp
wait for all buffers to have the WHDR_DONE flag before calling waveOutClose() in ~PcmPlayback.
After doing this, the sample worked fine and did not display the behavior of the WHDR_DONE flag never being marked.
In my main program, that behavior was caused by a deadlock that occurs in the following situation:
I have a vector of objects representing each peer I am streaming audio with
Each Object owns a Playback class
This vector is protected by a mutex
Recorder callback:
mutex.lock()
send audio packet to each peer.
Remove Peer:
mutex.lock()
~PcmPlayback
wait for WHDR_DONE flags to be marked
A deadlock occurs when I remove a peer, locking the mutex and the recorder callback tries to acquire a lock too.
Note that this will happen often because the playback buffer is usually (~4 * 20ms) while the recorder has a cadence of 20ms.
In ~PcmPlayback, the buffers will never be marked as WHDR_DONE and any calls to the WaveOut API will never return because the WaveOut API is waiting for the Recorder callback to complete, which is in turn waiting on mutex.lock(), causing a deadlock.

Restarting Streaming OpenAL Source?

Why does my streaming OpenAL source somtimes go to AL_STOPPED state, forcing me to call alSourcePlay? This usually happens when I do not call send fast enough, i.e. in debug mode. Does the oal source automatically stop when it doesn't have enough queue buffers? How do I avoid that?
void send(audio_buffer audio) override
{
ALenum state;
alGetSourcei(source_, AL_SOURCE_STATE,&state);
if(state != AL_PLAYING)
alSourcePlay(source_); // This happens sometimes, usually when "send" is not called fast enough.
ALuint buffer = 0;
alSourceUnqueueBuffers(source_, 1, &buffer);
if(buffer)
{
alBufferData(buffer, AL_FORMAT_STEREO16, audio.data(), static_cast<ALsizei>(audio.size()*sizeof(int16_t)), 48000);
alSourceQueueBuffers(source_, 1, &buffer);
}
else
LOG << "Dropped audio.";
}
It sounds like your basic problem is that your audio stream is starved. There are a few options you can use to mitigate this, but they all have their own side effects:
(1) You can configure it to play from a looping buffer, to which you are supplying the relevant data. The downside to this is that it will audibly repeat itself if you starve the buffer too long, but it will have some better performance characteristics (fragmentation, etc).
(2) You can increase the send buffer size. This will only cover up small problems, and potentially increases the latency in dynamic content.
(3) Finally, you can thread the audio send operation, that way so long as the audio thread isn't starved, it can continue to send data in the background.
The high production / quality solution probably involes all three of these. Sorry for the lack of OpenAL specific terminology, but every audio system I've seen has these capabilities.

Pooling PBOs and textures?

I have an application which does a lot of GPGPU using Opengl and Pixel Buffer Objects to transfer and process data.
Currently I employ a pooling of these resources, basically I have a pool for every buffer dimensions and usage that my application uses. When the usage of resource finishes it returns to its respective pool for re-use. However, I'm starting to have seconds thoughts whether there is any is in this since I need "orphan" the PBOs before re-use to not interfere with ongoing transfers.
My question is whether there is any merit is in pooling resources such as PBOs and textures, or would it be just a good to simply allocate from OpenGL directly when needed?
Here is an example of what I am doing. Vice versa with textures.
std::shared_ptr<pbo> create_pbo(int size, bool write)
{
auto pool = pbo_pools[write][size];
std::shared_ptr<pbo> buffer;
if(!pool->try_pop(buffer))
buffer = ogl_thread_.invoke([=]{return new pbo(size, write);});
return spl::shared_ptr<pbo>(buffer.get(), [=](pbo*) mutable
{
ogl_thread_.begin_invoke([=]() mutable
{
if(write)
buffer->map();
else // read
buffer->unmap();
pool->push(buffer);
});
});
}
I'm starting to have seconds thoughts whether there is any is in this since I need "orphan" the PBOs before re-use to not interfere with ongoing transfers.
No you don't have to. That's the nice thing about PBOs: You can submit new data into them, while a call to glTex(Sub)Image may still be reading from them, without the read operation being corrupted.