send over IP immediately on different thread

send over IP immediately on different thread - c++

This is probably impossible, but i'm going to ask anyways. I have a multi-threaded program (server) that receives a request on a thread dedicated to IP communications and then passes it on to worker threads to do work, then I have to send a reply back with answers to the client and send it when it is actually finished, with as little delay as possible. Currently I am using a consumer/producer pattern and placing replies on a queue for the IP Thread to take off and send back to my client. This, however gives me no guarantee about WHEN this is going to happen, as the IP thread might not get scheduled any time soon, I cannot know. This makes my client, that is blocking for this call, think that the request has failed, which is obviously not the point.
Due to the fact I am unable to make changes in the client, I need to solve this sending issue on my side, the problem that I'm facing is that I do not wish to start sharing my IP object (currently only on 1 thread) with the worker threads, as then things get overly complicated. I wondered if there is some way I can use thread sync mechanisms to ensure that the moment my worker thread is finished, the IP thread will execute my send the reply back to the client?
Will manual/autoreset events do this for me or are these not guaranteed to wake up the thread immediately?

If you need it sent immediately, your best bet is to bite the bullet and start sharing the connection object. Lock it before accessing it, of course, and be sure to think about what you'll do if the send buffer is already full (the connection thread will need to deal with sending the portion of the message that didn't fit the first time, or the worker thread will be blocked until the client accepts some of the data you've sent). This may not be too difficult if your clients only have one request running at a time; if that's the case you can simply pass ownership of the client object to the worker thread when it begins processing, and pass it back when you're done.
Another option is using real-time threads. The details will vary between operating systems, but in most cases, if your thread has a high enough priority, it will be scheduled in immediately if it becomes ready to run, and will preempt all other threads with lower priority until done. On Linux this can be done with the SCHED_RR priority class, for example. However, this can negatively impact performance in many cases; as well as crashing the system if your thread gets into an infinite loop. It also usually requires administrative rights to use these scheduling classes.
That said, if scheduling takes long enough that the client times out, you might have some other problems with load. You should also really put a number on how fast the response needs to be - there's no end of things you can do if you want to speed up the response, but there'll come a point where it doesn't matter anymore (do you need response in the tens of ms? single-digit ms? hundreds of microseconds? single-digit microseconds?).

There is no synchronization mechanism that will wake a thread immediately. When a synchronization mechanism for which a thread is waiting is signaled, the thread is placed in a ready queue for its priority class. It can be starved there for several seconds before it's scheduled (Windows does have mechanisms that deal with starvation over 3-4 second intervals).
I think that for out-of-band, critical communications you can have a higher priority thread to which you can enqueue the reply message and wake it up (with a condition variable, MRE or any other synchronization mechanism). If that thread has higher priority than the rest of your application's threads, waking it up will immediately effect a context switch.

Related

Problems implementing a multi-threaded UDP server (threadpool?)

I am writing an audio streamer (client-server) as a project of mine (C/C++),
and I decided to make a multi threaded UDP server for this project.
The logic behind this is that each client will be handled in his own thread.
The problems I`m having are the interference of threads to one another.
The first thing my server does is create a sort of a thread-pool; it creates 5
threads that all are blocked automatically by a recvfrom() function,
though it seems that, on most of the times when I connect another device
to the server, more than one thread is responding and later on
that causes the server to be blocked entirely and not operate further.
It's pretty difficult to debug this as well so I write here in order
to get some advice on how usually multi-threaded UDP servers are implemented.
Should I use a mutex or semaphore in part of the code? If so, where?
Any ideas would be extremely helpful.

Take a step back: you say
each client will be handled in his own thread
but UDP isn't connection-oriented. If all clients use the same multicast address, there is no natural way to decide which thread should handle a given packet.
If you're wedded to the idea that each client gets its own thread (which I would generally counsel against, but it may make sense here), you need some way to figure out which client each packet came from.
That means either
using TCP (since you seem to be trying for connection-oriented behaviour anyway)
reading each packet, figuring out which logical client connection it belongs to, and sending it to the right thread. Note that since the routing information is global/shared state, these two are equivalent:
keep a source IP -> thread mapping, protected by a mutex, read & access from all threads
do all the reads in a single thread, use a local source IP -> thread mapping
The first seems to be what you're angling for, but it's poor design. When a packet comes in you'll wake up one thread, then it locks the mutex and does the lookup, and potentially wakes another thread. The thread you want to handle this connection may also be blocked reading, so you need some mechanism to wake it.
The second at least gives a seperation of concerns (read/dispatch vs. processing).
Sensibly, your design should depend on
number of clients
I/O load
amount of non-I/O processing (or IO:CPU ratio, or ...)

The first thing my server does is create a sort of a thread-pool; it creates 5 threads that all are blocked automatically by a recvfrom() function, though it seems that, on most of the times when I connect another device to the server, more than one thread is responding and later on that causes the server to be blocked entirely and not operate further
Rather than having all your threads sit on a recvfrom() on the same socket connection, you should protect the connection with a semaphore, and have your worker threads wait on the semaphore. When a thread acquires the semaphore, it can call recvfrom(), and when that returns with a packet, the thread can release the semaphore (for another thread to acquire) and handle the packet itself. When it's done servicing the packet, it can return to waiting on the semaphore. This way you avoid having to transfer data between threads.

Your recvfrom should be in the master thread and when it gets data you should pass the address IP:Port and data of the UDP client to the helper threads.
Passing the IP:port and data can be done by spawning a new thread everytime the master thread receives a UDP packet or can be passed to the helper threads through a message queue

I think that your main problem is the non-persistent udp connection. Udp is not keeping your connections alive, it exchanges only two datagrams per session. Depending on your application, in the worst case, it will have concurrent threads reading from the first available information, ie, recvfrom() will unblock even if it is not it's turn to do it.
I think the way to go is using select in the main thread and, with a concurrent buffer, manage what wich thread will do.
In this solution, you can have one thread per client, or one thread per file, assuming that you keep the clients necessary information to make sure you're sending the right file part.
TCP is another way to do it, since it keeps the connection alive for every thread you run, but is not the best transmission way on data lost allowed applications.

Solution for non-blocking timer and server is boost threads?

My project has a queue, a server and a timer. The server receives data and puts it in the queue and the timer process the queue. When the queue is processed, external processes are open with popen, which means that popen will block the timer until a process has ended.
Correct me if I'm wrong, but as both server and timer are linked to the same io_service, if the server receives data, it will somehow block io_service from proceeding to the next event, and the vice-versa is the timer blocking if a process in the queue is being executed.
I'm thinking in a solution based on boost::thread but I'm not sure of what architecture should I use as I never used threads. My options are:
Two threads - one for the timer and one for the server, each one using its own io_service
One thread - one for the timer with its own io_service. the server remains in main process
In both ways the queue (a simple map) must be shared, so I think I'll have some trouble with mutexes and other things
If someone wants to take a look at the code, it is at https://github.com/MendelGusmao/CGI-for-LCD-Smartie
Thanks!

I don't see why you can't have your server listening for connections, processing data, and placing that data in the queue in one thread while your timer takes those items out of the queue in another thread and then spawns processes via popen() to process the queue data. Unless there is a detail here that I've missed, the socket that the server will be listening on (or pipe, FIFO, etc.), is separate from the pipe that will be internally opened by the libc runtime via popen(), so your server and timer threads won't be blocking each other. You'll simply have to make sure that you have enough space in the queue to store the data coming in from the server without overflowing memory (i.e., if this is a high-data-rate application, and data is coming in much faster than it's being processed, you'll eventually run out of memory).
Finally, while guarding a shared queue via muextes is a good thing, it's actually unnecessary for only a single producer/consumer situation like you're currently describing if you decide to use a bounded queue (i.e., a ring-buffer). If you decide on an unbounded queue, while there are some lockless algorithms out there, they're pretty complex, and so guarding an unbounded queue like std::queue<T> with a mutex is an absolute must.

I have implemented almost the exact thing you have described using windows threads. I had my consumer wait on an event HANDLE which is fired by the producer when the queue gets too long. There was a timeout on the wait as well so that if the queue was not filled fast enough the consumer would still wait and process the queue. It was a service in windows so the main thread was used for that. And yes, mutexes will be required to access the shared object.
So I used two threads (not including the main), 1 mutex, 1 shared object. I think your better option is also two threads as it keeps the logic cleaner. The main thread just starts the two threads and then waits (or can be used for signalling, control, output), and the two other threads are just doing their own jobs.

Waiting on a condition (pthread_cond_wait) and a socket change (select) simultaneously

I'm writing a POSIX compatible multi-threaded server in c/c++ that must be able to accept, read from, and write to a large number of connections asynchronously. The server has several worker threads which perform tasks and occasionally (and unpredictably) queue data to be written to the sockets. Data is also occasionally (and unpredictably) written to the sockets by the clients, so the server must also read asynchronously. One obvious way of doing this is to give each connection a thread which reads and writes from/to its socket; this is ugly, though, since each connection may persist for a long time and the server thus may have to hold hundred or thousand threads just to keep track of connections.
A better approach would be to have a single thread that handled all communications using the select()/pselect() functions. I.e., a single thread waits on any socket to be readable, then spawns a job to process the input that will be handled by a pool of other threads whenever input is available. Whenever the other worker threads produce output for a connection, it gets queued, and the communication thread waits for that socket to be writable before writing it.
The problem with this is that the communication thread may be waiting in the select() or pselect() function when output is queued by the worker threads of the server. It's possible that, if no input arrives for several seconds or minutes, a queued chunk of output will just wait for the communication thread to be done select()ing. This shouldn't happen, however--data should be written as soon as possible.
Right now I see a couple solutions to this that are thread-safe. One is to have the communication thread busy-wait on input and update the list of sockets it waits on for writing every tenth of a second or so. This isn't optimal since it involves busy-waiting, but it will work. Another option is to use pselect() and send the USR1 signal (or something equivalent) whenever new output has been queued, allowing the communication thread to update the list of sockets it is waiting on for writable status immediately. I prefer the latter here, but still dislike using a signal for something that should be a condition (pthread_cond_t). Yet another option would be to include, in the list of file descriptors on which select() is waiting, a dummy file that we write a single byte to whenever a socket needs to be added to the writable fd_set for select(); this would wake up the communications server because that particular dummy file would then be readable, thus allowing the communications thread to immediately update it's writable fd_set.
I feel intuitively, that the second approach (with the signal) is the 'most correct' way to program the server, but I'm curious if anyone knows either which of the above is the most efficient, generally speaking, whether either of the above will cause race conditions that I'm not aware of, or if anyone knows of a more general solution to this problem. What I really want is a pthread_cond_wait_and_select() function that allows the comm thread to wait on both a change in sockets or a signal from a condition.
Thanks in advance.

This is a fairly common problem.
One often used solution is to have pipes as a communication mechanism from worker threads back to the I/O thread. Having completed its task a worker thread writes the pointer to the result into the pipe. The I/O thread waits on the read end of the pipe along with other sockets and file descriptors and once the pipe is ready for read it wakes up, retrieves the pointer to the result and proceeds with pushing the result into the client connection in non-blocking mode.
Note, that since pipe reads and writes of less then or equal to PIPE_BUF are atomic, the pointers get written and read in one shot. One can even have multiple worker threads writing pointers into the same pipe because of the atomicity guarantee.

Unfortunately, the best way to do this is different for each platform. The canonical, portable way to do it is to have your I/O thread block in poll. If you need to get the I/O thread to leave poll, you send a single byte on a pipe that the thread is polling. That will cause the thread to exit from poll immediately.
On Linux, epoll is the best way. On BSD-derived operating systems (including OSX, I think), kqueue. On Solaris, it used to be /dev/poll and there's something else now whose name I forget.
You may just want to consider using a library like libevent or Boost.Asio. They give you the best I/O model on each platform they support.

Your second approach is the cleaner way to go. It's totally normal to have things like select or epoll include custom events in your list. This is what we do on my current project to handle such events. We also use timers (on Linux timerfd_create) for periodic events.
On Linux the eventfd lets you create such arbitrary user events for this purpose -- thus I'd say it is quite accepted practice. For POSIX only functions, well, hmm, perhaps one of the pipe commands or socketpair I've also seen.
Busy-polling is not a good option. First you'll be scanning memory which will be used by other threads, thus causing CPU memory contention. Secondly you'll always have to return to your select call which will create a huge number of system calls and context switches which will hurt overall system performance.

responsively checking two queues without pegging CPU

I have a thread pool system which uses message passing to organize events, and I am also using the Windows API which also does a bit of message passing. So essentially I need to use the functions which check for the presence of messages without blocking. If I block (if I use GetMessage I think it will block) while checking either queue, I may miss any incoming messages on the other queue.
The first solution I know of is to Sleep a couple of miliseconds somewhere during my loop of peeking on both queues.
Another way I can think of is to have an additional thread, so that now I have one for each loop I am listening to. I make it not responsible for doing anything other than running the windows message loop, then use it to process and forward any events to my own message queue for the event to be handled. But this won't work if Windows specifically sends the messages i'm interested in to the original thread.
Are there other good solutions?

Your requirement is a bit unclear, but I can agree that Windows message queues are awkward in that only one thread can wait on them. Windows binds windows to threads, and only the thread that creates a window can interact with it.
If you have user-defined messages that contain work to processed by to your thread pool, I suggest that you do exactly what you suggest in your question - use one thread to process all the Windows messages, (GetMessage() loop), requeue any work that turns up to your thread pool input queue and handle 'normal' Windows messages with the usual Translate/Dispatch mechanism.
If you need more help, could you describe more clearly the flow of Windows messages and/or work objects through your system? It is not obvious where the work for the thread pool comes from and how it is transported, (if forced to use a WMQ, I usually postMessage a reference in wParam/lParam, but your system?).
Rgds,
Martin

Normally, a thread pool would not be involved in the Windows message loop, and blocking indefinitely when there is no work is not only allowable for a worker thread, but even desirable.
The most elegant way of implementing a thread pool that can receive messages via some kind of queue, which automatically keeps all CPU cores busy, and which as a bonus is very efficient, is using a completion port.
CreateIoCompletionPort with a null handle will create a completion port and return the handle. Passing zero as NumberOfConcurrentThreads tells the operating system to keep as many threads running as there are cores available.
Create any number of worker threads (a few more than you have cores) and CreateIoCompletionPort with the handle returned by the first call. That will bind the workers to this completion port. Now call GetQueuedCompletionStatus with INFINITE timeout on every worker, that will block them indefinitively.
Make a struct which has an OVERLAPPED as the first member, plus any data that you want to hand as a task (some pointers to data, or anything).
For every task, set up one of your message structs, and PostQueuedCompletionStatus to the completion port handle. At application exit, post null. You can use the dwNumberOfBytesTransferred field (and the completion key) to pass some additional info.
Now Windows will wake one thread for every message you posted, in last-in-first-out order, up to the number of cores available. If one of the workers blocks on IO, Windows will wake another one for another task (keeping the CPU busy as long as there is work to do).
After finishing a task, go back to GetQueuedCompletionStatus.
A way to gracefully terminate all workers is to pass "zero bytes transferred" and have the worker re-post the event, and exit if it encounters that.

I am not an expert on windows queues, but I am nearly certain there has to be an asynchronous event driven mechanism for message passing.

Spinning in a non-blocking app without gobbling up CPU time

I have a UDP network application that reads packets sent to it and then processes them (same thread). The reads are non-blocking so I'm not using poll or select.
Packets received are grouped by sessions.
Work is governed by whether there is a session in progress. If there is no work to be done i.e. there are no sessions, or there are no packets to process then I need to spin.
I've been looking at the hybrid algorithm found here:
http://www.1024cores.net/home/lock-free-algorithms/tricks/spinning
Been playing with it. I'm told it's more for busy waits. What methods do you use to prevent unnecessary processing and needlessly high CPU usage?
EDIT:
Thanks for all the answers and comments.
I'm now doing the following. When it comes to reading from the network I look to see if there is other work to be done. If there is, then I call poll with a timeout of zero. I then read as many packets as I can and place them into an in memory queue for processing. If no other work then I poll indefinite (i.e. -1). It seems to work well, CPU is only high when things are busy, otherwise it drops to zero.

If you have nothing to do, you should be blocking - if not on the socket itself (i.e. if it's an event loop that processes more than one network socket or event type), then on a gate that gets signaled when something happens (the design depends on how your OS does async I/O).
Spinning is something you should only be doing when you're waiting for a very short period of time (usually only in kernel mode).

How many packets per second are you processing? How long does it take to process those packets? If you use blocking threads, what is the average CPU usage you get?
Unless blocking wait is close to 100% usage, where shaving a few bits of performance from the blocking itself can help, spinning will not improve but rather worsen performance. By spinning, you lock one core that will not be available to run other code (possibly including the code that feeds you with work: i.e. kernel code that reads network and passes up to your app the packets), you burn resources without performing any work at all...
Note that when the article says that it is harder to write blocking code than non blocking spin waits, the author is not talking about operations for which the blocking version is implemented in the system, but rather for situations where on thread must wait on a condition triggered by other threads (a shared variable value goes above/below a limit, a flag is changed...).
Also, if the cost of checking the condition is high, then spinning will incur in that cost for each and every iteration of the loop, and that might well exceed the cost of checking once and performing an expensive wait.
Remember that spinning is an active wait, it does not make sense to ask how to actively wait while not consuming processor, as the active wait approach implies consuming processor time. What can you do to avoid needless CPU usage? Use a blocking call to get the next packet. In the particular case of reading an UDP packet I doubt that two calls to the non-blocking read are not more expensive in processing time than a single call to the blocking read operation.
Again think on the questions in the beginning, that can be summed to: Is blocking proven to be the bottleneck? *Is this an scenario where active waits can actually help?*

Since you have to read from a socket, you can just do a blocking read. Without a packet, you have no reason to be running, right?
If there is more than one socket, then the blocking read won't work, so you need pselect() to monitor multiple descriptors.
Am I missing something obvious?
It occurs to me that you may have some long-term processing after you do receive a datagram. If the reason you are going with non-blocking I/O is to avoid ignoring incoming traffic while working on a session, then in that case the obvious thing to do is to fork() the sessions. (Hmm, so I still think I must be missing something...)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js