Allowing connections given the number of threads in server - c++

Every connection requires one thread for each, and for now, we're allowing only certain number of connections per period. So every time a user connects, we increment the counter if we're within certain period from the last time we set the check time.
1.get current_time = time(0)
2.if current_time is OUTSIDE certain period from check_time,
set counter = 0, and check_time = current_time.
3.(otherwise, just leave it the way it is)
4.if counter < LIMIT, counter++ and return TRUE
5.Otherwise return FALSE
But this is independent of actually how many threads we have running in the server, so I'm thinking of a way to allow connections depending on this number.
The problem is that we're actually using a third-party api for this, and we don't know exactly how long the connection will last. First I thought of creating a child thread and run ps on it to pass the result to the parent thread, but it seems like it's going to take more time since I'll have to parse the output result to get the total number of threads, etc. I'm actually not sure if I'm making any sense.. I'm using c++ by the way. Do you guys have any suggestions as to how I could implement the new checking method? It'll be very much appreciated.

There will be a /proc/[pid]/task (since Linux 2.6.0-test6) directory for every thread belonging to process [pid]. Look at man proc for documentation. Assuming you know the pid of your thread pool you could just count those directories.
You could use boost::filesystem to do that from c++, as described here:
How do I count the number of files in a directory using boost::filesystem?
I assumed you are using Linux.

Okay, if you know the TID of the thread in use by the connection then you can wait on that object in a separate thread which can then decrement the counter.
At least I know that you can do it with MSVC...
bool createConnection()
{
if( ConnectionMonitor::connectionsMaxed() )
{
LOG( "Connection Request failed due to over-subscription" );
return false;
}
ConnectionThread& connectionThread = ThreadFactory::createNewConnectionThread();
connectionThread.startConnection();
ThreadMonitorThread& monitor = ThreadFactory::createThreadMonitor(connectionThread);
monitor.monitor();
}
and in ThreadMonitorThread
ThreadMonitorThread( const Thread& thread )
{
this.thread = thread;
}
void monitor()
{
WaitForSingleObject( thread.getTid() );
ConnectionMonitor::decrementThreadCounter();
}
Of course ThreadMonitorThread will require some special privileges to call the decrement and the ThreadFactory will probably need the same to increment it.
You also need to worry about properly coding this up... who owns the objects and what about exceptions and errors etc...

Related

Light event in WinAPI / C++

Is there some light (thus fast) event in WinAPI / C++ ? Particularly, I'm interested in minimizing the time spent on waiting for the event (like WaitForSingleObject()) when the event is set. Here is a code example to clarify further what I mean:
#include <Windows.h>
#include <chrono>
#include <stdio.h>
int main()
{
const int64_t nIterations = 10 * 1000 * 1000;
HANDLE hEvent = CreateEvent(nullptr, true, true, nullptr);
auto start = std::chrono::high_resolution_clock::now();
for (int64_t i = 0; i < nIterations; i++) {
WaitForSingleObject(hEvent, INFINITE);
}
auto elapsed = std::chrono::high_resolution_clock::now() - start;
double nSec = 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
printf("%.3lf Ops/sec\n", nIterations / nSec);
return 0;
}
On 3.85GHz Ryzen 1800X I'm getting 7209623.405 operations per second, meaning 534 CPU clocks (or 138.7 nanoseconds) are spent on average for a check whether the event is set.
However, I want to use the event in performance-critical code where most of the time the event is actually set, so it's just a check for a special case and in that case the control flow goes to code which is not performance-critical (because this situation is seldom).
WinAPI events which I know (created with CreateEvent) are heavy-weight because of security attributes and names. They are intended for inter-process communication. Perhaps WaitForSingleObject() is so slow because it switches from user to kernel mode and back, even when the event is set. Furthermore, this function has to behave differently for manual- and auto-reset events, and a check for the type of the event takes time too.
I know that a fast user-mode mutex (spin lock) can be implemented with atomic_flag . Its spinning loop can be extended with a std::this_thread::yield() in order to let other threads run while spinning.
With the event I wouldn't like a complete equivalent of a spin-lock, because when the event is not set, it may take substantial time till it becomes set again. If every thread that needs the event set start spinning till it's set again, that would be an epic waste of CPU electricity (though shouldn't affect system performance if they call std::this_thread::yield)
So I would rather like an analogy of a critical section, which usually just does the work in user mode and when it realizes it needs to wait (out of spins), it switches to kernel mode and waits on a heavy synchronization object like a mutex.
UPDATE1: I've found that .NET has ManualResetEventSlim , but couldn't find an equivalent in WinAPI / C++.
UPDATE2: because there were details of event usage requested, here they are. I'm implementing a knowledge base that can be switched between regular and maintenance mode. Some operations are maintenance-only, some operations are regular-only, some can work in both modes, but of them some are faster in maintenance and some are faster in regular mode. Upon its start each operation needs to know whether it is in maintenance or regular mode, as the logic changes (or the operation refuses to execute at all). From time to time user can request a switch between maintenance and regular mode. This is rare. When this request arrives, no new operations in the old mode can start (a request to do so fails) and the app waits for the current operations in the old mode to finish, then it switches mode. So light event is a part of this data structure: the operations except mode switching have to be fast, so they need to set/reset/wait event quickly.
begin from win8 the best solution for you use WaitOnAddress (in place WaitForSingleObject, WakeByAddressAll (work like SetEvent for NotificationEvent) and WakeByAddressSingle (work like SynchronizationEvent ). more read - WaitOnAddress lets you create a synchronization object
implementation can be next:
class LightEvent
{
BOOLEAN _Signaled;
public:
LightEvent(BOOLEAN Signaled)
{
_Signaled = Signaled;
}
void Reset()
{
_Signaled = FALSE;
}
void Set(BOOLEAN bWakeAll)
{
_Signaled = TRUE;
(bWakeAll ? WakeByAddressAll : WakeByAddressSingle)(&_Signaled);
}
BOOL Wait(DWORD dwMilliseconds = INFINITE)
{
BOOLEAN Signaled = FALSE;
while (!_Signaled)
{
if (!WaitOnAddress(&_Signaled, &Signaled, sizeof(BOOLEAN), dwMilliseconds))
{
return FALSE;
}
}
return TRUE;
}
};
don't forget add Synchronization.lib for linker input.
code for this new api very effective, they not create internal kernel objects for wait (like event) but use new api ZwAlertThreadByThreadId ZwWaitForAlertByThreadId special design for this targets.
how implement this yourself, before win8 ? for first look trivial - boolen varitable + event handle. and must look like:
void Set()
{
SetEvent(_hEvent);
// Sleep(1000); // simulate thread innterupted here
_Signaled = true;
}
void Reset()
{
_Signaled = false;
// Sleep(1000); // simulate thread innterupted here
ResetEvent(_hEvent);
}
void Wait(DWORD dwMilliseconds = INFINITE)
{
if(!_Signaled) WaitForSingleObject(_hEvent);
}
but this code really incorrect. problem that we do 2 operation in Set (Reset) - change state of _Signaled and _hEvent. and no way do this from user mode as atomic/interlocked operation. this mean that thread can be interrupted between this two operation. assume that 2 different threads in concurrent call Set and Reset. in most case operation will be executed in next order for example:
SetEvent(_hEvent);
_Signaled = true;
_Signaled = false;
ResetEvent(_hEvent);
here all ok. but possible and next order (uncomment one Sleep for test this)
SetEvent(_hEvent);
_Signaled = false;
ResetEvent(_hEvent);
_Signaled = true;
as result _hEvent will be in reset state, when _Signaled is true.
implement this as atomic yourself, without os support will be not simply, however possible. but i be first look for usage of this - for what ? are event like behavior this is exactly you need for task ?
The other answer is very good if you can drop support of Windows 7.
However on Win7, if you set/reset the event many times from multiple threads, but only need to sleep rarely, the proposed method is quite slow.
Instead, I use a boolean guarded by a critical section, with condition variable to wake / sleep.
The wait method will go to the kernel for sleep on SleepConditionVariableCS API, that’s expected and what you want.
However set & reset methods will work entirely in user mode: setting a single boolean variable is very fast, i.e. in 99% of cases, the critical section will do it’s user-mode lock free magic.

How to call a method/function 50 time in a second

How to call a method/function 50 time in a second then calculate time spent, If time spent is less than one second then sleep for (1-timespent) seconds.
Below is the pseudo code
while(1)
{
start_time = //find current time
int msg_count=0;
send_msg();
msg_count++;
// Check time after sending 50 messages
if(msg_count%50 == 0)
{
curr_time = //Find current time
int timeSpent = curr_time - start_time ;
int waitingTime;
start_time = curr_time ;
waitingTime = if(start_time < 1 sec) ? (1 sec - timeSpent) : 0;
wait for waitingTime;
}
}
I am new with Timer APIs. Can anyone help me that what are the timer APIs, I have to use to achieve this. I want portable code.
First, read the time(7) man page.
Then you may want to call timer_create(2) to set up a timer. To query about time, use clock_gettime(2)
You probably may want to wait and multiplex on some input and output. poll(2) is useful for this. To sleep for a small amount of time without using the CPU consider nanosleep(2)
If using timer doing signals, read signal(7) and be careful because signal handlers are restricted to async-signal-safe functions (consider having a signal handler which just sets some global volatile sig_atomic_t flag). You may also be interested by the Linux specific timerfd_create(2) (which you could poll or pass to your event loop).
You might want to use some existing event loop library, like libevent or libev (or those from GTK/Glib, Qt, etc...), which are often using poll (or fancier things). The linux specific eventfd(2) and signalfd(2) might be very helpful.
Advanced Linux Programming is also useful to read.
If send_msg is doing network I/O, you probably need to redesign your program around some event loop (perhaps your own, based on poll) - you'll need to multiplex (i.e. poll) both on network sends and network recieves. continuation-passing style is then a useful paradigm.

C++ threads & infinite loop

I have a little problem, I wrote a program, server role, doing an infinite loop waiting for client requests.
But I would like this program to also return his pid.
Thus, I think I should use multithreading.
Here's my main :
int main(int argc, char **argv) {
int pid = (int) getpid();
int port = 5555
ServerSoap *servsoap;
servsoap = new ServerSoap(port, false);
servsoap->StartServer(); //Here starts the infinite loop
return pid; //so it never executes this
}
If it was bash scripting I would add & to run it in background.
Shall I use pthread ? And how to do it please ?
Thanks.
eo
When a program returns (exits), all running threads terminate, so you can't have a background thread continue to run.
In addition, the int return value of main is (usually) truncated to a 7-bit value, so you don't have enough space to return a full pid.
It'd be better just to print the pid to stdout using printf.
If you put the infinite loop in a separate thread, and then return from main it will kill the whole process including your new thread. One solution, keeping to threads, is to make a detached thread. A better solution is probably to create a new process:
int main()
{
int pid = fork();
if (pid == -1)
perror("fork");
else if (pid == 0)
{
ServerSoap serversoap(5555, false);
serversoap.StartServer();
}
return pid;
}
Edit: Also note the limit to the return value from main as noted in the answer from ecatmur.
I have a feeling that you're trying to implement daemon.
To add to #ecatmur answer, if no error has happened program should always return 0 on termination.
PID is usually saved in some file, often times in /var/run/ directory. Some programs use /tmp/ directory.
Your main is attempting to do what your server should do. You're confusing a couple patterns here.
Pattern #1: Daemon
Think of the main as the program that, when on, accepts client requests and performs operations with them. The main has to wait for requests if this is the structure of the program. When a request is received, only then do you perform the requested operation. The main serves only to turn on or off this service. Normally this type of behavior is handled by default with threads. The listener activates a thread calling specific methods with information regarding the request, for instance. Unless you require threads for the work you need done, you shouldn't require threads for this.
Pattern #2: Tool
Alternatively, you could simply call this program as a tool. You'd still need a web service, but this program could be separate from that. Apart from what your tool should do, you shouldn't require threads for this.
In either case, I don't think what you're looking for is to implement threading. You're simply activating a server which does nothing. You should probably look into adding request handlers instead.

Reliable way to count running instances of a process on Windows using c++/WinAPIs

I need to know how many instances of my process are running on a local Windows system. I need to be able to do it using C++/MFC/WinAPIs. So what is a reliable method to do this?
I was thinking to use process IDs for that, stored as a list in a shared memory array that can be accessed by the process. But the question is, when a process is closed or crashes how soon will its process ID be reused?
The process and thread identifiers may be reused any time after closure of all handles. See When does a process ID become available for reuse? for more information on this.
However if you are going to store a pair of { identifier, process start time } you can resolve these ambiguities and detect identifier reuse. You can create a named file mapping to share information between the processes, and use IPC to synchronize access to this shared data.
You can snag the process handles by the name of the process using the method described in this question. It's called Process Walking. That'll be more reliable than process id's or file paths.
A variation of this answer is what you're looking for. Just loop through the processes with Process32Next, and look for processes with the same name using MatchProcessName. Unlike the example in the link I provided, you'll be looking to count or create a list of the processes with the same name, but that's a trivial addition.
If you are trying to limit the number of instances of your process to some number you can use a Semaphore.
You can read in detail here:
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686946(v=vs.85).aspx
In a nutshell, the semaphore is initialized with a current count and max count. Each instance of your process will decrement the count when it acquires the semaphore. When the nth process tries to acquire it but the count has reached zero that process will fail to acquire it and can terminate or take appropriate action.
The following code should give you the gist of what you have to do:
#include <windows.h>
#include <stdio.h>
// maximum number of instances of your process
#define MAX_INSTANCES 10
// name shared by all your processes. See http://msdn.microsoft.com/en-us/library/windows/desktop/aa382954(v=vs.85).aspx
#define SEMAPHORE_NAME "Global\MyProcess"
// access rights for semaphore, see http://msdn.microsoft.com/en-us/library/windows/desktop/ms686670(v=vs.85).aspx
#define MY_SEMAPHORE_ACCESS SEMAPHORE_ALL_ACCESS
DWORD WINAPI ThreadProc( LPVOID );
int main( void )
{
HANDLE semaphore;
// Create a semaphore with initial and max counts of MAX_SEM_COUNT
semaphore = CreateSemaphore(
NULL, // default security attributes
MAX_INSTANCES, // initial count
MAX_INSTANCES, // maximum count
SEMAPHORE_NAME );
if (semaphore == NULL)
{
semaphore = OpenSemaphore(
MY_SEMAPHORE_ACCESS,
FALSE, // don't inherit the handle for child processes
SEMAPHORE_NAME );
if (semaphore == NULL)
{
printf("Error creating/opening semaphore: %d\n", GetLastError());
return 1;
}
}
// acquire semaphore and decrement count
DWORD acquireResult = 0;
acquireResult = WaitForSingleObject(
semaphore,
0L); // timeout after 0 seconds trying to acquire
if(acquireResult == WAIT_TIMEOUT)
{
printf("Too many processes have the semaphore. Exiting.");
CloseHandle(semaphore);
return 1;
}
// do your application's business here
// now that you're done release the semaphore
LONG prevCount = 0;
BOOL releaseResult = ReleaseSemaphore(
semaphore,
1, // increment count by 1
&prevCount );
if(!releaseResult)
{
printf("Error releasing semaphore");
CloseHandle(semaphore);
return 1;
}
printf("Semaphore released, prev count is %d", prevCount);
CloseHandle(semaphore);
return 0;
}
Well, your solution is not very reliable. PIDs can be reused by the OS at any later time.
I did it once by going through all the processes and comparing their command line string (the path of the executable) with the one for my process. Works pretty well.
Extra care should be taken for programs that are started via batch files (like some java apps/servers).
Other solutions involve IPC, maybe through named pipes, sockets, shared memory (as you mentioned). But none of them are that easy to implement and maintain.

Of these 3 methods for reading linked lists from shared memory, why is the 3rd fastest?

I have a 'server' program that updates many linked lists in shared memory in response to external events. I want client programs to notice an update on any of the lists as quickly as possible (lowest latency). The server marks a linked list's node's state_ as FILLED once its data is filled in and its next pointer has been set to a valid location. Until then, its state_ is NOT_FILLED_YET. I am using memory barriers to make sure that clients don't see the state_ as FILLED before the data within is actually ready (and it seems to work, I never see corrupt data). Also, state_ is volatile to be sure the compiler doesn't lift the client's checking of it out of loops.
Keeping the server code exactly the same, I've come up with 3 different methods for the client to scan the linked lists for changes. The question is: Why is the 3rd method fastest?
Method 1: Round robin over all the linked lists (called 'channels') continuously, looking to see if any nodes have changed to 'FILLED':
void method_one()
{
std::vector<Data*> channel_cursors;
for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
{
Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
channel_cursors.push_back(current_item);
}
while(true)
{
for(std::size_t i = 0; i < channel_list.size(); ++i)
{
Data* current_item = channel_cursors[i];
ACQUIRE_MEMORY_BARRIER;
if(current_item->state_ == NOT_FILLED_YET) {
continue;
}
log_latency(current_item->tv_sec_, current_item->tv_usec_);
channel_cursors[i] = static_cast<Data*>(current_item->next_.get(segment));
}
}
}
Method 1 gave very low latency when then number of channels was small. But when the number of channels grew (250K+) it became very slow because of looping over all the channels. So I tried...
Method 2: Give each linked list an ID. Keep a separate 'update list' to the side. Every time one of the linked lists is updated, push its ID on to the update list. Now we just need to monitor the single update list, and check the IDs we get from it.
void method_two()
{
std::vector<Data*> channel_cursors;
for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
{
Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
channel_cursors.push_back(current_item);
}
UpdateID* update_cursor = static_cast<UpdateID*>(update_channel.tail_.get(segment));
while(true)
{
ACQUIRE_MEMORY_BARRIER;
if(update_cursor->state_ == NOT_FILLED_YET) {
continue;
}
::uint32_t update_id = update_cursor->list_id_;
Data* current_item = channel_cursors[update_id];
if(current_item->state_ == NOT_FILLED_YET) {
std::cerr << "This should never print." << std::endl; // it doesn't
continue;
}
log_latency(current_item->tv_sec_, current_item->tv_usec_);
channel_cursors[update_id] = static_cast<Data*>(current_item->next_.get(segment));
update_cursor = static_cast<UpdateID*>(update_cursor->next_.get(segment));
}
}
Method 2 gave TERRIBLE latency. Whereas Method 1 might give under 10us latency, Method 2 would inexplicably often given 8ms latency! Using gettimeofday it appears that the change in update_cursor->state_ was very slow to propogate from the server's view to the client's (I'm on a multicore box, so I assume the delay is due to cache). So I tried a hybrid approach...
Method 3: Keep the update list. But loop over all the channels continuously, and within each iteration check if the update list has updated. If it has, go with the number pushed onto it. If it hasn't, check the channel we've currently iterated to.
void method_three()
{
std::vector<Data*> channel_cursors;
for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
{
Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
channel_cursors.push_back(current_item);
}
UpdateID* update_cursor = static_cast<UpdateID*>(update_channel.tail_.get(segment));
while(true)
{
for(std::size_t i = 0; i < channel_list.size(); ++i)
{
std::size_t idx = i;
ACQUIRE_MEMORY_BARRIER;
if(update_cursor->state_ != NOT_FILLED_YET) {
//std::cerr << "Found via update" << std::endl;
i--;
idx = update_cursor->list_id_;
update_cursor = static_cast<UpdateID*>(update_cursor->next_.get(segment));
}
Data* current_item = channel_cursors[idx];
ACQUIRE_MEMORY_BARRIER;
if(current_item->state_ == NOT_FILLED_YET) {
continue;
}
found_an_update = true;
log_latency(current_item->tv_sec_, current_item->tv_usec_);
channel_cursors[idx] = static_cast<Data*>(current_item->next_.get(segment));
}
}
}
The latency of this method was as good as Method 1, but scaled to large numbers of channels. The problem is, I have no clue why. Just to throw a wrench in things: if I uncomment the 'found via update' part, it prints between EVERY LATENCY LOG MESSAGE. Which means things are only ever found on the update list! So I don't understand how this method can be faster than method 2.
The full, compilable code (requires GCC and boost-1.41) that generates random strings as test data is at: http://pastebin.com/0kuzm3Uf
Update: All 3 methods are effectively spinlocking until an update occurs. The difference is in how long it takes them to notice the update has occurred. They all continuously tax the processor, so that doesn't explain the speed difference. I'm testing on a 4-core machine with nothing else running, so the server and the client have nothing to compete with. I've even made a version of the code where updates signal a condition and have clients wait on the condition -- it didn't help the latency of any of the methods.
Update2: Despite there being 3 methods, I've only tried 1 at a time, so only 1 server and 1 client are competing for the state_ member.
Hypothesis: Method 2 is somehow blocking the update from getting written by the server.
One of the things you can hammer, besides the processor cores themselves, is your coherent cache. When you read a value on a given core, the L1 cache on that core has to acquire read access to that cache line, which means it needs to invalidate the write access to that line that any other cache has. And vice versa to write a value. So this means that you're continually ping-ponging the cache line back and forth between a "write" state (on the server-core's cache) and a "read" state (in the caches of all the client cores).
The intricacies of x86 cache performance are not something I am entirely familiar with, but it seems entirely plausible (at least in theory) that what you're doing by having three different threads hammering this one memory location as hard as they can with read-access requests is approximately creating a denial-of-service attack on the server preventing it from writing to that cache line for a few milliseconds on occasion.
You may be able to do an experiment to detect this by looking at how long it takes for the server to actually write the value into the update list, and see if there's a delay there corresponding to the latency.
You might also be able to try an experiment of removing cache from the equation, by running everything on a single core so the client and server threads are pulling things out of the same L1 cache.
I don't know if you have ever read the Concurrency columns from Herb Sutter. They are quite interesting, especially when you get into the cache issues.
Indeed the Method2 seems better here because the id being smaller than the data in general would mean that you don't have to do round-trips to the main memory too often (which is taxing).
However, what can actually happen is that you have such a line of cache:
Line of cache = [ID1, ID2, ID3, ID4, ...]
^ ^
client server
Which then creates contention.
Here is Herb Sutter's article: Eliminate False Sharing. The basic idea is simply to artificially inflate your ID in the list so that it occupies one line of cache entirely.
Check out the other articles in the serie while you're at it. Perhaps you'll get some ideas. There's a nice lock-free circular buffer I think that could help for your update list :)
I've noticed in both method 1 and method 3 you have a line, ACQUIRE_MEMORY_BARRIER, which I assume has something to do with multi-threading/race conditions?
Either way, method 2 doesn't have any sleeps which means the following code...
while(true)
{
if(update_cursor->state_ == NOT_FILLED_YET) {
continue;
}
is going to hammer the processor. The typical way to do this kind of producer/consumer task is to use some kind of semaphore to signal to the reader that the update list has changed. A search for producer/consumer multi threading should give you a large number of examples. The main idea here is that this allows the thread to go to sleep while it's waiting for the update_cursor->state to change. This prevents this thread from stealing all the cpu cycles.
The answer was tricky to figure out, and to be fair would be hard with the information I presented though if anyone actually compiled the source code I provided they'd have a fighting chance ;) I said that "found via update list" was printed after every latency log message, but this wasn't actually true -- it was only true for as far as I could scrollback in my terminal. At the very beginning there were a slew of updates found without using the update list.
The issue is that between the time when I set my starting point in the update list and my starting point in each of the data lists, there is going to be some lag because these operations take time. Remember, the lists are growing the whole time this is going on. Consider the simplest case where I have 2 data lists, A and B. When I set my starting point in the update list there happen to be 60 elements in it, due to 30 updates on list A and 30 updates on list B. Say they've alternated:
A
B
A
B
A // and I start looking at the list here
B
But then after I set the update list to there, there are a slew of updates to B and no updates to A. Then I set my starting places in each of the data lists. My starting points for the data lists are going to be after that surge of updates, but my starting point in the update list is before that surge, so now I'm going to check for a bunch of updates without finding them. The mixed approach above works best because by iterating over all the elements when it can't find an update, it quickly closes the temporal gap between where the update list is and where the data lists are.