VerySleepy Profiling c++ code - c++

While profiling my code to find what is going slow, I have 3functions that are taking forever apparently, well thats what very sleepy says.
These functions are:
ZwDelayExecution 20.460813 20.460813 19.987685 19.987685
MsgWaitForMultipleObjects 20.460813 20.460813 19.987685 19.987685
WaitForSingleObject 20.361805 20.361805 19.890967 19.890967
Can anybody tell me what these functions are? Why they are taking so long, and how to fix them.
Thanks

Probably that functions are used to make thread 'sleeping' in Win32 API. Also they might be used as thread synchronization so check these thing.
They are taking so much CPU time because they are designed for that.
The WaitForSingleObject function can wait for the following objects:
Change notification
Console input
Event
Memory resource notification
Mutex
Process
Semaphore
Thread
Waitable timer
So the other possible thing where it can be used for is console user input waiting.
ZwDelayExecution is an internal function of Windows. As it can be seen it is used to realize Sleep function. Here is call stack for Sleep function so you can see it with your own eyes:
0 ntdll.dll ZwDelayExecution
1 kernel32.dll SleepEx
2 kernel32.dll Sleep
It probaly uses Assembly low-level features to realize that so it can delay thread with precision of 100ns.
MsgWaitForMultipleObjects has a similar to WaitForSingleObject goal.

Judging on the names, all 3 functions seem to block, so they take a long time because they are designed to do so, but they shouldn't use any CPU while waiting.

One of the first steps should always be to check the documentation:
WaitForSingleObject:
http://msdn.microsoft.com/en-us/library/windows/desktop/ms687032.aspx
Waits for an object like a thread, process, mutex.
MsgWaitForMultipleObjects:
http://msdn.microsoft.com/en-us/library/windows/desktop/ms684242.aspx
Simply waits for multiple objects, just like WaitForSingleObject.
ZwDelayExecution:
There doesn't seem to be a documentation for ZwDelayExecution but I think that is an internal method which get's called when you call Sleep.
Anyway, the name already reveals part of it. "Wait" and "Delay"-functions are supposed to take time. If you want to reduce the waiting time you have to find out what is calling these functions.
To give you an example:
If you start a new thread and then wait for it to finish in your main thread, you will call WaitForSingleObject one way or another in WINAPI-programming. It doesn't even have to be you who is starting the thread - it could be the runtime itself. The function will wait until the thread finishes. Therefore it will take time and block the program in WaitForSingleObject until thread is done or a timeout occurs. This is nothing bad, this is intended behaviour.

Before you start zooming in on these functions, you might first want to determine what kind of slowness your program is suffering from. It is pretty normal for a Windows program to have one or more threads spending most of their time in blocking functions.
You would first need to determine whether your actual critical thread is CPU bound. In that case you don't want to zoom in on the functions that take a lot off wall clock time, you want to find those functions that take CPU time.
I don't have much experience with Very Sleepy, but IIRC it is a sampling profiler, and those are typically not so good at measuring CPU usage.
Only after you've determined that your program is not CPU bound, then you should zoom in on the functions that wait a lot.

Related

Block a thread with sleep vs block without sleep

I've created a multi-threaded application using C++ and POSIX threads. In which I should now block a thread (main thread) until a boolean flag is set (becomes true).
I've found two ways to get this done.
Spinning through a loop without sleep.
while(!flag);
Spinning through a loop with sleep.
while(!flag){
sleep(some_int);
}
If I should follow the first way, why do some people write codes following the second way? If the second way should be used, why should we make current thread to sleep? And what are disadvantages of this way?
The first option (a "busy wait") wastes an entire core for the duration of the wait, preventing other useful work being done and/or wasting energy.
The second option is less wasteful - your waiting thread uses very little CPU and allows other threads to run. But it is still wasteful to keep switching back to the thread to check the flag.
Far better than either would be to use a condition variable, which allows the waiting thread to block without consuming any resources until it is able to proceed.
while(flag); will cause your thread to use all of its allocated time checking the condition. This wastes a lot of CPU cycles checking something which has likely not changed.
Sleeping for a bit causes the thread to pause and give up the CPU to programs that actually need it.
You shouldn't do either though; you should use a threading library to create a flag object and call its wait function, so that the kernel will pause the thread until the flag is set.
The first way (just the plain while) is wasting resources, specifically the processor time of your process.
When a thread is put into sleep, OS may decide that the processor will be used for different tasks when talking about systems with preemptive multitasking. In theory, if you had as many processors / cores as threads, there would not have to be any difference.
If a solution is good or not depends on the operating system used, and sometimes architecture the program is running on. You should consult your syscall reference to find out more about this.

Relinquish Processor Time in C++ (Windows)

I've looked around a fair amount and can't seem to find what I'm looking for, but let me first stress that I'm not looking for a high-precision sleep function.
Here's the background for the problem I'm trying to solve:
I've made a memory mapping library that operates a lot like a named pipe. You can put bytes into it, get bytes out of it, and query how many bytes are available to read/write, all that good stuff.
It's fast (mostly) processes communicating using it will average at 4GB/s if they're passing chunks of bytes 8KBs or larger. Performance goes down to around 300MB/s as you approach 512B chunk size.
The problem:
Very occasionally, on heavily loaded servers, very large lag times will occur (Upwards of 5s). My running theory for the cause of this issue is that when large transfers are taking place (larger than the size of the mapped memory), the process that's writing data will tight poll to wait for more space to be available in the circular buffer that's implemented on top of the memory map. There are no calls to sleep, so the polling process could be hogging the CPU for no good reason! The issue is that even the smallest call to sleep (1ms) would absolutely demolish performance. The memmap size is 16KB, so if it slept for 1ms every 16KB, performance would drop to a best-case scenario of 16MB/s.
The solution:
I want a function that I can call that will relinquish the CPU, but makes no limitations on when it gets rescheduled by the operating system (Windows 7 in this case).
Has anyone got any bright alternatives?/Does anyone know if such a function exists?
Thanks.
According to the MSDN documentation, on XP or newer, when you call Sleep with a timeout of 0 will yield to other processes of equal priority.
A value of zero causes the thread to relinquish the remainder of its
time slice to any other thread of equal priority that is ready to
run. If there are no other threads of equal priority ready to run, the
function returns immediately, and the thread continues execution.
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686298(v=vs.85).aspx
Another option that will require more work but that will work more reliably would be to share an event handle between the producer and consumer process. You can use CreateEvent to create your event and DuplicateHandle to get it into your other process. As the producer fills the buffer, it will call ResetEvent on the event handle and call WaitForSingleObject with it. When the consumer has removed some data from the full shared buffer, it will call SetEvent, which will wake the producer which was waiting in WaitForSingleObject.
std::this_thread::yield() probably does what you want. I believe it just calls Sleep with 0 in most implementations.
You need the SwitchToThread() function (which will only relinquish its time slice if something else can run), not Sleep(0) (which would relinquish its time slice even if nothing else can run).
If you're writing code that's designed to take advantage of hyperthreading, YieldProcessor might do something for you too, but I doubt that'll be helpful.
You're incorrectly assuming a binary choice. You now are always busy-waiting because sleep always would be a bad idea.
The better solution is to try a few times without sleeping. If that still fails (because the map is full, and the other thread isn't running), then you can issue a true sleep. This will be sufficiently rare that on average you'll be sleeping microseconds. You could even check the realtime clock (RDTSC) to determine how long you've spent busy-waiting before surrendering your timeslice.
If you're operating under .Net, you can look into the Thread::Yield() method.
It may or may not help with your specific scenario but it's the correct way notify the scheduler that you want to relinquish the remainder of your timeslice.
If you're running in a pre-.Net environment (seems unlikely if you're on Windows 7), you can look into the Win32 SwitchToThread() function instead.

Boost threads: is it possible to limit the run time of a thread before moving to another thread

I have a program with a main thread and a diagnostics thread. The main thread is basically a while(1) loop that performs various tasks. One of these tasks is to provide a diagnostics engine with information about the system and then check back later (i.e. in the next loop) to see if there are any problems that should be dealt with. An iteration of the main loop should take no longer than 0.1 seconds. If all is well, then the diagnostic engine takes almost no time to come back with an answer. However, if there is a problem, the diagnostic engine can take seconds to isolate the problem. For this reason each time the diagnostic engine receives new information it spins up a new diagnostics thread.
The problem we're having is that the diagnostics thread is stealing time away from the main thread. Effectively, even though we have two threads, the main thread is not able to run as often as I would like because the diagnostic thread is still spinning.
Using Boost threads, is it possible to limit the amount of time that a thread can run before moving on to another thread? Also of importance here is that the diagnostic algorithm we are using is blackbox, so we can't put any threading code inside of it. Thanks!
If you run multiple threads they will indeed consume CPU time. If you only have a single processor, and one thread is doing processor intensive work then that thread will slow down the work done on other threads. If you use OS-specific facilities to change the thread priority then you can make the diagnostic thread have a lower priority than the main thread. Also, you mention that the diagnostic thread is "spinning". Do you mean it literally has the equivalent of a spin-wait like this:
while(!check_done()) ; // loop until done
If so, I would strongly suggest that you try and avoid such a busy-wait, as it will consume CPU time without achieving anything.
However, though multiple threads can cause each other to slow-down, if you are seeing an actual delay of several seconds this would suggest there is another problem, and that the main thread is actually waiting for the diagnostic thread to complete. Check that the call to join() for the diagnostic thread is outside the main loop.
Another possibility is that the diagnostic thread is locking a mutex needed by the main thread loop. Check which mutexes are locked and where.
To really help, I'd need to see some code.
looks like your threads are interlocked, so your main thread waits until background thread finished its work. check any multithreading sychronization that can cause this.
to check that it's nothing related to OS scheduling run you program on double-core system, so both threads can be executed really in parallel
From the way you've worded your question, it appears that you're not quite sure how threads work. I assume by "the amount of time that a thread can run before moving on to another thread" you mean the number of cpu cycles spent per thread. This happens hundreds of thousands of times per second.
Boost.Thread does not have support for thread priorities, although your OS-specific thread API will. However, your problem seems to indicate the necessity for a fundamental redesign -- or at least heavy profiling to find bottlenecks.
You can't do this generally at the OS level, so I doubt boost has anything specific for limiting execution time. You can kinda fake it with small-block operations and waits, but it's not clean.
I would suggest looking into processor affinity, either at a thread or process level (this will be OS-specific). If you can isolate your diagnostic processing to a limited subset of [logical] processors on a multi-core machine, it will give you a very course mechanism to control maximum execution amount relative to the main process. That's the best solution I have found when trying to do a similar type of thing.
Hope that helps.

pthread sleep linux

I am creating a program with multiple threads using pthreads.
Is sleep() causing the process (all the threads) to stop executing or just the thread where I am calling sleep?
Just the thread. The POSIX documentation for sleep() says:
The sleep() function shall cause the calling thread to be suspended from execution...
Try this,
#include <unistd.h>
usleep(microseconds);
I usually use nanosleep and it works fine.
Nanosleep supends the execution of the calling thread. I have had the same doubt because in some man pages sleep refers to the entire process.
In practice, there are few cases where you just want to sleep for a small delay (milliseconds). For Linux, read time(7), and see also this answer. For a delay of more than a second, see sleep(3), for a small delay, see nanosleep(2). (A counter example might be a RasPerryPi running some embedded Linux and driving a robot; in such case you might indeed read from some hardware device every tenth of seconds). Of course what is sleeping is just a single kernel-scheduled task (so a process or thread).
It is likely that you want to code some event loop. In such a case, you probably want something like poll(2) or select(2), or you want to use condition variables (read a Pthread tutorial about pthread_cond_init etc...) associated with mutexes.
Threads are expensive resources (since each needs a call stack, often of a megabyte at least). You should prefer having one or a few event loops instead of having thousands of threads.
If you are coding for Linux, read also Advanced Linux Programming and syscalls(2) and pthreads(7).
Posix sleep function is not thread safe.
https://clang.llvm.org/extra/clang-tidy/checks/concurrency/mt-unsafe.html
sleep() function does not cease a specific thread, but it stops the whole process for the specified amount of time. For stopping the execution of a particular thread, we can use one pthread condition object and use pthread_cond_timedwait() function for making the thread wait for a specific amount of time. Each thread will have its own condition object and it will never receive a signal from any other thread.

C++ - how does Sleep() and cin work?

Just curious. How does actually the function Sleep() work (declared in windows.h)? Maybe not just that implementation, but anyone. With that I mean - how is it implemented? How can it make the code "stop" for a specific time? Also curious about how cin >> and those actually work. What do they do exactly?
The only way I know how to "block" something from continuing to run is with a while loop, but considering that that takes a huge amount of processing power in comparison to what's happening when you're invoking methods to read from stdin (just compare a while (true) to a read from stdin), I'm guessing that isn't what they do.
The OS uses a mechanism called a scheduler to keep all of the threads or processes it's managing behaving nicely together.
several times per second, the computer's hardware clock interrupts the CPU, which causes the OS's scheduler to become activated. The scheduler will then look at all the processes that are trying to run and decides which one gets to run for the next time slice.
The different things it uses to decide depend on each processes state, and how much time it has had before. So if the current process has been using the CPU heavily, preventing other processes from making progress, it will make the current process wait and swaps in another process so that it can do some work.
More often, though, most processes are going to be in a wait state. For instance, if a process is waiting for input from the console, the OS can look at the processes information and see which io ports its waiting for. It can check those ports to see if they have any data for the process to work on. If they do, it can start the process up again, but if there is no data, then that process gets skipped over for the current timeslice.
as for sleep(), any process can notify the OS that it would like to wait for a while. The scheduler will then be activated even before a hardware interrupt (which is also what happens when a process tries to do a blocking read from a stream that has no data ready to be read,) and the OS makes a note of what the process is waiting for. For a sleep, the process is waiting for an alarm to go off, or it may just yield again each time it's restarted until the timer is up.
Since the OS only resumes processes after something causes it to preempt a running process, such as the process yielding or the hardware timer interrupt i mentioned, sleep() is not very accurate, how accurate depends on the OS or hardware, but it's usually on the order of one or more milliseconds.
If more accuracy is needed, or very short waits, the only option is to use the busy loop construct you mentioned.
The operating system schedules how processes run (which processes are eligible to run, in what order, ...).
Sleep() probably issues a system call which tells the kernel “don't let me use the processor for x milliseconds”.
In short, Sleep() tells the OS to ignore the process/thread for a while.
'cin' uses a ton of overloaded operators. The '>>', which is usually right bit-shift, is overloaded for pretty much every type of right-hand operand in C++. A separate function is provided for each one, which reads from the console and converts the input into whichever variable type you have given. For example:
std::cin::operator>> (int &rhs);
That's not real C++ — I haven't worked with streams and overloading in a while, so I don't remember the return type or the exact order of arguments. Nevertheless, this function is called when you run cin >> an integer variable.
The exact underlying implementation depends on the operating system.
The answer depends on the operating system, but generally speaking, the operating system either schedules some other code to run elsewhere in another thread, or if it literally has nothing to do, it gets the CPU to wait until a hardware event occurs, which causes the CPU to jump to some code called an interrupt handler, which can then decide what code to run.
If you are looking for a more controlled way of blocking a thread/process in a multi-threaded program, have a look at Semaphores, Mutexes, CriticalSections and Events. These are all techniques used to block a process or thread (without loading the CPU via a while construct).
They essentially work off of a Wait/Signal idiom where the blocked thread is waiting and another process signals it to tell it to start again. These (at least in windows) can also have timeouts, thus providing a similar functionality to Sleep().
At a low level, the system has a routine called the "scheduler" that dispatches the instructions from all the running programs to the CPU(s), which actually run them. System calls like "Sleep" and "usleep" match to instructions that tell the scheduler to IGNORE that thread or process for a fixed amount of time.
As for C++ streams, the "cin" hides the actual file handle (stdin and stdout actually are such handles) you're accessing, and the ">>" operator for it hides the underlying calls to read and write. Since its an interface the implementation can be OS-specific, but conceptually it is still doing things like printf and scanf under the hood.