XP app won't increase cpu utilization - c++

I am trying to fix a problem with a legacy Visual Studio win32 un-managed c++ app which is not keeping up with input. As a part of my solution, I am exploring bumping up the class and thread priorities.
My PC has 4 xeon processors, running 64 bit XP. I wrote a short win32 test app which creates 4 background looping threads, each one running on their own processor. Some code samples are shown following. The problem is that even when I bump the priorities to the extreme, the cpu utilization is still less than 1%.
My test app is 32 bit, running on WOW64. The same test app also utilizes less than 1% cpu utilization on a 32 bit xp machine. I am an administrator on both machines. What else do I need to do to get this to work?
DWORD __stdcall ThreadProc4 (LPVOID)
{
SetThreadPriority(GetCurrentThread(),THREAD_PRIORITY_TIME_CRITICAL);
while (true)
{
for (int i = 0; i < 1000; i++)
{
int p = i;
int red = p *5;
theClassPrior4 = GetPriorityClass(theProcessHandle);
}
Sleep(1);
}
}
int APIENTRY _tWinMain(...)
{
...
theProcessHandle = GetCurrentProcess();
BOOL theAffinity = GetProcessAffinityMask(
theProcessHandle,&theProcessMask,&theSystemMask);
SetPriorityClass(theProcessHandle,REALTIME_PRIORITY_CLASS);
DWORD threadid4 = 0;
HANDLE thread4 = CreateThread((LPSECURITY_ATTRIBUTES)NULL,
0,
(LPTHREAD_START_ROUTINE)ThreadProc4,
NULL,
0,
&threadid4);
DWORD_PTR theAff4 = 8;
DWORD_PTR theAf4 = SetThreadAffinityMask(thread1,theAff4);
SetThreadPriority(thread4,THREAD_PRIORITY_TIME_CRITICAL);
ResumeThread(thread4);

Well, if you want it to actually eat CPU time, you'll want to remove that 'Sleep' call - your 'processing' is taking no significant amount of time, and so it's spending most of it's time sleeping.
You'll also want to look at what the optimizer is doing to your code. I wouldn't be totally surprised if it completely removed 'p' and 'red' (and the multiply) in your loop (because the results are never used). You could trying marking 'red' as volatile, that should force it to not remove the calculation.

Related

Measuring cpu cycles or another unit that doesn't depends on cpu frequency and doesn't count time/cycles in Sleep? WinAPI C++

I need a profiling feature in two ways. The first one is total time that the code spends and the second one is an unit that doesn't depends on cpu-freq and sleeps in the code. (The Profiling is needed for our software with own language/interpreter, it runs on Windows)
My problem is by the second one.
Results with GetThreadTimes depends on cpu-freq and not accurate (10 - 15ms) see for more: Why GetThreadTimes is wrong? Kalmbachnet
QueryThreadCycleTime also depends on implementation. (and also counts in sleep, as i tested) see for more: What
does QueryThreadCycleTime actually count? OldNewThing
QueryPerformanceCounter is a accurate counter, but cpu-freq-change changes the result and sleeps are also included.
Is it possible what i want to do? or is there any other way? How/What does visual studio profiling do?
Note: I know that my question seems like a duplicate. I tried to comment some old answers to same questions (like: Another question) to get better answers, but my comments are deleted after 1-2 days. (see: meta for comments deleted)
EDIT: (My test code for QueryThreadCycleTime)
static void foo()
{
for (int i = 0; i < 3; i++)
{
Sleep(20);
for (int x = 0; x < 1000; x++)
x = x + 1 - 1;
}
}
static void testCycles()
{
HANDLE hThread = nullptr;
::DuplicateHandle(::GetCurrentProcess(), ::GetCurrentThread(), ::GetCurrentProcess(), &hThread, 0, false, DUPLICATE_SAME_ACCESS);
std::vector<ULONG64> results;
results.resize(7);
for (auto &tElapsed : results)
{
ULONG64 tStart = 0;
::QueryThreadCycleTime(hThread, &tStart);
foo();
ULONG64 tEnd = 0;
::QueryThreadCycleTime(hThread, &tEnd);
tElapsed = tEnd - tStart;
}
::CloseHandle(hThread);
}
And there is results;
with Sleep(20)
in Thread
123383
192271
128028
208208
277983
223377
155222
in Main-Thread
191616
120002
126258
125267
141934
204753
125243
with Sleep(1000)
in Thread
121595
143863
182068
307464
388448
342315
468244
in Main-Thread
289568
256256
348599
359328
234065
167849
299888

Resource intensive multithreading killing other processes

I have a very resource intensive code, that I made, so I can split the workload over multiple pthreads. While everything works, the computation is done faster, etc. What I'm guessing happens is that other processes on that processor core get so slow, that they crash after a few seconds of runtime.
I already managed to kill random processes like Chrome tabs, the Cinnamon DE or even the entire OS (Kernel?).
Code: (It's late, and I'm too tired to make a pseudo code, or even comments..)
-- But it's a brute force code, not so much for cracking, but for testing passwords and or CPU IPS.
Any ideas how to fix this, while still keeping as much performance as possible?
static unsigned int NTHREADS = std::thread::hardware_concurrency();
static int THREAD_COMPLETE = -1;
static std::string PASSWORD = "";
static std::string CHARS;
static std::mutex MUTEX;
void *find_seq(void *arg_0)
{
unsigned int _arg_0 = *((unsigned int *) arg_0);
std::string *str_CURRENT = new std::string(" ");
while (true)
{
for (unsigned int loop_0 = _arg_0; loop_0 < CHARS.length() - 1; loop_0 += NTHREADS)
{
str_CURRENT->back() = CHARS[loop_0];
if (*str_CURRENT == PASSWORD)
{
THREAD_COMPLETE = _arg_0;
return (void *) str_CURRENT;
}
}
str_CURRENT->back() = CHARS.back();
for (int loop_1 = (str_CURRENT->length() - 1); loop_1 >= 0; loop_1--)
{
if (str_CURRENT->at(loop_1) == CHARS.back())
{
if (loop_1 == 0)
str_CURRENT->assign(str_CURRENT->length() + 1, CHARS.front());
else
{
str_CURRENT->at(loop_1) = CHARS.front();
str_CURRENT->at(loop_1 - 1) = CHARS[CHARS.find(str_CURRENT->at(loop_1 - 1)) + 1];
}
}
}
};
}
Areuz,
Can you post the full code? I suspect the issue is the NTHREADS value. On my Ubuntu box, the value is set to 8 which is the number of cores in the /proc/cpuinfo file. Kicking off 8 'hot' threads on my box hogs 100% of the CPU. The kernel will time slice for its own critical processes but in general all other processes will starve for CPU.
Check out the max processor value in /etc/cpuinfo and go at least one lower then that. The CPU's are numbered 0-7 on my box, so 7 would be the max for me. The actual max might be 3 since 4 of my cores are hyper-threads. For completely CPU processes, hyper-threading generally doesn't help.
Bottom line, don't hog all the CPU, it will destabilize the system.
--Matt
Thank you for your answers and especially Matthew Fisher for his suggestion to try it on another system.
After some trial and error I decided to pull back my CPU overclock that I thought was stable (I had it for over a year) and that solved this weird behaviour. I guess that I've never ran such a CPU intensive and (I'm guessing) efficient (In regards to not throttling the full CPU by yielding) script to see this happen.
As Matthew suggested I need to come up with a better way than to just constantly check the THREAD_COMPLETE variable with a while true loop, but I hope to resolve that in the comments.
Full and updated code for future visitors is here: pastebin.com/jbiYyKBu

Correctly measure CPU usage with HyperThreading?

I know there are several answers here on SO on how to measure CPU usage with either of two approaches:
By using the performance counters (PDH API)
By using GetProcessTimes() and dividing that against either wall time or times from GetSystemTimes()
For some days now I am miserably failing to perform CPU usage measurements of my program with either of these - with both mechanisms I get a CPU usage that is smaller than displayed in Task Manager or Process Explorer. Is there some magic how these tools do this and is this related to HyperThreading being enabled? I will perform my tests on a CPU without HyperThreding but if anyone can point out what am I missing here I would be very thankful.
To illustrate what I have tried, here is the code that does PDH based measruements:
class CCpuUsageMonitor
{
public:
CCpuUsageMonitor(const wchar_t* pProcessName)
{
GetSystemInfo(&m_SystemInfo);
auto nStatus = PdhOpenQuery(NULL, NULL, &m_hPdhQuery);
_ASSERT(nStatus == ERROR_SUCCESS);
nStatus = PdhAddCounter(m_hPdhQuery, L"\\Processor(_Total)\\% Processor Time", NULL, &m_hPdhCpuUsageCounter);
_ASSERT(nStatus == ERROR_SUCCESS);
wchar_t pCounterPath[PDH_MAX_COUNTER_PATH];
StringCbPrintf(pCounterPath, PDH_MAX_COUNTER_PATH, L"\\Process(%s)\\%% Processor Time", pProcessName);
nStatus = PdhAddCounter(m_hPdhQuery, pCounterPath, NULL, &m_hPhdProcessCpuUsageCounter);
_ASSERT(nStatus == ERROR_SUCCESS);
}
~CCpuUsageMonitor()
{
PdhCloseQuery(&m_hPdhQuery);
}
void CollectSample()
{
auto nStatus = PdhCollectQueryData(m_hPdhQuery);
_ASSERT(nStatus == ERROR_SUCCESS);
}
double GetCpuUsage()
{
DWORD nType;
PDH_FMT_COUNTERVALUE CounterValue;
auto nStatus = PdhGetFormattedCounterValue(m_hPdhCpuUsageCounter, PDH_FMT_DOUBLE | PDH_FMT_NOCAP100, &nType, &CounterValue);
_ASSERT(nStatus == ERROR_SUCCESS);
return CounterValue.doubleValue;
}
double GetProcessCpuUsage()
{
DWORD nType;
PDH_FMT_COUNTERVALUE CounterValue;
auto nStatus = PdhGetFormattedCounterValue(m_hPhdProcessCpuUsageCounter, PDH_FMT_DOUBLE | PDH_FMT_NOCAP100, &nType, &CounterValue);
_ASSERT(nStatus == ERROR_SUCCESS);
return CounterValue.doubleValue / m_SystemInfo.dwNumberOfProcessors;
}
private:
SYSTEM_INFO m_SystemInfo;
HANDLE m_hPdhQuery;
HANDLE m_hPdhCpuUsageCounter;
HANDLE m_hPhdProcessCpuUsageCounter;
};
With the second approach I basically take two snapshots of process times via GetProcessTimes() before and after my code runs, substract and divide against wall time multiplied by the number of processors.
Here are a few links I've used in the past and a good article on why GetThreadTimes is wrong (I wouldn't use it as a reliable source of data):
http://blog.kalmbachnet.de/?postid=28
https://msdn.microsoft.com/en-us/library/aa392397(VS.85).aspx
http://www.drdobbs.com/windows/win32-performance-measurement-options/184416651
https://msdn.microsoft.com/en-us/library/aa394279(VS.85).aspx
You seem well on your way and knowledgeable those links should get you going in the right direction at least.
From this link:
Starting with Windows 8, a change was made to the way that Task Manager and Performance Monitor report CPU utilization...
This change affects the way that CPU utilization is computed. The values in Task Manager now correspond to the Processor Information\% Processor Utility and Processor Information\% Privileged Utility performance counters, not to the Processor Information\% Processor Time and Processor Information\% Privileged Time counters as in Windows 7.
Your code will work as written other than the change in which counters you are querying. You are using the Processor counters; you should switch to Processor Information enabled in Windows 8; and also use the "Utility" versions of the counters.
If you query the formatted value as you currently do, you'll get the same number displayed on the Task manager with 1-second polling.
If you want to do calculations over longer intervals, you can query the raw value; the numbers into a PDH_RAW_COUNTER structure instead of your current PDH_FMT_COUNTERVALUE. The values used to calculate the usage for the numerator are in the PDH_RAW_COUNTER structure's FirstValue, and the "base" values for the denominator are in SecondValue.

Multithreading: Why two programs is better than one?

Shortly about my problem:
I have a computer with 2 sockets of AMD Opteron 6272 and 64GB RAM.
I run one multithread program on all 32 cores and get speed 15% less in comparison with the case when I run 2 programs, each on one 16 cores socket.
How do I make one-program version as fast as two-programs?
More details:
I have a big number of tasks and want to fully load all 32 cores of the system.
So I pack the tasks in groups by 1000. Such a group needs about 120Mb input data, and take about 10 seconds to complete on one core. To make the test ideal I copy these groups 32 times and using ITBB's parallel_for loop distribute tasks between 32 cores.
I use pthread_setaffinity_np to insure that system would not make my threads jump between cores. And to insure that all cores are used consequtively.
I use mlockall(MCL_FUTURE) to insure that system would not make my memory jump between sockets.
So the code looks like this:
void operator()(const blocked_range<size_t> &range) const
{
for(unsigned int i = range.begin(); i != range.end(); ++i){
pthread_t I = pthread_self();
int s;
cpu_set_t cpuset;
pthread_t thread = I;
CPU_ZERO(&cpuset);
CPU_SET(threadNumberToCpuMap[i], &cpuset);
s = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
mlockall(MCL_FUTURE); // lock virtual memory to stay at physical address where it was allocated
TaskManager manager;
for (int j = 0; j < fNTasksPerThr; j++){
manager.SetData( &(InpData->fInput[j]) );
manager.Run();
}
}
}
Only the computing time is important to me therefore I prepare input data in separate parallel_for loop. And do not include preparation time in time measurements.
void operator()(const blocked_range<size_t> &range) const
{
for(unsigned int i = range.begin(); i != range.end(); ++i){
pthread_t I = pthread_self();
int s;
cpu_set_t cpuset;
pthread_t thread = I;
CPU_ZERO(&cpuset);
CPU_SET(threadNumberToCpuMap[i], &cpuset);
s = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
mlockall(MCL_FUTURE); // lock virtual memory to stay at physical address where it was allocated
InpData[i].fInput = new ProgramInputData[fNTasksPerThr];
for(int j=0; j<fNTasksPerThr; j++){
InpData[i].fInput[j] = InpDataPerThread.fInput[j];
}
}
}
Now I run all these on 32 cores and see speed of ~1600 tasks per second.
Then I create two version of program, and with taskset and pthread insure that first run on 16 cores of first socket and second - on second socket. I run them one next to each other using simply & command in shell:
program1 & program2 &
Each of these programs achieves speed of ~900 tasks/s. In total this are >1800 tasks/s, which is 15% more than one-program version.
What do I miss?
I consider that may be the problem is in libraries, which I load to memory of muster thread only. Can this be a problem? Can I copy libraries data so it would be available independently on both sockets?
I would guess that it's STL/boost memory allocation that's spreading memory for your collections, etc across numa nodes due to the fact that they're not numa aware and you have threads in the program running on each node.
Custom allocators for all of the STL/boost things that you use might help (but is likely a huge job).
You might be suffering a bad case of false sharing of cache: http://en.wikipedia.org/wiki/False_sharing
Your threads probably share access to the same data structure through the block_range reference. If speed is all you need, you might want to pass a copy to each thread. If your data is too huge to fit onto the call-stack you could dynamically allocate a copy of each range in different cache segments (i.e. just make sure they are far enough appart).
Or maybe I need to see the rest of the code to understand what you are doing better.

How can I measure CPU time in C++ on windows and include calls of system()?

I want to run some benchmarks on a C++ algorithm and want to get the CPU time it takes, depending on inputs. I use Visual Studio 2012 on Windows 7. I already discovered one way to calculate the CPU time in Windows: How can I measure CPU time and wall clock time on both Linux/Windows?
However, I use the system() command in my algorithm, which is not measured that way. So, how can I measure CPU time and include the times of my script calls via system()?
I should add a small example. This is my get_cpu_time-function (From the link described above):
double get_cpu_time(){
FILETIME a,b,c,d;
if (GetProcessTimes(GetCurrentProcess(),&a,&b,&c,&d) != 0){
// Returns total user time.
// Can be tweaked to include kernel times as well.
return
(double)(d.dwLowDateTime |
((unsigned long long)d.dwHighDateTime << 32)) * 0.0000001;
}else{
// Handle error
return 0;
}
}
That works fine so far, and when I made a program, that sorts some array (or does some other stuff that takes some time), it works fine. However, when I use the system()-command like in this case, it doesn't:
int main( int argc, const char* argv[] )
{
double start = get_cpu_time();
double end;
system("Bla.exe");
end = get_cpu_time();
printf("Everything took %f seconds of CPU time", end - start);
std::cin.get();
}
The execution of the given exe-file is measured in the same way and takes about 5 seconds. When I run it via system(), the whole thing takes a CPU time of 0 seconds, which obviously does not include the execution of the exe-file.
One possibility would be to get a HANDLE on the system call, is that possible somehow?
Linux:
For the wall clock time, use gettimeofday() or clock_gettime()
For the CPU time, use getrusage() or times()
It will actually prints the CPU time that your program takes. But if you use threads in your program, It will not work properly. You should wait for thread to finish it's job before taking the finish CPU time. So basically you should write this:
WaitForSingleObject(threadhandle, INFINITE);
If you dont know what exactly you use in your program (if it's multithreaded or not..) you can create a thread for doing that job and wait for termination of thread and measure the time.
DWORD WINAPI MyThreadFunction( LPVOID lpParam );
int main()
{
DWORD dwThreadId;
HANDLE hThread;
int startcputime, endcputime, wcts, wcte;
startcputime = cputime();
hThread = CreateThread(
NULL, // default security attributes
0, // use default stack size
MyThreadFunction, // thread function name
NULL, // argument to thread function
0, // use default creation flags
dwThreadIdArray);
WaitForSingleObject(hThread, INFINITE);
endcputime = cputime();
std::cout << "it took " << endcputime - startcputime << " s of CPU to execute this\n";
return 0;
}
DWORD WINAPI MyThreadFunction( LPVOID lpParam )
{
//do your job here
return 0;
}
If your using C++11 (or have access to it) std::chrono has all of the functions you need to calculate how long a program has run.
You'll need to add your process to a Job object before creating any child processes. Child processes will then automatically run in the same job, and the information you want can be found in the TotalUserTime and TotalKernelTime members of the JOBOBJECT_BASIC_ACCOUNTING_INFORMATION structure, available through the QueryInformationJobObject function.
Further information:
Resource Accounting for Jobs
JOBOBJECT_BASIC_ACCOUNTING_INFORMATION structure
Beginning with Windows 8, nested jobs are supported, so you can use this method even if some of the programs already rely on job objects.
I don't think there is a cross platform mechanism. Using CreateProcess to launch the application, with a WaitForSingleObject for the application to finish, would allow you to get direct descendants times. After that you would need job objects for complete accounting (if you needed to time grandchildren)
You might also give external sampling profilers a shot. I've used the freebie "Sleepy" [http://sleepy.sourceforge.net/]and even better "Very Sleepy" [http://www.codersnotes.com/sleepy/] profilers under Windows and been very happy with the results -- nicely formatted info in a few minutes with virtually no effort.
There is a similar project called "Shiny" [http://sourceforge.net/projects/shinyprofiler/] that is supposed to work on both Windows and *nix.
You can try using boost timer. It is cross-platform capable. Sample code from boost web-site:
#include <boost/timer/timer.hpp>
#include <cmath>
int main() {
boost::timer::auto_cpu_timer t;
for (long i = 0; i < 100000000; ++i)
std::sqrt(123.456L); // burn some time
return 0;
}