c++ parallel programming bug - c++

I am trying to do some parallel programming. I have been following a guide and I have this code:
void main()
{
CPUs = GetNumCPUs();
HANDLE *threads = new HANDLE[CPUs];
queues = new queue<functionPointer>[CPUs];
DWORD_PTR threadID = 0;
DWORD_PTR threadCore = 1 << 0;
threads[0] = CreateThread(NULL, 0, (LPTHREAD_START_ROUTINE)loop, (LPVOID)&queues, NULL, &threadID);
SetThreadAffinityMask(threads[0], threadCore);
for (DWORD_PTR i = 1; i < CPUs; i++)
{
threadID = i;
threadCore = 1 << i;
threads[i] = CreateThread(NULL, 0, (LPTHREAD_START_ROUTINE)Coroutine, (LPVOID)&queues[i], NULL, &threadID);
SetThreadAffinityMask(threads[i], threadCore);
wprintf(L"Creating Thread %d (0x%08x) Assigning to CPU 0x%08x\r\n", i, (LONG_PTR)threads[i], threadCore);
}
while(true) Sleep(1000);
}
The function the threads is just adding 1 to a variable. I have seen that this code is not faster than the code without the threads. I think that I did something wrong in it and it is not multicore. What is it?
Here is the guild: http://www.dreamincode.net/forums/topic/52380-multi-threading-on-multi-processors/
Adding 1 the a varible was an example. I have a very complicated program that is taking 8-9 secondes to finish. That why I am need the multi processing.

If you are not running your code on a multi-processor/multi-core system then you will not see any performance gain.
If you are but your threads are just doing simple processing (adding 1 to a variable?) it may cost more processor cycles to spawn/shutdown the thread than it does for the thread to do its work. In which case you'd be better off doing all the work in a single thread.

Related

Problem with multi-threading and waiting on events

I have a problem with my code:
#define _CRT_SECURE_NO_WARNINGS
#include <iostream>
#include <windows.h>
#include <string.h>
#include <math.h>
HANDLE event;
HANDLE mutex;
int runner = 0;
DWORD WINAPI thread_fun(LPVOID lpParam) {
int* data = (int*)lpParam;
for (int j = 0; j < 4; j++) { //this loop necessary in order to reproduce the issue
if ((data[2] + 1) == data[0]) { // if it is last thread
while (1) {
WaitForSingleObject(mutex, INFINITE);
if (runner == data[0] - 1) { // if all other thread reach event break
ReleaseMutex(mutex);
break;
}
printf("Run:%d\n", runner);
ReleaseMutex(mutex);
Sleep(10);
}
printf("Check Done:<<%d>>\n", data[2]);
runner = 0;
PulseEvent(event); // let all other threads continue
}
else { // if it is not last thread
WaitForSingleObject(mutex, INFINITE);
runner++;
ReleaseMutex(mutex);
printf("Wait:<<%d>>\n", data[2]);
WaitForSingleObject(event, INFINITE); // wait till all other threads reach this stage
printf("Exit:<<%d>>\n", data[2]);
}
}
return 0;
}
int main()
{
event = CreateEvent(NULL, TRUE, FALSE, NULL);
mutex = CreateMutex(NULL, FALSE, NULL);
SetEvent(event);
int data[3] = {2,8}; //0 amount of threads //1 amount of numbers
HANDLE t[10000];
int ThreadData[1000][3];
for (int i = 0; i < data[0]; i++) {
memcpy(ThreadData[i], data, sizeof(int) * 2); // copy amount of threads and amount of numbers to the threads data
ThreadData[i][2] = i; // creat threads id
LPVOID ThreadsData = (LPVOID)(&ThreadData[i]);
t[i] = CreateThread(0, 0, thread_fun, ThreadsData, 0, NULL);
if (t[i] == NULL)return 0;
}
while (1) {
DWORD res = WaitForMultipleObjects(data[0], t, true, 1000);
if (res != WAIT_TIMEOUT) break;
}
for (int i = 0; i < data[0]; i++)CloseHandle(t[i]); // close all threads
CloseHandle(event); // close event
CloseHandle(mutex); //close mutex
printf("Done");
}
The main idea is to wait until all threads except one reach the event and wait there, meanwhile the last thread must release them from waiting.
But the code doesn't work reliably. 1 in 10 times, it ends correctly, and 9 times just gets stuck in while(1). In different tries, printf in while (printf("Run:%d\n", runner);) prints different numbers of runners (0 and 3).
What can be the problem?
As we found out in the comments section, the problem was that although the event was created in the initial state of being non-signalled
event = CreateEvent(NULL, TRUE, FALSE, NULL);
it was being set to the signalled state immediately afterwards:
SetEvent(event);
Due to this, at least on the first iteration of the loop, when j == 0, the first worker thread wouldn't wait for the second worker thread, which caused a race condition.
Also, the following issues with your code are worth mentioning (although these issues were not the reason for your problem):
According to the Microsoft documentation on PulseEvent, that function should not be used, as it can be unreliable and is mainly provided for backward-compatibility. According to the documentation, you should use condition variables instead.
In your function thread_fun, the last thread is locking and releasing the mutex in a loop. This can be bad, because mutexes are not guaranteed to be fair and it is possible that this will cause other threads to never be able to acquire the mutex. Although this possibility is mitigated by you calling Sleep(10); once in every loop iteration, it is still not the ideal solution. A better solution would be to use a condition variable, so that the thread only checks for changes of the variable runner when another thread actually signals a possible change. Such a solution would also be better for performance reasons.

ResumeThread takes over a minute to resume

I'm using SuspendThread / ResumeThread to modify the RIP register between the calls through GetThreadContext / SetThreadContext. It allows me to execute arbitrary code in a thread in another process.
So this works, but sometimes ResumeThread takes about 60 seconds to resume the target thread.
I understand that I'm somewhat abusing the API through this usage, but is there any way to speed this up? Or something I should look at that might indicate a bad usage?
The target thread is a sample program that loops over itself.
uint64_t blarg = 1;
while (true) {
Sleep(100);
std::cout << blarg << std::endl;
blarg++;
if (blarg == std::numeric_limits<uint64_t>::max()) {
blarg = 0;
}
}
The Suspend / Resume sequence is very simple as well:
void hijackRip(uint64_t targetAddress, DWORD threadId){
HANDLE targetThread = OpenThread(THREAD_ALL_ACCESS, FALSE, threadId);
NTSTATUS suspendResult = SuspendThread(targetThread);
CONTEXT threadContext;
memset(&threadContext, 0, sizeof(threadContext));
threadContext.ContextFlags = CONTEXT_ALL;
BOOL getThreadContextResult = GetThreadContext(targetThread, &threadContext);
threadContext.Rip = targetAddress;
BOOL setThreadContextResult = SetThreadContext(targetThread, &threadContext);
DWORD resumeThreadResult = ResumeThread(targetThread);
}
Again, this works, I can redirect execution correctly, but only 30 / 60 seconds after executing this function.

Windows creates events on thread shutdown

I am attempting to add handle leak detection to the unit test framework on my code. (Windows 7, x64 VS2010)
I basically call GetProcessHandleCount() before and after each unit test.
This works fine except when threads are created/destroyed as part of the test.
It seems that windows is occasionally creating an 1-3 events on thread shutdown. Running the same test in a loop does not increase the event creation count. (eg running the test 5000 times in a loop only results in 1-3 extra events being created)
I do not create events manually in my own code.
It seems that this is similar to this problem:
boost::thread causing small event handle leak?
but I am doing manual thread creation/shutdown.
I followed this code:
http://blogs.technet.com/b/yongrhee/archive/2011/12/19/how-to-troubleshoot-a-handle-leak.aspx
And got this callstack from WinDbg:
Outstanding handles opened since the previous snapshot:
--------------------------------------
Handle = 0x0000000000000108 - OPEN
Thread ID = 0x00000000000030dc, Process ID = 0x0000000000000c90
0x000000007715173a: ntdll!NtCreateEvent+0x000000000000000a
0x0000000077133f26: ntdll!RtlpCreateCriticalSectionSem+0x0000000000000026
0x0000000077133ee3: ntdll!RtlpWaitOnCriticalSection+0x000000000000014e
0x000000007714e40b: ntdll!RtlEnterCriticalSection+0x00000000000000d1
0x0000000077146ad2: ntdll!LdrShutdownThread+0x0000000000000072
0x0000000077146978: ntdll!RtlExitUserThread+0x0000000000000038
0x0000000076ef59f5: kernel32!BaseThreadInitThunk+0x0000000000000015
0x000000007712c541: ntdll!RtlUserThreadStart+0x000000000000001d
--------------------------------------
As you can see, this is an event created on the thread shutdown.
Is there a better way of doing this handle leak detection in unit tests? My only current options are:
Forget trying to do this handle leak detection
Spin up some dummy tasks to attempt to create these spurious events.
Allow some small tolerance value in leaks and run each test 100's of times (so actual leaks will be a large number)
Get the handle count excluding events (difficult amount of code)
I have also tried switching to using std::thread in VS2013, but it seems that it creates a lot of background threads and handles when used. (makes the count difference much worse)
Here is a self contained example where 99+% of the time (on my computer) an event is created behind the scenes. (handle count is different). Putting the startup/shutdown code in a loop indicates it does not directly leak, but accumulates the occasional events:
#include "stdio.h"
#include <Windows.h>
#include <process.h>
#define THREADCOUNT 3
static HANDLE s_semCommand, s_semRender;
static unsigned __stdcall ExecutiveThread(void *)
{
WaitForSingleObject(s_semCommand, INFINITE);
ReleaseSemaphore(s_semRender, THREADCOUNT - 1, NULL);
return 0;
}
static unsigned __stdcall WorkerThread(void *)
{
WaitForSingleObject(s_semRender, INFINITE);
return 0;
}
int main(int argc, char* argv[])
{
DWORD oldHandleCount = 0;
GetProcessHandleCount(GetCurrentProcess(), &oldHandleCount);
s_semCommand = CreateSemaphoreA(NULL, 0, 0xFFFF, NULL);
s_semRender = CreateSemaphoreA(NULL, 0, 0xFFFF, NULL);
// Spool threads up
HANDLE threads[THREADCOUNT];
for (int i = 0; i < THREADCOUNT; i++)
{
threads[i] = (HANDLE)_beginthreadex(NULL, 4096, (i==0) ? ExecutiveThread : WorkerThread, NULL, 0, NULL);
}
// Signal shutdown - Wait for threads and close semaphores
ReleaseSemaphore(s_semCommand, 1, NULL);
for (int i = 0; i < THREADCOUNT; i++)
{
WaitForSingleObject(threads[i], INFINITE);
CloseHandle(threads[i]);
}
CloseHandle(s_semCommand);
CloseHandle(s_semRender);
DWORD newHandleCount = 0;
GetProcessHandleCount(GetCurrentProcess(), &newHandleCount);
printf("Handle %d -> %d", oldHandleCount, newHandleCount);
return 0;
}

Testing the need of synchronization lock on primitive data types with C++

I'm seeing a lot of threads on this forum dealing with a question whether or not we need to use synchronization when accessing primitive data types from multiple threads: Question 1, question 2, question 3, question 4...
So I wrote a small test to verify this:
I ran it for over an hour on my CPU Intel(R) Core(TM) i7 CPU 860 # 2.80GHz that runs with 4 physical cores:
#define MULTIPLIC_VAL 17
DWORD gdwSharedVal01 = MULTIPLIC_VAL;
DWORD WINAPI thread001(LPVOID lpParameter);
//Begin threads
for(int i = 0; i < 30; i++)
{
DWORD dwThreadId;
HANDLE hThread = ::CreateThread(NULL, 0, thread001, NULL, 0, &dwThreadId);
if(hThread)
{
::CloseHandle(hThread);
}
else
{
_tprintf(L"ERROR: CreateThread error %d\n", ::GetLastError());
}
}
//Wait
getchar();
BOOL checkSharedValue()
{
//RETURN:
// = TRUE if value is OK
if((gdwSharedVal01 % MULTIPLIC_VAL) == 0)
{
return TRUE;
}
return FALSE;
}
DWORD WINAPI thread001(LPVOID lpParameter)
{
srand((UINT)time(NULL));
DWORD dwThreadID = ::GetCurrentThreadId();
_tprintf(L"Thread %d began...\n", dwThreadID);
for(;;)
{
//Set value
DWORD v = rand();
v <<= 16;
v ^= rand();
v = v / MULTIPLIC_VAL;
gdwSharedVal01 = v * MULTIPLIC_VAL;
//Check value
if(!checkSharedValue())
{
//Failed
_tprintf(L"FAILED thread %d\n", dwThreadID);
}
}
return 0;
}
and I got no fails. So how would you explain it?
In Intel, reads and writes to aligned words are atomic operations (atomic in the sense that other processors will see either the original or the new value).
Note that this does not mean that you should not provide synchronization mechanism. This test case is one in which threads just write and read new values into the same variable. If they were providing some sort of operation that involved a read/write for the update it could fail (say 10 threads incrementing the variable by 100, the variable at the end might not have been incremented by 1000 total!) and that there are no other variables in play (where compiler/cpu reordering could cause other issues).

SetThreadAffinityMask() not seeming to take effect more than once

I'm trying to set the affinity of my thread to a certain mask each time I run a thread by pressing a button. It will work the first time I do it after opening the window, but not after that. However, my OutputDebugString code produces output that suggests it has been changed. I've tried using CloseHandle() but that didn't seem to have an effect. Is there something else it could be?
void CSMPDemoDlg::OnBnClickedButton1()
{
// Start thread
DWORD_PTR affinityMask = (static_cast<DWORD_PTR>(1) << NumberOfCores ) - 1;
HANDLE WorkThreadHandle = CreateThread(NULL, 0, WorkThread, &tp, 0, NULL);
DWORD_PTR z = SetThreadAffinityMask(WorkThreadHandle, affinityMask);
if (z!=0) {
char bb[100];
sprintf_s(bb, 100, "Affinity changed from %d to %d", z, affinityMask);
OutputDebugString(bb);
}
}
So, you want something like this:
static count = 0;
DWORD_PTR affinityMask = (static_cast<DWORD_PTR>(1) << NumberOfCores ) - 1;
affinityMask <<= ((count * numberOfCores) % totalCores);
That means that it will run on the next set of cores in the group, so if you run on, say 4 cores, the first tiem, it will run on cores 0..3, then 4..7, then 8..11.
It does assume that totalCores is a multiple of numberofCores, so if you have 16 cores and numberOfCores = 3, you'll get weird results.