How to set thread priority in privately managed pools in Windows? - c++

I am following the examples given here. While I am able to successfully create threads, these threads have default affinity to all the processes.
How do I set affinity? Can someone please provide an example on how can I use SetThreadAffinityMask with the examples given on above link?

Ok, I'm going to assume you want affinity. The second parameter of SetThreadAffinityMask is a bit mask representing on which processors the thread is allowed to run. The bits are set to 1 on the corresponding processors. For example:
// binary 01, so it allows this thread to run on CPU 0
SetThreadAffinityMask(hThread, 0x01);
// binary 10, so it allows this thread to run on CPU 1
SetThreadAffinityMask(hThread, 0x02);
// binary 11, so it allows this thread to run on CPU 0 or CPU 1
SetThreadAffinityMask(hThread, 0x03);

Related

Strange behaviors of cuda kernel with infinite loop on different NVIDIA GPU

#include <cstdio>
__global__ void loop(void) {
int smid = -1;
if (threadIdx.x == 0) {
asm volatile("mov.u32 %0, %%smid;": "=r"(smid));
printf("smid: %d\n", smid);
}
while (1);
}
int main() {
loop<<<1, 32>>>();
cudaDeviceSynchronize();
return 0;
}
This is my source code, the kernel just print smid when thread index is 0 and then go to infinite loop, and the host just invoke the previous cuda kernel and wait for it. I run some experiments under 2 different configurations as following:
1. GPU(Geforce 940M) OS(Ubuntu 18.04) MPS(Enable) CUDA(v11.0)
2. GPU(Geforce RTX 3050Ti Mobile) OS(Ubuntu 20.04) MPS(Enable) CUDA(v11.4)
Experiment 1: When I run this code under configuration 1, the GUI system seems to get freezed because any graphical responses cannot be observed anymore, but as I press ctrl+c, this phenomena disappears as the CUDA process is killed.
Experiment 2: When I run this code under configuration 2, the system seems to work well without any abnormal phenomena, and the output of smid such as smid: 2\n can be displayed.
Experiment 3: As I change the block configuration loop<<<1, 1024>>> and run this new code twice under configuration 2, I get the same smid output such as smid: 2\nsmid: 2\n.(As for Geforce RTX 3050Ti Mobile, the amount of SM is 20, the maximum number of threads per multiprocessor is 1536 and max number of threads per block is 1024.)
I'm confused with these results, and here are my questions:
1. Why doesn't the system output smid under configuration 1?
2. Why does the GUI system seems to get freezed under configuration 1?
3. Unlike experiment 1, why does experiment 2 output smid normally?
4. In third experiment, the block configuation reaches to 1024 threads, which means that two different block cannot be scheduled to the same SM. Under MPS environment, all CUDA contexts will be merged into one CUDA context and share the GPU resource without timeslice anymore, but why do I still get same smid in the third experiment?(Furthermore, as I change the grid configuration into 10 and run it twice, the smid varies from 0 to 19 and each smid just appears once!)
Why doesn't the system output smid under configuration 1?
A safe rule of thumb is that unlike host code, in-kernel printf output will not be printed to the console at the moment the statement is encountered, but at the point of completion of the kernel and device synchronization with the host. This is the actual regime in effect in configuration 1, which is using a maxwell gpu. So no printf output is observed in configuration 1, because the kernel never ends.
Why does the GUI system seems to get freezed under configuration 1?
For the purpose of this discussion, there are two possible regimes: a pre-pascal regime in which compute-preemption is not possible, and a post-pascal regime in which it is possible. Your configuration 1 is a maxwell device, which is pre-pascal. Your configuration 2 is ampere device, which is post-pascal. So in configuration 2, compute preemption is working. This has a variety of impacts, one of which is that the GPU will service both GUI needs as well as compute kernel needs, "simultaneously" (the low level behavior is not thoroughly documented but is a form of time-slicing, alternating attention to the compute kernel and the GUI). Therefore in config 1, pre-pascal, kernels running for any noticeable time at all will "freeze" the GUI during kernel execution. In config2, the GPU services both, to some degree.
Unlike experiment 1, why does experiment 2 output smid normally?
Although its not well-documented, the compute preemption process appears to introduce an additional synchronization point, allowing for the flushing of the printf buffer, as mentioned in point 1. If you read the documentation I linked there, you will see that "synchronization point" covers a number of possibilities, and compute preemption seems to introduce (a new) one.
Sorry, won't be able to answer your 4th question at this time. A best practice on SO is to ask one question per question. However, I would consider usage of MPS with a GPU that is also servicing a display to be "unusual". Since we've established that compute preemption is in effect here, it may be that due to compute-preemption as well as the need to service a display, the GPU services clients in a round-robin timeslicing fashion (since it must do so anyway to service the display). In that case the behavior under MPS may be different. Compute preemption allows for the possibility of the usual limitations you are describing to be voided. One kernel can completely replace another.

Set Process Affinity of System process to reserve core for own Application

I am working with a CPU-intensive real time application, and therefore I am trying to reserve a whole core for it.
To accomplish this in Windows, I am trying to set the CPU affinity of all running processes to the other cores, and then set the affinity of my real time application to the "free" core. Additionally, I am setting the priority to high.
Unfortunately, the following code (129 for testing as it means first and last core on my system) is not changing the affinity of all running processes:
while (Process32Next(hSnapShot, processInfo)!=FALSE)
{
hProcess = OpenProcess(PROCESS_ALL_ACCESS, TRUE, processInfo->th32ProcessID);
SetProcessAffinityMask(hProcess, 129);
}
Some system processes, like svchost.exe or csrss.exe, have the affinity 0xCCCCCCCC (looks like it is not initialized and is not used at all). And, of course, they are keeping it after a failed SetProcessAffinityMask().
Also, using Task Manager is not possible, as it denies access when trying to change affinity of those system processes.
Is it possible to change affinity for those processes as well?
Additional Information:
Windows 7 64bit
Real-time app has only one thread, therefore one core is "enough".
The below images show the difference:
Not working:
Working:

Linux - make sure a core is exclusively saved for critical tasks

I have a process that is launched on a Linux-based machine with exactly two cores.
Let's assume my process is the only process in the system (I will ignore other processes and even the system's ones).
My process is divided to two parts:
Critical performance code
Low priority code
Also let's assume my main process was launched on Core 0, and I want to exclusively reserve Core 1 for the critical performance code.
I'd like to divide the question to two:
How can I make sure that every thread in my process (including 3rd party libraries which I have linked my code with that might call pthread_create and etc.) will always open new threads on Core 0 ?
How can I write a test that can verify that Core 1 is doing absolutely nothing besides the performance critical path ?
I am familiar with APIs such as:
pthread_setaffinity_np
that can set a specific thread affinity but I want to know if there is a more low level way to make sure even threads that 3rd party libraries create (from inside the process) will also be pinned to Core 0.
Perhaps I can set the default affinity for the process to be Core 0 and for a specific thread - pin it to Core 1?
You have already described the solution you want:
Perhaps I can set the default affinity for the process to be Core 0 and for a specific thread - pin it to Core 1?
But, perhaps the question is you are not sure how to achieve this.
Linux provides sched_setaffinity to set the affinity of the current process.
To get newly created threads to run on a specific core, the easiest way is to initialize a pthread_attr_t, and set the desired core affinity with pthread_attr_setaffinity_np.
One of the solution is to install (if you do not have it already) and run Cpuset utility. Details can be found here

MPI fortran code on top of another one

I wrote a MPI fortran program that I need to run multiple times (for consistency let call this program P1). The minimum number of core that I can use to run a program is 512. The problem is that P1 has the best scalability with 128 cores.
What I want to do is to create another program (P2) on top of P1, that call P1 4 times simultaneously, each of the call would be on 128 cores..
Basically I need to run 4 instances of a call simultaneously with a number of process equal to the total processors divided by 4.
Do you think it is possible? My problem is I don't know where to search to do this.
I am currently looking at MPI groups and communicators, am I following the good path to reach my goal?
EDIT :
The system scheduler is Loadleveler. When I submit a job I need to specify how many nodes I need. There is 16 cores by node et the minimum nodes I can use is 32. In the batch, we specify also -np NBCORES, but if we do so i.e. -np 128, the time consumed will be as if we were using 512 cores (32 nodes) even if the job ran on 128 cores..
I were able to do it thanks to your answers.
As I mentioned later (sry for that), the scheduler is Loadlever,
if you have access to the subblock module follow this : http://www.hpc.cineca.it/content/batch-scheduler-loadleveler-0#sub-block as Hristo Iliev mentioned.
if you don't, you can do a multistep job with no dependency between the steps, so they will be executed simultaneously. It is a classic multistep job, you just have to remove any #dependency flags (in the case of Loadleveler).

How do i find the Physical Socket ID / number a process is running on?

I would like to know if there is any way to find the actual physical processor / socket number a current process is running on or the mapping for the same provided a logical processor number.
I have a 8 socket system resulting in a total of 128 (0-127) logical processors.
As from what i have read in msdn, they would be divided into 2 processor groups of 64 logical processors each.
http://msdn.microsoft.com/en-us/library/dd405503
I have tried looking at cpuid and GetNumaProcessorNodeEx
from cpuid, APIC id helps identify the logical processors ID, and from GetNumaProcessorNodeEx, i get the numa node (this i found to be useful IF there are 64 or less number of logical processors)
Is it also possible if a logical processor is hyper thread?
I am trying to create a tool like this.
processor no - socket/core id/HT
processor 0 - 0/0/0
processor 1 - 0/1/1
processor 2 - 0/2/0
...
processor 8 - 1/0/0
processor 9 - 1/1/1
...
Any help or links to figure this out would be great.
Thank you
The socket/core/thread hierarchy is in the bits of the APIC ID.
The N low bits are the thread, M low are the core, the remaining high bits are the socket.
To find N and M, you need to use some CPUID codes:
Code 1 gives you the number of threads per socket (somewhere in EBX).
Code 4 gives you the number of cores per socket (in EAX).
If, for example, you have 12 threads and 6 cores per socket, then the lowest bit is the thread ID, the next 3 are the core, the rest are the socket.