Consider this code for setting thread affinity on a specific processor core:
pthread_attr_t attr;
cpu_set_t cpu;
CPU_ZERO(&cpu);
CPU_SET(CoreNumber, &cpu);
pthread_attr_init(&attr);
pthread_attr_setscope(&attr, PTHREAD_SCOPE_SYSTEM);
pthread_attr_setaffinity_np(&attr,sizeof(cpu_set_t),&cpu);
pthread_attr_setschedpolicy(&attr,SCHED_FIFO);
pthread_create(Thread,&attr,func,param);
My system has 4 physical cores, and each core has 2 logical cores. With this code when my core-number is 4, every thread runs on a separate core. For instance, thread 1 runs on core 0, thread 2 run on core 2, etc.
I want change the affinity such that two threads are ran on each core. For instance, thread 1 and thread 2 run on core 1's two logical cores, and thread 3 and thread 4 run on core 2's two logical cores.
Is that possible? How should I change above code?
Related
I am running a cluster of EMR Spark with this setup:
Master: 1 of m5.xlarge
Core: 4 of m5.xlarge
spark.executor.instances 4
spark.executor.cores 4
spark.driver.memory 11171M
spark.executor.memory 10356M
spark.emr.maximizeResourceAllocation true
spark.emr.default.executor.memory 10356M
spark.emr.default.executor.cores 4
spark.emr.default.executor.instances 4
where xlarge is an instance type which has 4 vCPU cores 16 GB memory.
Since I am using Spark to migrate database, the workload is very I/O intensive but not so much CPU intensive. I notice each executor node only spawns 4 threads (seems like 1 thread per vCPU core), while the CPU still has plenty of headroom.
Is there a way which allows me to force a higher thread allocation per executor node so that I can fully utilize my resources? Thanks.
One vCPU can hold only one thread.
In case you have assign 4 vCPU to your executor it will never spawns more than 4 threads.
For more detail
Calculation of vCPU & Cores
First, we need to select a virtual server and CPU. For this example, we’ll select Intel Xeon E-2288G as the underlying CPU. Key stats for the Intel Xeon E-2288G include 8 cores / 16 threads with a 3.7GHz base clock and 5.0GHz turbo boost. There is 16MB of onboard cache.
(16 Threads x 8 Cores) x 1 CPU = 128 vCPU
reference
We have very weird problem, we manage to create very small somehow reproducible(on some PCs) example:
main.cpp Dockerfile
This code do absolutely nothing useful it only logs few lines, but these few lines prove it do not work with more than 1 thread.
We use boost 1.70 (Asio 1.14.0)
I tested on 3 computes so far:
my desktop Ryzen 1950X 64 GB RAM mostly end up with:
started for 1 threads
Listener started on thread: 140593064609536
started for 2 threads
started for 3 threads
started for 4 threads
started for 5 threads
started for 6 threads
started for 7 threads
(sometimes 2, or 3 also works, but mostly not)
I tested on this machine also build with msvc on Windows and result was that it run fine, so the problem is somehow linux+core count specific.
main server 2x E5-2630 v3 128 GB RAM mostly end up with:
root#cf8c892390ce:/app/test/bin# ./test
started for 1 threads
Listener started on thread: 140062574507776
started for 2 threads
started for 3 threads
started for 4 threads
started for 5 threads
started for 6 threads
started for 7 threads
(once in over 100 tests 2 also worked, but never more)
old test server 2x old intel 2 core CPU 4 GB RAM most results look like this:
root#f06821a4cbc8:/app/test/bin# ./test
started for 1 threads
Listener started on thread: 140650646316800
started for 2 threads
Listener started on thread: 140650621138688
started for 3 threads
Listener started on thread: 140650246829824
started for 4 threads
Listener started on thread: 140650213259008
started for 5 threads
Listener started on thread: 140649944823552
started for 6 threads
Listener started on thread: 140649743496960
started for 7 threads
Listener started on thread: 140649726711552
(sometimes 5, 6, or 7 threads do not work)
We tested on few other systems and only 1 thread works reliably.
Can you please look at the code and tell me if we have some stupid mistake there?, or if this is bug in asio?
And most importantly can we fix it somehow?
You probably start the event loops (run() and related) before the actual work is posted.
This would allow the services to complete before your listener is started (a race condition), and that explains the symptoms.
The usual way to avoid this is to use a work<> object.
Having a look at your code now.
I have recently set up an instance (m4.4xlarge).
when I execute 'lscpu' command, the output looks something like the following:
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 8
CPU socket(s): 1
.
.
.
Does this mean that only 8 cores can be utilized?
If so, what are the rest of CPUs for?
m4.4xlarge instances have 16 logical CPUs, so it looks like your EC2 instance is reporting it as having a single socket, with 1 physical CPU that has 8 cores. Each core can execute two threads simultaneously (Intel Hyperthreading technology) so each core is presented as 2 logical CPUs.
CPU(s): 16 <- logical CPUs (Threads per core * Cores per socket)
On-line CPU(s) list: 0-15
Thread(s) per core: 2 <- Each core has hyperthreading and presents
itself as two logical CPUs
Core(s) per socket: 8 <- Instance sees it has 8-core physical CPU per socket
CPU socket(s): 1 <- Instance sees it has 1 physical CPU
I am trying to run a highly multi-threaded application and want to measure its performance with different cores ( 0,1,2,3,4,5,6 ... 12). I saw taskset when googled,
taskset 0x00000003 ./my_app
but when I see system monitor of the fedora, It only shows one core doing 100% and others only 12%, 0%,...etc.
Is there any way to tell the process to run on certain core. I also heard of an option like -t #no of cores . like
./my_app -t2
for 0 and 1 core .. but this also have no effect
what am I doing wrong can any one please lead me to right direction.
taskset 0x00000003 ./my_app sets the affinity of the my_app process to cores 1 and 2. If your application is multithreaded, the threads inherit the affinity, but their distribution between core 1 and 2 is not set.
To set the affinity of each thread within your process, you can either use taskset after the process is running (i.e. run myapp, examine the thread ids and call taskset -pc <core> <tid> for each) or set the affinity at thread creation with sched_setaffinity, pthread_setaffinity_np if you are using pthreads etc).
Whatever ./myapp -t2 does is specific to you application.
I am following the examples given here. While I am able to successfully create threads, these threads have default affinity to all the processes.
How do I set affinity? Can someone please provide an example on how can I use SetThreadAffinityMask with the examples given on above link?
Ok, I'm going to assume you want affinity. The second parameter of SetThreadAffinityMask is a bit mask representing on which processors the thread is allowed to run. The bits are set to 1 on the corresponding processors. For example:
// binary 01, so it allows this thread to run on CPU 0
SetThreadAffinityMask(hThread, 0x01);
// binary 10, so it allows this thread to run on CPU 1
SetThreadAffinityMask(hThread, 0x02);
// binary 11, so it allows this thread to run on CPU 0 or CPU 1
SetThreadAffinityMask(hThread, 0x03);