OS task scheduling emulator [closed] - c++

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I'm trying to find a c or c++ library which is could work with tasks(or threads) in preemptive way. I need a mechanism which could manage tasks one by one like in RTOS. Creating tasks (function as an entry point to a task), timeslicing, switching etc...
Is it possible to write it in a user-space?

The simplest solution is perhaps to run a real RTOS in a virtual machine or processor emulator. Any RTOS with an x86 port might be persuaded to run in a PC VM, but you could also use QEMU.
For example you can run RTEMS on QUEMU, and QUEMU itself can emulate ARM targets - though that may not matter and the i386 emulation may suit your needs and will be faster.

RTOS scheduling/dispatching to handle threads in an efficient manner requires hardware interrupts to communicate effectively with peripheral hardware, (KB, mouse, disk, NIC, timer etc). Standard C has no means of handling interrupts, so you cannot do it.
If you have memory-management hardware that defines separate user and kernel memory access rights, then no - a hardware interrupt will change state in hardware and so you will leave user space whether you want to or not.
You should be aware that preemptive schedulers are not primarily designed to switch between tasks that need CPU upon a timer interrupt - they are designed first to provide efficent, high-performance I/O by removing CPU from tasks that don't need it because their I/O requests cannot be met immediately.

Related

How to identify the bottlenecks preventing my program to scale well on a 32 core CPU? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have written a program that works just fine. I now want to run 32 independent instances of it, in parallel, on our 32 core machine (AMD Threadripper 2990wx, 128GB DDR4 RAM, Ubuntu 18.04). However, the performance gains are almost null after about 12 processes running concurrently on the same machine. I now need to optimize this. Here is a plot of the average speedup:
I want to identify the source of this scaling bottleneck.
I would like to know the available techniques to see, in my code, if there are any "hot" parts that prevent 32 processes to yield significant gains compared to 12
My guess is it has to do with memory access and the NUMA architecture. I tried experimenting with numactl and assign a core to each process, without noticeable improvement.
Each instance of the application uses at most about 1GB of memory. It is written in C++, and there is no "parallel code" (no threads, no mutexes, no atomic operations), each instance is totally independent, there is no interprocess communication (I just start them with nohup, through a bash script). The core of this application is an agent-based simulation: a lot of objects are progressively created, interact with each other and are regularly updated, which is probably not very cache friendly.
I have tried to use linux perf but I am not sure what I should look for; also, the mem modules of perf doesn't work on AMD CPU.
I have also tried using AMD uProf but again I am not sure where this system wide bottleneck would appear.
Any help would be greatly appreciated.
The problem may be the Threadripper architecture. It is 32-core CPU, but those cores are distributed among 4 NUMA nodes with half of them not directly connected to the memory. So you may need to
set processor affinity for all your processes to ensure that they never jump between cores
ensure that processes running on the normal NUMA nodes only access memory directly attached to that node
put less load on cores situated on crippled NUMA nodes

How to decide whether to use hyperthreading or not? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Hyperthreading can hurt performance of some applications and it should not be used. From the Microsoft website (https://msdn.microsoft.com/en-us/library/cc615012%28BTS.10%29.aspx):
It is critical hyper-threading be turned off for BizTalk Server
computers. Hyper-threading makes the server appear to have more
processors/processor cores than it actually does; however
hyper-threaded processors typically provide between 20 and 30% of the
performance of a physical processor/processor core. When BizTalk
Server counts the number of processors to adjust its self-tuning
algorithms; the hyper-threaded processors cause these adjustments to
be skewed which is detrimental to overall performance.
Process Lasso program allows to disable hyperthreading for some processes:
You can use programs like Process Lasso (free) to set default CPU
affinities for critical processes, so that their threads never get
allocated to logical cores. We call this feature HyperThreaded Core
Avoidance.
I've got some older programs which perform a lot of mathematical computations. It is frustrating to see them use one core if they could use 4. I want to rewrite them to use many threads. They use large continuous memory blocks so number of cache misses is minimal. My questions are following:
How to decide whether to use hyperthreading or not in your application ? (general guidance with some technical details if necessary)
Does it come down to performing experiments to make final decision ?
How to avoid hyperthreading in your application if it is not advantageous ? (examples in c++ and c)
I don't know Process Lasso works wrt "disabling HyperThreading". For that particular app, the best you can do is to inject DLL into every process into the system, call SetProcessAffinityMask with something that only amounts to a guess, disable every other core, in the hopes that the OS will avoid scheduling to the hyperthreaded logical cores.
Guesses and hopes, there's nothing in the Windows API that will do this for certain. This answers your third bullet point.
You can disable HyperThreading as the BIOS level (usually).
I can't comment on the Microsoft advice of disabling HT for BizTalk, your linked article, since I can't find a date for this article. The only interesting bit was about "Assigning interrupt affinity to logical processors...", new to me. The only other advice in that article regarding HT is rather weak.
On a larger note: I don't know why you're asking about HyperThreading, when you should be concerned with multithreading in general. If you're concerned about multiple threads contending for the same shared resource... then don't use threads in your app.
A humorous aside: the same company also sells a product called SmartTrim, reminiscent of the RAM-doublers that were popular in the '90's.
Basically, it comes down to configuring the number of concurrent threads executing CPU workloads. The OS is aware of hyperthreading, and will assign threads to physical cores until it runs out, and only if there are more threads than physical cores will it start assigning work to logical cores.
To decide whether the optimal number of threads is the number of physical or logical cores, measuring performance of your real tasks is the best approach. Synthetic benchmarks can teach you something about how hyperthreading works, but won't tell you what is best for your particular mix of instructions.
The exact way to control number of threads depends on the multithreading construct you use -- if you create threads yourself, it is obvious, but threadpool and automated parallelism frameworks such as OpenMP provide ways to tune thread count also.

How is a process able to acquire many CPU cycles? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 7 years ago.
Improve this question
I noticed that VLC media player acquired at times up to 98% of the CPU when performing a file conversion from MP4 to MP3. My understanding is that the OS tries to balance the time each process gets so this captured my attention. I have a feeling that programs like disk defragmenters and antivirus may also require processor cycles on such a magnitude. How it achieved in code( C,C++)?
It depends on OS, but OS tries to balance the time each process gets is usually not the prime objective.
A smart scheduler will instead utilise the available CPU(s) while still be responsive for higher priority things like user input and hardware events. A nicely behalves thread will also withdraw its time slice before its cpu quota if there is no more work to do (e.g. blocking for event), otherwise upon deadline the scheduler may take over the cpu(preempt) and give other thread a chance to execute.
You may set the thread priority as a hint to the scheduler, that may affect the take over condition, but it all depends on the scheduler and OS internals.
Simply put, you don't need to do special things to utilise a cpu core, if you have intensive calculation, the OS give the most to you.

How to get the current Windows system-wide timer resolution [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I know that the default is 15.6 ms per tick, but some loser may change it and then change back and forth again and again, and I need to poll what the current value is to perform valid QueryPerformanceCounter synchronization.
So is there an API way to get the timer resolution?
I'm on C++ BTW.
Windows timer resolution is provided by the hidden API call:
NTSTATUS NtQueryTimerResolution(OUT PULONG MinimumResolution,
OUT PULONG MaximumResolution,
OUT PULONG ActualResolution);
NtQueryTimerResolution is exported by the native Windows NT library NTDLL.DLL.
Common hardware platforms report 156,250 or 100,144 for ActualResolution; older platforms may report even larger numbers; newer systems, particulary when HPET (High Precision Event Timer) or constant/invariant TSC are supported, may return 156,001 for ActualResolution.
Calls to timeBeginPeriod(n) are reflected in ActualResolution.
More details in this answer.
This won't be helpful, another process can change it while you are calibrating.
This falls in the "if you can't beat them, join them" category. Call timeBeginPeriod(1) before you start calibrating. This ensures that you have a known rate that nobody can change. Getting the improved timer accuracy surely doesn't hurt either.
Do note that it is pretty unlikely that you can do better than QueryPerformanceFrequency(). Unless you calibrate for a very long time, the clock rate just isn't high enough to give you extra accuracy since you can never measure better than +/- 0.5 msec. And the timer event isn't delivered with millisecond accuracy, it can be arbitrarily delayed. If you calibrate over long periods then GetTickCount64() is plenty good enough.
The RDTSC instruction may be used to read the CPU time-stamp counter.
In most cases (if not all), this counter will change at the CPU clock rate.
If you want to be picky, you can also use an instruction like CPUID to serialize instructions.
Refer to the Intel manuals for more details.
You can work RDTSC against API's like QueryPerformanceCounter, et al.
In other words, use RDTSC before and after a call to make measurements.
WINAPI function GetSystemTimeAdjustment

System not reaching 100% CPU how to trouble shoot [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 6 years ago.
Improve this question
I have a application (basically C++ application) which has below properties
Multi Threaded
Each thread has its own thread attributes (like stack size, etc).
Multi process (i.e will run multiple process).
Run on a 8 core processor.
Uses shared memory/IPC's/extensive heap management (allocation/deallocation), system sleep etc.
So now, I am supposed to find the system CAPS at max CPU. The ideal way is to load the system to 100% CPU and them check the CAPS (successful) the system supports.
I know, that in complex systems, CPU will be "dead" for context switches, page swaps, I/O etc.
But my system is max able to run at 95% CPU (not more than that irrespective of the load). So the idea here is to find out these points which is really contributing to "CPU eating" and then see if I can engineer them to reduce/eliminate the unused CPU's.
Question
How do we find out which IO/Context switching... etc is the cause of the un-conquerable 5% CPU? Is there any tool for this? I am aware of OProfile/Quantify and vmstat reports. But none of them would give this information.
There may be some operations which I am not aware - which may restrict the MAX CPU utilization. Any link/document which can help me in understanding a detailed set of operation which will reduce my CPU usage would be very helpful.
Edit 1:
Added some more information
a. The OS under question is SUSE10 Linux server.
b. CAPS - it is the average CALLS you can run on your system per second. Basically a telecommunication term - But it can be considered generic - Assume your application provides a protocol implementation. How many protocol calls can you make per second?
"100% CPU" is a convenient engineering concept, not a mathematical absolute. There's no objective definition of what it means. For instance, time spent waiting on DRAM is often counted as CPU time, but time spent waiting on Flash is counted as I/O time. With my hardware hat on, I'd say that both Flash and DRAM are solid-state cell-organized memories, and could be treated the same.
So, in this case, your system is running at "100% CPU" for engineering purposes. The load is CPU-limited, and you can measure the Calls Per Second in this state.