Right now I basically have a program that uses clock to test the amount of time my program takes to do certain operations and usually it is accurate to a couple milliseconds. My question is this: If the CPU is under heavy load will I still get the same results?
Does clock only count when the CPU is working on my process?
Lets assume: multi-core CPU but a process that does not take advantage of multithreading
The function of clock depends on the OS. In windows, from a long distant decision, clock gives the elapsed time, in most other OS's (certainly Linux, MacOS and other Unix-related OS's).
Depending on what you actually want to achieve, elapsed time or CPU time may be what you want to measure.
In a system where there are other processes running, the difference between elapsed time and CPU usage may be huge (and of course, if your CPU is NOT busy running your application, e.g. waiting for network packets to go down the wire or file-data from the hard-disk), then elapsed time is "avialable" for other applications.
There are also a huge number of error factors/interference factors when there are other processes running in the same system:
If we assume that your OS supports clock as a measure of CPU-time, the precision here is not always that great - for example, it may well be accounted in terms of CPU-timer ticks, and your process may not run for the "full tick" if it's doing I/O for example.
Other processes may use "your" cpu for parts of the interrupt handling, before the OS has switched to "account for this as interrupt time", when dealing with packets over the network or hard-disk i/o for some percentage of the time [typically not huge amounts, but in a very busy system, it can be several percent of the total time], and if other processes run on "your" cpu, the time to reload the cache(s) with "your" process' data after the other process loaded it's data will be accounted on "your time". This sort of "interference" may very well affect your measurements - how much very much depends on "what else" is going on in the system.
If your process shares data [via shared memory] with another process, there will also be (again, typically a minute amount, but in extreme cases, it can be significant) some time spent on dealing with "cache-snoop requests" between your process and the other process, when your process doesn't get to execute.
If the OS is switching tasks, "half" of the time spent switching to/from your task will accounted in your process, and half in the other process being switched in/out. Again, this is usually tiny amounts, but if you have a very busy system with lots of process switches, it can add up.
Some processor types, e.g. Intel's HyperThreading also share resources with your actual core, so only SOME of the time on that core is spent in your process, and the cache-content of your process is now shared with some other process' data and instructions - meaning your process MAY get "evicted" from the cache by the other thread running on the same CPU-core.
Likewise, multicore CPU's often have a shared L3 cache that gets
affected by other processes running on the other cores of the CPU.
File-caching and other "system caches" will also be affected by other processes - so if your process is reading some file(s), and other processes also access file(s), the cache-content will be "less yours" than if the system wasn't so busy.
For accurate measurements of how much your process uses of the system resources, you need processor performance counters (and a reproducable test-case, because you probably need to run the same setup several times to ensure that you get the "right" combination of performance counters). Of course, most of these counters are ALSO system-wide, and some kinds of processing in for example interrupts and other random interference will affect the measurement, so the most accurate results will be if you DON'T have many other (busy) processes running in the system.
Of course, in MANY cases, just measuring the overall time of your application is perfectly adequate. Again, as long as you have a reproducable test-case that gives the same (or at least similar) timing each time it's run in a particular scenario.
Each application is different, each system is different. Performance measurement is a HUGE subject, and it's very hard to cover EVERYTHING - and of course, we're not here to answer extremely specific questions about "how do I get my PI-with-a-million-decimals to run faster when there are other processes running in the same system" or whatever it may be.
In addition to agreeing with the responses indicating that timings depend on many factors, I would like to bring to your attention the std::chrono library available since C++11:
#include <chrono>
#include <iostream>
int main() {
auto beg = std::chrono::high_resolution_clock::now();
std::cout << "*** Displaying Some Stuff ***" << std::endl;
auto end = std::chrono::high_resolution_clock::now();
auto dur = std::chrono::duration_cast<std::chrono::microseconds>(end - beg);
std::cout << "Elapsed: " << dur.count() << " microseconds" << std::endl;
}
As per the standard, this program will utilize the highest-precision clock provided by your system and will tick with microsecond resolution (there are other resolutions available; see the docs).
Sample run:
$ g++ example.cpp -std=c++14 -Wall -Wextra -O3
$ ./a.out
*** Displaying Some Stuff ***
Elapsed: 29 microseconds
While it is much more verbose than relying on the C-style std::clock(), I feel it gives you much more expressiveness, and you can hide the verbosity behind a nice interface (for example, see my answer to a previous post where I use std::chrono to build a function timer).
There are shared components in CPU like last level cache, execution units (between hardware threads within one core), so under heavy loads you will get jitter, because even if your application executed exactly the same amount of instructions, each instructions may take more cycles (waiting for memory because data was evicted from cache, available execution unit), and more cycles means more time to execute (assuming that Turbo Boost won't compensate).
If you seek for precise instrument, look at hardware counters.
It is also important to consider factors like the number of cores available on the physical CPU, hyper-threading and other BIOS settings like Turbo Boost on Intel CPUs, and threading techniques used when coding when looking at timing metrics for CPU intensive tasks.
Parallelization tools like OpenMP provide built-in functions for calculating computation and wall time like omp_get_wtime( ); which are often times more accurate than clock() in programs that make use of this type of parallelization.
Related
I have a set of C++ functions which does some image processing related operation. Generally I see that the final output is delivered in 5-6ms time range. I am measuring the time taken using QueryPerformanceCounter Win32 API. But when running in a continuous loop with 100 images, I see that the performance spikes up to 20ms for some images. My question is how do I go about analyzing such issues. Basically I want to determine whether the spikes are caused due to some delay in this code or whether some other task started running inside the CPU because of which this operation took time. I have tried using GetThreadTimes API to see how much time my thread spent inside CPU but am unable to conclude based on those numbers. What is the standard way to go about troubleshooting these types of issues?
Reason behind sudden spikes during processing could be any of IO, interrupt, scheduled processes etc.
It is very common to see such spikes considering such low latency/processing time operations. IMO you can consider them because of any of the above mentioned reasons (There could be more). Simplest solution is run same experiment with more inputs multiple times and take the average for final consideration.
To answer your question about checking/confirming source of the spike you can try following,
Check variation in images - already ruled out as per your comment
Monitor resource utilization during processing. Check if any resource is choking (% util is simplest way to check and SAR/NMON utility on linux is best with minimal overhead)
Reserve few CPU's on system (CPU Affinity) for your experiment which are dedicated only for your program and no OS task will run on them. Taskset is simplest utility to try out. More details are here.
Run the experiment with this setting and check behavior.
That's a nasty thing you are trying to figure out, I wouldn'd even attempt to, since coming into concrete conlusions is hard.
In general, one should run a loop of many iterations (100 just seems too small I think), and then take the average time for an image to be processed.
That will rule out any unexpected exterior events that may have hurt performance of your program.
A typical way to check if "some other task started running inside the CPU" would be to run your program once and mark the images that produce that spike. Example, image 2, 4, 5, and 67 take too long to be processed. Run your program again some times, and mark again which images produce the spikes.
If the same images produce these spikes, then it's not something caused by another exterior task.
What is the standard way to go about troubleshooting these types of issues?
There are Real Time Operating Systems (RTOS) which guarantee those kind of delays. It is totally different class of operating systems than Windows or Linux.
But still, there are something you can do about your delays even on general purpose OS.
1. Avoid system calls
Once you ask your OS to read or write something to a disk -- there are no guarantees whatever about delays. So, avoid any system functions on you critical path:
even functions like gettimeofday() might cause unpredictable delays, so you should really avoid any system calls in time-critical code;
use another thread to perform IO and pass data via a shared buffer to your critical code.
If your code base is big, tools like strace on Linux or Dr Memory on Windows to trace system calls.
2. Avoid context switches
The multi threading on Windows is preemptive. It means, there is a system scheduler, which might stop your thread any time and schedule another thread on your CPU. As previously, there are RTOSes, which allow to avoid such context switches, but there is something you can do about it:
make sure there is at least one CPU core left for system and other tasks;
bind each of your threads to a dedicated CPU with SetThreadAffinityMask() (Windows) or sched_setaffinity() (Linux) -- this effectively hints system scheduler to avoid scheduling other threads on this CPU;
make sure hardware interrupts go to another CPU; usually interrupts go to CPU 0, so the easiest way would be to bind your thread with CPU 1+;
increase your thread priority, so scheduler less likely to switch your thread with another one.
There are tools like perf (Linux) and Intel VTune (Windows) to confirm there are context switches.
3. Avoid other non-deterministic features
Few more sources of unexpected delays:
disable swap, so you know for sure your thread memory will not be swapped on slow and unpredictable disk drive;
disable CPU turbo boost -- after a high-performance CPU boosts, there is always a slow down, so the CPU stays withing its thermal power (TDP);
disable hyper threading -- from scheduler point of view those are independent CPUs, but in fact performance of each hyper-thread CPU depend on what another thread is doing at the moment.
Hope this helps.
I encountered a problem where my game-loop stuttered approximating once a second (variable intervals). A single frame then takes over 60ms whereas all others require less than 1ms.
After simplifying a lot I ended with the following program which reproduces the bug. It only measures the frame time and reports it.
#include <iostream>
#include "windows.h"
int main()
{
unsigned long long frequency, tic, toc;
QueryPerformanceFrequency((LARGE_INTEGER*)&frequency);
QueryPerformanceCounter((LARGE_INTEGER*)&tic);
double deltaTime = 0.0;
while( true )
{
//if(deltaTime > 0.01)
std::cerr << deltaTime << std::endl;
QueryPerformanceCounter((LARGE_INTEGER*)&toc);
deltaTime = (toc - tic) / double(frequency);
tic = toc;
if(deltaTime < 0.01) deltaTime = 0.01;
}
}
Again one frame in many is much slower than the others. Adding the if let the error vanish (cerr is never called then). My original problem didn't contain any cerr/cout. However, I consider this as a reproduction of the same error.
cerr is flushed in every iteration, so this is not what happens to create single slow frames. I know from a profiler (Very Sleepy) that the stream internally uses a lock/critical section, but this shouldn't change anything because the program is singlethreaded.
What causes single iterations to stall that much?
Edit: I did some more tests:
Adding std::this_thread::sleep_for( std::chrono::milliseconds(7) ); and therefore reducing the process CPU utilization does not change anything.
With printf("%f\n", deltaTime); the problem vanishes (maybe because it doesn't use a mutex and memory allocation in contrast to the stream)
The design of windows does not guarantee an upper limit on any execution time, since it dynamically allocates runtime resources to all programs using some logic - for example, the scheduler will allocate resources to a process with high priority, and starve out lower priority processes in some circumstances. Programs are statistically more likely to - eventually - be affected by such things if they run tight loops and consume a lot of CPU resources. Because - again eventually - the scheduler will temporarily boost the priority of programs that are being starved and/or reduce the priority of programs that are starving others (in your case, by running a tight loop).
Making the output to std::cerr conditional doesn't change the fact of this happening - it just changes the likelihood that it will happen in a specified time interval, because it changes how the program uses system resources in the loop, and therefore changes how it interacts with the system scheduler, policies, etc.
This sort of thing affects programs running in all non-realtime operating systems, although the precise impact depends on how each OS is implemented (e.g. scheduling strategies, other policies controlling access by programs to resources, etc). There is always a non-zero probability (even if it is small) of such stalls occurring.
If you want absolute guarantees of no stalls on such things, you will need a realtime operating system. These systems are designed to do things more predictably in a timing sense, but that comes with trade-offs, since it also requires your programs to be designed with the knowledge that they MUST complete execution of specified functions within specified time intervals. Realtime operating systems use different strategies, but their enforcing timing of constraints can cause the program to malfunction if the program is not designed with such things in mind.
I'm not sure about it, but it could be that the system is interrupting your main thread to let others run, and since it takes some time (I remember on my Windows XP pc the quantum was 10ms), it will stall a frame.
This is very visible because it is a single-threaded application, if you use several thread they are usually dispatched on several cores of the processor (if available), and the stalls will still be here but less important (if you implemented your application logic right).
Edit: here you can have more information about windows and linux schedulers. Basically, windows use quantums (varying from a handful of milliseconds to 120 ms on Windows Server).
Edit 2: you can see a more detailed explanation on the windows scheduler here.
Could you please tell me why the value of timediff printed by the following program is often 4 microseconds (in the range 90 to 1000 times for different runs), but sometimes 70 or more microseconds for a few cases (in the range of 2 to 10 times for different runs):
#include <iostream>
using namespace std;
#include<sys/time.h>
#define MAXQ 1000000
#define THRDS 3
double GetMicroSecond()
{
timeval tv;
gettimeofday (&tv, NULL);
return (double) (((double)tv.tv_sec * 1000000) + (double)tv.tv_usec);
}
int main()
{
double timew, timer, timediff;
bool flagarray[MAXQ];
int x=0, y=0;
for(int i=0; i<MAXQ; ++i)
flagarray[i] = false;
while(y <MAXQ)
{
x++;
if(x%1000 == 0)
{
timew = GetMicroSecond();
flagarray[y++]=true;
timer = GetMicroSecond();
timediff = timer - timew;
if(timediff > THRDS) cout << timer-timew << endl;
}
}
}
Compiled using: g++ testlatency.cpp -o testlatency
Note: In my system there are 12 cores. The performance is checked with only this program running in the system.
timew = GetMicroSecond();
flagarray[y++]=true;
timer = GetMicroSecond();
The statement flagarray[y++]=true; will take much less than a microsecond to execute on a modern computer if flagarray[y++] happens to be in the level 1 cache. The statement will take longer to execute if that location is in level 2 cache but not in level 1 cache, much longer if it is in level 3 cache but not in level 1 or level 2 cache, and much, much longer yet if it isn't in any of the caches.
Another thing that can make timer-timew exceed three milliseconds is when your program yields to the OS. Cache misses can result in a yield. So can system calls. The function gettimeofday is a system call. As a general rule, you should expect any system call to yield.
Note: In my system there are 12 cores. The performance is checked with only this program running in the system.
This is not true. There are always many other programs, and many, many other threads running on your 12 core computer. These include the operating system itself (which comprises many threads in and of itself), plus lots and lots of little daemons. Whenever your program yields, the OS can decide to temporarily suspend your program so that one of the myriad other threads that are temporarily suspended but are asking for use of the CPU.
One of those daemons is the Network Time Protocol daemon (ntpd). This does all kinds of funky little things to your system clock to keep it close to in sync with atomic clocks. With a tiny little instruction such as flagarray[y++]=true being the only thing between successive calls to gettimeofday, you might even see time occasionally go backwards.
When testing for timing, its a good idea to do the timing at a coarse level. Don't time an individual statement that doesn't involve any function calls. It's much better to time a loop than it is to time than it is to time individual executions of the loop body. Even then, you should expect some variability in timing because of cache misses and because the OS temporarily suspends execution of your program.
Modern Unix-based systems have better timers (e.g., clock_gettime) than gettimeofday that are not subject to changes made by the Network Time Protocol daemon. You should use one of these rather than gettimeofday.
Generally, there are many threads sharing a small number of cores. Unless you take steps to ensure that your thread has uninterrupted use of a core, you can't guarantee that the OS won't decide to preempt your thread between the two calls GetMicroSecond() calls, and let some other thread use the core for a bit.
Even if your code runs uninterrupted, the line you're trying to time:
flagarray[y++]=true;
likely takes much less time to execute than the measurement code itself.
There are many things happening inside of modern OS at the same time as Your program executes. Some of them may may "steal" CPU from Your program as it is stated in NPE's answer. A few more examples of what can influence timing:
interrups from devices (timer, HDD, network interfaces a few to mention);
access to RAM (caching)
None of these are easily predictable.
You can expect consistency if You run Your code on some microcontroller, or maybe using real time OS.
There are a lot of variables that might explain different time values seen. I would focus more on
Cache miss/fill
Scheduler Events
Interrupts
bool flagarray[MAXQ];
Since you defined MAXQ to 1000000, let's assume that flagarray takes up 1MB of space.
You can compute how many cache-misses can occur, based on your L1/L2 D-cache sizes. Then you can correlate with how many iterations it takes to fill all of L1 and start missing and same with L2. OS may deschedule your process and reschedule it - but, that I am hoping is less likely due to the number of cores you have. Same is the case with interrupts. An idle system is never completely idle. You may choose to affine your process to a core number, say N by doing
taskset 0x<MASK> ./exe and control its execution.
If you are really curious, I would suggest that you use "perf" tool available on most Linux distros.
You may do
perf stat -e L1-dcache-loadmisses
or
perf stat -e LLC-load-misses
Once you have these numbers and the number of iterations you start building a picture of the activity that causes the noticed lag. You may also monitor OS scheduler events using "perf stat".
My application contains several latency-critical threads that "spin", i.e. never blocks.
Such thread expected to take 100% of one CPU core. However it seems modern operation systems often transfer threads from one core to another. So, for example, with this Windows code:
void Processor::ConnectionThread()
{
while (work)
{
Iterate();
}
}
I do not see "100% occupied" core in Task manager, overall system load is 36-40%.
But if I change it to this:
void Processor::ConnectionThread()
{
SetThreadAffinityMask(GetCurrentThread(), 2);
while (work)
{
Iterate();
}
}
Then I do see that one of the CPU cores is 100% occupied, also overall system load is reduced to 34-36%.
Does it mean that I should tend to SetThreadAffinityMask for "spin" threads? If I improved latency adding SetThreadAffinityMask in this case? What else should I do for "spin" threads to improve latency?
I'm in the middle of porting my application to Linux, so this question is more about Linux if this matters.
upd found this slide which shows that binding busy-waiting thread to CPU may help:
Running a thread locked to a single core gives the best latency for that thread in most circumstances if this is the most important thing in your code.
The reasons(R) are
your code is likely to be in your iCache
the branch predictors are tuned to your code
your data is likely to be ready in your dCache
the TLB points to your code and data.
Unless
Your running a SMT sytem (ex. hyperthreaded) in which case the evil twin will "help" you with by causing your code to be washed out, your branch predictors to be tuned to its code and its data will push your out of the dCache, your TLB is impacted by its use.
Cost unknown, each cache misses cost ~4ns, ~15ns and ~75ns for data, this quickly runs up to several 1000ns.
It saves for each reason R mentioned above, that is still there.
If the evil twin also just spins the costs should be much lower.
Or your allowing interrupts on your core, in which case you get the same problems and
your TLB is flushed
you take a 1000ns-20000ns hit on the context switch, most should be in the low end if the drivers are well programmed.
Or you allow the OS to switch your process out, in which case you have the same problems as the interrupt, just in the hight end of the range.
switching out could also cause the thread to pause for the entire slice as it can only be run on one (or two) hardware threads.
Or you use any system calls that cause context switches.
No disk IO at all.
only async IO else.
having more active (none-paused) threads than cores increases the likelihood of problems.
So if you need less than 100ns latency to keep your application from exploding you need to prevent or lessen the impact of SMT, interrupts and task switching on your core.
The perfect solution would be an Real time operating system with static scheduling. This is a nearly perfect match for your target, but its a new world if your have mostly done server and desktop programming.
The disadvantages of locking a thread to a single core are:
It will cost some total throughput.
as some threads that might have run if the context could have been switched.
but the latency is more important in this case.
If the thread gets context switched out it will take some time before it can be scheduled potentially one or more time slices, typically 10-16ms, which is unacceptable in this application.
Locking it to a core and its SMT will lessen this problem, but not eliminate it. Each added core will lessen the problem.
setting its priority higher will lessen the problem, but not eliminate it.
schedule with SCHED_FIFO and highest priority will prevent most context switches, interrupts can still cause temporary switches as does some system calls.
If you got a multi cpu setup you might be able to take exclusive ownership of one of the CPU's through cpuset. This prevents other applications from using it.
Using pthread_setschedparam with SCHED_FIFO and highest priority running in SU and locking it to the core and its evil twin should secure the best latency of all of these, only a real time operating system can eliminate all context switches.
Other links:
Discussion on interrupts.
Your Linux might accept that you call sched_setscheduler, using SCHED_FIFO, but this demands you got your own PID not just a TID or that your threads are cooperative multitasking.
This might not ideal as all your threads would only be switches "voluntarily" and thereby removing flexibility for the kernel to schedule it.
Interprocess communication in 100ns
Pinning a task to specific processor will generally give better performance for the task. But, there are a lot of nuances and costs to consider when doing so.
When you force affinity, you restrict the operating system's scheduling choices. You increase cpu contention for the remaining tasks. So EVERYTHING else on the system is impacted including the operating system itself. You also need to consider that if tasks need to communicate across memory, and affinities are set to cpus that don't share cache, you can drastically increase latency for communication across tasks.
One of the biggest reasons setting task cpu affinity is beneficial though, is that it gives more predictable cache and tlb (translation lookaside buffer) behavior. When a task switches cpus, the operating system can switch it to a cpu that doesn't have access to the last cpu's cache or tlb. This can increase cache misses for the task. It's particularly an issue communicating across tasks, as it takes more time to communicate across higher level caches and worst finally memory. To measure cache statistics on linux (performance in general) I recommend using perf.
The best suggestion is really to measure before you try to fix affinities. A good way to quantify latency would be by using the rdtsc instruction (at least on x86). This reads the cpu's time source, which will generally give the highest precision. Measuring across events will give roughly nanosecond accuracy.
volatile uint64_t rdtsc() {
register uint32_t eax, edx;
asm volatile (".byte 0x0f, 0x31" : "=d"(edx), "=a"(eax) : : );
return ((uint64_t) edx << 32) | (uint64_t) eax;
}
note - the rdtsc instruction needs to be combined with a load fence to ensure all previous instructions have completed (or use rdtscp)
also note - if rdtsc is used without an invariant time source (on linux grep constant_tsc /proc/cpuinfo, you may get unreliable values across frequency changes and if the task switches cpu (time source)
So, in general, yes, setting the affinity does gives lower latency, but this is not always true, and there are very serious costs when you do it.
Some additional reading...
Intel 64 Architecture Processor Topology Enumeration
What Every Programmer Needs to Know About Memory (Parts 2, 3, 4, 6, and 7)
Intel Software Developer Reference (Vol. 2A/2B)
Aquire and Release Fences
TCMalloc
I came across this question because I'm dealing with the exactly same design problem. I'm building HFT systems where each nanosecond count.
After reading all the answers, I decided to implement and benchmark 4 different approaches
busy wait with no affinity set
busy wait with affinity set
observer pattern
signals
The imbatible winner was "busy wait with affinity set". No doubt about it.
Now, as many have pointed out, make sure to leave a couple of cores free in order to allow OS run freely.
My only concern at this point is if there is some physical harm to those cores that are running at 100% for hours.
Binding a thread to a specific core is probably not the best way to get the job done. You can do that, it will not harm a multi core CPU.
The really best way to reduce latency is to raise the priority of the process and the polling thread(s). Normally the OS will interrupt your threads hundreds of times a second and let other threads run for a while. Your thread may not run for several milliseconds.
Raising the priority will reduce the effect (but not eliminate it).
Read more about SetThreadPriority and SetProcessPriorityBoost.
There some details in the docs you need to understand.
This is simply foolish. All it does is reduce the scheduler's flexibility. Whereas before it could run it on whatever core it thought was best, now it can't. Unless the scheduler was written by idiots, it would only move the thread to a different core if it had a good reason to do that.
So you're just saying to the scheduler, "even if you have a really good reason to do this, don't do it anyway". Why would you say that?
At my company, we often test the performance of our USB and FireWire devices under CPU strain.
There is a test code we run that loads the CPU, and it is often used in really simple informal tests to see what happens to our device's performance.
I took a look at the code for this, and its a simple loop that increments a counter and does a calculation based on the new value, storing this result in another variable.
Running a single instance will use 1/X of the CPU, where X is the number of cores.
So, for instance, if we're on a 8-core PC and we want to see how our device runs under 50% CPU usage, we can open four instances of this at once, and so forth...
I'm wondering:
What decides how much of the CPU gets used up? does it just run everything as fast as it can on a single thread in a single threaded application?
Is there a way to voluntarily limit the maximum CPU usage your program can use? I can think of some "sloppy" ways (add sleep commands or something), but is there a way to limit to say, some specified percent of available CPU or something?
CPU quotas on Windows 7 and on Linux.
Also on QNX (i.e. Blackberry Tablet OS) and LynuxWorks
In case of broken links, the articles are named:
Windows -- "CPU rate limits in Windows Server 2008 R2 and Windows 7"
Linux -- "CPU Usage Limiter for Linux"
QNX -- "Adaptive Partitioning"
LynuxWorks - "Partitioning Operating Systems" and "ARINC 653"
The OS usually decides how to schedule processes and on which CPUs they should run. It basically keeps a ready queue for processes which are ready to run (not marked for termination and not blocked waiting for some I/O, event etc.). Whenever a process used up its timeslice or blocks it basically frees a processing core and the OS selects another process to run. Now if you have a process which is always ready to run and never blocks then this process essentially runs whenever it can thus pushing the utilization of a processing unit to a 100%. Of course this is a bit simplified description (there are things like process priorities for example).
There is usually no generic way to achieve this. The OS you are using might offer some mechanism to do this (some kind of CPU quota). You could try and measure how much time has passed vs. how much cpu time your process used up and then put your process to sleep for certain periods to achieve an approximation of desired CPU utilization.
You've essentially answered your own questions!
The key trait of code that burns a lot of CPU is that it never does anything that blocks (e.g. waiting for network or file I/O), and never voluntarily yields its time slice (e.g. sleep(), etc.).
The other trick is that the code must do something that the compiler cannot optimize away. So, most likely your CPU burn code outputs something based on the loop calculation at the end, or is simply compiled without optimization so that the optimizer isn't tempted to remove the useless loop. Since you're trying to load the CPU, there's no sense in optimizing anyways.
As you hypothesized, single threaded code that matches this description will saturate a CPU core unless the OS has more of these processes than it has cores to run them--then it will round-robin schedule them and the utilization of each will be some fraction of 100%.
The issue isn't how much time the CPU spends idle, but rather how long it takes for your code to start executing. Who cares if it's idle or doing low-priority busywork, as long as the latency is low?
Your problem is fundamentally a consequence of using a synthetic benchmark, presumably in an attempt to obtain reproducible results. But synthetic benchmarks tend to produce meaningless results, so reproducibility is moot.
Look at your bug database, find actual customer complaints, and use actual software and test hardware to reproduce a situation that actually made someone dissatisfied. Develop the performance test in parallel with hard, meaningful performance requirements.