Total process time PROC SQL, understanding the log - sas

I'd like to understand the time log better to improve performance.
My real time is typically much bigger than the cpu time. Is it expected ?
Is the difference due to disk access ? Are the in memory operations included in CPU time or in real time ?
Is there useful information regarding performance optimization in the other lines of the log ?
An example:
NOTE: PROCEDURE SQL used (Total process time):
real time 9:06.00
user cpu time 1:36.79
system cpu time 19.11 seconds
memory 7463.31k
OS Memory 24628.00k
Timestamp 06/07/2018 12:45:31 PM
Page Faults 7
Page Reclaims 1566
Page Swaps 0
Voluntary Context Switches 370694
Involuntary Context Switches 36835
Block Input Operations 0
Block Output Operations 0

CPU time is how much CPU processing was going on - how many CPU clock cycles were used. This is per core; it can be greater than real time, if multiple cores (virtual or physical) are in use. For example, something that's CPU-intensive and works with parallel processing uses 4 cores on your machine for 2 minutes, real time will be 2 minutes and CPU time could be as high as 8 minutes.
Real time is literally the clock time between job start and job end. In the case of proc sql, it's until quit is reached - so if you leave proc sql open without a quit, the real time could be until you next run a proc.
Much of the time, the difference between CPU time and Real time is disk i/o as you say - if you're reading over a network or from a spinning disk, it's likely that your disk i/o will take longer than your CPU time.
Reading a "file" from memory is not included in CPU time except insofar as it involves the CPU.
The other fields are helpful for diagnosing performance issues at varying levels. Amount of memory used could indicate particular issues with your code (if it's using huge amounts of memory, maybe you're doing something inefficiently).
You might want to read some papers on SQL efficiency and optimization, like Kirk Lafler's - see in particular the _METHOD and _TREE discussion at the end. And of course the various ways to get more information are mentioned in the documentation, such as STIMER and FEEDBACK.

Related

Multithreaded Cache Miss Exploiting

When I eg. iterate over a linked list and become really unlucky, I will have ~ 0% cache-hitrate (let's assume this anyways). Let's also assume I have a CPU that can only run one Instruction at a time (no multicore / hyperthreads) for simplicity. Cool. Now with my 0% hitrate the CPU / program is spending 99% of the time waiting for data.
Question: If a thread is waiting for data from the RAM / disk is that core blocked? Or can I exploit the low cache-hitrate by running other threads (or another way that is not todo with increasing the hitrate) to not have the CPU exclusively wait for data and do other work instead?
If you run SMT, then the other thread can grap all the core resources and hence cover over the cache miss (at least partially).
I know of no processor that makes task switch on cache miss, but I know several architectures that use SMT-2/4/8 (yes some Power CPU's have SMT-8) to cover over such cases.

Is adding user time + system time (from the shell's time command) a reliable measure even when multitasking?

I know user time deals with CPU time so it isn't affected by time-sharing, since it inly measures the number of CPU operations when our process had the CPU for itself.
But can the system time be slowed down if like multiple process are doing I/O ?
There are three measures which are easily available
wall time - how much time has elapsed.
kernel time - how much work the system did on your behalf
user time - how much time your code ran.
These help tell a story, but not a whole story. In general the system accounts for these in a special way. it allows your program to run on a cpu core for a fixed length of time.
When that time elapses, it stops your program, and checks if there is anything else to do. If there is nothing else, your program starts again.
The values of user and system time you get from the common operating systems (Windows, unix), are basically a count of the number of times the system reached the end of the quantum and found your program in user mode code (user time), or kernel code (kernel time).
Thus the time measured by the user/system, is an estimate of the time spent, not an accurate measurement.
If you perform I/O, or yield your quantum (e.g. calling sleep), then the system will not account for any of your time spent, and will assume you were idle for almost all of the quantum, and no time is accounted for you application (user or system).
The wall time is usually delivered by "reliable" timers. They may not give accurate time-of-day (PCs can drift by seconds per day), but they will give an accurate measure of elapsed time - they will measure the number of clock-ticks which have occurred between two intervals, and that will be unaffected by load on the systems (VMs can be impacted by load on the machine).
But can the system time be slowed down if like multiple process are doing I/O
The system time is stable, and (excluding VMs on a machine under load), will provide accurate and repeatable time measurements.
I know user time deals with CPU time
User time (and kernel time), are guesses at how much time is spent by an application, and can be fooled.

Is there any usage for letting a process "warm up"?

I recently did some digging into memory and how to use it properly. Of course, I also stumbled upon prefetching and how I can make life easier for the CPU.
I ran some benchmarks to see the actual benefits of proper storage/access of data and instructions. These benchmarks showed not only the expected benefits of helping your CPU prefetch, it also showed that prefetching also speeds up the process during runtime. After about 100 program cycles, the CPU seems to have figured it out and has optimized the cache accordingly. This saves me up to 200.000 ticks per cycle, the number drops from around 750.000 to 550.000. I got these Numbers using the qTestLib.
Now to the Question: Is there a safe way to use this runtime-speedup, letting it warm up, so to speak? Or should one not calculate this in at all and just build faster code from the start up?
First of all, there is generally no gain in trying to warm up a process prior to normal execution: That would only speed up the first 100 program cycles in your case, gaining a total of less than 20000 ticks. That's much less than the 75000 ticks you would have to invest in the warming up.
Second, all these gains from warming up a process/cache/whatever are rather brittle. There is a number of events that destroy the warming effect that you generally do not control. Mostly these come from your process not being alone on the system. A process switch can behave pretty much like an asynchronous cache flush, and whenever the kernel needs a page of memory, it may drop a page from the disk cache.
Since the factors make computing time pretty unpredictable, they need to be controlled when running benchmarks that are supposed to produce results of any reliability. Apart from that, these effects are mostly ignored.
It is important to note that keeping the CPU busy isn't necessarily a bad thing. Ideally you want your CPU to run anywhere from 60% to 100% because that means that your computer is actually doing "work". Granted, if there is a process that you are unaware of and that process is taking up CPU cycles, that isn't good.
In answer to your question, the machine usually takes care of this.

Is clock() in c++ consistent with heavy CPU loads

Right now I basically have a program that uses clock to test the amount of time my program takes to do certain operations and usually it is accurate to a couple milliseconds. My question is this: If the CPU is under heavy load will I still get the same results?
Does clock only count when the CPU is working on my process?
Lets assume: multi-core CPU but a process that does not take advantage of multithreading
The function of clock depends on the OS. In windows, from a long distant decision, clock gives the elapsed time, in most other OS's (certainly Linux, MacOS and other Unix-related OS's).
Depending on what you actually want to achieve, elapsed time or CPU time may be what you want to measure.
In a system where there are other processes running, the difference between elapsed time and CPU usage may be huge (and of course, if your CPU is NOT busy running your application, e.g. waiting for network packets to go down the wire or file-data from the hard-disk), then elapsed time is "avialable" for other applications.
There are also a huge number of error factors/interference factors when there are other processes running in the same system:
If we assume that your OS supports clock as a measure of CPU-time, the precision here is not always that great - for example, it may well be accounted in terms of CPU-timer ticks, and your process may not run for the "full tick" if it's doing I/O for example.
Other processes may use "your" cpu for parts of the interrupt handling, before the OS has switched to "account for this as interrupt time", when dealing with packets over the network or hard-disk i/o for some percentage of the time [typically not huge amounts, but in a very busy system, it can be several percent of the total time], and if other processes run on "your" cpu, the time to reload the cache(s) with "your" process' data after the other process loaded it's data will be accounted on "your time". This sort of "interference" may very well affect your measurements - how much very much depends on "what else" is going on in the system.
If your process shares data [via shared memory] with another process, there will also be (again, typically a minute amount, but in extreme cases, it can be significant) some time spent on dealing with "cache-snoop requests" between your process and the other process, when your process doesn't get to execute.
If the OS is switching tasks, "half" of the time spent switching to/from your task will accounted in your process, and half in the other process being switched in/out. Again, this is usually tiny amounts, but if you have a very busy system with lots of process switches, it can add up.
Some processor types, e.g. Intel's HyperThreading also share resources with your actual core, so only SOME of the time on that core is spent in your process, and the cache-content of your process is now shared with some other process' data and instructions - meaning your process MAY get "evicted" from the cache by the other thread running on the same CPU-core.
Likewise, multicore CPU's often have a shared L3 cache that gets
affected by other processes running on the other cores of the CPU.
File-caching and other "system caches" will also be affected by other processes - so if your process is reading some file(s), and other processes also access file(s), the cache-content will be "less yours" than if the system wasn't so busy.
For accurate measurements of how much your process uses of the system resources, you need processor performance counters (and a reproducable test-case, because you probably need to run the same setup several times to ensure that you get the "right" combination of performance counters). Of course, most of these counters are ALSO system-wide, and some kinds of processing in for example interrupts and other random interference will affect the measurement, so the most accurate results will be if you DON'T have many other (busy) processes running in the system.
Of course, in MANY cases, just measuring the overall time of your application is perfectly adequate. Again, as long as you have a reproducable test-case that gives the same (or at least similar) timing each time it's run in a particular scenario.
Each application is different, each system is different. Performance measurement is a HUGE subject, and it's very hard to cover EVERYTHING - and of course, we're not here to answer extremely specific questions about "how do I get my PI-with-a-million-decimals to run faster when there are other processes running in the same system" or whatever it may be.
In addition to agreeing with the responses indicating that timings depend on many factors, I would like to bring to your attention the std::chrono library available since C++11:
#include <chrono>
#include <iostream>
int main() {
auto beg = std::chrono::high_resolution_clock::now();
std::cout << "*** Displaying Some Stuff ***" << std::endl;
auto end = std::chrono::high_resolution_clock::now();
auto dur = std::chrono::duration_cast<std::chrono::microseconds>(end - beg);
std::cout << "Elapsed: " << dur.count() << " microseconds" << std::endl;
}
As per the standard, this program will utilize the highest-precision clock provided by your system and will tick with microsecond resolution (there are other resolutions available; see the docs).
Sample run:
$ g++ example.cpp -std=c++14 -Wall -Wextra -O3
$ ./a.out
*** Displaying Some Stuff ***
Elapsed: 29 microseconds
While it is much more verbose than relying on the C-style std::clock(), I feel it gives you much more expressiveness, and you can hide the verbosity behind a nice interface (for example, see my answer to a previous post where I use std::chrono to build a function timer).
There are shared components in CPU like last level cache, execution units (between hardware threads within one core), so under heavy loads you will get jitter, because even if your application executed exactly the same amount of instructions, each instructions may take more cycles (waiting for memory because data was evicted from cache, available execution unit), and more cycles means more time to execute (assuming that Turbo Boost won't compensate).
If you seek for precise instrument, look at hardware counters.
It is also important to consider factors like the number of cores available on the physical CPU, hyper-threading and other BIOS settings like Turbo Boost on Intel CPUs, and threading techniques used when coding when looking at timing metrics for CPU intensive tasks.
Parallelization tools like OpenMP provide built-in functions for calculating computation and wall time like omp_get_wtime( ); which are often times more accurate than clock() in programs that make use of this type of parallelization.

How limit cpu usage from specific process?

How i can limit cpu usage to 10% for example for specific process in windows C++?
You could use Sleep(x) - will slow down your program execution but it will free up CPU cycles
Where x is the time in milliseconds
This is rarely needed and maybe thread priorities are better solution but since you asked, what you should is:
do a small fraction of your "solid" work i.e. calculations
measure how much time step 1) took, let's say it's twork milliseconds
Sleep() for (100/percent - 1)*twork milliseconds where percent is your desired load
go back to 1.
In order for this to work well, you have to be really careful in selecting how big a "fraction" of a calculation is and some tasks are hard to split up. A single fraction should take somewhere between 40 and 250 milliseconds or so, if it takes less, the overhead from sleeping and measuring might become significant, if it's more, the illusion of using 10% CPU will disappear and it will seem like your thread is oscillating between 0 and 100% CPU (which is what happens anyway, but if you do it fast enough then it looks like you only take whatever percent). Two additional things to note: first, as mentioned before me, this is on a thread level and not process level; second, your work must be real CPU work, disk/device/network I/O usually involves a lot of waiting and doesn't take as much CPU.
That is the job of OS, you can not control it.
You can't limit to exactly 10%, but you can reduce it's priority and restrict it to use only one CPU core.