How to Adjust Processor Bus Multiplier - c++

I am looking for a windows function, structure, API that control the multiplier of the bus speed of the processor. In other word, I am trying to adjust frequency of the CPU by varying the multiplier.Currently, I am adjusting the CPU speed by modifying the power scheme using the following function.
PowerWriteDCValueIndex(…,…,…,…)
And adjust the
THROTTLE_MAXIMUM; & THROTTLE_MINIMUM;
However, this allows me to vary the processor speed as % which is not accurate.
Hope my question is clear and you can help.
Thanks.

The handling of power states in the OS is handled by a kernel driver module, which will be specific at least to the particular CPU vendor, sometimes also to the CPU Model (e.g. the operations is done differently in an 64-bit AMD processor than it is in a 32-bit AMD processor. I once played with a Linux drivers to set the clock speeds of AMD processors).
This driver will be controlled by a "governor" process that takes as input the configuration settings (policy) you're already using, the current load on the CPU (and often some "load history" too, to reduce too many switches) and other sources such as CPU temperature, power left in battery (if applicable). [In mobile devices, the temperature of the CPU is definitely an input into the equation, since most modern CPUs and GPUs are able to draw much more power than the device can dissipate, thus overheating the chip if the power setting is left on a high setting for too long]
Unfortunately, you need to know a lot more details than "I want to run this fast" before you can do this. There are BIOS tables (ACPI and/or other vendor specific tables) that define what voltage to use at what frequency, and you will need to set the voltage first, then the clock-speed when going up in speed, and the clock speed then voltage when going down in speed. The tables will most often not contain ALL the speeds the CPU can go at, but a "full speed", "medium speed" and "slow speed" setting. [And there will be multiple tables for different types of CPU's, since the BIOS doesn't know if the person building the system will use a high power, high speed CPU or a low speed, low power CPU that].
There are also registers that need to be programmed to determine how long the CPU should "sleep" before it switches to the new speed, to allow the PLL's (that control the clock multipliers) to stabilize. This means that you don't want to switch too often.
The system also needs to know that the clock-frequency has changed so that any processing that depends on the CPU-speed can be adjusted (e.g. things that use the RDTSC instruciton on x86 to measure short times will need to adjust their timings based on a new setting).
If you don't get ALL of these things perfect, you will have an unstable system (and in a mobile device, you could even "fry" the chip - or the user!).
It is not clear what you intend to do, but in general, it's better to leave these things to the governor that already is in the system than to try to make a better system - almost all attempts to make this "better" will fail.

Related

OpenCL Profiling timestamps are not consistent in duration compared to CPU clock

I am creating a custom tool interface with my application to profile the performance of OpenCL kernels while also integrating CPU profiling points. I'm currently working with this code on Linux using Ubuntu, and am testing using the 3 OpenCL devices in my machine: Intel CPU, Intel IGP, and Nvidia Quadro.
I am using this code std::chrono::high_resolution_clock::now().time_since_epoch().count() to produce a timestamp on the CPU, and of course for the OpenCL profiling time points, they are 64-bit nanoseconds provided from the OpenCL profiling events API. The purpose of the tool I made is to consume log output (specially formatted and generated so as not to impact performance much) from the program and generate a timeline chart to aid performance analysis.
So far in my visualization interface I had made the assumption that nanoseconds are uniform. I've realized now after getting my visual interface working and checking a few assumptions that this condition more or less does hold to a standard deviation of 0.4 microsecond for the CPU OpenCL device (which indicates that the CPU device could be implemented using the same time counter, as it has no drift), but does not hold for the two GPU devices! This is perhaps not the most surprising thing in the world, but it affects the core design of my tool, so this was an unforeseen risk.
I'll provide some eye candy since it is very interesting and it does prove to me that this is indeed happening.
This is zoomed into the beginning of the profile where the GPU has the corresponding mapBuffer for poses happening around a millisecond before the CPU calls it (impossible!)
Toward the end of the profile we see the same shapes but reversed relationship, clearly showing that GPU seconds seem to count for a little bit less compared to CPU seconds.
The way that this visualization currently works as i had assumed a GPU nanosecond is indeed a CPU nanosecond, is that I actually have been computing the average of the delta between the values given to me by the CPU and GPU... Since I did implement this initially, perhaps it indicates that i was at least subconsciously expecting there to be an issue like this one. Anyway, what I did was establish a sync point at the kernel dispatch by recording a CPU timestamp immediately before calling clEnqueueNDRangeKernel and then comparing this against the CL_PROFILING_COMMAND_QUEUED OpenCL Profile event time. This delta upon further inspection showed the time drift:
This screenshot from the chrome console shows me logging the array of delta values I collected from these two GPU devices; they are showing BigInts to avoid losing integer precision: in both cases the GPU reported timestamp deltas are trending down.
Compare with the numbers from the CPU:
My questions:
What might be a practical way to deal with this issue? I am currently leaning toward the use of sync points when dispatching OpenCL kernels, and these sync points could be used to either locally piecewise stretch the OpenCL Profiling timestamps, or to locally sync at the beginning of, say, a kernel dispatch and just ignore the discrepancy we have, assuming it will be insignificant during the period. In particular it is unclear whether it'd be a good idea to maximize granularity by implementing a sync point for every single profiling event I want to use.
What might be some other time measuring systems I can or should use on the CPU-side to see if maybe they will end up aligning better? I don't really have much hope in this at this point because I can imagine that the profiling times being provided to me are actually generated and timed on the GPU device itself. The fluctuations would then be affected by such things as dynamic GPU clock scaling, and there would be no hope of stumbling upon a different better timekeeping scheme on the CPU.

What strategies and practices are used, when running very intense and long calculations, to ensure that hardware isn't damaged?

I have many large Fortran programs to run at work. I have access to several desktop computers and the Fortran code runs over takes several consecutive days. It's essentially running the same master module many times (lets say N times) with different parameters, something akin to Monte Carlo on steroids. In that sense the code is parallelizable, however I don't have access to a cluster.
With the scientific computing community, what practices and strategies are used to minimise hardware damaged from heat? The machines of course have their own cooling system (fans and heat sinks), but even so running intense calculations non stop for half a week cannot be healthy for the life of the machines? Though maybe I'm over-thinking this?
I'm not aware of any intrinsic functions in Fortran that can pause the code to give components a break? Current I've written a small module that keeps an eye on system clock, with a do while loop that "wastes time" in between consecutive runs of the master module in order to discharge heat. Is this an acceptable way of doing this? The processor is, after all, still running a while loop.
Another way would be to use a shell scripts or a python code to import Fortran? Alternatively are there any intrinsic routines in the compile (gfortran) that could achieve this? What are the standard, effective and accepted practices for dealing with this?
Edit: I should mention that all machines run on Linux, specifically Ubuntu 12.04.
For MS-DOS application I would consider the following:
Reduce as much as possible I/O operations withHDD, that is, keep data in memory as much as you can,
or keep data on a RamDisk.A RamDisk driver is available on Microsoft's website.
Let me know if you won't be able to find and I look at my CD archives
-Try to use Extended Memory by using aDPMI driver
DPMI - DOS Protected Mode Interface
-Set CPU affinity for a second CPU
Boost a priority to High, butI wouldn't recommend toboost toReal-Time
I think you need a hardware solution here, not a software solution. You need to increase the rate of heat exchange in the computers (new fans, water cooling, etc) and in the room (turn the thermostat way down, get some fans running, etc).
To answer the post more directly, you can use the fortran SLEEP command to pause a computation for a given number of seconds. You could use some system calls in Fortran to set the argument on the fly. But I wouldn't recommend it - you might as well just run your simulations on fewer computers.
To keep the advantages of the multiple computers, you need better heat exchange.
As long as the hardware is adequately dissipating heat and components are not operating at or beyond their "safe" temperature limits, they * should be fine.
*Some video cards were known to run very hot; i.e. 65-105°C. Typically, electronic components have a maximum temperature rating of exactly this. Beyond it, reliability degrades very quickly. Even though the manufacturer made these cards this way, they ended up with a reputation for failing (i.e. older nVidia FX, Quadro series.)
*Ubuntu likely has a "Critical temperature reached" feature where the entire system will power off if it overheats, as explained here. Windows is "blissfully ignorant." :)
*Thermal stress (large, repeated temperature variations) may contribute to component failure of IC's, capacitors, and hard disks. Over three decades of computing has taught that adequate cooling and leaving the PC on 24/7 actually may save wear-and-tear in my experience. (A typical PC will cost around $200 USD/year in electricity, so it's more like a trade-off in terms of cost.)
*PC's must be cleaned twice a year (depending on airborne particulate constituency and concentration.) Compressed air is nice for removing dust. Dust traps heat and causes failures. Operate a shop-vac while "dusting" to prevent the dust from going everywhere. Wanna see a really dusty computer?
*The CPU should be "ok" with it's stock cooler. Check it's temperature at cold system boot-up, then again after running code for an hour or so. The fan is speed-controlled to limit temperature rise. CPU temperature rise shouldn't be much warmer than about 40°C and less would be better. But an aftermarket, better-performing CPU cooler never hurts, such as these. CPU's rarely fail unless there is a manufacturing flaw or they operate near or beyond their rated temperatures for too long, so as long as they stay cool, long calculations are fine. Typically, they stop functioning and/or reset the PC if too hot.
*Capacitors tend to fail very rapidly when overheated. It is a known issue that some cap vendors are "junk" and will fail prematurely, regardless of other factors. "Re-capping" is the art of fixing these components. For a full run-down on this topic, see badcaps.net. It used to be possible to re-cap a motherboard, but today's 12+ layer and ROHS (no lead) motherboards make it very difficult without specialty hot-air tools.

Finding CPU utilization and CPU cycles

My program is in C++ and I have one server listening to a number of clients. Clients send small packets to server. I'm running my code on Ubuntu.
I want to measure the CPU utilization and possibly total number of CPU cycles on both sides, ideally with a breakdown on cycles/utilization spent on networking (all the way from NIC to the user space and vice versa), kernel space, user space, context switches, etc.
I did some search, but I couldn't figure out whether it should be done inside my C++ code or an external profiler should be used, or perhaps some other way.
Your best friend/helper in this case is the /proc file system in Linux. In /proc you will find CPU usage, memory usage, power usage etc. Have a look at this link
http://www.linuxhowtos.org/System/procstat.htm
Even you can check each process cpu usage by looking at the files /proc/process_id/stat.
Take a look at the RDTSCP instruction and some other ways to measure performacnce metrics. System simulators like SniperSim, Gem5 etc. can also give total cycle count of your running program ( however, they may not be very accurate - there are some conditions that need to be met (core frequencies are same etc.))
As I commented, you probably should consider using oprofile. I am not very familiar with it (and it may be complex to use, and require system-wide configuration)

Linux Timing across Kernel & User Space

I'm writing a kernel module for a special camera, working through V4L2 to handle transfer of frames to userspace code.. Then I do lots of userspace stuff in the app..
Timing is very critical here, so I've been doing lots of performance profiling and plain old std::chrono::steady_clock stuff to track timing, but I've reached the point where I need to also collect timing data from the Kernel side of things so that I can analyze the entire path from hardware interrupt through V4L DQBuf to userspace...
Can anyone recommend a good way to get high-resolution timing data, that would be consistent with application userspace data, that I could use for such comparisons? Right now I'm measuring activity in microseconds..
Ubuntu 12.04 LTS
At the lowest level, there are the rdtsc and rdtscp instructions if you're on an x86/x86-64 processor. That should provide the lowest overhead, highest possible resolution across the kernel/userspace boundary.
However, there are things you need to worry about. You need to make sure you're executing across the same core/cpu, the process isn't being context switched, and the frequency isn't changing across invocations. If the cpu supports an invariant tsc, (constant_tsc in /proc/cpuinfo) it's a little more reliable across cpus/cores and frequencies.
This should provide roughly nanosecond accuracy.
There are lot of kernel level utilities available that can get the timing related traces for you. For eg ptrace, ftrace, LTTng, Kprobes. Check out this link for more information.
http://elinux.org/Kernel_Trace_Systems

How to measure read/cycle or instructions/cycle?

I want to thoroughly measure and tune my C/C++ code to perform better with caches on a x86_64 system. I know how to measure time with a counter (QueryPerformanceCounter on my Windows machine) but I'm wondering how would one measure the instructions per cycle or reads/write per cycle with respect to the working set.
How should I proceed to measure these values?
Modern processors (i.e., those not very constrained that are less than some 20 years old) are superscalar, i.e., they execute more than one instruction at a time (given correct instruction ordering). Latest x86 processors translate the CISC instructions into internal RISC instructions, reorder them and execute the result, have even several regster banks so instructions using "the same registers" can be done in parallel. There isn't any reasonable way to define the "time the instruction execution takes" today.
The current CPUs are much faster than memory (a few hundred instructions is the typical cost of accessing memory), they are all heavily dependent on cache for performance. And then you have all kinds of funny effects of cores sharing (or not) parts of cache, ...
Tuning code for maximal performance starts with the software architecture, goes on to program organization, algorithm and data structure selection (here a modicum of cache/virtual memory awareness is useful too), careful programming and (as te most extreme measures to squeeze out the last 2% of performance) considerations like the ones you mention (and the other favorite, "rewrite in assembly"). And the ordering is that one because the first levels give more performance for the same cost. Measure before digging in, programmers are notoriously unreliable in finding bottlenecks. And consider the cost of reorganizing code for performance, both in the work itself, in convincing yourself this complex code is correct, and maintenance. Given the relative costs of computers and people, extreme performance tuning rarely makes any sense (perhaps for heavily travelled code paths in popular operating systems, in common code paths generated by a compiler, but almost nowhere else).
If you are really interested in where your code is hitting cache and where it is hitting memory, and the processor is less than about 10-15 years old in its design, then there are performance counters in the processor. You need driver level software to access these registers, so you probably don't want to write your own tools for this. Fortunately, you don't have to.
There is tools like VTune from Intel, CodeAnalyst from AMD and oprofile for Linux (works with both AMD and Intel processors).
There are a whole range of different registers that count the number of instructions actually completed, the number of cycles the processor is waiting for . You can also get a count of things like "number of memory reads", "number of cache misses", "number of TLB misses", "number of FPU instructions".
The next, more tricky part, is of course to try to fix any of these sort of issues, and as mentioned in another answer, programmers aren't always good at tweaking these sort of things - and it's certainly time consuming, not to mention that what works well on processor model X will not necessarily run fast on model Y (there were some tuning tricks for early Pentium 4 that works VERY badly on AMD processors - if on the other hand, you tune that code for AMD processors of that age, you get code that runs well on the same generation Intel processor too!)
You might be interested in the rdtsc x86 instruction, which reads a relative number of cycles.
See http://www.fftw.org/cycle.h for an implementation to read the counter in many compilers.
However, I'd suggest simply measuring using QueryPerformanceCounter. It is rare that the actual number of cycles is important, to tune code you typically only need to be able to compare relative time measurements, and rdtsc has many pitfalls (though probably not applicable to the situation you described):
On multiprocessor systems, there is not a single coherent cycle counter value.
Modern processors often adjust the frequency, changing the rate of change in time with respect to the rate of change in cycles.