How to run SPECfp benchmarks on verilog module?

How to run SPECfp benchmarks on verilog module? - c++

I have created a verilog FPU and I was wondering how I would go about running SPECfp benchmarks on it, or is that even possible?

Coordination between e.g. memory access and floating point arithmetic is critical to real-world floating point performance, which is what SPECfp is intended to measure. Often, the limiting factor is getting the operands to the floating point ALU, and the results back to memory, not doing the arithmetic once the right operands are in the right registers.
Do you have a model of the whole processor and memory? If so, how fast does it run? Processor designers do run benchmarks on models, but it takes a lot of compute power.

The SPEC benchmarks are software (I believe SPECfp is entirely in C and Fortran, but I'm not sure). You don't run them on arbitrary hardware components, you run them on a complete system with a compiler and runtime environment. If all you have is an FPU, you'll need to mate it with something that can run general code, and then come up with a compiler back end to target your custom architecture.

Related

Performance of custom bitwise vs native CPU operations

everyone! I have been trying to create my own big integer class for RSA implementation in C++ (for practice purposes only). The only way I see such thing to be implemented well in terms of performance is by using C++'s built-in bitwise operations (&|^), meaning implementing custom full-adders for addition, binary multipliers for multiplication, etc. The thing I am interested in can be formulated as follows: would custom-made emulators of hardware circuits for number arithmetic (like full-adders, multipliers) using bitwise C++ operations be slower in terms of performance. In other words, if I make my own unsigned integer class of the size of 64 bits, and make it "ideal" in terms of number of bitwise operations needed to perform addition, multiplication, division on it, can it have the same performance as built-in unsigned long long? Can it be implemented to be this fast with any programming language at all or you never will be able to make any operation faster than those in CPUs' intrinsic instruction set?
Please note that I am not interested in answers regarding implementations of RSA, but only in performance comparison of native and hand-made arithmetic.
Thank you in advance!

Software implementations cannot even come close to arithmetic circuits that are typically used in hardware. It's not even a question of benchmarking or of different results depending on the system (assuming we're talking about hardware that isn't prehistoric), it's a hands-down win for hardware circuits, guaranteed, every time, by a huge factor. A big difference between hardware and software is this: hardware has almost unlimited parallelism, so the number of bit-operations doesn't matter as much, speed depends primarily on the "depth" of a circuit. Software also has parallelism, but it's very limited.
Consider a typical fast multiplier, as used in modern hardware. They're based on some parallel reduction scheme, such as a Dadda multiplier (the actual circuit doesn't necessarily follow Dadda's algorithm to the letter, but it's going to use a similar parallel reduction). Hardware has almost unlimited parallelism to do that parallel reduction with. As a result, 64bit multiplication takes 3 cycles on many modern machines (not all of them, but for example both Apple M1 and all current Intel and AMD x64 processors). Granted, in 3 cycles you could squeeze more than 3 bitwise operations, but it's just not even a contest - you cannot implement multiplication in just a handful of bitwise operations.
Even just addition is already unbeatable. It's already as fast as bitwise operations are, or perhaps more accurately, bitwise operations are as slow as addition is. The time a bitwise operation takes at the software level has little to do with the latency of the corresponding gate, it's more a property of how the processor was designed in general. By the way you may also be interested in this other question: Why is addition as fast as bit-wise operations in modern processors?

As said by P Kramer in his comment, it totally depends on your system, instruction set and compiler, modern compilers are pretty good at finding optimizations for your specific CPU when you ask them to but it's completely impossible to know if they'll do as good as/better than the native instruction through theory only.
As usual in this case, I suggest A/B testing (don't forget to use -march and -mtune if using gcc/clang) to check which implementation is the fastest on your machine and by how much.

Is it possible that a C++ program gives different results on different x86 hardware?

i want to know if someone has experience with C++ programs which are built on a x86 system, and after releasing it for x86 systems (but for other processor hardware, f.e. AMD, Intel) some results differ.
So the only thing that changed is the hardware.
The two things i have in mind are:
floating point standard IEEE (I don't know how strict the processor manufacturers comply with that)
(Especially for iterative solvers, like FEM solvers, where one result is based on the result of the previous result. So small differences could lead to different results, f. e. 10000 iterations.)
Multi-threading
I heard such things now several times.
And I'm just interested if there are some proofed facts related to that topic.

There's always the Pentium FDIV bug for starters, although some compilers can take this into account.
Some compilers can generate code to take advantage of SIMD instructions so could give different results between SIMD and non-SIMD versions when doinf floating point.
There are also some instructions that behave differently on different CPUs, pushf/popf for example.
So, yes, programs can behave differently on different hardware (it is, afterall, how programs that identify CPUs work).

Not going to discuss you first (1) case, but pretty sure that exists a lot of commands (usually internal/undocumented) that behave in different way.
But concerning against multithreading (2) - I can tell that behavior of the same processor on the same process can produce different results. In general multithreading (especially on hyper-thread and multicore processor) is random matter. It depends of lot factors- not only of manufacturer, but also on loading of process, kind of DMA controller... Even more there is special thread technique (random thread boost) - that invokes random generator to improve response of multithread engine.

It's called a race condition.
( http://en.wikipedia.org/wiki/Race_condition )

How to measure FLOPS

How do I measure FLOPS or IOPS? If I do measure time for ordinary floating point addition / multiplication , is it equivalent to FLOPS?

FLOPS is floating point operations per second. To measure FLOPS you first need code that performs such operations. If you have such code, what you can measure is its execution time. You also need to sum up or estimate (not measure!) all floating point operations and divide that over the measured wall time. You should count all ordinary operations like additions,subtractions,multiplications,divisions (yes, even though they are slower and better avoided, they are still FLOPs..). Be careful how you count! What you see in your source code is most likely not what the compiler produces after all the optimisations. To be sure you will likely have to look at the assembly..
FLOPS is not the same as Operations per second. So even though some architectures have a single MAD (multiply-and-add) instruction, those still count as two FLOPs. Similarly the SSE instructions. You count them as one instruction, though they perform more than one FLOP.
FLOPS are not entirely meaningless, but you need to be careful when comparing your FLOPS to sb. elses FLOPS, especially the hardware vendors. E.g. NVIDIA gives the peak FLOPS performance for their cards assuming MAD operations. So unless your code has those, you will not ever get this performance. Either rethink the algorithm, or modify the peak hardware FLOPS by a correct factor, which you need to figure out for your own algorithm! E.g., if your code only performs multiplication, you would divide it by 2. Counting right might get your code from suboptimal to quite efficient without changing a single line of code..

You can use the CPU performance counters to get the CPU to itself count the number of floating point operations it uses for your particular program. Then it is the simple matter of dividing this by the run time. On Linux the perf tools allow this to be done very easily, I have a writeup on the details of this on my blog here:
http://www.bnikolic.co.uk/blog/hpc-howto-measure-flops.html

FLOP's are not well defined. mul FLOPS are different than add FLOPS. You have to either come up with your own definition or take the definition from a well-known benchmark.

Usually you use some well-known benchmark. Things like MIPS and megaFLOPS don't mean much to start with, and if you don't restrict them to specific benchmarks, even that tiny bit of meaning is lost.
Typically, for example, integer speed will be quoted in "drystone MIPS" and floating point in "Linpack megaFLOPS". In these, "drystone" and "Linpack" are the names of the benchmarks used to do the measurements.
IOPS are I/O operations. They're much the same, though in this case, there's not quite as much agreement about which benchmark(s) to use (though SPC-1 seems fairly popular).

This is a highly architecture specific question, for a naive/basic/start start I would recommend to find out how many Operations 1 multiplication take's on your specific hardware then do a large matrix multiplication , and see how long it takes. Then you can eaisly estimate the FLOP of your particular hardware
the industry standard of measuring flops is the well known Linpack or HPL high performance linpack, try looking at the source or running those your self
I would also refer to this answer as an excellent reference

Speed of C++ operators/ simple math

I'm working on a physics engine and feel it would help having a better understanding of the speed and performance effects of performing many simple or complex math operations.
A large part of a physics engine is weeding out the unnecessary computations, but at what point are the computations small enough that a comparative checks aren't necessary?
eg: Testing if two line segments intersect. Should there be check on if they're near each other before just going straight into the simple math, or would the extra operation slow down the process in the long run?
How much time do different mathematical calculations take
eg: (3+8) vs (5x4) vs (log(8)) etc.
How much time do inequality checks take?
eg: >, <, =

You'll have to do profiling.
Basic operations, like additions or multiplications should only take one asm instructions.
EDIT: As per the comments, although taking one asm instruction, multiplications can expand to microinstructions.
Logarithms take longer.
Also one asm instruction.
Unless you profile your code, there's no way to tell where your bottlenecks are.
Unless you call math operations millions of times (and probably even if you do), a good choice of algorithms or some other high-level optimization will results in a bigger speed gain than optimizing the small stuff.
You should write code that is easy to read and easy to modify, and only if you're not satisfied with the performance then, start optimizing - first high-level, and only afterwards low-level.
You might also want to try dynamic programming or caching.

As regards 2 and 3, I could refer you to the Intel® 64 and IA-32 Architectures Optimization Reference Manual. Appendix C presents the latencies and the throughput of various instructions.
However, unless you hand-code assembly code, your compiler will apply its own optimizations, so using this information directly would be rather difficult.
More importantly, you could use SIMD to vectorize your code and run computations in parallel. Also, memory performance can be a bottleneck if your memory layout is not ideal. The document I linked to has chapters on both issues.
However, as #Ph0en1x said, the first step would be choosing (or writing) an efficient algorithm, making it work for your problem. Only then should you start wondering about low-level optimizations.
As for 1, in a general case I'd say that if your algorithm works in such a way that it has some adjustable thresholds for when to execute certain tests, you could do some profiling and print out a performance graph of some kind, and determine the optimal values for those thresholds.

Well, this depends on your hardware. Very nice tables with instruction latency are http://www.agner.org/optimize/instruction_tables.pdf
1. it depends on the code a lot. Also don't forget it doesn't depend only on computations, but how well the comparison results can be predicted.
2. Generally addition/subtraction is very fast, multiplication of floats is a bit slower. Float division is rather slow (if you need to divide by a constant c, it's often better to precompute 1/c and multiply by it). The library functions are usually (I'd dare to say always) slower than simple operators, unless the compiler decides to use SSE. For example sqrt() and 1/sqrt() can be computed using one SSE instruction.
3. From about one cycle to several dozens of cycles. The current processors does the prediction on conditions. If the prediction is right right, it will be fast. However, if the prediction is wrong, the processor has to throw away all the preloaded instructions (IIRC Sandy Bridge preloads up to 30 instructions) and start processing new instructions.
That means if you have a code, where a condition is met most of the time, it will be fast. Similarly if you have code where the condition is not met most the time, it will be fast. Simple alternating conditions (TFTFTF…) are usually fast too.

This depends on the scenario you are trying to simulate. How many objects do you have and how close are they? Are they clustered or distributed evenly? Do your objects move around alot, or are they static? You will have to run tests. Possible data-structures for fast checking of proximity are kd-trees or locality-sensitive hashes (there may be others). I am not sure if these are appropriate for your application, you'd have to check if the maintenance of the data-structure and the lookup-cost are OK for you.
You will have to run tests. Consider checking if you can use vectorization, or if you can even run some of the computations in a GPU using CUDA or something like that.
Same as above - you have to test.

You can generally consider inequality checks, increment, decrement, bit shifts, addition and subtraction to be really cheap. Multiplication and division are generally a little more expensive. Complex math operations like logarithms are much more expensive.
Benchmark on your platform to be sure. Be careful about benchmarking using artificial tests with tight loops -- that tends to give you misleading results. Try to benchmark in code that's as realistic as possible. Ideally, profile the actual code under realistic conditions.
As for the optimizations for things like line intersection, it depends on the data set. If you do a lot of checks and most of your lines are short, it may be worth a quick check to rule out cases where the X or Y ranges don't overlap.

as much as I know all "inequality checks" take the same time.
regarding the rest calculations, I would advice you to run some tests like
take time stamp A
make 1,000,000 "+" calculation (or any other).
take time stamp B
calculate the diff between A and B.
then you can compare the calculations.
take in mind:
using different mathematical lib may change it (some math lib are more performance oriented and some more precision oriented)
the compiler optimization may change it.
each processor is doing it differently.

Floating point operations in interrupt handler (PowerPC, VxWorks)

I haven't found any resources that exactly answer what I am trying to understand with an issue I saw in a piece of software I am working on, so I'll ask the geniuses here!
For starters, I'm running with VxWorks on a PowerPC processor.
In trying to debug a separate issue, I tried throwing some quick and dirty debug code in an interrupt handling routine. It involved a double precision floating point operation to store a value of interest (namely, how long it had been since I saw the last interrupt come in) which I used later outside the handler in my running thread. I didn't see a problem in this (sure, it takes longer, but time-wise I had pleanty; the interrupts aren't coming in too quickly) however VxWorks sure didn't like it. It consistently crashes the when it reaches that code, one of the bad crashes that reboots the system. It took me a bit to track down the double operation as the source of the issue, and I realized it's not even double "operations", even returning a constant double from a routine called in the interrupt failed miserably.
On PowerPC (or other architectures in general) are there generally issues doing floating point operations in interrupt handlers and returning floating point (or other type) values in functions called by an interrupt handler? I'm at a loss for why this would cause a program to crash.
(The workaround was to delay the conversion of "ticks" since last interrupt to "time" since laster interrupt until the code is out of the handler, since it seems to handle long integer operations just fine.)

In VxWorks, each task that utilises floating point has to be specified as such in the task creation so that the FP registers are saved during context switches, but only when switching from tasks that use floating point. This allows non-floating point tasks to have faster context switch times.
When an interrupt pre-empts a floating point task however, it is most likely the case that FP registers are not saved. To do so, the interrupt handler would need to determine what task was pre-empted and whether it had been specified as a floating point task; this would make the interrupt latency both higher and variable, which is generally undesirable in a real-time system.
So to make it work any interrupt routine using floating point must explicitly save and restore the FP registers itself. Any task that uses floating point must be specified as such in any case, though you can get away with it if you only have one such task.
If a floating-point task is pre-empted, your interrupt will modify floating point register values in use by that task, the result of this when the FP task resumes is non-deterministic but includes causing a floating point exception - if a previously non-zero register for example, becomes zero, and is subsequently used as the right-hand of a division operation.
It seems to me however that in this case the floating point operation is probably entirely unnecessary. Your "workaround" is in fact the conventional, safest and most deterministic method, and should probably be regarded as a correction of your design rather than a workaround.

Does your ISR call the fppSave()/fppRestore() functions?
If it doesn't, then the ISR is stomping on FP registers that might be in use by existing tasks.
Specifically, FP registers are used by the C++ compiler on the PPC architecture (I think dealing with throw/catch).

In VxWorks, at least for the PPC architectures, a floating point operation will cause a FP Unavilable Exception. This is because when an interrupt occurs the FP bit in MSR is cleared because VxWorks assumes that there will be no FP operations. This speeds up ISR/Task context switching because the FP registers do not have to saved/restored.
That being said, there was a time when we had some debug code that we needed FP operations in the interrupt context. We changed the VxWorks code that calls the specific ISR to 1) set the MSR[FP], do a fpsave call, call the ISR, do a fprestore call, then clear the MSR[FP]. This got us around the problem.
That being said, I agree with the rest of the folks here that FP operations should not be used in an ISR context because that ISRs should be fast and FP operations at typically not.

I have worked with e300 core while developing bare-metal applications and I can say that when an interrupt occurs, core closes the FPU, that you can observe by checking FP bit of MSR. Before doing anything with the floating point registers, you must re-enable FPU by writing 1 to FP bit of MSR. Then you make operations on FPU registers as you want in an ISR.

The general assumption in VxWorks is that Floating Point registers don't need to be saved and restored by ISRs. Primarily because ISRs usually don't mess with them. Historically, most real-time tasks didn't do FP either, but that's obviously changed. What's not obvious is that many tasks that don't explicitly use floating point nevertheless use the floating point registers. I believe that any task with code written in C++ uses the floating point registers (at least on some processors/compilers), even though no floating point operations are obvious. Such tasks should be given the FP_? (I forget the exact spelling) task attribute, causing their FP regs to be saved during context switches.

I think you will find this article interesting. Maybe you are getting into a floating point exception.
I never used PowerPC, but I'm good with Google :P

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js