Bitwise operators for normalisation of floating-point numbers and identification of interrupts - bit-manipulation

After having spent a significant amount of time googling this topic, I am still struggling to confidently answer the following two questions:
Identify the process using logical operators to normalise a floating-point number.
Interrupts from various sources are stored as bits in a binary value.
How can logical operations be used to identify whether a specific interrupt has been generated?
When it comes to question 1, let's take the number 17 stored in 8-bits for example (00010001).
My initial assumption was, we simply shift it to the left by 2 places. But then what happens to the exponent? Do we use just use XOR to increase/decrease it by the right amount? I'm not sure how to approach this question.
As for the second one, I'm just lost. At first I thought they might be referring to the interrupt masking, but as far as I understand it, that feature is used for temporarily suspending an interrupt, and not for identifying it.
Could someone please point me in the right direction? Thank you in advance for your time and help.

Related

Where is integer addition and subtraction event count from intel Vtune?

I am using intel VTune to profile my program.
The CPU I am using is IVY Bridge.
All the hardware instruction event can be found here:
https://software.intel.com/en-us/node/589933
FP_COMP_OPS_EXE.X87
Number of FP Computational Uops Executed this
cycle. The number of FADD, FSUB, FCOM, FMULs, integer MULsand IMULs,
FDIVs, FPREMs, FSQRTS, integer DIVs, and IDIVs. This event does not
distinguish an FADD used in the middle of a transcendental flow from a
s
FP_COMP_OPS_EXE.X87 seems to include Integer Multiplication and Integer Division; however, there is no Integer Addition and Integer Subtraction there. I can not find those two kinds of instruction either from the above website.
Can anyone tell me what is the event that counts integer addition and integer subtraction instructions?
I'm reading a lot into your question, but here goes:
It's possible that if your code is computationally bound you could find ways to infer the significance of integer adds and subs without measuring them directly. For example, UOPS_RETIRED.ALL - FP_COMP_OPS_EXE.ALL would give you a very rough estimate of adds and subs, assuming that you've already done something to establish that your code is compute bound.
Have you? If not, it might help to start with VTune's basic analysis and then eliminate memory, cache and front end bottlenecks. If you've already done this, you have a few more options:
Cross-reference UOPS_DISPATCHED_PORT with an Ivy Bridge block diagram, or even better, a list of which specific types of arithmetic can execute on which ports (which I can't find).
Modify your program source, compiler flags or assembly, rerun a coarser-grained profile like basic analysis, and see whether you see an impact at the level of a measure like INST_RETIRED.ANY / CPU_CLK_UNHALTED.
Sorry there doesn't seem to be a more direct answer.

How is FLOPS/IOPS calculated and what is its use?

I have been following some tutorials for OpenCL and a lot of times people speak in terms of FLOPS. Wikipedia does explain the formula but does not tell what it actually means? For example, 1 light year = 9.4605284 × 10^15 meters but what it means is the distance traveled by light in a year. Similarly what does FLOP mean?
Answer to a similar question says 100 IOPS for the code
for(int i = 0; i < 100; ++i)
Ignoring the initialisation, I see 100 increment operations, so there's 100IOPS. But I also see 100 comparison operations. So why isn't it 200IOPS? So what types of operators are included in FLOPS/IOPS calculation?
Secondly I want to know what would you do by calculating the FLOPS of your algorithm?
I am asking this because the value is specific for a CPU clock speed and no of cores.
Any guidance on this arena would be very helpful.
"FLOPS" stands for "Floating Point Operations Per Second" and it's exactly that. It's used as a measure of the computing speed of large, number based (usually scientific) operations. Measuring it is a matter of knowing two things:
1.) The precise execution time of your algorithm
2.) The precise number of floating point operations involved in your algorithm
You can get pretty good approximations of the first one from profiling tools, and the second one from...well you might be on your own there. You can look through the source for floating point operations like "1.0 + 2.0" or look at the generated assembly code, but those can both be misleading. There's probably a debugger out there that will give you FLOPS directly.
It's important to understand that there is a theoretical maximum FLOPS value for the system you're running on, and then there is the actual achieved FLOPS of your algorithm. The ratio of these two can give you a sense of the efficiency of your algorithm. Hope this helps.

Convert all doubles to integers for better performance, is it just a rumor?

I have a very complicated and sophisticated data fitting program which uses the Levenverg-Marquardt algorithm to do fitting in double precision (basically the fitting class is templatized, but I use instantiate it to doubles). The fitting process involves:
Calculating an error function (chi-square)
Solving a system of linear equations (I use lapack for that)
calculating the derivatives of a function with respect to the parameters, which I want to fit to the data (usually 20+ parameters)
calculating the function value continuously: the function is a complicated combination of a sinusoidal and exponential functions with a few harmonics.
A colleague of mine has suggested that I use integers for at least 10 times faster at least. My questions are:
Is that true that I will get that kind of improvement?
Is it safe to convert everything to integers? And what are the drawbacks to this?
What advice would you have for this whole issue? What would you do?
The program is developed to calculate some parameters from the signal online, which means that the program must be as fast as possible, but I'm wondering whether it's worth it to start the project of converting everything to integers.
The amount of improvement depends on your platform. For example, if your platform has a fast floating point coprocessor, performing arithmetic in floating point may be faster than integral arithmetic.
You may be able to get more performance gain by optimizing your algorithms rather than switching to integer arithmetic.
Another method for boosting performance is to reduce data cache hits and also reducing branches and loops.
I would measure performance of the program to find out where the bottlenecks are and then review the sections that where most of the performance takes place. For example, in my embedded system, micro-optimizations like what you are suggesting, saved 3 microseconds. This gain is not worth the effort to retest the entire system. If it works, don't fix it. Concentrate on correctness and robustness first.
The bottom line here is that you have to test it and decide for yourself. Profile a release build using real data.
1- Is that true that I will get that kind of improvement?
Maybe yes, maybe no. It depends on a number of factors, such as
How long it takes to convert from double to int
How big a word is on your machine
What platform/toolset you're using and what optimizations you have enabled
(Maybe) how big a cache line is on your platform
How fast your memory is
How fast your platform computes floating-point versus integer.
And who knows what else. In short, too many complex variables for anyone to be able to say for sure if you will or will not improve performance.
But I would be highly skeptical about your friend's claim, "at least 10 times faster at least."
2- Is it safe to convert everything to integers? And what are the
drawbacks to this?
It depends on what you're converting and how. Obviously converting a value like 123.456 to an integer is decidedly unsafe.
Drawbacks include loss of precision, loss of accuracy, and the expense in terms of space and time to actually do the conversions. Another significant drawback is the fact that you have to write a substantial amount of code, and every line of code you write is a probable source of new bugs.
3- What advice would you have for this whole issue? What would you do?
I would step back & take a deep breath. Profile your code under real-world conditions. Identify the sources of the bottlenecks. Find out what the real problems are, and if there even are any.
Identify inefficiencies in your algorithms, and fix them.
Throw hardware at the problem.
Then you can endeavor to start micro-optimizing. This would be my last resort, especially if the optimization technique you are considering would require writing a lot of code.
First, this reeks of attempting to optimize unnecessarily.
Second, doubles are a minimum of 64-bits. ints on most systems are 32-bits. So you have a couple of choices: truncate the double (which reduces your precision to a single), or store it in the space of 2 integers, or store it as an unsigned long long (which is at least 64-bits as well). For the first 2 options, you are facing a performance hit as you must convert the numbers back and forth between the doubles you are operating on and the integers you are storing it as. For the third option, you are not gaining any performance increase (in terms of memory usage) as they are basically the same size - so you'd just be converting them to integers for no reason.
So, to get to your questions:
1) Doubtful, but you can try it to see for yourself.
2) The problem isn't storage as the bits are just bits when they get into memory. The problem is the arithmetic. Since you stated you need double precision, attempting to do those operations on an integer type will not give you the results you are looking for.
3) Don't optimize until it has been proven something needs to have a performance improvement. And always remember Amdahl's Law: Make the common case fast and the rare case correct.
What I would do is:
First tune it in single-thread mode (by the random-pausing method) until you can't find any way to reduce cycles. The kinds of things I've found are:
a large fraction of time spent in library functions like sin, cos, exp, and log where the arguments were often unchanged, so the answers would be the same. The solution for that is called "memoizing", where you figure out a place to store old values of arguments and results, and check there first before calling the function.
In calling library functions like DGEMM (lapack matrix-multiply) that one would assume are optimized to the teeth, they are actually spending a large fraction of time calling a function to determine if the matrices are upper or lower triangle, square, symmetric, or whatever, rather than actually doing the multiplication. If so, the answer is obvious - write a special routine just for your situation.
Don't say "but I don't have those problems". Of course - you probably have different problems - but the process of finding them is the same.
Once you've made it as fast as possible in single-thread, then figure out how to parallelize it. Multi-threading can have high overhead, so it's best not to tightly-couple the threads.
Regarding your question about converting from doubles to integers, the other answers are right on the money. It only makes sense in very particular situations.

Calculating FLOPS (Floating-point Operations per Seconds)

How can I calculate FLOPS of my application?
If I have the total number of executed instructions, I can divide it by the execution time. But, how to count the number of executed instructions?
My question is general and answer for any language is highly appreciated. But I am looking to find a solution for my application which is developed by C/C++ and CUDA.
I do not know whether the tags are proper, please correct me if I am wrong.
What I do if the number of floating point operations is not easily modeled is to produce two executables: One that is the production version and gives me the execution time, and an instrumented one that counts all floating point operations while performing them (surely that will be slow, but that doesn't matter for our purpose). Then I can compute the FLOP/s value by dividing the number of floating point ops from the second executable by the time from the first one.
This could probably even be automated, but I haven't had a need for this so far.
You should mathematically model what's done with your data. Isolate one loop iteration. Then count all simple floating-point additions, multiplications, divisions, etc. For example,
y = x * 2 * (y + z*w) is 4 floating-point operations. Multiply the resulting number by the number of iterations. The result will be the number of instructions you're searching for.

How to Determine Which CRC to Use?

If I have a certain number of bytes to transfer serially, how do I determine which CRC (CRC8, CRC16, etc., basically a how many bit CRC?) to use and still have error detection percentage be high? Is there a formula for this?
From a standpoint of the length of the CRC, normal statistics apply. For a bit-width of CRC, you have 1/(2^n) chance of having a false positive. So for a 8 bit CRC, you have a 1/255 chance, etc.
However, the polynomial chosen also makes a big impact. The math is highly dependent on the data being transferred and isn't an easy answer.
You should evaluate more than just CRC depending on your communication mechanism (FEC with systems such a turbo codes is very useful and common).
To answer this question, you need to know the bit error rate of your channel which can only be determined empirically. And then once you have the measured BER, you have to decide what detection rate is "high" enough for your purposes.
Sending each message, for example, 5 times will give you pretty pretty good detection even on a very noisy channel, but it does crimp your throughput a bit. However, if you are sending commands to a deep-space-probe you may need that redundancy.