Assuming x is a 8bit unsigned integer, what is the most efficient command to set the last two bits to 01 ?
So regardless of the initial value it should be x = ******01 in the final state.
In order to set
the last bit to 1, one can use OR like x |= 00000001, and
the forelast bit to 0, one can use AND like x &= 11111101 which is ~(1<<1).
Is there an arithmetic / logical operation that can be use to apply both operations at the same time?
Can this be answered independently of programm-specific implementation but pure logical operations?
Of course you can put both operation in one instructions.
x = (x & 0b11111100) | 1;
That at least saves one assignment and one memory read/write. However, most likely the compilers may optimize this anyway, even if put into two instructions.
Depending on the target CPU, compilers may even optimize the code into bit manipulation instructions that directly can set or reset single bits. And if the variable is locally heavily used, then most likely its kept in register.
So at the end the generated code may look at as simple as (pseudo asm):
load x
setBit 0
clearBit 1
store x
however it also may be compiled into something like
load x to R1
load immediate 0b11111100 to R2
and R1, R2
load immediate 1 to R2
or R1, R2
store R1 to x
For playing with things like this you may throw a look at compiler explorer https://godbolt.org/z/sMhG3YTx9
Please try removing the -O2 compiler option and see the difference between optimized and non-optimized code. Also you may try to switch to different cpu architectures and compilers
This is not possible with a dyadic bitwise operator. Because the second operand should allow to distinguish between "reset to 0", "set to 1" or "leave unchanged". This cannot be encoded in a single binary mask.
Instead, you can use the bitwise ternary operator and form the expression
x = 0b11111100 ??? x : 0b00000001.
Anyway, a piece of caution: this operator does not exist.
Related
I want to know if the compiled code of a bool-to-int conversion contains a branch (jump) operation.
For example, given void func(bool b) and int i:
Is the compiled code of calling func(i) equivalent to the compiled code of func(i? 1:0)?
Or is there a more elaborate way for the compiler to perform this without the branch operation?
Update:
In other words, what code does the compiler generate in order to push 1 or 0 into the stack before jumping to the address of the function?
I assume that it really comes down to the architecture of the CPU at hand, and that some specific processors (certain DSPs, for example) may support this. So my question refers to "conventional" general-purpose CPUs (assuming that this definition is acceptable).
In terms of pure software, the question can also be phrased as: is there an efficient way for converting an integer value to 1 when it's not 0, and to 0 otherwise, without using a conditional statement?
Thanks
It's not your (compiler user) job too make built-in type conversion efficient. If the compiler is not dumb, it will make that sort of things as close as the CPU representation are.
For the most of the commercial CPU, bool and int are the exact same thing, and if(x) { ... }
translate in bit-anding (or bit-oring, whichever is faster: they are normally immediate instructions) x with itself and make a conditional jump after the } if the zero flag is set. (not that this is just a trick to force the zero-flag computation, that is an immediate consequence of the arithmetic unit electronics)
variants are much more a matter of CPU electronics, than code. So don'care about it. ifs are not triggered by a bool, but by the last arithmetic operation result.
Whatever arithmetic operation held by a CPU produces a result ans set some flags that represent certain result attributes: if it is zero, if it produced a carry or borrow, if it has an odd or even number of bit set to 1 etc. Resut and Flags are two registers, and can be loaded and stored from/to memory.
Does using bitwise operations in normal flow or conditional statements like for, if, and so on increase overall performance and would it be better to use them where possible? For example:
if(i++ & 1) {
}
vs.
if(i % 2) {
}
Unless you're using an ancient compiler, it can already handle this level of conversion on its own. That is to say, a modern compiler can and will implement i % 2 using a bitwise AND instruction, provided it makes sense to do so on the target CPU (which, in fairness, it usually will).
In other words, don't expect to see any difference in performance between these, at least with a reasonably modern compiler with a reasonably competent optimizer. In this case, "reasonably" has a pretty broad definition too--even quite a few compilers that are decades old can handle this sort of micro-optimization with no difficulty at all.
TL;DR Write for semantics first, optimize measured hot-spots second.
At the CPU level, integer modulus and divisions are among the slowest operations. But you are not writing at the CPU level, instead you write in C++, which your compiler translates to an Intermediate Representation, which finally is translated into assembly according to the model of CPU for which you are compiling.
In this process, the compiler will apply Peephole Optimizations, among which figure Strength Reduction Optimizations such as (courtesy of Wikipedia):
Original Calculation Replacement Calculation
y = x / 8 y = x >> 3
y = x * 64 y = x << 6
y = x * 2 y = x << 1
y = x * 15 y = (x << 4) - x
The last example is perhaps the most interesting one. Whilst multiplying or dividing by powers of 2 is easily converted (manually) into bit-shifts operations, the compiler is generally taught to perform even smarter transformations that you would probably think about on your own and who are not as easily recognized (at the very least, I do not personally immediately recognize that (x << 4) - x means x * 15).
This is obviously CPU dependent, but you can expect that bitwise operations will never take more, and typically take less, CPU cycles to complete. In general, integer / and % are famously slow, as CPU instructions go. That said, with modern CPU pipelines having a specific instruction complete earlier doesn't mean your program necessarily runs faster.
Best practice is to write code that's understandable, maintainable, and expressive of the logic it implements. It's extremely rare that this kind of micro-optimisation makes a tangible difference, so it should only be used if profiling has indicated a critical bottleneck and this is proven to make a significant difference. Moreover, if on some specific platform it did make a significant difference, your compiler optimiser may already be substituting a bitwise operation when it can see that's equivalent (this usually requires that you're /-ing or %-ing by a constant).
For whatever it's worth, on x86 instructions specifically - and when the divisor is a runtime-variable value so can't be trivially optimised into e.g. bit-shifts or bitwise-ANDs, the time taken by / and % operations in CPU cycles can be looked up here. There are too many x86-compatible chips to list here, but as an arbitrary example of recent CPUs - if we take Agner's "Sunny Cove (Ice Lake)" (i.e. 10th gen Intel Core) data, DIV and IDIV instructions have a latency between 12 and 19 cycles, whereas bitwise-AND has 1 cycle. On many older CPUs DIV can be 40-60x worse.
By default you should use the operation that best expresses your intended meaning, because you should optimize for readable code. (Today most of the time the scarcest resource is the human programmer.)
So use & if you extract bits, and use % if you test for divisibility, i.e. whether the value is even or odd.
For unsigned values both operations have exactly the same effect, and your compiler should be smart enough to replace the division by the corresponding bit operation. If you are worried you can check the assembly code it generates.
Unfortunately integer division is slightly irregular on signed values, as it rounds towards zero and the result of % changes sign depending on the first operand. Bit operations, on the other hand, always round down. So the compiler cannot just replace the division by a simple bit operation. Instead it may either call a routine for integer division, or replace it with bit operations with additional logic to handle the irregularity. This may depends on the optimization level and on which of the operands are constants.
This irregularity at zero may even be a bad thing, because it is a nonlinearity. For example, I recently had a case where we used division on signed values from an ADC, which had to be very fast on an ARM Cortex M0. In this case it was better to replace it with a right shift, both for performance and to get rid of the nonlinearity.
C operators cannot be meaningfully compared in therms of "performance". There's no such thing as "faster" or "slower" operators at language level. Only the resultant compiled machine code can be analyzed for performance. In your specific example the resultant machine code will normally be exactly the same (if we ignore the fact that the first condition includes a postfix increment for some reason), meaning that there won't be any difference in performance whatsoever.
Here is the compiler (GCC 4.6) generated optimized -O3 code for both options:
int i = 34567;
int opt1 = i++ & 1;
int opt2 = i % 2;
Generated code for opt1:
l %r1,520(%r11)
nilf %r1,1
st %r1,516(%r11)
asi 520(%r11),1
Generated code for opt2:
l %r1,520(%r11)
nilf %r1,2147483649
ltr %r1,%r1
jhe .L14
ahi %r1,-1
oilf %r1,4294967294
ahi %r1,1
.L14: st %r1,512(%r11)
So 4 extra instructions...which are nothing for a prod environment. This would be a premature optimization and just introduce complexity
Always these answers about how clever compilers are, that people should not even think about the performance of their code, that they should not dare to question Her Cleverness The Compiler, that bla bla bla… and the result is that people get convinced that every time they use % [SOME POWER OF TWO] the compiler magically converts their code into & ([SOME POWER OF TWO] - 1). This is simply not true. If a shared library has this function:
int modulus (int a, int b) {
return a % b;
}
and a program launches modulus(135, 16), nowhere in the compiled code there will be any trace of bitwise magic. The reason? The compiler is clever, but it did not have a crystal ball when it compiled the library. It sees a generic modulus calculation with no information whatsoever about the fact that only powers of two will be involved and it leaves it as such.
But you can know if only powers of two will be passed to a function. And if that is the case, the only way to optimize your code is to rewrite your function as
unsigned int modulus_2 (unsigned int a, unsigned int b) {
return a & (b - 1);
}
The compiler cannot do that for you.
Bitwise operations are much faster.
This is why the compiler will use bitwise operations for you.
Actually, I think it will be faster to implement it as:
~i & 1
Similarly, if you look at the assembly code your compiler generates, you may see things like x ^= x instead of x=0. But (I hope) you are not going to use this in your C++ code.
In summary, do yourself, and whoever will need to maintain your code, a favor. Make your code readable, and let the compiler do these micro optimizations. It will do it better.
I have problem in accessing 32 most significant and 32 least significant bits in Verilog. I have written the following code but I get the error "Illegal part-select expression" The point here is that I don't have access to a 64 bit register. Could you please help.
`MLT: begin
if (multState==0) begin
{C,Res}<={A*B}[31:0];
multState=1;
end
else
begin
{C,Res}<={A*B}[63:32];
multState=2;
end
Unfortunately the bit-select and part-select features of Verilog are part of expression operands. They are not Verilog operators (see Sec. 5.2.1 of the Verilog 2005 Std. Document, IEEE Std 1364-2005) and can therefore not be applied to arbitrary expressions but only directly to registers or wires.
There are various ways to do what you want but I would recommend using a temporary 64 bit variable:
wire [31:0] A, B;
reg [63:0] tmp;
reg [31:0] ab_lsb, ab_msb;
always #(posedge clk) begin
tmp = A*B;
ab_lsb <= tmp[31:0];
ab_msb <= tmp[63:32];
end
(The assignments to ab_lsb and ab_msb could be conditional. Otherwise a simple "{ab_msb, ab_lsb} <= A*B;" would do the trick as well of course.)
Note that I'm using a blocking assignment to assign 'tmp' as I need the value in the following two lines. This also means that it is unsafe to access 'tmp' from outside
this always block.
Also note that the concatenation hack {A*B} is not needed here, as A*B is assigned to a 64 bit register. This also fits the recommendation in Sec 5.4.1 of IEEE Std 1364-2005:
Multiplication may be performed without losing any overflow bits by assigning the result
to something wide enough to hold it.
However, you said: "The point here is that I don't have access to a 64 bit register".
So I will describe a solution that does not use any Verilog 64 bit registers. This will however not have any impact on the resulting hardware. It will only look different in
the Verilog code.
The idea is to access the MSB bits by shifting the result of A*B. The following naive version of this will not work:
ab_msb <= (A*B) >> 32; // Don't do this -- it won't work!
The reason why this does not work is that the width of A*B is determined by the left hand side of the assignment, which is 32 bits. Therefore the result of A*B will only contain the lower 32 bits of the results.
One way of making the bit width of an operation self-determined is by using the concatenation operator:
ab_msb <= {A*B} >> 32; // Don't do this -- it still won't work!
Now the result width of the multiplication is determined using the max. width of its operands. Unfortunately both operands are 32 bit and therefore we still have a 32 bit multiplication. So we need to extend one operand to be 64 bit, e.g. by appending zeros
(I assume unsigned operands):
ab_msb <= {{32'd0, A}*B} >> 32;
Accessing the lsb bits is easy as this is the default behavior anyways:
ab_lsb <= A*B;
So we end up with the following alternative code:
wire [31:0] A, B;
reg [31:0] ab_lsb, ab_msb;
always #(posedge clk) begin
ab_lsb <= A*B;
ab_msb <= {{32'd0, A}*B} >> 32;
end
Xilinx XST 14.2 generates the same RTL netlist for both versions. I strongly recommend the first version as it is much easier to read and understand. If only 'ab_lsb' or 'ab_msb' is used, the synthesis tool will automatically discard the unused bits of 'tmp'. So there is really no difference.
If this is not the information you where looking for you should probably clarify why and how you "don't have access to 64 bit registers". After all, you try to access the bits [63:32] of a 64 bit value in your code as well. As you can't calculate the upper 32 bits of the product A*B without also performing almost all calculations required for the lower 32 bits, you might be asking for something that is not possible.
You are mixing blocking and non-blocking assignments here:
{C,Res}<={A*B}[63:32]; //< non-blocking
multState=2; //< blocking
this is considered bad practice.
Not sure that a concatenation operation which is just {A*B} is valid. At best it does nothing.
The way you have encoded it looks like you will end up with 2 hardware multipliers. What makes you say you do not have a 64 bit reg, available? reg does not have to imply flip-flops. If you have 2 32bit regs then you could have 1 64 bit one. I would personally do the multiply on 1 line then split the result up and output as 2 32 bit sections.
However :
x <= (a*b)[31:0] is unfortunately not allowed. If x is 32 bits it will take the LSBs, so all you need is :
x <= (a*b)
To take the MSBs you could try:
reg [31:0] throw_away;
{x, throw_away} <= (a*b) ;
i have this line of code:
base_num = (arr[j]/base)%256;
This line runs in a loop and the operations "/" and "%" take a lot of resources and time to perform. I would like to change this line and apply bit operations in order to maximize the program performance. How can i do that?
Thanks.
If base is the nth power of two, you can replace division by it with a bitshift of n to the right. Then, since taking the mod 256 of an integer is equivalent to taking its last 8 bits, you can AND it with 0xFF. Alternately, you can reverse the operations if you AND it with 256*base and then bitshift n to the right.
base_num = arr[j] >> n;
base_num &= 0xFF;
Of course, any half-decent compiler should be able to do this for you.
Add -O1 or greater to your compiler options and the compiler will do it for you.
In gcc, -O1 turns on -ftree-slsr which is, according to the docs,
Perform straight-line strength reduction on trees. This recognizes related expressions involving multiplications and replaces them by less expensive calculations when possible.
This will replace the modulo, and the base if it is constant. However, if you know that the base will be some non-constant power of two, you can refactor the surrounding code to give you the log2 of that number, and >> by that amount minus one.
You could also just declare base_num as an 8 bit integer:
#include <stdint.h>
uint8_t base_num;
uint16_t crap;
crap = 0xFF00;
base_num = crap;
If your compiler is standards compliment, it will put the value of byte(0xFF00) (0x00) into base_num.
I have yet to meet a compiler that does saturated arithmetic in plain C (neither C++ or C#), but if it does, it will put the value of sat_byte(0xFF00) which being greater than 0xFF, it will put 0xFF into base_num.
Keep in mind your compiler will warn you of a loss of precision in this instance. Your compiler may error out in this case (Visual Studio does with Treat Warnings as Errors On). If that happens, you can just do:
base_num = (uint8_t)crap;
but this seems like what you are trying to avoid.
What you are trying to do it seems is to remove the modulus operator as that requires a division and division is the most costly basic arithmetic operation. I generally would not think of this as a bottleneck in any way as any "intelligent" compiler (even in debug mode) would "optimize" it to:
base_num = crap & 0xFF;
on a supported platform (every mainstream processor I've heard of - x86, AMD64, ARM, MIPS), which should be any. I would be dumbfounded to hear of a processor that has no basic AND and OR arithmetic instructions.
I'm talking about this:
If we have the letter 'A' which is 77 in decimal and 4D in Hex.
I am looking for the fastest way to get D.
I thought about two ways:
Given x is a byte.
x << 4; x >> 4
x %= 16
Any other ways? Which one is faster?
Brevity is nice - explanations are better :)
x &= 0x0f
is, of course, the right answer. It exactly expresses the intent of what you're trying to achieve, and on any sane architecture will always compile down to the minimum number of instructions (i.e. 1). Do use hex rather than decimal whenever you put constants in a bit-wise operator.
x <<= 4; x >>= 4
will only work if your 'byte' is a proper unsigned type. If it was actually a signed char then the second operation might cause sign extension (i.e. your original bit 3 would then appear in bits 4-7 too).
without optimization this will of course take 2 instructions, but with GCC on OSX, even -O1 will reduce this to the first answer.
x %= 16
even without the optimizer enabled your compiler will almost certainly do the right thing here and turn that expensive div/mod operation into the first answer. However it can only do that for powers of two, and this paradigm doesn't make it quite so obvious what you're trying to achieve.
I always use x &= 0x0f
There are many good answers and some of them are technically the right ones.
In a broader scale, one should understand that C/C++ is not an assembler. Programmer's job is to try to tell to the compiler the intention what you want to achieve. The compiler will pick the best way to do it depending on the architecture and various optimization flags.
x &= 0x0F; is the most clear way to tell the compiler what you want to achieve. If shifting up and down is faster on some architecture, it is the compiler's job to know it and do the right thing.
Single AND operation can do it.
x = (x & 0x0F);
It will depend on on the architecture to some extent - shifting up and back down on an ARM is probably the fastest way - however the compiler should do that for you. In fact, all of the suggested methods will probably be optimized to the same code by the compiler.
x = x & 15