LLVM constant in DAG changes when lowering

LLVM constant in DAG changes when lowering - llvm

%bf.clear = and i8 %bf.load, -32
-view-dag-combine1-dags
-view-legalize-dags
How can i prevent this change? (-32 is correct).

It's a perfectly legal transformation. If i8 is not a legal type for your target, then legalizer promotes i8 to i32. The -32 constant is legalized to 224, as upper bits clearly should not matter (and influence the result).
So, if this is a problem, then I'd suppose that there is some problem elsewhere.

Related

Checking the top bits of an i64 Value in LLVM IR

I am going to keep this short and to the point, but if further clarifications are needed please let me know.
I have an i64 Value that I want to check the top bits of if they are zeros or not. If they are zeros, I would do something, if they are not, I would do something else. How do I instrument the IR to allow this to happen at runtime?
One thing I found is that LLVM has an intrinsic "llvm.ctlz" that counts the leading zeros and puts them in an i64 Value, but how do I use its return value to do the checking? Or how do I instrument so the checking happens at runtime?
Any help or suggestions would be appreciated. Thanks!

You didn't say how many top bits, so I'll do an example with the top 32 bits. Given i64 %x, I'd check it with %result = icmp uge i64 %x, i64 4294967296 because 4294967296 is 2^32 and that is the first value which has a 1 bit in the top 32-bits. If you want to check the top two bits to be zero, use 2^62 (4611686018427387904) instead.
In order to do two different things based on the value of %result in general you'll want to branch on it. BasicBlock has a method splitBasicBlock that takes an instruction to split at. Use that to split your block into a before and after. Create new blocks for the true side an false side, add a branch on your result to your new blocks, br i1 %result, label %cond_true, label %cond_false. Make sure those two new blocks branch back to the after block.
Depending on what you want to do, you may not need an entire block, for instance if you're only calculating a value and not doing any side-effecting operations you might be able to use a select instruction instead of a branch and separate blocks.

Pre evaluate LLVM IR

Let's suppose we have expressions like:
%rem = srem i32 %i.0, 10
%mul = mul nsw i32 %rem, 2
%i.0 is a llvm::PHINode which I can get the bounds.
The question is: Is there a way to get the value of %mul during compile time? I'm writing a llvm Pass and I need to evaluate some expressions which use %i.0. I'm searching for a function, class or something else which I will give a value to %i.0 and it will evaluate the expression and return the result.

You could clone the code (the containing function or the entire module, depending on how much context you need), then replace %i.0 with a constant value, run the constant propagation pass on the code, and finally check whether %mul is assigned to a constant value and if so, extract it.
It's not elegant, but I think it would work. Just pay attention to:
Make sure %mul is not elided out - for example, return it from the function, or store its value to memory, or something.
Be aware constant propagation assumes some things about the code, in particular that it already passed through mem2reg.

Julia/LLVM Efficient Division of Integer Numbers with Integer Result

I ran into a basic type stability issue where dividing two Integers will produce some concrete type of AbstractFloat.
typeof(60 * 5 / 60)
> Float64
Now this is the safe thing to do, but it incurs runtime overhead converting to a float.
What if we know that division will always result in a number with remainder 0, ie. an Integer?
We can use either:
div(60 * 5 , 60)
fld(60 * 5 , 60)
Which gives us some concrete type of Integer, however this approach still has overhead which we can see from the LLVM IR:
#code_llvm div(60 * 5 , 60)
So is there any magic we can do to remove the runtime overhead when we know that the result will not have a remainder?
Possible Solution Paths:
I would prefer this be solved using a Julia construct, even if we need to create it, rather than injecting LLVM IR... But then again, we could just wrap that injection into a Julia function...
Or maybe we need a macro like #inbounds for safe integer division resulting in an integer.
Or maybe there is some purely mathematical way to perform this that applies to any language?

Integer division is one of the slowest cache-independent operations on a CPU; indeed, floating-point division is faster on most CPUs (test it yourself and see). If you know what you'll be dividing by in advance (and want to divide by it many times), it can be worth precomputing factors that allow you to replace integer-division with multiplication/shift/add. There are many sites that describe this basic idea, here's one.
For an implementation in julia, see
https://gist.github.com/simonster/a3b691e71cc2b3826e39

You're right — there is a little bit of overhead in the div function, but it's not because there may be a remainder. It's because div(typemin(Int),-1) is an error, as is div(x, 0). So the overhead you're seeing in #code_llvm is just the checks for those cases. The LLVM instruction that you want is just sdiv i64 %0, %1… and the processor will even throw a SIGFPE in those error conditions. We can use llvmcall to create our own "overhead-free" version:
julia> unsafe_div(x::Int64,y::Int64) = Base.llvmcall("""
%3 = sdiv i64 %0, %1
ret i64 %3""", Int64, Tuple{Int64, Int64}, x, y)
unsafe_div (generic function with 1 method)
julia> unsafe_div(8,3)
2
julia> #code_llvm unsafe_div(8,3)
define i64 #julia_unsafe_div_21585(i64, i64) {
top:
%2 = sdiv i64 %0, %1
ret i64 %2
}
julia> unsafe_div(8,0)
ERROR: DivideError: integer division error
in unsafe_div at none:1
So if that works, why does Julia insist on inserting those checks into the LLVM IR itself? It's because LLVM considers those error cases to be undefined behavior within its optimization passes. So if LLVM can ever prove that it would error through static analysis, it changes its output to skip the division (and subsequent exception) entirely! This custom div function is indeed unsafe:
julia> f() = unsafe_div(8,0)
f (generic function with 2 methods)
julia> f()
13315560704
julia> #code_llvm f()
define i64 #julia_f_21626() {
top:
ret i64 undef
}
On my machine (an old Nehalem i5), this unsafe version can speed div up by about 5-10%, so the overhead here isn't really all that terrible relative to the inherent cost of integer division. As #tholy notes, it's still really slow compared to almost all other CPU operations, so if you're frequently dividing by the same number, you may want to investigate the alternatives in his answer.

How can I find the size of a type?

I'm holding a Type* in my hand. How do I find out its size (the size objects of this type will occupy in memory) in bits / bytes? I see all kinds of methods allowing me to get "primitive" or "scalar" size, but that won't help me with aggregate types...

If you only need the size because you are inserting it into the IR (e.g., so you can send it to a call to malloc()), you can use the getelementptr instruction to do the dirty work (with a little casting), as described here (with updating for modern LLVM):
Though LLVM does not contain a special purpose sizeof/offsetof instruction, the
getelementptr instruction can be used to evaluate these values. The basic idea
is to use getelementptr from the null pointer to compute the value as desired.
Because getelementptr produces the value as a pointer, the result is casted to
an integer before use.
For example, to get the size of some type, %T, we would use something like
this:
%Size = getelementptr %T* null, i32 1
%SizeI = ptrtoint %T* %Size to i32
This code is effectively pretending that there is an array of T elements,
starting at the null pointer. This gets a pointer to the 2nd T element
(element #1) in the array and treats it as an integer. This computes the
size of one T element.
The good thing about doing this is that it is useful in exactly the cases where you do not care what the value is; where you just need to pass the correct value from the IR to something. That's by far the most common case for my need for sizeof()-alike operations in the IR generation.
The page also goes on to describe how to do an offsetof() equivalent:
To get the offset of some field in a structure, a similar trick is used. For
example, to get the address of the 2nd element (element #1) of { i8, i32* }
(which depends on the target alignment requirement for pointers), something
like this should be used:
%Offset = getelementptr {i8,i32*}* null, i32 0, i32 1
%OffsetI = ptrtoint i32** %Offset to i32
This works the same way as the sizeof trick: we pretend there is an instance of
the type at the null pointer and get the address of the field we are interested
in. This address is the offset of the field.
Note that in both of these cases, the expression will be evaluated to a
constant at code generation time, so there is no runtime overhead to using this
technique.
The IR optimizer also converts the values to constants.

The size depends on the target (for several reasons, alignment being one of them).
In LLVM versions 3.2 and above, you need to use DataLayout, in particular its getTypeAllocSize method. This returns the size in bytes, there's also a bit version named getTypeAllocSizeInBits. A DataLayout instance can be obtained by creating it from the current module: DataLayout* TD = new DataLayout(M).
With LLVM up to version 3.1 (including), use TargetData instead of DataLayout. It exposes the same getTypeAllocSize methods, though.

Would you use num%2 or num&1 to check if a number is even?

Well, there are at least two low-level ways of determining whether a given number is even or not:
1. if (num%2 == 0) { /* even */ }
2. if ((num&1) == 0) { /* even */ }
I consider the second option to be far more elegant and meaningful, and that's the one I usually use. But it is not only a matter of taste; The actual performance may vary: usually the bitwise operations (such as the logial-and here) are far more efficient than a mod (or div) operation. Of course, you may argue that some compilers will be able to optimize it anyway, and I agree...but some won't.
Another point is that the second one might be a little harder to comprehend for less experienced programmers. On that I'd answer that it will probably only benefit everybody if these programmers take that short time to understand statements of this kind.
What do you think?
The given two snippets are correct only if num is either an unsigned int, or a negative number with a two's complement representation. - As some comments righfuly state.

I code for readability first so my choice here is num % 2 == 0. This is far more clear than num & 1 == 0. I'll let the compiler worry about optimizing for me and only adjust if profiling shows this to be a bottleneck. Anything else is premature.
I consider the second option to be far more elegant and meaningful
I strongly disagree with this. A number is even because its congruency modulo two is zero, not because its binary representation ends with a certain bit. Binary representations are an implementation detail. Relying on implementation details is generally a code smell. As others have pointed out, testing the LSB fails on machines that use ones' complement representations.
Another point is that the second one might be a little harder to comprehend for less experienced programmers. On that I'd answer that it will probably only benefit everybody if these programmers take that short time to understand statements of this kind.
I disagree. We should all be coding to make our intent clearer. If we are testing for evenness the code should express that (and a comment should be unnecessary). Again, testing congruency modulo two more clearly expresses the intent of the code than checking the LSB.
And, more importantly, the details should be hidden away in an isEven method. So we should see if(isEven(someNumber)) { // details } and only see num % 2 == 0 once in the definition of isEven.

If you're going to say that some compilers won't optimise %2, then you should also note that some compilers use a ones' complement representation for signed integers. In that representation, &1 gives the wrong answer for negative numbers.
So what do you want - code which is slow on "some compilers", or code which is wrong on "some compilers"? Not necessarily the same compilers in each case, but both kinds are extremely rare.
Of course if num is of an unsigned type, or one of the C99 fixed-width integer types (int8_t and so on, which are required to be 2's complement), then this isn't an issue. In that case, I consider %2 to be more elegant and meaningful, and &1 to be a hack that might conceivably be necessary sometimes for performance. I think for example that CPython doesn't do this optimisation, and the same will be true of fully interpreted languages (although then the parsing overhead likely dwarfs the difference between the two machine instructions). I'd be a bit surprised to come across a C or C++ compiler that didn't do it where possible, though, because it's a no-brainer at the point of emitting instructions if not before.
In general, I would say that in C++ you are completely at the mercy of the compiler's ability to optimise. Standard containers and algorithms have n levels of indirection, most of which disappears when the compiler has finished inlining and optimising. A decent C++ compiler can handle arithmetic with constant values before breakfast, and a non-decent C++ compiler will produce rubbish code no matter what you do.

I define and use an "IsEven" function so I don't have to think about it, then I chose one method or the other and forget how I check if something is even.
Only nitpick/caveat is I'd just say that with the bitwise operation, you're assuming something about the representation of the numbers in binary, with modulo you are not. You are interpreting the number as a decimal value. This is pretty much guaranteed to work with integers. However consider that the modulo would work for a double, however the bitwise operation would not.

You conclusion about performance is based on the popular false premise.
For some reason you insist on translating the language operations into their "obvious" machine counterparts and make the performance conclusions based on that translation. In this particular case you concluded that a bitwise-and & operation of C++ language must be implemented by a bitwise-and machine operation, while a modulo % operation must somehow involve machine division, which is allegedly slower. Such approach is of very limited use, if any.
Firstly, I can't imagine a real-life C++ compiler that would interpret the language operations in such a "literal" way, i.e. by mapping them into the "equivalent" machine operations. Mostly because more often than one'd think the equivalent machine operations simply do not exist.
When it comes to such basic operations with an immediate constant as an operand, any self-respecting compiler will always immediately "understand" that both num & 1 and num % 2 for integral num do exactly the same thing, which will make the compiler generate absolutely identical code for both expressions. Naturally, the performance is going to be exactly the same.
BTW, this is not called "optimization". Optimization, by definition, is when the compiler decides to deviate from the standard behavior of abstract C++ machine in order to generate the more efficient code (preserving the observable behavior of the program). There's no deviation in this case, meaning that there's no optimization.
Moreover, it is quite possible that on the given machine the most optimal way to implement both is neither bitwise-and nor division, but some other dedicated machine-specific instruction. On top of that, it is quite possible that there won't be any need for any instruction at all, since even-ness/odd-ness of a specific value might be exposed "for free" through the processor status flags or something like that.
In other words, the efficiency argument is invalid.
Secondly, to return to the original question, the more preferable way to determine the even-ness/odd-ness of a value is certainly the num % 2 approach, since it implements the required check literally ("by definition"), and clearly expresses the fact that the check is purely mathematical. I.e. it makes clear that we care about the property of a number, not about the property of its representation (as would be in case of num & 1 variant).
The num & 1 variant should be reserved for situations when you want access to the bits of value representation of a number. Using this code for even-ness/odd-ness check is a highly questionable practice.

It's been mentioned a number of times that any modern compiler would create the same assembly for both options. This reminded me of the LLVM demo page that I saw somewhere the other day, so I figured I'd give it a go. I know this isn't much more than anecdotal, but it does confirm what we'd expect: x%2 and x&1 are implemented identically.
I also tried compiling both of these with gcc-4.2.1 (gcc -S foo.c) and the resultant assembly is identical (and pasted at the bottom of this answer).
Program the first:
int main(int argc, char **argv) {
return (argc%2==0) ? 0 : 1;
}
Result:
; ModuleID = '/tmp/webcompile/_27244_0.bc'
target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:32:32"
target triple = "i386-pc-linux-gnu"
define i32 #main(i32 %argc, i8** nocapture %argv) nounwind readnone {
entry:
%0 = and i32 %argc, 1 ; <i32> [#uses=1]
ret i32 %0
}
Program the second:
int main(int argc, char **argv) {
return ((argc&1)==0) ? 0 : 1;
}
Result:
; ModuleID = '/tmp/webcompile/_27375_0.bc'
target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:32:32"
target triple = "i386-pc-linux-gnu"
define i32 #main(i32 %argc, i8** nocapture %argv) nounwind readnone {
entry:
%0 = and i32 %argc, 1 ; <i32> [#uses=1]
ret i32 %0
}
GCC output:
.text
.globl _main
_main:
LFB2:
pushq %rbp
LCFI0:
movq %rsp, %rbp
LCFI1:
movl %edi, -4(%rbp)
movq %rsi, -16(%rbp)
movl -4(%rbp), %eax
andl $1, %eax
testl %eax, %eax
setne %al
movzbl %al, %eax
leave
ret
LFE2:
.section __TEXT,__eh_frame,coalesced,no_toc+strip_static_syms+live_support
EH_frame1:
.set L$set$0,LECIE1-LSCIE1
.long L$set$0
LSCIE1:
.long 0x0
.byte 0x1
.ascii "zR\0"
.byte 0x1
.byte 0x78
.byte 0x10
.byte 0x1
.byte 0x10
.byte 0xc
.byte 0x7
.byte 0x8
.byte 0x90
.byte 0x1
.align 3
LECIE1:
.globl _main.eh
_main.eh:
LSFDE1:
.set L$set$1,LEFDE1-LASFDE1
.long L$set$1
ASFDE1:
.long LASFDE1-EH_frame1
.quad LFB2-.
.set L$set$2,LFE2-LFB2
.quad L$set$2
.byte 0x0
.byte 0x4
.set L$set$3,LCFI0-LFB2
.long L$set$3
.byte 0xe
.byte 0x10
.byte 0x86
.byte 0x2
.byte 0x4
.set L$set$4,LCFI1-LCFI0
.long L$set$4
.byte 0xd
.byte 0x6
.align 3
LEFDE1:
.subsections_via_symbols

It all depends on context. I actually prefer the &1 approach myself if it's a low level, system context. In many of these kinds of contexts, "is even" basically means has low bit zero to me, rather than is divisible by two.
HOWEVER: Your one liner has a bug.
You must go
if( (x&1) == 0 )
not
if( x&1 == 0 )
The latter ANDs x with 1==0, ie it ANDs x with 0, yielding 0, which always evaluates as false of course.
So if you did it exactly as you suggest, all numbers are odd!

Any modern compiler will optimise away the modulo operation, so speed is not a concern.
I'd say using modulo would make things easier to understand, but creating an is_even function that uses the x & 1 method gives you the best of both worlds.

They're both pretty intuitive.
I'd give a slight edge to num % 2 == 0, but I really don't have a preference. Certainly as far as performance goes, it's probably a micro-optimization, so I wouldn't worry about it.

I spent years insisting that any reasonable compiler worth the space it consumes on disk would optimize num % 2 == 0 to num & 1 == 0. Then, analyzing disassembly for a different reason, I had a chance to actually verify my assumption.
It turns out, I was wrong. Microsoft Visual Studio, all the way up through version 2013, generates the following object code for num % 2 == 0:
and ecx, -2147483647 ; the parameter was passed in ECX
jns SHORT $IsEven
dec ecx
or ecx, -2
inc ecx
$IsEven:
neg ecx
sbb ecx, ecx
lea eax, DWORD PTR [ecx+1]
Yes, indeed. This is in Release mode, with all optimizations enabled. You get virtually equivalent results whether building for x86 or x64. You probably won't believe me; I barely believed it myself.
It does essentially what you would expect for num & 1 == 0:
not eax ; the parameter was passed in EAX
and eax, 1
By way of comparison, GCC (as far back as v4.4) and Clang (as far back as v3.2) do what one would expect, generating identical object code for both variants. However, according to Matt Godbolt's interactive compiler, ICC 13.0.1 also defies my expectations.
Sure, these compilers are not wrong. It's not a bug. There are plenty of technical reasons (as adequately pointed out in the other answers) why these two snippets of code are not identical. And there's certainly a "premature optimization is evil" argument to be made here. Granted, there's a reason it took me years to notice this, and even then I only stumbled onto it by mistake.
But, like Doug T. said, it is probably best to define an IsEven function in your library somewhere that gets all of these little details correct so that you never have to think about them again—and keep your code readable. If you regularly target MSVC, perhaps you'll define this function as I've done:
bool IsEven(int value)
{
const bool result = (num & 1) == 0;
assert(result == ((num % 2) == 0));
return result;
}

Both approaches are not obvious especially to someone who is new to programming. You should define an inline function with a descriptive name. The approach you use in it won't matter (micro optimizations most likely won't make your program faster in a noticeable way).
Anyway, I believe 2) is much faster as it doesn't require a division.

I don't think the modulo makes things more readable. Both make sense, and both versions are correct. And computers store numbers in binary, so you can just use the binary version.
The compiler may replace the modulo version with an efficient version. But that sounds like an excuse for prefering the modulo.
And readability in this very special case is the same for both versions. A reader that is new to programming may not even know that you can use modulo 2 to determine the even-ness of a number. The reader has to deduce it. He may not even know the modulo operator!
When deducing the meaning behind the statements, it could even be easier to read the binary version:
if( ( num & 1 ) == 0 ) { /* even */ }
if( ( 00010111b & 1 ) == 0 ) { /* even */ }
if( ( 00010110b & 1 ) == 0 ) { /* odd */ }
(I used the "b" suffix for clarification only, its not C/C++)
With the modulo version, you have to double-check how the operation is defined in its details (e.g. check documentation to make sure that 0 % 2 is what you expect).
The binary AND is simpler and there are no ambiguities!
Only the operator precedence may be a pitfall with binary operators. But it should not be a reason to avoid them (some day even new programmers will need them anyway).

At this point, I may be just adding to the noise, but as far as readability goes, the modulo option makes more sense. If your code is not readable, it's practically useless.
Also, unless this is code to be run on a system that's really strapped for resources (I'm thinking microcontroller), don't try to optimize for the compiler's optimizer.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js