Clang sizeof("literal") optimization - c++

Experiencing with C++, I tried to understand the performance difference between sizeof and strlen for string literal.
Here my small benchmark code:
#include <iostream>
#include <cstring>
#define LOOP_COUNT 1000000000
unsigned long long rdtscl(void)
{
unsigned int lo, hi;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}
int main()
{
unsigned long long before = rdtscl();
size_t ret;
for (int i = 0; i < LOOP_COUNT; i++)
ret = strlen("abcd");
unsigned long long after = rdtscl();
std::cout << "Strlen " << (after - before) << " ret=" << ret << std::endl;
before = rdtscl();
for (int i = 0; i < LOOP_COUNT; i++)
ret = sizeof("abcd");
after = rdtscl();
std::cout << "Sizeof " << (after - before) << " ret=" << ret << std::endl;
}
Compiling with clang++, I get the following result:
clang++ -O3 -Wall -o sizeof_vs_strlen sizeof_vs_strlen.cpp
./sizeof_vs_strlen
Strlen 36 ret=4
Sizeof 62092396 ret=5
With g++:
g++ -O3 -Wall -o sizeof_vs_strlen sizeof_vs_strlen.cpp
./sizeof_vs_strlen
Strlen 30 ret=4
Sizeof 30 ret=5
I strongly suspect that g++ does optimize the loop with sizeof and clang++ don't.
Is this result a known issue?
EDIT:
The assembly generated by clang++ for the loop with sizeof:
rdtsc
mov %edx,%r14d
shl $0x20,%r14
mov $0x3b9aca01,%ecx
xchg %ax,%ax
add $0xffffffed,%ecx // 0x400ad0
jne 0x400ad0 <main+192>
mov %eax,%eax
or %rax,%r14
rdtsc
And the one by g++:
rdtsc
mov %edx,%esi
mov %eax,%ecx
rdtsc
I don't understand why clang++ do the {add, jne} loop, it seems useless. Is it a bug?
For information:
g++ (GCC) 5.1.0
clang version 3.6.2 (tags/RELEASE_362/final)
EDIT2:
It it likely to be a bug in clang.
I opened a bug report.

I'd call that a bug in clang.
It's actually optimising the sizeof itself, just not the loop.
To make the code much clearer, I change the std::cout to printf, and then you get the following LLVM-IR code for main:
; Function Attrs: nounwind uwtable
define i32 #main() #0 {
entry:
%0 = tail call { i32, i32 } asm sideeffect "rdtsc", "={ax},={dx},~{dirflag},~{fpsr},~{flags}"() #2, !srcloc !1
%asmresult1.i = extractvalue { i32, i32 } %0, 1
%conv2.i = zext i32 %asmresult1.i to i64
%shl.i = shl nuw i64 %conv2.i, 32
%asmresult.i = extractvalue { i32, i32 } %0, 0
%conv.i = zext i32 %asmresult.i to i64
%or.i = or i64 %shl.i, %conv.i
%1 = tail call { i32, i32 } asm sideeffect "rdtsc", "={ax},={dx},~{dirflag},~{fpsr},~{flags}"() #2, !srcloc !1
%asmresult.i.25 = extractvalue { i32, i32 } %1, 0
%asmresult1.i.26 = extractvalue { i32, i32 } %1, 1
%conv.i.27 = zext i32 %asmresult.i.25 to i64
%conv2.i.28 = zext i32 %asmresult1.i.26 to i64
%shl.i.29 = shl nuw i64 %conv2.i.28, 32
%or.i.30 = or i64 %shl.i.29, %conv.i.27
%sub = sub i64 %or.i.30, %or.i
%call2 = tail call i32 (i8*, ...) #printf(i8* getelementptr inbounds ([21 x i8], [21 x i8]* #.str, i64 0, i64 0), i64 %sub, i64 4)
%2 = tail call { i32, i32 } asm sideeffect "rdtsc", "={ax},={dx},~{dirflag},~{fpsr},~{flags}"() #2, !srcloc !1
%asmresult1.i.32 = extractvalue { i32, i32 } %2, 1
%conv2.i.34 = zext i32 %asmresult1.i.32 to i64
%shl.i.35 = shl nuw i64 %conv2.i.34, 32
br label %for.cond.5
for.cond.5: ; preds = %for.cond.5, %entry
%i4.0 = phi i32 [ 0, %entry ], [ %inc10.18, %for.cond.5 ]
%inc10.18 = add nsw i32 %i4.0, 19
%exitcond.18 = icmp eq i32 %inc10.18, 1000000001
br i1 %exitcond.18, label %for.cond.cleanup.7, label %for.cond.5
for.cond.cleanup.7: ; preds = %for.cond.5
%asmresult.i.31 = extractvalue { i32, i32 } %2, 0
%conv.i.33 = zext i32 %asmresult.i.31 to i64
%or.i.36 = or i64 %shl.i.35, %conv.i.33
%3 = tail call { i32, i32 } asm sideeffect "rdtsc", "={ax},={dx},~{dirflag},~{fpsr},~{flags}"() #2, !srcloc !1
%asmresult.i.37 = extractvalue { i32, i32 } %3, 0
%asmresult1.i.38 = extractvalue { i32, i32 } %3, 1
%conv.i.39 = zext i32 %asmresult.i.37 to i64
%conv2.i.40 = zext i32 %asmresult1.i.38 to i64
%shl.i.41 = shl nuw i64 %conv2.i.40, 32
%or.i.42 = or i64 %shl.i.41, %conv.i.39
%sub13 = sub i64 %or.i.42, %or.i.36
%call14 = tail call i32 (i8*, ...) #printf(i8* getelementptr inbounds ([21 x i8], [21 x i8]* #.str, i64 0, i64 0), i64 %sub13, i64 5)
ret i32 0
}
As you can see, the call to printf is using the constant 5 from sizeof, and the for.cond.5: starts the empty loop:
a "phi" node (which selects the "new" value of i based on where we came from - before loop -> 0, inside loop -> %inc10.18)
an increment
a conditional branch that jumps back if %inc10.18 is not 100000001.
I don't know enough about clang and LLVM to explain why that empty loop isn't optimised away. But it's certainly not the sizeof that is taking time, as there is no sizeof inside the loop.
It's worth noting that sizeof is ALWAYS a constant at compile-time, it NEVER "takes time" beyond loading a constant value into a register.

The difference is, sizeof() is not a function call. The value, "returned" by sizeof() is known at compile-time.
At the same moment, strlen is function call (executed at run-time, apparently) and is not optimized at all. It seeks '\0' in a string, and it doesn't even have a single clue if it is a dynamically allocated string or a string literal.
So, sizeof is expected to be always faster for string literals.
I am not an expert, but your results might be explained by scheduling algorithms or overflow in your time variables.

Related

Why std::unique_ptr isn't optimized while std::variant can? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I tried to compare the overhead of std::visit(std::variant polymorphism) and virtual function(std::unique_ptr polymorphism).(please note my question is not about overhead or performance, but optimization.)
Here is my code.
https://quick-bench.com/q/pJWzmPlLdpjS5BvrtMb5hUWaPf0
#include <memory>
#include <variant>
struct Base
{
virtual void Process() = 0;
};
struct Derived : public Base
{
void Process() { ++a; }
int a = 0;
};
struct VarDerived
{
void Process() { ++a; }
int a = 0;
};
static std::unique_ptr<Base> ptr;
static std::variant<VarDerived> var;
static void PointerPolyMorphism(benchmark::State& state)
{
ptr = std::make_unique<Derived>();
for (auto _ : state)
{
for(int i = 0; i < 1000000; ++i)
ptr->Process();
}
}
BENCHMARK(PointerPolyMorphism);
static void VariantPolyMorphism(benchmark::State& state)
{
var.emplace<VarDerived>();
for (auto _ : state)
{
for(int i = 0; i < 1000000; ++i)
std::visit([](auto&& x) { x.Process();}, var);
}
}
BENCHMARK(VariantPolyMorphism);
I know it's not good benchmark test, it was only draft during my test.
But I was surprised at the result.
std::visit benchmark was high(which means slow) without any optimization.
But When I turn on optimization (higher than O2), std::visit benchmark is extremely low(which means extremely fast) while std::unique_ptr isn't.
I'm wondering why the same optimization can't be applied to the std::unique_ptr polymorphism?
I've compiled your code with Clang++ to LLVM (without your benchmarking) with -Ofast. Here's what you get for VariantPolyMorphism, unsurprisingly:
define void #_Z19VariantPolyMorphismv() local_unnamed_addr #2 {
ret void
}
On the other hand, PointerPolyMorphism does really execute the loop and all calls:
define void #_Z19PointerPolyMorphismv() local_unnamed_addr #2 personality i32 (...)* #__gxx_personality_v0 {
%1 = tail call dereferenceable(16) i8* #_Znwm(i64 16) #8, !noalias !8
tail call void #llvm.memset.p0i8.i64(i8* nonnull align 16 dereferenceable(16) %1, i8 0, i64 16, i1 false), !noalias !8
%2 = bitcast i8* %1 to i32 (...)***
store i32 (...)** bitcast (i8** getelementptr inbounds ({ [3 x i8*] }, { [3 x i8*] }* #_ZTV7Derived, i64 0, inrange i32 0, i64 2) to i32 (...)**), i32 (...)*** %2, align 8, !tbaa !11, !noalias !8
%3 = getelementptr inbounds i8, i8* %1, i64 8
%4 = bitcast i8* %3 to i32*
store i32 0, i32* %4, align 8, !tbaa !13, !noalias !8
%5 = load %struct.Base*, %struct.Base** getelementptr inbounds ({ { %struct.Base* } }, { { %struct.Base* } }* #_ZL3ptr, i64 0, i32 0, i32 0), align 8, !tbaa !4
store i8* %1, i8** bitcast ({ { %struct.Base* } }* #_ZL3ptr to i8**), align 8, !tbaa !4
%6 = icmp eq %struct.Base* %5, null
br i1 %6, label %7, label %8
7: ; preds = %8, %0
br label %11
8: ; preds = %0
%9 = bitcast %struct.Base* %5 to i8*
tail call void #_ZdlPv(i8* %9) #7
br label %7
10: ; preds = %11
ret void
11: ; preds = %7, %11
%12 = phi i32 [ %17, %11 ], [ 0, %7 ]
%13 = load %struct.Base*, %struct.Base** getelementptr inbounds ({ { %struct.Base* } }, { { %struct.Base* } }* #_ZL3ptr, i64 0, i32 0, i32 0), align 8, !tbaa !4
%14 = bitcast %struct.Base* %13 to void (%struct.Base*)***
%15 = load void (%struct.Base*)**, void (%struct.Base*)*** %14, align 8, !tbaa !11
%16 = load void (%struct.Base*)*, void (%struct.Base*)** %15, align 8
tail call void %16(%struct.Base* %13)
%17 = add nuw nsw i32 %12, 1
%18 = icmp eq i32 %17, 1000000
br i1 %18, label %10, label %11
}
The reason for this is that both your variables are static. This allows the compiler to infer that no code outside the translation unit has access to your variant instance. Therefore your loop doesn't have any visible effect and can be safely removed. However, although your smart pointer is static, the memory it points to could still change (as a side-effect of the call to Process, for example). The compiler can therefore not easily prove that is safe to remove the loop and doesn't.
If you remove the static from both VariantPolyMorphism you get:
define void #_Z19VariantPolyMorphismv() local_unnamed_addr #2 {
store i32 0, i32* getelementptr inbounds ({ { %"union.std::__1::__variant_detail::__union", i32 } }, { { %"union.std::__1::__variant_detail::__union", i32 } }* #var, i64 0, i32 0, i32 1), align 4, !tbaa !16
store i32 1000000, i32* getelementptr inbounds ({ { %"union.std::__1::__variant_detail::__union", i32 } }, { { %"union.std::__1::__variant_detail::__union", i32 } }* #var, i64 0, i32 0, i32 0, i32 0, i32 0, i32 0), align 4, !tbaa !18
ret void
}
Which isn't surprising once again. The variant can only contain VarDerived so nothing needs to be computed at run-time: The final state of the variant can already be determined at compile-time. The difference, though, now is that some other translation unit might want to access the value of var later on and the value must therefore be written.
Your variant can store only singe type, so this is same as single regular variable (it is working more like an optional).
You are running test without optimizations enabled
Result is not secured from optimizer so it can trash your code.
Your code actually do not utilizes polymorphism, some compilers are able to figure out that there is only one implementation of Base class and drop virtual calls.
This is better but still not trustworthy:
ver 1, ver 2 with arrays.
Yes polymorphism can be expensive when used in tight loops.
Witting benchmarks for such small extremely fast features is hard and full of pitfalls, so must be approached with extreme caution, since you reaching limitations of benchmark tool.

LLVM generates unefficient IR

I playing with LLVM and tried to compile simple C++ code using it
#include <stdio.h>
#include <stdlib.h>
int main()
{
int test = rand();
if (test % 2)
test += 522;
else
test *= 333;
printf("test %d\n", test);
}
Especially to test how LLVM treats code branches
Result I got is very strange, it gives valid result on execution, but looks unefficient
; Function Attrs: nounwind
define i32 #main() local_unnamed_addr #0 {
%1 = tail call i32 #rand() #3
%2 = and i32 %1, 1
%3 = icmp eq i32 %2, 0
%4 = add nsw i32 %1, 522
%5 = mul nsw i32 %1, 333
%6 = select i1 %3, i32 %5, i32 %4
%7 = tail call i32 (i8*, ...) #printf(i8* getelementptr inbounds ([9 x i8], [9 x i8]* #.str, i64 0, i64 0), i32 %6)
ret i32 0
}
It looks like it executing both ways even if only one is needen
My question is: Should not LLVM in this case generate labels and why?
Thank you
P.S. I'm using http://ellcc.org/demo/index.cgi for this test
Branches can be expensive, so generating code without branches at the cost of one unnecessary add or mul instruction, will usually work out to be faster in practice.
If you make the branches of your if longer, you'll see that it'll eventually become a proper branch instead of a select.
The compiler tends to have a good understanding of which option is faster in which case, so I'd trust it unless you have specific benchmarks that show the version with select to be slower than a version that branches.

how to see content of a method pointer?

typedef int (D::*fptr)(void);
fptr bfunc;
bfunc=&D::Bfunc;
cout<<(reinterpret_cast<unsigned long long>(bfunc)&0xffffffff00000000)<<endl;
complete code available at : https://ideone.com/wRVyTu
I am trying to use reinterpret_cast, but the compiler throws error
prog.cpp: In function 'int main()': prog.cpp:49:51: error: invalid cast from type 'fptr {aka int (D::*)()}' to type 'long long unsigned int' cout<<(reinterpret_cast<unsigned long long>(bfunc)&0xffffffff00000000)<<endl;
My questions are :
why is reinterpret_cast not suitable for this occasion?
Is there another way, I can see the contents of the method pointer?
Using clang++ to compile a slightly modified version of your code (removed all the cout to not get thousands of lines...), we get this for main:
define i32 #main() #0 {
entry:
%retval = alloca i32, align 4
%bfunc = alloca { i64, i64 }, align 8
%dfunc = alloca { i64, i64 }, align 8
store i32 0, i32* %retval, align 4
store { i64, i64 } { i64 1, i64 16 }, { i64, i64 }* %bfunc, align 8
store { i64, i64 } { i64 9, i64 0 }, { i64, i64 }* %dfunc, align 8
ret i32 0
}
Note that the bfunc and dfunc are two 64-bit integer values. If I compile for 32-bit x86 it is two i32 (so 32-bit integer values).
So, if we make main look like this:
int main() {
// your code goes here
typedef int (D::*fptr)(void);
fptr bfunc;
fptr dfunc;
bfunc=&D::Bfunc;
dfunc=&D::Dfunc;
D d;
(d.*bfunc)();
return 0;
}
the generated code looks like this:
; Function Attrs: norecurse uwtable
define i32 #main() #0 {
entry:
%retval = alloca i32, align 4
%bfunc = alloca { i64, i64 }, align 8
%dfunc = alloca { i64, i64 }, align 8
%d = alloca %class.D, align 8
store i32 0, i32* %retval, align 4
store { i64, i64 } { i64 1, i64 16 }, { i64, i64 }* %bfunc, align 8
store { i64, i64 } { i64 9, i64 0 }, { i64, i64 }* %dfunc, align 8
call void #_ZN1DC2Ev(%class.D* %d) #3
%0 = load { i64, i64 }, { i64, i64 }* %bfunc, align 8
%memptr.adj = extractvalue { i64, i64 } %0, 1
%1 = bitcast %class.D* %d to i8*
%2 = getelementptr inbounds i8, i8* %1, i64 %memptr.adj
%this.adjusted = bitcast i8* %2 to %class.D*
%memptr.ptr = extractvalue { i64, i64 } %0, 0
%3 = and i64 %memptr.ptr, 1
%memptr.isvirtual = icmp ne i64 %3, 0
br i1 %memptr.isvirtual, label %memptr.virtual, label %memptr.nonvirtual
memptr.virtual: ; preds = %entry
%4 = bitcast %class.D* %this.adjusted to i8**
%vtable = load i8*, i8** %4, align 8
%5 = sub i64 %memptr.ptr, 1
%6 = getelementptr i8, i8* %vtable, i64 %5
%7 = bitcast i8* %6 to i32 (%class.D*)**
%memptr.virtualfn = load i32 (%class.D*)*, i32 (%class.D*)** %7, align 8
br label %memptr.end
memptr.nonvirtual: ; preds = %entry
%memptr.nonvirtualfn = inttoptr i64 %memptr.ptr to i32 (%class.D*)*
br label %memptr.end
memptr.end: ; preds = %memptr.nonvirtual, %memptr.virtual
%8 = phi i32 (%class.D*)* [ %memptr.virtualfn, %memptr.virtual ], [ %memptr.nonvirtualfn, %memptr.nonvirtual ]
%call = call i32 %8(%class.D* %this.adjusted)
ret i32 0
}
This is not entirely trivial to follow, but in essense:
%memptr.adj = Read adjustment from bfunc[1]
%2 = %d[%memptr.adj]
cast %2 to D*
%memptr.ptr = bfunc[0]
if (%memptr.ptr & 1) goto is_virtual else goto is_non_virtual
is_virtual:
%memptr.virtual=vtable[%memptr.ptr-1]
goto common
is_non_virtual:
%memptr.non_virtual = %memptr.ptr
common:
if we came from
is_non_virtual: %8 = %memptr.non_virtual
is_virtual: %8 = %memptr.virutal
call %8
I skipped some type-casts and stuff to make it simpler.
NOTE This is NOT meant to say "this is how it is implemented always. It's one example of what the compiler MAY do. Different compilers will do this subtly differently. But if the function may or may not be virtual, the compiler first has to figure out which. [In the above example, I'm fairly sure we can turn on optimisation and get much better code, but it would presumably just figure out exactly what's going on and remove all of the code, which for understanding how it works is pointless]
There is a very simple answer to this. Pointers-to-methods are not 'normal' pointers and can not be cast to those, even through reinterpret_cast. One can cast first to void*, and than to the long long, but this is really ill-advised.
Remember, size of pointer-to-method is not neccessarily (and usually is not!) equal to the size of 'normal' pointer. The way most compilers implement pointer-to-method, it is twice the size of 'normal' pointer.
GCC is going to complain for the pointer-to-method to void* cast in pedantic mode, but will generate code still.

LLVM Can't find getelementptr instruction

I have this byte code fragment:
define void #setGlobal(i32 %a) #0 {
entry:
%a.addr = alloca i32, align 4
store i32 %a, i32* %a.addr, align 4
%0 = load i32* %a.addr, align 4
store i32 %0, i32* #Global, align 4
%1 = load i32* %a.addr, align 4
store i32 %1, i32* getelementptr inbounds ([5 x i32]* #GlobalVec, i32 0, i64 0), align 4
store i32 2, i32* getelementptr inbounds ([5 x i32]* #GlobalVec, i32 0, i64 2), align 4
ret void
}
I am using this code to find the getelementptr from "store i32 %1, i32* getelementptr inbounds ([5 x i32]* #GlobalVec, i32 0, i64 0), align 4":
for (Module::iterator F = p_Module.begin(), endF = p_Module.end(); F != endF; ++F) {
for (Function::iterator BB = F->begin(), endBB = F->end(); BB != endBB; ++BB) {
for (BasicBlock::iterator I = BB->begin(), endI = BB->end(); I
!= endI; ++I) {
if (StoreInst* SI = dyn_cast<StoreInst>(I)) {
if (Instruction *gep = dyn_cast<Instruction>(SI->getOperand(1)))
{
if (gep->getOpcode() == Instruction::GetElementPtr)
{
//do something
}
}
}
}
}
}
This code can't find the getelementptr. What am I doing wrong?
There are no getelementptr instructions in your bitcode snippet, which is why you can't find them.
The two cases that look like a getelementptr instructions are actually constant expressions - the telltale sign is that they appear as part of another instruction (store), which is not something you can do with regular instructions.
So if you want to search for that expression, you need to look for type GetElementPtrConstantExpr, not GetElementPtrInst.

How can you print instruction in llvm

From an llvm pass, I need to print an llvm instruction (Type llvm::Instruction) on the screen, just like as it appears in the llvm bitcode file. Actually my compilation is crashing, and does not reach the point where bitcode file is generated. So for debugging I want to print some instructions to know what is going wrong.
Assuming I is your instruction
I.print(errs());
By simply using the print method.
For a simple Hello World program, using C++'s range-based loops, you can do something like this:
for(auto& B: F){
for(auto& I: B){
errs() << I << "\n";
}
}
This gives the output:
%3 = alloca i32, align 4
%4 = alloca i8**, align 8
store i32 %0, i32* %3, align 4
store i8** %1, i8*** %4, align 8
%5 = call i32 (i8*, ...) #printf(i8* getelementptr inbounds ([15 x i8], [15 x i8]* #.str, i64 0, i64 0))
ret i32 0