Explanation for D vs. C++ performance difference

Explanation for D vs. C++ performance difference - c++

Simple example in D:
import std.stdio, std.conv, core.memory;
class Foo{
int x;
this(int _x){x=_x;}
}
void main(string args[]) {
GC.disable();
int n = to!int(args[1]);
Foo[] m= new Foo[n];
for(int i=0;i<n;i++){
m[i] = new Foo(i);
}
}
C++ code:
#include <cstdlib>
using namespace std;
class Foo{
public:
int x;
Foo(int _x);
};
Foo::Foo(int _x){
x = _x;
}
int main(int argc, char** argv) {
int n = atoi(argv[1]);
Foo** gx = new Foo*[n];
for(int i=0;i<n;i++){
gx[i] = new Foo(i);
}
return 0;
}
No any comilation flags.
compiling and runing:
>dmd td.d
>time ./td 10000000
>real 0m2.544s
Anlogue example in C++ (gcc), runing:
>time ./tc 10000000
>real 0m0.523s
Why? Such a simple example, and such a big difference: 2.54s and 0.52s.

You're mainly measuring three differences:
The difference between the code generated by gcc and dmd
The extra time D takes to allocate using the GC.
The extra time D takes to allocate a class.
Now, you might think that point 2 is invalid because you used GC.disable();, but this only makes it so that the GC won't collect as it normally does. It does not make the GC disappear entirely and automatically redirect all memory allocations to C's malloc. It still must do most of what it normally does to ensure that the GC knows about the memory allocated, and all that takes time. Normally, this is a relatively insignificant part of program execution (even ignoring the benefits GCs give). However, your benchmark makes it the entirety of the program which exaggerates this effect.
Therefore, I suggest you consider two changes to your approach:
Either switch to using gdc to compare against gcc or switch to dmc to compare to dmd
Make the programs more equivalent. Either have both D and C++ allocate structs on the heap or, at the very least, make it so that D is allocating without touching the GC. If you're optimizing a program for maximum speed, you'd be using structs and C's malloc anyway, regardless of language.
I'd even recommend a 3rd change: since you're interested in maximum performance, you ought to try to come up with a better program entirely. Why not switch to structs and have them located contiguously in memory? This would make allocation (which is, essentially, the entire program) as fast as possible.
Use of your above code running using dmd & dmc on my machine results in the following times:
DMC 8.42n (no flags) : ~880ms
DMD 2.062 (no flags) : ~1300ms
Modifying the code to the following:
C++ code:
#include <cstdlib>
struct Foo {
int x;
};
int main(int argc, char** argv) {
int n = atoi(argv[1]);
Foo* gx = (Foo*) malloc(n * sizeof(Foo));
for(int i = 0; i < n; i++) {
gx[i].x = i;
}
free(gx);
return 0;
}
D code:
import std.conv;
struct Foo{
int x;
}
void main(string args[]) {
int n = to!int(args[1]);
Foo[] m = new Foo[](n);
foreach(i, ref e; m) {
e.x = i;
}
}
Use of my code using DMD & DMC results in the following times:
DMC 8.42n (no flags) : ~95ms +- 20ms
DMD 2.062 (no flags) : ~95ms +- 20ms
Essentially, identical (I'd have to start using some statistics to give you a better idea of which one is truly faster, but at this scale, it's irrelevant). Notice that using this is much, much faster than a naive approach and D is equally capable of using this strategy. In this case, the run-time difference is negligible, yet we retain the benefits of using a GC and there is definitely far fewer things that could go wrong in the writing of the D code (Notice how your program failed to delete all of its allocations?).
Furthermore, if you absolutely wanted, D allows you to use C's standard library by import std.c.stdlib; This would allow you to truly bypass the GC and achieve maximum performance by using C's malloc, if necessary. In this case, it's not necessary, so I erred on the side of safer, more readable code.

try this one:
import std.stdio, std.conv, core.memory;
class Foo{
int x = void;
this(in int _x){x=_x;}
}
void main(string args[]) {
GC.disable();
int n = to!int(args[1]);
Foo[] m= new Foo[n];
foreach(i; 0..n){
m[i] = new Foo(i);
}
}

Related

Why this code extremely slow？Any thing related to cache behavior?

I started to some data-oriented design experiment. I initially start doing some oop code and found some code is extremely slow, don't know why. Here is one example:
I have a game object
class GameObject
{
public:
float m_Pos[2];
float m_Vel[2];
float m_Foo;
void UpdateFoo(float f){
float mag = sqrtf(m_Vel[0] * m_Vel[0] + m_Vel[1] * m_Vel[1]);
m_Foo += mag * f;
}
};
then I create 1,000,000 of objects using new, and then loop over calling UpdateFoo()
for (unsigned i=0; i<OBJECT_NUM; ++i)
{
v_objects[i]->UpdateFoo(10.0);
}
it takes about 20ms to finish the loop. And strange things happened when I comment out float m_Pos[2], so the object looks like this
class GameObject
{
public:
//float m_Pos[2];
float m_Vel[2];
float m_Foo;
void UpdateFoo(float f){
float mag = sqrtf(m_Vel[0] * m_Vel[0] + m_Vel[1] * m_Vel[1]);
m_Foo += mag * f;
}
};
and suddenly the loop takes about 150ms to finish. And if I put anything before m_Vel, much faster. I try to put some padding between m_Vel and m_Foo or other places except the place before m_Vel....slow.
I tested on vs2008 and vs2010 in release build, i7-4790
Any idea how this difference could happen? Is it related to any cache coherent behavior.
here is whole sample:
#include <iostream>
#include <math.h>
#include <vector>
#include <Windows.h>
using namespace std;
class GameObject
{
public:
//float m_Pos[2];
float m_Velocity[2];
float m_Foo;
void UpdateFoo(float f)
{
float mag = sqrtf(m_Velocity[0] * m_Velocity[0] + m_Velocity[1] *
m_Velocity[1]);
m_Foo += mag * f;
}
};
#define OBJECT_NUM 1000000
int main(int argc, char **argv)
{
vector<GameObject*> v_objects;
for (unsigned i=0; i<OBJECT_NUM; ++i)
{
GameObject * pObject = new GameObject;
v_objects.push_back(pObject);
}
LARGE_INTEGER nFreq;
LARGE_INTEGER nBeginTime;
LARGE_INTEGER nEndTime;
QueryPerformanceFrequency(&nFreq);
QueryPerformanceCounter(&nBeginTime);
for (unsigned i=0; i<OBJECT_NUM; ++i)
{
v_objects[i]->UpdateFoo(10.0);
}
QueryPerformanceCounter(&nEndTime);
double dWasteTime = (double)(nEndTime.QuadPart-
nBeginTime.QuadPart)/(double)nFreq.QuadPart*1000;
printf("finished: %f", dWasteTime);
// for (unsigned i=0; i<OBJECT_NUM; ++i)
// {
// delete(v_objects[i]);
// }
}

then I create 1,000,000 of objects using new, and then loop over
calling UpdateFoo()
There's your problem right there. Don't allocate a million teeny things individually that are going to be processed repeatedly using a general-purpose allocator.
Try storing the objects contiguously or in contiguous chunks. An easy solution is store them all in one big std::vector. To remove in constant time, you can swap the element to remove with the last and pop back. If you need stable indices, you can leave a hole behind to be reclaimed on insertion (can use a free list or stack approach). If you need stable pointers that don't invalidate, deque might be an option combined with the "holes" idea using a free list or separate stack of indices to reclaim/overwrite.
You can also just use a free list allocator and use placement new against it while careful to free using the same allocator and manually invoke the dtor, but that gets messier faster and requires more practice to do well than the data structure approach. I recommend instead to just seek to store your game objects in some big container so that you get back the control over where everything is going to reside in memory and the spatial locality that results.
I tested on vs2008 and vs2010 in release build, i7-4790 Any idea how
this difference could happen? Is it related to any cache coherent
behavior.
If you are benchmarking and building the project properly, maybe the allocator is fragmenting the memory more when GameObject is smaller where you are incurring more cache misses as a result. That would seem to be the most likely explanation, but difficult to know for sure without a good profiler.
That said, instead of analyzing it further, I recommend the above solution so that you don't have to worry about where the allocator is allocating every teeny thing in memory.

Virtual function calling cost is 1.5x of normal function call (with test case)

I have to decide whether to use template vs virtual-inheritance.
In my situation, the trade-off make it really hard to choose.
Finally, it boiled down to "How much virtual-calling is really cost (CPU)?"
I found very few resources that dare to measure the vtable cost in actual number e.g. https://stackoverflow.com/a/158644, which point to page 26 of http://www.open-std.org/jtc1/sc22/wg21/docs/TR18015.pdf.
Here is an excerpt from it:-
However, this overhead (of virtual) is on the order of 20% and 12% – far less than
the variability between compilers.
Before relying on the fact, I have decided to test it myself.
My test code is a little long (~ 40 lines), you can also see it in the links in action.
The number is ratio of time that virtual-calling used divided by normal-calling.
Unexpectedly, the result is contradict to what open-std stated.
http://coliru.stacked-crooked.com/a/d4d161464e83933f : 1.58
http://rextester.com/GEZMC77067 (with custom -O2): 1.89
http://ideone.com/nmblnK : 2.79
My own desktop computer (Visual C++, -O2) : around 1.5
Here is it :-
#include <iostream>
#include <chrono>
#include <vector>
using namespace std;
class B2{
public: int randomNumber=((double) rand() / (RAND_MAX))*10;
virtual ~B2() = default;
virtual int f(int n){return -n+randomNumber;}
int g(int n){return -n+randomNumber;}
};
class C : public B2{
public: int f(int n) override {return n-randomNumber;}
};
int main() {
std::vector<B2*> bs;
const int numTest=1000000;
for(int n=0;n<numTest;n++){
if(((double) rand() / (RAND_MAX))>0.5){
bs.push_back(new B2());
}else{
bs.push_back(new C());
}
};
auto t1 = std::chrono::system_clock::now();
int s=0;
for(int n=0;n<numTest;n++){
s+=bs[n]->f(n);
};
auto t2= std::chrono::system_clock::now();
for(int n=0;n<numTest;n++){
s+=bs[n]->g(n);
};
auto t3= std::chrono::system_clock::now();
auto t21=t2-t1;
auto t32=t3-t2;
std::cout<<t21.count()<<" "<<t32.count()<<" ratio="<< (((float)t21.count())/t32.count()) << std::endl;
std::cout<<s<<std::endl;
for(int n=0;n<numTest;n++){
delete bs[n];
};
}
Question
Is it what to be expect that virtual calling is at least +50% slower than normal calling?
Did I test it in a wrong-way?
I have also read :-
AI Applications in C++: How costly are virtual functions? What are the possible optimizations?
Virtual functions and performance - C++

cannot resize c++ std::vector that's a member variable of a class

code (simplified version)
(part of) the class definition:
struct foo {
std::vector<int> data;
foo(int a=0):data(a+1,0) {}
void resize(int a) {
data.resize(a+1,0);
}
}
The a+1 part is because I want the data to be 1-indexed to simplify some operations.
in global scope:
int k;
foo bar;
in main function:
std::cin>>k;
bar.resize(k);
Later in the main function, there is a call to another member function (in foo) that accesses the data, causing a segmentation fault (segsegv).
After debugging, I found that data.size() returns 0. Which is very unexpected.
After a very long session of debugging, I feel very confident that the problem is with the resizeing, which shouldn't cause any problems (it's from the standard library, after all!).
P.S. Don't accuse me for putting anything in global scope or giving public access to class members. I'm not writing any "real" program, because I'm just practicing for a programming competition.

After a very long session of debugging, I feel very confident that the problem is with the resize
It is almost certain that:
The issue doesn't have anything to do with resize().
You have a memory-related bug somewhere (double delete, uninitialized/dangling pointer, buffer overrun etc).
The thing with memory-related bugs is that they can be completely symptomless until well after the buggy code has done the damage.
My recommendation would be to run your program under valgrind (or at least show us an SSCCE that doesn't work for you).

The following works fine for me
#include <vector>
#include <cstdlib>
struct foo {
std::vector<int> data;
explicit foo(int a=0) : data(a+1, 0) {}
void resize(int a) {
data.resize(a+1, 0);
}
};
int main() {
foo test_foo(1);
for (size_t i = 0; i < 1000; ++i) {
int a = std::rand() % 65536;
test_foo.resize(a);
if (test_foo.data.size() != a + 1)
return 1;
}
return 0;
}

Instrumenting C/C++ code using LLVM

I want to write a LLVM pass to instrument every memory access.
Here is what I am trying to do.
Given any C/C++ program (like the one given below), I am trying to insert calls to some function, before and after every instruction that reads/writes to/from memory. For example consider the below C++ program (Account.cpp)
#include <stdio.h>
class Account {
int balance;
public:
Account(int b)
{
balance = b;
}
~Account(){ }
int read()
{
int r;
r = balance;
return r;
}
void deposit(int n)
{
balance = balance + n;
}
void withdraw(int n)
{
int r = read();
balance = r - n;
}
};
int main ()
{
Account* a = new Account(10);
a->deposit(1);
a->withdraw(2);
delete a;
}
So after the instrumentation my program should look like :
#include <stdio.h>
class Account
{
int balance;
public:
Account(int b)
{
balance = b;
}
~Account(){ }
int read()
{
int r;
foo();
r = balance;
foo();
return r;
}
void deposit(int n)
{
foo();
balance = balance + n;
foo();
}
void withdraw(int n)
{
foo();
int r = read();
foo();
foo();
balance = r - n;
foo();
}
};
int main ()
{
Account* a = new Account(10);
a->deposit(1);
a->withdraw(2);
delete a;
}
where foo() may be any function like get the current system time or increment a counter .. so on.
Please give me examples (source code, tutorials etc) and steps on how to run it. I have read the tutorial on how make a LLVM Pass given on http://llvm.org/docs/WritingAnLLVMPass.html, but couldn't figure out how write a pass for the above problem.

I'm not very familiar with LLVM, but I am a bit more familiar with GCC (and its plugin machinery), since I am the main author of GCC MELT (a high level domain specific language to extend GCC, which by the way you could use for your problem). So I will try to answer in general terms.
You should first know why you want to adapt a compiler (or a static analyzer). It is a worthwhile goal, but it does have drawbacks (in particular, w.r.t. redefining some operators or others constructs in your C++ program).
The main point when extending a compiler (be it GCC or LLVM or something else) is that you very probably should handle all its internal representation (and you probably cannot skip parts of it, unless you have a very narrow defined problem). For GCC it means to handle the more than 100 kinds of Tree-s and nearly 20 kinds of Gimple-s: in GCC middle end, the tree-s represent the operands and declarations, and the gimple-s represent the instructions.
The advantage of this approach is that once you've done that, your extension should be able to handle any software acceptable by the compiler. The drawback is the complexity of compilers' internal representations (which is explainable by the complexity of the definitions of the C & C++ source languages accepted by the compilers, and by the complexity of the target machine code they are generating, and by the increasing distance between source & target languages).
So hacking a general compiler (be it GCC or LLVM), or a static analyzer (like Frama-C), is quite a big task (more than a month of work, not a few days). To deal only with a tiny C++ programs like you are showing, it is not worth it. But it is definitely worth the effort if you plain to deal with large source software bases.
Regards

Try something like this: ( you need to fill in the blanks and make the iterator loop work despite the fact that items are being inserted )
class ThePass : public llvm::BasicBlockPass {
public:
ThePass() : BasicBlockPass() {}
virtual bool runOnBasicBlock(llvm::BasicBlock &bb);
};
bool ThePass::runOnBasicBlock(BasicBlock &bb) {
bool retval = false;
for (BasicBlock::iterator bbit = bb.begin(), bbie = bb.end(); bbit != bbie;
++bbit) { // Make loop work given updates
Instruction *i = bbit;
CallInst * beforeCall = // INSERT THIS
beforeCall->insertBefore(i);
if (!i->isTerminator()) {
CallInst * afterCall = // INSERT THIS
afterCall->insertAfter(i);
}
}
return retval;
}
Hope this helps!

C++ virtual function call versus boost::function call speedwise

I wanted to know how fast is a single-inheritance virtual function call when compared to one same boost::function call. Are they almost the same in performance or is boost::function slower?
I'm aware that performance may vary from case to case, but, as a general rule, which is faster, and to a how large degree is that so?
Thanks,
Guilherme
-- edit
KennyTM's test was sufficiently convincing for me. boost::function doesn't seem to be that much slower than a vcall for my own purposes. Thanks.

As a very special case, consider calling an empty function 109 times.
Code A:
struct X {
virtual ~X() {}
virtual void do_x() {};
};
struct Y : public X {}; // for the paranoid.
int main () {
Y* x = new Y;
for (int i = 100000000; i >= 0; -- i)
x->do_x();
delete x;
return 0;
}
Code B: (with boost 1.41):
#include <boost/function.hpp>
struct X {
void do_x() {};
};
int main () {
X* x = new X;
boost::function<void (X*)> f;
f = &X::do_x;
for (int i = 100000000; i >= 0; -- i)
f(x);
delete x;
return 0;
}
Compile with g++ -O3, then time with time,
Code A takes 0.30 seconds.
Code B takes 0.54 seconds.
Inspecting the assembly code, it seems that the slowness may be due to exceptions and handling the possibility and that f can be NULL. But given the price of one boost::function call is only 2.4 nanoseconds (on my 2 GHz machine), the actual code in your do_x() could shadow this pretty much. I would say, it's not a reason to avoid boost::function.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js