I am here to ask if my perception is actually true.
I originally thought defining vector<T> v(size_t someSize, T init_value) would call a function such as vector<T>::reserve, instead of vector<T>::push_back. I found some discussion relating to this here: std::vector push_back is bottleneck, but this is slightly different in its idea.
Running some experiments, I notice that vector<T> v(size_t someSize, T init_value) calls ::push_back all along. Is this true? I have the following report using uftrace(https://github.com/namhyung/uftrace).
Avg total Min total Max total Function
========== ========== ========== ====================================
858.323 ms 858.323 ms 858.323 ms main
618.245 ms 618.245 ms 618.245 ms sortKaway
234.795 ms 234.795 ms 234.795 ms std::sort
72.752 us 72.752 us 72.752 us std::vector::_M_fill_initialize
65.788 us 49.551 us 82.026 us std::vector::vector
20.292 us 11.387 us 68.629 us std::vector::_M_emplace_back_aux
18.722 us 17.263 us 20.181 us std::equal
18.472 us 18.472 us 18.472 us std::vector::~vector
17.891 us 10.002 us 102.079 us std::vector::push_back // push_back?!
Does vector<T>::reserve also call on vector<t>::push_back eventually? Is there faster version for vector?
The above was the original post. After some comments, I tested a simple version, and realized I was completely mistaken.
#include <vector>
#include <functional>
#include <queue>
#include <cassert>
using namespace std; // for the time being
int main () {
vector<int> v(10, 0);
return 0;
}
This actually results in the following, which doesn't involve std::vector<T>::push_back.
# Function Call Graph for 'main' (session: 9ce7f6bb33885ff7)
=============== BACKTRACE ===============
backtrace #0: hit 1, time 12.710 us
[0] main (0x4009c6)
========== FUNCTION CALL GRAPH ==========
12.710 us : (1) main
0.591 us : +-(1) std::allocator::allocator
0.096 us : | (1) __gnu_cxx::new_allocator::new_allocator
: |
6.880 us : +-(1) std::vector::vector
4.338 us : | +-(1) std::_Vector_base::_Vector_base
0.680 us : | | +-(1) std::_Vector_base::_Vector_impl::_Vector_impl
0.445 us : | | | (1) std::allocator::allocator
0.095 us : | | | (1) __gnu_cxx::new_allocator::new_allocator
: | | |
3.294 us : | | +-(1) std::_Vector_base::_M_create_storage
3.073 us : | | (1) std::_Vector_base::_M_allocate
2.849 us : | | (1) std::allocator_traits::allocate
2.623 us : | | (1) __gnu_cxx::new_allocator::allocate
0.095 us : | | +-(1) __gnu_cxx::new_allocator::max_size
: | | |
1.867 us : | | +-(1) operator new
: | |
2.183 us : | +-(1) std::vector::_M_fill_initialize
0.095 us : | +-(1) std::_Vector_base::_M_get_Tp_allocator
: | |
1.660 us : | +-(1) std::__uninitialized_fill_n_a
1.441 us : | (1) std::uninitialized_fill_n
1.215 us : | (1) std::__uninitialized_fill_n::__uninit_fill_n
0.988 us : | (1) std::fill_n
0.445 us : | +-(1) std::__niter_base
0.096 us : | | (1) std::_Iter_base::_S_base
: | |
0.133 us : | +-(1) std::__fill_n_a
Sorry for the confusion. Yes, the library implementation works as we expect, it doesn't involve push_back if constructed with initial size.
Phew, looks like you answered your own question! I was extremely confused for a moment. Just hypothetically, I could have imagined some obscure case of vector's fill constructor using reserve and push_backs, but definitely not in the likes of high-quality implementations like those found accompanying GCC in the GNU standard lib. I would say, hypothetically, that it is possible for an obscure vector implementation to be implemented this way, but practically completely unlikely for any decent implementation.
To the contrary, this was almost two decades ago but I tried to implement my version of std::vector in hopes of matching its performance. This wasn't just some dumb exercise but the temptation was due to the fact that we had a software development kit and wanted to use some basic C++ containers for it, but it had the goal of allowing people to write plugins for our software using different compilers (and also different standard library implementations) than what we were using. So we couldn't safely use std::vector in those contexts since our version may not match the plugin writer's. We were forced to begrudgingly roll our own containers for the SDK.
Instead I found std::vector to be incredibly efficient in ways that were hard to match, especially for plain old data types with trivial ctors and dtors. Again this was over a decade ago but I found that using the fill constructor with vector<int> in MSVC 5 or 6 (forgot which one) actually translated to the same disassembly as using memset in ways that my naive version, just looping through things and using placement new on them regardless of whether they were PODs or not, did not. The range ctor also effectively translated to a super fast memcpy for PODs. And that's precisely what made vector so hard to beat for me, at least back then. Without getting deep into type traits and special casing PODs, I couldn't really match vector's performance for PODs. I could match it for UDTs, but most of our performance-critical code tended to use PODs.
So chances are that popular vector implementations today are just as efficient if not more than back when I conducted those tests, and I wanted to pitch in kind of as a reassurance that your vector implementation is most likely damned fast. The last thing I'd expect it to do is be implementing fill ctors or range ctors using push_backs.
Related
While creating a version of std::basic_string_view for a private project (choices made for me: C++11; no boost:: allowed; with a pinch of NIH, so no GSL either) I came to implementing std::basic_string_view::max_size() for which the standard (n4820 21.4.2.3 Capacity) simply says:
Returns: The largest possible number of char-like objects that can be referred to by a basic_string_view.
Logically, this would be the maximum number that std::basic_string_view::size_type can represent: std::numeric_limits<std::basic_string_view::size_type>::max() which comes out at 18446744073709551615 on my platform where size_type is std::size_t.
I figured that, since I want to be compatible with the standard libraries, I should ensure that I arrive at the same number as other implementations. This is where I get lost.
Given that I have a auto max_size = string_view{"foo"}.max_size() then I get the following results:
+--------------+--------------------------+
| Library | Result |
+--------------+--------------------------+
| libstdc++ | 4611686018427387899 |
| libc++ | 18446744073709551615 |
| boost 1.72.0 | 3 |
+--------------+--------------------------+
If my interpretation is correct, then that means that libc++ and I agree on what the value should be. I feel that boost is completely wrong, since the specification for max_size is to return the largest possible number that a, not this, string_view can refer to. However, as noted in the comments, boost::string_view predates the standard and it is thus unfair to call it "completely wrong". Further, looking at the implementations of all three libraries libc++ returns
numeric_limits<size_type>::max();
libstdc++ returns
(npos - sizeof(size_type) - sizeof(void*)) / sizeof(value_type) / 4;
and boost returns:
len_;
Basically, two implementations appear to be wrong, but the question is: which one is correct?
I am trying to understand Fortran 77 code but stumbled on EQUIVALENCE() statement on the code.
Here is part of the code:
REAL*8 DATA1(0:N-1)
COMPLEX*16 DATA2(0:N/2-1)
EQUIVALENCE(DATA1, DATA2)
...
...
CALL FFT(DATA1, N/2, -1)
Basically FFT subroutine is a one-dimensional complex-to-complex FFT engine. There are some permutation and matrix-vector multiplication on the subroutine.
The code calls DATA2 later in this manner:
K0=DATA2(0)
K1=DCONJG(DATA2(0))
can anyone give me clue about why EQUIVALENCE() statement is used? My assumption is DATA1 which is REAL, being changed to DATA2, which is COMPLEX variable with some changed performed on FFT subroutine. But if it is so, how about the imaginary part of DATA2? Because the FFT subroutine only contains REAL variable. And why the array size of DATA1 and DATA2 is different?
I can not find any answer in this forum which satisfy my question. Thanks for your answers. It would help me a lot.
equivalence is one of Fortran's two features for storage-association of entities. (The other is common blocks on which topic I will remain silent here). The equivalence statement declares that the entities named in its argument list share the same storage locations. In this case data1 and data2 share the same memory locations.
If you have a tool for inspecting memory locations and point it at data1 you'll see something like this:
+----------+----------+----------+----------+----------+----------+
| | | | | |
| data1(0) | data1(1) | data1(2) | data1(3) | data1(4) | data1(...
| | | | | |
+----------+----------+----------+----------+----------+----------+
Point the same tool at data2 and you'll see something like this
+----------+----------+----------+----------+----------+----------+
| | |
| data2(0) | data2(1) | data2(....
| re im | re im | re im
+----------+----------+----------+----------+----------+----------+
but the 'truth' is rather more like
+----------+----------+----------+----------+----------+----------+
| | |
| data1(0) data1(1) | data1(2) data1(3) | data1(4) data1(...
| | |
| data2(0) | data2(1) | data2(....
| re im | re im | re im
+----------+----------+----------+----------+----------+----------+
data1(0) is at the same location as the real component of data2(0). data1(1) is the imaginary component of data2(0), and so forth.
This is one of the applications of equivalence which one still occasionally comes across -- being able to switch between viewing data as complex or as pairs of reals. However, it's not confined to this kind of type conversion, there's nothing to say you can't equivalence integers and reals, or any other types.
Another use one occasionally still sees is the use of equivalence for remapping arrays from one rank to another. For example, given
integer, dimension(3,2) :: array2
integer, dimension(6) :: array1
and
equivalence(array1(1),array2(1,1))
the same elements can be treated as belonging to a rank-2 array or to a rank-1 array to suit the program's needs.
equivalence is generally frowned upon these days, most of what it has been used for can be done more safely with modern Fortran. For more, you might care to look at my answer to Is the storage of COMPLEX in fortran guaranteed to be two REALs?
How is the output of the deterministic profiler of sbcl to be interpreted?
seconds | gc | consed | calls | sec/call | name
-------------------------------------------------------
seconds (total execution time),calls (amount overall calls),sec/call (average time per call), and name (well the name of the function) are quite straight forward. What does consed and gcmean?
I guess consed tells the allocated memory (though in what unit?) and I'd say that gc tells the amount of units reclaimed by the gc but those two values never match and even use a different representation scheme (gc has a . every 3 numbers and consed a , .
E.g. what would this example output tell me (if my guess is right I'd have a massive memory leak):
seconds | gc | consed | calls | sec/call | name
-------------------------------------------------------
0.011 | 0.000 | 965,488 | 6 | 0.001817 | PACKAGE:NAME
-------------------------------------------------------
0.011 | 0.000 | 965,488 | 6 | | Total
The columns are easier to interpret if you are familiar with the output of (free …) under SBCL (abbreviated):
Evaluation took:
0.771 seconds of real time
[ Run times consist of 0.061 seconds GC time, and 0.639 seconds non-GC time. ]
166,575,680 bytes consed
In your example package:name was at the top of the call stack (among profiled functions only) for 0.011 seconds, and GC did not affect it because it never happened during this time or was too fast, and 965488 bytes of SBCL-managed memory were allocated.
The amount of memory that was used and became unused after each GC can not be broken down per function, because this information is not tracked. You can measure overall memory consumption by evaluating (sb-ext:gc :full t) (room) before and after, but note that the reported amount fluctuates slightly, and does not include memory allocated by foreign code (C libraries, if your application uses them), and that the last three results and expressions at the REPL are retained.
I wondered if unordered_map is implemented using type erasure, since an unordered_map<Key, A*> and unordered_map<Key, B*> can use exactly the same code (apart from casting, which is a no-op in machine code). That is, the implementation of both could be based on unordered_map<Key, void*> to save code size.
Update: This technique is commonly referred to as the Thin Template Idiom (Thanks to the commenters below for pointing that out).
Update 2: I would be particlarly interested in Howard Hinnant's opinion. Let's hope he reads this.
So I wrote this small test:
#include <iostream>
#if BOOST
# include <boost/unordered_map.hpp>
using boost::unordered_map;
#else
# include <unordered_map>
using std::unordered_map;
#endif
struct A { A(int x) : x(x) {} int x; };
struct B { B(int x) : x(x) {} int x; };
int main()
{
#if SMALL
unordered_map<std::string, void*> ma, mb;
#else
unordered_map<std::string, A*> ma;
unordered_map<std::string, B*> mb;
#endif
ma["foo"] = new A(1);
mb["bar"] = new B(2);
std::cout << ((A*) ma["foo"])->x << std::endl;
std::cout << ((B*) mb["bar"])->x << std::endl;
// yes, it leaks.
}
And determined the size of the compiled output with various settings:
#!/bin/sh
for BOOST in 0 1 ; do
for OPT in 2 3 s ; do
for SMALL in 0 1 ; do
clang++ -stdlib=libc++ -O${OPT} -DSMALL=${SMALL} -DBOOST=${BOOST} map_test.cpp -o map_test
strip map_test
SIZE=$(echo "scale=1;$(stat -f "%z" map_test)/1024" | bc)
echo boost=$BOOST opt=$OPT small=$SMALL size=${SIZE}K
done
done
done
It turns out, that with all settings I tried, lots of inner code of unordered_map seems to be instantiated twice:
With Clang and libc++:
| -O2 | -O3 | -Os
-DSMALL=0 | 24.7K | 23.5K | 28.2K
-DSMALL=1 | 17.9K | 17.2K | 19.8K
With Clang and Boost:
| -O2 | -O3 | -Os
-DSMALL=0 | 23.9K | 23.9K | 32.5K
-DSMALL=1 | 17.4K | 17.4K | 22.3K
With GCC and Boost:
| -O2 | -O3 | -Os
-DSMALL=0 | 21.8K | 21.8K | 35.5K
-DSMALL=1 | 16.4K | 16.4K | 26.2K
(With the compilers from Apple's Xcode)
Now to the question: Is there some convincing technical reason due to which the implementers have chosen to omit this simple optimization?
Also: why the hell is the effect of -Os exactly the opposite of what is advertised?
Update 3:
As suggested by Nicol Bolas, I have repeated the measurements with shared_ptr<void/A/B> instead of naked pointers (created with make_shared and cast with static_pointer_cast). The tendency in the results is the same:
With Clang and libc++:
| -O2 | -O3 | -Os
-DSMALL=0 | 27.9K | 26.7K | 30.9K
-DSMALL=1 | 25.0K | 20.3K | 26.8K
With Clang and Boost:
| -O2 | -O3 | -Os
-DSMALL=0 | 35.3K | 34.3K | 43.1K
-DSMALL=1 | 27.8K | 26.8K | 32.6K
Since I've been specifically asked to comment, I will, though I'm not sure I have much more to add than has already been said. (sorry it took me 8 days to get here)
I've implemented the thin template idiom before, for some containers, namely vector, deque and list. I don't currently have it implemented for any container in libc++. And I've never implemented it for the unordered containers.
It does save on code size. It also adds complexity, much more so than the referenced wikibooks link implies. One can also do it for more than just pointers. You can do it for all scalars which have the same size. For example why have different instantiations for int and unsigned? Even ptrdiff_t can be stored in the same instantiation as T*. After all, it is all just a bag bits at the bottom. But it is extremely tricky to get the member templates which take a range of iterators correct when playing these tricks.
There are disadvantages though (besides difficulty of implementation). It doesn't play nearly as nicely with the debugger. At the very least it makes it much more difficult for the debugger to display container innards. And while the code size savings can be significant, I would stop short of calling the code size savings dramatic. Especially when compared to the memory required to store the photographs, animations, audio clips, street maps, years of email with all of the attachments from your best friends and family, etc. I.e. optimizing code size is important. But you should take into account that in many apps today (even on embedded devices), if you cut your code size in half, you might cut your app size by 5% (statistics admittedly pulled from thin air).
My current position is that this particular optimization is one best paid for and implemented in the linker instead of in the template container. Though I know this isn't easy to implement in the linker, I have heard of successful implementations.
That being said, I still do try to make code size optimizations in templates. For example in the libc++ helper structures such as __hash_map_node_destructor are templated on as few parameters as possible, so if any of their code gets outlined, it is more likely that one instantiation of the helper can serve more than one instantiation of unordered_map. This technique is debugger friendly, and not that hard to get right. And can even have some positive side effects for the client when applied to iterators (N2980).
In summary, I wouldn't hold it against code for going the extra mile and implementing this optimization. But I also wouldn't classify it as high a priority as I did a decade ago, both because linker technology has progressed, and the ratio of code size to application size has tended to fairly dramatically decrease.
When you have a void* parameter there is no type checking at compile-time.
Such maps as those you propose would be a flaw in a program since they would accept value elements of type A*, B*, and even more unimaginable fancy types that would have nothing to do in that map. ( for example int*, float*; std::string*, CString*, CWnd*... imagine the mess in your map...)
Your optimisation is premature. And premature optimization is root of all evil.
Is there a way to improve the boost ublas product performance?
I have two matrices A,B which i want to mulitply/add/sub/...
In MATLAB vs. C++ i get the following times [s] for a 2000x2000 matrix Operations
OPERATION | MATLAB | C++ (MSVC10)
A + B | 0.04 | 0.04
A - B | 0.04 | 0.04
AB | 1.0 | 62.66
A'B' | 1.0 | 54.35
Why there is such a huge performance loss here?
The matrices are only real doubles.
But i also need positive definites,symmetric,rectangular products.
EDIT:
The code is trivial
matrix<double> A( 2000 , 2000 );
// Fill Matrix A
matrix<double> B = A;
C = A + B;
D = A - B;
E = prod(A,B);
F = prod(trans(A),trans(B));
EDIT 2:
The results are mean values of 10 trys. The stddev was less than 0.005
I would expect an factor 2-3 maybe to but not 50 (!)
EDIT 3:
Everything was benched in Release ( NDEBUG/MOVE_SEMANTICS/.. ) mode.
EDIT 4:
Preallocated Matrices for the product results did not affect the runtime.
Post your C+ code for advice on any possible optimizations.
You should be aware however that Matlab is highly specialized for its designed task, and you are unlikely to be able to match it using Boost. On the other hand - Boost is free, while Matlab decidedly not.
I believe that best Boost performance can be had by binding the uBlas code to an underlying LAPACK implementation.
You should use noalias in the left hand side of matrix multiplications in order to get rid of unnecessary copies.
Instead of E = prod(A,B); use noalias(E) = prod(A,b);
From documentation:
If you know for sure that the left hand expression and the right hand
expression have no common storage, then assignment has no aliasing. A
more efficient assignment can be specified in this case: noalias(C) =
prod(A, B); This avoids the creation of a temporary matrix that is
required in a normal assignment. 'noalias' assignment requires that
the left and right hand side be size conformant.
There are many efficient BLAS implementation, like ATLAS, gotoBLAS, MKL, use them instead.
I don't pick at the code, but guess the ublas::prod(A, B) using three-loops, no blocks and not cache friendly. If that's true, prod(A, B.trans()) will be much faster then others.
If cblas is avaiable, using cblas_dgemm to do the calculation. If not, you can simply rearrange the data, means, prod(A, B.trans()) instead.
You don't know what role memory-management is playing here. prod is having to allocate a 32mb matrix, and so is trans, twice, and then you're doing all that 10 times. Take a few stackhots and see what it's really doing. My dumb guess is if you pre-allocate the matrices you get a better result.
Other ways matrix-multiplication could be speeded up are
pre-transposing the left-hand matrix, to be cache-friendly, and
skipping over zeros. Only if A(i,k) and B(k,j) are both non-zero is any value contributed.
Whether this is done in uBlas is anybody's guess.