Declaring a very large vector of ints? - c++

Is there a way to do this in C++ without having things crash on runtime?
Right now I am declaring
vector<vector<int> > myvec(veclength);
How can I crank up veclength as high as it will go (properly)? Even at 10^7 it crashes when I should have more than enough computer memory.

This should take take approximately 250 MiB of space1 (or less, depending on architecture) so memory definitely isn’t the problem here, and neither should max_size, which would be in the order of 1017 (≈ 264∕8+8+8).
I should mention that I corroborated these calculations by looking at the implementations of std::vectorin GCC' libstdc++ and LLVM's libc++, and by testing on a live system. The calculated values correspond 1:1 to the real implementations, and the OP’s code works flawlessly with veclength = 10e7.
I therefore conclude that the real cause is elsewhere.
1) Calculated by approximating the size of each individual vector by three 64 bit integers to denote begin pointer, size and capacity respectively, and assuming that an empty vector has a default capacity of 0. Actual implementations may differ but probably not by much.

Based on my comment above, I think I might have a solution for you.
See max_size and test what the
max_size is on your machine. The key here is that the vector size
limit is reduced as the size of vector elements increase - and since
you have a vector of vector of int, the size of the outer vector is
likely to be quite limited.
Here is the result one person got by running the above program. Note that a vector of int (size 4) has a max size of 1073741823 which is 10^9 Your using a vector> which will take up significantly more space, and thus significantly reduce the max size.
Max elements that can be inserted into a vector having elements of size '1' is: 4294967295
Max elements that can be inserted into a vector having elements of size '4' is: 1073741823
Max elements that can be inserted into a vector having elements of size '8' is: 536870911
Max elements that can be inserted into a vector having elements of size '4' is: 1073741823
If you change your data structure to be a vector of vector<int> pointers, you'll be able to store many more. I know this may fundamentally change a lot of your corresponding functions and structures, but that's the limitations of vector.
To whom it may concern: http://codepad.org/nAoPi7cV
int main()
{
std::cout << "Max elements that can be inserted into a vector having elements of size '"
<< sizeof( std::vector<int> ) << "' is: "
<< std::vector<std::vector<int> >().max_size() << std::endl;
}
Max elements that can be inserted into a vector having elements of size '4' is: 1073741823
Max elements that can be inserted into a vector having elements of size '28' is: 153391689
log(153391689) ~= 8.2
So it's max_size is big enough to hold 10^7 on the codepad compiling machine. On lesser machines, it may not be.
Also note that even though this max size is given, the program segfaults on construction: http://codepad.org/agKMMEjQ
int main()
{
std::vector<std::vector<int> > myvec(153391689);
}
Segmentation fault
If you reduce this further to the size proposed by the asker (10^7), the program crashes yet again: http://codepad.org/zMG0VCeg
std::vector<std::vector<int> > myvec(10000000);
std::bad_alloc: St9bad_alloc
Aborted.
If you reduce your attempted size further though, the program runs happily: http://codepad.org/sbMPppgx
std::vector<std::vector<int> > myvec(100000);
std::cout << myvec.size();
100000
The above program is extremely simple and encounters the exact problems the asker specified - therefore the problem does not lie elsewhere - the limitation is encountered due to the std::vector class. Codepad links have been left in for easy reference, but I got the same numbers when testing in a local environment.

Related

Why a program starts slow but later gains the full speed?

I noticed that sometimes a program runs very slow but later the performance is good. For example, I have some code which I run in a loop and the first iteration takes ages but other iterations of the same code runs pretty fast. It's hard to name the circumstances because I can't figure it out and it seems that even single literal can affect this behavior. I prepared a small code snippet:
#include <chrono>
#include <vector>
#include <iostream>
using namespace std;
int main()
{
const int num{ 100000 };
vector<vector<int>> octs;
for (int i{ 0 }; i < num; ++i)
{
octs.emplace_back(vector<int>{ 42 });
}
vector<int> datas;
for (int i{ 0 }; i < num; ++i)
{
datas.push_back(42);
}
for (int n{ 0 }; n < 10; ++n)
{
cout << "start" << '\n';
//cout << 0 << "start" << '\n';
auto start = chrono::high_resolution_clock::now();
for (int i{ 0 }; i < num; ++i)
{
vector<int> points{ 42 };
}
auto end = chrono::high_resolution_clock::now();
auto time = chrono::duration_cast<chrono::milliseconds>(end - start);
cout << time.count() << '\n';
}
cin.get();
return 0;
}
The first two vectors are essential. At least with Visual Studio. Thought they're not in use they affect the performance a lot. Moreover, tweaking them also give performance effect (like change the order of initialization, remove push_back and allocate the necessary size in constructor). But this code as it is gives me the following results:
with gcc there're no problems at all
with clang the first iteration takes two times longer than the others
with vs2013 the first iteration is 100 (yes, one hundred) times slower.
Moreover, with vs2013 if I uncomment the line cout << 0 << "start" << '\n'; the performance problem goes away and all iterations are equal!
What's going on?
For your first two loops, probably the biggest performance consideration is going to be the allocation of memory, and the copying of the vector contents to the larger buffer. In this case, the fact that the loops appear to be 'gaining speed' is not surprising.
This is due to the implementation details of the vector class. Let's look at the documentation:
Internally, vectors use a dynamically allocated array to store their
elements. This array may need to be reallocated in order to grow in
size when new elements are inserted, which implies allocating a new
array and moving all elements to it. This is a relatively expensive
task in terms of processing time, and thus, vectors do not reallocate
each time an element is added to the container.
Instead, vector containers may allocate some extra storage to
accommodate for possible growth, and thus the container may have an
actual capacity greater than the storage strictly needed to contain
its elements (i.e., its size). Libraries can implement different
strategies for growth to balance between memory usage and
reallocations, but in any case, reallocations should only happen at
logarithmically growing intervals of size so that the insertion of
individual elements at the end of the vector can be provided with
amortized constant time complexity (see push_back).
So under the hood, the actual memory allocated for your vector might be much more than what you are actually using. So the vector only needs to do the costly re-allocation and copy when you add a new element to the vector which wouldn't fit into its current buffer. Moreover, since it says that re-allocations should only happen at logarithmically growing intervals, you can expect that the vector class is roughly doubling the buffer size every time it needs to re-allocate. But note that the vector implementations on various platforms are highly tuned to be optimal for the most common usage patterns for the class, which could be one factor in the different performance you are seeing across tool chains and platforms.
So you should see the loops be slow on the first several executions, and then gain more speed as push_back and emplace operations need to do fewer re-allocations and copies to accommodate the new elements.
So I think this is the main fact you can use to reason about how long your first two loops should take to execute. But for your specific examples, due to the simplicity of the program, the compiler may be taking some liberties with what code it generates. So we could imagine that a sufficiently clever optimizing compiler might be able to see that your vectors will only be growing to a size which it knows at compile time, num. And this is the biggest issue I suspect with your last loop, which seems like an arbitrary and useless test. For example, the nested loop in loop 3 can be optimized away entirely. I think this is the main reason why you are seeing such different run-time behavior across the different compilers.
If you want to get the real story, take a look at the assembly code that your compiler is generating.

Is there any array-like data structure that can grow in size on both sides?

I'm a student working on a small project for an high performance computing course, hence efficiency it's a key issue.
Let say that I have a vector of N floats and I want to remove the smallest n elements and the biggest n elements. There are two simple ways of doing this:
A
sort in ascending order // O(NlogN)
remove the last n elements // O(1)
invert elements order // O(N)
remove the last n elements // O(1)
B
sort in ascending order // O(NlogN)
remove the last n elements // O(1)
remove the first n elements // O(N)
In A inverting the elements order require swapping all the elements, while in B removing the first n elements require moving all the others to occupy the positions left empty. Using std::remove would give the same problem.
If I could remove the first n elements for free then solution B would be cheaper. That should be easy to achieve, if instead of having a vector, i.e. an array with some empty space after vector::end(), I would have a container with some free space also before vector::begin().
So the question is: does exist already an array-like (i.e. contiguous memory, no linked lists) in some libraries (STL, Boost) that allows for O(1) inserting/removing on both sides of the array?
If not, do you think that there are better solutions than creating such a data structure?
Have you thought of using std::partition with a custom functor like the example below:
#include <iostream>
#include <vector>
#include <algorithm>
template<typename T>
class greaterLess {
T low;
T up;
public:
greaterLess(T const &l, T const &u) : low(l), up(u) {}
bool operator()(T const &e) { return !(e < low || e > up); }
};
int main()
{
std::vector<double> v{2.0, 1.2, 3.2, 0.3, 5.9, 6.0, 4.3};
auto it = std::partition(v.begin(), v.end(), greaterLess<double>(2.0, 5.0));
v.erase(it, v.end());
for(auto i : v) std::cout << i << " ";
std::cout << std::endl;
return 0;
}
This way you would erase elements from your vector in O(N) time.
Try boost::circular_buffer:
It supports random access iterators, constant time insert and erase operations at the beginning or the end of the buffer and interoperability with std algorithms.
Having looked at the source, it seems (and is only logical) that data is kept as a continuous memory block.
The one caveat is that the buffer has fixed capacity and after exhausting it elements will get overwritten. You can either detect such cases yourself and resize the buffer manually, or use boost::circular_buffer_space_optimized with a humongous declared capacity, since it won't allocate it if not needed.
To shrink & grow a vector at both ends, you can use idea of slices, reserving extra memory to expand into ahead of time at front and back, if efficient growth is needed.
Simply, make a class with not only a length but indices for first & last elements and a suitably sized vector, to create a window of data on the underlying block of stored floats. A C++ class can provide inlined functions, for things like deleting items, address into the array, find the nth largest value, shift the slice values down or up to insert new elements maintaining sorted order. Should no spare elements be available, then dynamic allocation of a new larger float store, permits continuing growth at the cost of an array copy.
A circular buffer is designed as a FIFO, with new elements added at end, removal at front, and not allowing insertion in the middle, a self defined class can also (trivially) support array subscript values different from 0..N-1
Due to memory locality, avoiding excessive indirection due to pointer chains, and the pipelining of subscript calculations on a modern processor, a solution based on an array (or a vector), is likely to be most efficicent, despite element copying on insertion. Deque would be suitable but it fails to guarantee contiguous storage.
Additional supplementary info. Researching classes providing slices, finds some plausible alternatives to evaluate :
A) std::slice which uses slice_arrays
B) Boost Class Range
Hope this is the kind of specific information you were hoping for, in general a simpler clearer solution is more maintainable, than a tricky one. I would expect slices and ranges on sorted data sets, being quite common, for example filtering experimental data where "outliers" are excluded as faulty readings.
I think a good solution, should actually be - O(NlogN), 2xO(1), with any binary searches O(logN +1) for filtering on outlying values, in place of deleting a fixed number of small or large values; it matters that the "O" is relatively fast to, sometimes an O(1) algorithmn can be in practice slower for practical values of N than an O(N) one.
as a complementary to #40two 's answer, before partitioning the array, you will need to find the partitioning pivot, which is you will need to find the nth smallest number, and the nth greatest number in an unsorted array.
There is a discussion on that in SO: How to find the kth largest number in unsorted array
There are several algorithms to solve this problem. Some are deterministic O(N) - on of them is a variation on finding the median (median of medians). There are some non-deterministic algorithms with O(N) average-case.
A good source book to find those algorithms is Introduction to algorithms.
Also in books like
So eventually, your code will run in an O(N) time

std::bad_alloc at transpose of Eigen::SparseMatrix

I'm trying to calculate the following:
A = X^t * X
I'm using the Eigen::SparseMatrix and get a std::bad_alloc error on the transpose() operation:
Eigen::SparseMatrix<double> trans = sp.transpose();
sp is also a Eigen::SparseMatrix Matrix, but it is very big, on one of the smaller datasets, the commands
std::cout << "Rows: " << sp.rows() << std::endl;
std::cout << "Rows: " << sp.cols() << std::endl;
give the following result:
Rows: 2061565968
Cols: 600
(I precompute the sizes of this matrix before I start to fill it)
Is there a limit on how many entries such a matrix can hold?
I'm using a 64bit Linux system with g++
Thanks in advance
Alex
The answer from ggael worked with a slight modification:
In the definition of the SparseMatrix one cannot ommit the options, so the correct typedef is
typedef SparseMatrix<double, 0, std::ptrdiff_t> SpMat;
The 0 can also be exchanged for a 1, 0 means column-major and 1 means RowMajor
Thank your for your help
By default Eigen::SparseMatrix uses int to stores sizes and indices (for compactness). However, with that huge amount of rows, you need to use 64 integers for both sp and sp.transpose():
typedef SparseMatrix<double, 0, std::ptrdiff_t> SpMat;
Note that you can directly write:
SpMat sp, sp2;
sp2 = sp.transpose() * sp;
even though sp.transpose() will have to be evaluated into a temporary anyway.
I think it is impossible to answer your question in its current state.
There are two things. The size of the matrix - the mathematical object, and the size understood as memory it occupies. In dense matrices the are pretty much the same (linear dependence). But in sparse case the memory occupation is not tied to the size of the matrix, but to the number of non-zero elements.
So, technically, you have pretty much unlimited size constraints - equal to the Size type. However, you are, of course, still bound by memory when it comes to the number of (non-zero) elements.
You make a copy of a matrix obviously. So you could try calculating the size of the data the matrix object need to hold, and see if it fits within your memory.
This is not very trivial, but docs say that the storage is a list of non-zero elements. So a good estimate would probably be (2*sizeof(Index)+sizeof(Scalar))*sp.nonZeros() - for (x,y,value).
You could also monitor RAM usage before calling the transpose, and see if it stays within the limit if you double it.
Note: The transposition is probably not the culprit there, but operator=. Maybe you can avoid making the copy.

If I want maximum speed, should I just use an array over a std::vector? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I am writing some code that needs to be as fast as possible without sucking up all of my research time (in other words, no hand optimized assembly).
My systems primarily consist of a bunch of 3D points (atomic systems) and so the code I write does lots of distance comparisons, nearest-neighbor searches, and other types of sorting and comparisons. These are large, million or billion point systems, and the naive O(n^2) nested for loops just won't cut it.
It would be easiest for me to just use a std::vector to hold point coordinates. And at first I thought it will probably be about as fast an array, so that's great! However, this question (Is std::vector so much slower than plain arrays?) has left me with a very uneasy feeling. I don't have time to write all of my code using both arrays and vectors and benchmark them, so I need to make a good decision right now.
I am sure that someone who knows the detailed implementation behind std::vector could use those functions with very little speed penalty. However, I primarily program in C, and so I have no clue what std::vector is doing behind the scenes, and I have no clue if push_back is going to perform some new memory allocation every time I call it, or what other "traps" I could fall into that make my code very slow.
An array is simple though; I know exactly when memory is being allocated, what the order of all my algorithms will be, etc. There are no blackbox unknowns that I may have to suffer through. Yet so often I see people criticized for using arrays over vectors on the internet that I can't but help wonder if I am missing some more information.
EDIT: To clarify, someone asked "Why would you be manipulating such large datasets with arrays or vectors"? Well, ultimately, everything is stored in memory, so you need to pick some bottom layer of abstraction. For instance, I use kd-trees to hold the 3D points, but even so, the kd-tree needs to be built off an array or vector.
Also, I'm not implying that compilers cannot optimize (I know the best compilers can outperform humans in many cases), but simply that they cannot optimize better than what their constraints allow, and I may be unintentionally introducing constraints simply due to my ignorance of the implementation of vectors.
all depends on this how you implement your algorithms. std::vector is such general container concept that gives us flexibility but leaves us with freedom and responsibility of structuring implementation of algorithm deliberately. Most of the efficiency overhead that we will observe from std::vector comes from copying. std::vector provides a constructor which lets you initialize N elements with value X, and when you use that, the vector is just as fast as an array.
I did a tests std::vector vs. array described here,
#include <cstdlib>
#include <vector>
#include <iostream>
#include <string>
#include <boost/date_time/posix_time/ptime.hpp>
#include <boost/date_time/microsec_time_clock.hpp>
class TestTimer
{
public:
TestTimer(const std::string & name) : name(name),
start(boost::date_time::microsec_clock<boost::posix_time::ptime>::local_time())
{
}
~TestTimer()
{
using namespace std;
using namespace boost;
posix_time::ptime now(date_time::microsec_clock<posix_time::ptime>::local_time());
posix_time::time_duration d = now - start;
cout << name << " completed in " << d.total_milliseconds() / 1000.0 <<
" seconds" << endl;
}
private:
std::string name;
boost::posix_time::ptime start;
};
struct Pixel
{
Pixel()
{
}
Pixel(unsigned char r, unsigned char g, unsigned char b) : r(r), g(g), b(b)
{
}
unsigned char r, g, b;
};
void UseVector()
{
TestTimer t("UseVector");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
std::vector<Pixel> pixels;
pixels.resize(dimension * dimension);
for(int i = 0; i < dimension * dimension; ++i)
{
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = 0;
}
}
}
void UseVectorPushBack()
{
TestTimer t("UseVectorPushBack");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
std::vector<Pixel> pixels;
pixels.reserve(dimension * dimension);
for(int i = 0; i < dimension * dimension; ++i)
pixels.push_back(Pixel(255, 0, 0));
}
}
void UseArray()
{
TestTimer t("UseArray");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
Pixel * pixels = (Pixel *)malloc(sizeof(Pixel) * dimension * dimension);
for(int i = 0 ; i < dimension * dimension; ++i)
{
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = 0;
}
free(pixels);
}
}
void UseVectorCtor()
{
TestTimer t("UseConstructor");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
std::vector<Pixel> pixels(dimension * dimension, Pixel(255, 0, 0));
}
}
int main()
{
TestTimer t1("The whole thing");
UseArray();
UseVector();
UseVectorCtor();
UseVectorPushBack();
return 0;
}
and here are results (compiled on Ubuntu amd64 with g++ -O3):
UseArray completed in 0.325 seconds
UseVector completed in 1.23 seconds
UseConstructor completed in 0.866 seconds
UseVectorPushBack completed in 8.987 seconds
The whole thing completed in 11.411 seconds
clearly push_back wasn't good choice here, using constructor is still 2 times slower than array.
Now, providing Pixel with empty copy Ctor:
Pixel(const Pixel&) {}
gives us following results:
UseArray completed in 0.331 seconds
UseVector completed in 0.306 seconds
UseConstructor completed in 0 seconds
UseVectorPushBack completed in 2.714 seconds
The whole thing completed in 3.352 seconds
So in summary: re-think your algorithm, otherwise, perhaps resort to a custom wrapper around New[]/Delete[]. In any case, the STL implementation isn't slower for some unknown reason, it just does exactly what you ask; hoping you know better.
In the case when you just started with vectors it might be surprising how they behave, for example this code:
class U{
int i_;
public:
U(){}
U(int i) : i_(i) {cout << "consting " << i_ << endl;}
U(const U& ot) : i_(ot.i_) {cout << "copying " << i_ << endl;}
};
int main(int argc, char** argv)
{
std::vector<U> arr(2,U(3));
arr.resize(4);
return 0;
}
results with:
consting 3
copying 3
copying 3
copying 548789016
copying 548789016
copying 3
copying 3
Vectors guarantee that the underlying data is a contiguous block in memory. The only sane way to guarantee this is by implementing it as an array.
Memory reallocation on pushing new elements can happen, because the vector can't know in advance how many elements you are going to add to it. But when you know it in advance, you can call reserve with the appropriate number of entries to avoid reallocation when adding them.
Vectors are usually preferred over arrays because they allow bound-checking when accessing elements with .at(). That means accessing indices outside of the vector doesn't cause undefined behavior like in an array. This bound-checking does however require additional CPU cycles. When you use the []-operator to access elements, no bound-checking is done and access should be as fast as an array. This however risks undefined behavior when your code is buggy.
People who invented STL, and then made it into the C++ standard library, are expletive deleted smart. Don't even let yourself imagine for one little moment you can outperform them because of your superior knowledge of legacy C arrays. (You would have a chance if you knew some Fortran though).
With std::vector, you can allocate all memory in one go, just like with C arrays. You can also allocate incrementally, again just like with C arrays. You can control when each allocation happens, just like with C arrays. Unlike with C arrays, you can also forget about it all and let the system manage the allocations for you, if that's what you want. This is all absolutely necessary, basic functionality. I'm not sure why anyone would assume it is missing.
Having said all that, go with arrays if you find them easier to understand.
I am not really advising you to go either for arrays or vectors, because I think that for your needs they may not be totally fit.
You need to be able to organize your data efficiently, so that queries would not need to scan the whole memory range to get the relevant data. So you want to group the points which are more likely to be selected together close to each other.
If your dataset is static, then you can do that sorting offline, and make your array nice and tidy to be loaded up in memory at your application start up time, and either vector or array would work (provided you do the reserve call up front for vector, since the default allocation growth scheme double up the size of the underlying array whenever it gets full, and you wouldn't want to use up 16Gb of memory for only 9Gb worth of data).
But if your dataset is dynamic, it will be difficult to do efficient inserts in your set with a vector or an array. Recall that each insert within the array would create a shift of all the successor elements of one place. Of course, an index, like the kd tree you mention, will help by avoiding a full scan of the array, but if the selected points are scattered accross the array, the effect on memory and cache will essentially be the same. The shift would also mean that the index needs to be updated.
My solution would be to cut the array in pages (either list linked or array indexed) and store data in the pages. That way, it would be possible to group relevant elements together, while still retaining the speed of contiguous memory access within pages. The index would then refer to a page and an offset in that page. Pages wouldn't be filled automatically, which leaves rooms to insert related elements, or make shifts really cheap operations.
Note that if pages are always full (excepted for the last one), you still have to shift every single one of them in case of an insert, while if you allow incomplete pages, you can limit a shift to a single page, and if that page is full, insert a new page right after it to contain the suplementary element.
Some things to keep in mind:
array and vector allocation have upper limits, which is OS dependent (these limits might be different)
On my 32bits system, the maximum allowed allocation for a vector of 3D points is at around 180 millions entries, so for larger datasets, on would have to find a different solution. Granted, on 64bits OS, that amount might be significantly larger (On windows 32bits, the maximum memory space for a process is 2Gb - I think they added some tricks on more advanced versions of the OS to extend that amount). Admittedly memory will be even more problematic for solutions like mine.
resizing of a vector requires allocating the new size of the heap, copy the elements from the old memory chunck to the new one.
So for adding just one element to the sequence, you will need twice the memory during the resizing. This issue may not come up with plain arrays, which can be reallocated using the ad hoc OS memory functions (realloc on unices for instance, but as far as I know that function doesn't make any guarantee that the same memory chunck will be reused). The problem might be avoided in vector as well if a custom allocator which would use the same functions is used.
C++ doesn't make any assumption about the underlying memory architecture.
vectors and arrays are meant to represent contiguous memory chunks provided by an allocator, and wrap that memory chunk with an interface to access it. But C++ doesn't know how the OS is managing that memory. In most modern OS, that memory is actually cut in pages, which are mapped in and out of physical memory. So my solution is somehow to reproduce that mechanism at the process level. In order to make the paging efficient, it is necessary to have our page fit the OS page, so a bit of OS dependent code will be necessary. On the other hand, this is not a concern at all for a vector or array based solution.
So in essence my answer is concerned by the efficiency of updating the dataset in a manner which will favor clustering points close to each others. It supposes that such clustering is possible. If not the case, then just pushing a new point at the end of the dataset would be perfectly alright.
Although I do not know the exact implementation of std:vector, most list systems like this are slower than arrays as they allocate memory when they are resized, normally double the current capacity although this is not always the case.
So if the vector contains 16 items and you added another, it needs memory for another 16 items. As vectors are contiguous in memory, this means that it will allocate a solid block of memory for 32 items and update the vector. You can get some performance improvements by constructing the std:vector with an initial capacity that is roughly the size you think your data set will be, although this isn't always an easy number to arrive at.
For operation that are common between vectors and arrays (hence not push_back or pop_back, since array are fixed in size) they perform exactly the same, because -by specification- they are the same.
vector access methods are so trivial that the simpler compiler optimization will wipe them out.
If you know in advance the size of a vector, just construct it by specifyinfg the size or just call resize, and you will get the same you can get with a new [].
If you don't know the size, but you know how much you will need to grow, just call reserve, and you get no penality on push_back, since all the required memory is already allocated.
In any case, relocation are not so "dumb": the capacity and the size of a vector are two distinct things, and the capacity is typically doubled upon exhaustion, so that relocation of big amounts are less and less frequent.
Also, in case you know everything about sizes, and you need no dynamic memory and want the same vector interface, consider also std::array.
Sounds like you need gigs of RAM so you're not paging. I tend to go along with #Philipp's answer, because you really really want to make sure it's not re-allocating under the hood
but
what's this about a tree that needs rebalancing?
and you're even thinking about compiler optimization?
Please take a crash course in how to optimize software.
I'm sure you know all about Big-O, but I bet you're used to ignoring the constant factors, right? They might be out of whack by 2 to 3 orders of magnitude, doing things you never would have thought costly.
If that translates to days of compute time, maybe it'll get interesting.
And no compiler optimizer can fix these things for you.
If you're academically inclined, this post goes into more detail.

std::map and performance, intersecting sets

I'm intersecting some sets of numbers, and doing this by storing a count of each time I see a number in a map.
I'm finding the performance be very slow.
Details:
- One of the sets has 150,000 numbers in it
- The intersection of that set and another set takes about 300ms the first time, and about 5000ms the second time
- I haven't done any profiling yet, but every time I break the debugger while doing the intersection its in malloc.c!
So, how can I improve this performance? Switch to a different data structure? Some how improve the memory allocation performance of map?
Update:
Is there any way to ask std::map or
boost::unordered_map to pre-allocate
some space?
Or, are there any tips for using these efficiently?
Update2:
See Fast C++ container like the C# HashSet<T> and Dictionary<K,V>?
Update3:
I benchmarked set_intersection and got horrible results:
(set_intersection) Found 313 values in the intersection, in 11345ms
(set_intersection) Found 309 values in the intersection, in 12332ms
Code:
int runIntersectionTestAlgo()
{
set<int> set1;
set<int> set2;
set<int> intersection;
// Create 100,000 values for set1
for ( int i = 0; i < 100000; i++ )
{
int value = 1000000000 + i;
set1.insert(value);
}
// Create 1,000 values for set2
for ( int i = 0; i < 1000; i++ )
{
int random = rand() % 200000 + 1;
random *= 10;
int value = 1000000000 + random;
set2.insert(value);
}
set_intersection(set1.begin(),set1.end(), set2.begin(), set2.end(), inserter(intersection, intersection.end()));
return intersection.size();
}
You should definitely be using preallocated vectors which are way faster. The problem with doing set intersection with stl sets is that each time you move to the next element you're chasing a dynamically allocated pointer, which could easily not be in your CPU caches. With a vector the next element will often be in your cache because it's physically close to the previous element.
The trick with vectors, is that if you don't preallocate the memory for a task like this, it'll perform EVEN WORSE because it'll go on reallocating memory as it resizes itself during your initialization step.
Try something like this instaed - it'll be WAY faster.
int runIntersectionTestAlgo() {
vector<char> vector1; vector1.reserve(100000);
vector<char> vector2; vector2.reserve(1000);
// Create 100,000 values for set1
for ( int i = 0; i < 100000; i++ ) {
int value = 1000000000 + i;
set1.push_back(value);
}
sort(vector1.begin(), vector1.end());
// Create 1,000 values for set2
for ( int i = 0; i < 1000; i++ ) {
int random = rand() % 200000 + 1;
random *= 10;
int value = 1000000000 + random;
set2.push_back(value);
}
sort(vector2.begin(), vector2.end());
// Reserve at most 1,000 spots for the intersection
vector<char> intersection; intersection.reserve(min(vector1.size(),vector2.size()));
set_intersection(vector1.begin(), vector1.end(),vector2.begin(), vector2.end(),back_inserter(intersection));
return intersection.size();
}
Without knowing any more about your problem, "check with a good profiler" is the best general advise I can give. Beyond that...
If memory allocation is your problem, switch to some sort of pooled allocator that reduces calls to malloc. Boost has a number of custom allocators that should be compatible with std::allocator<T>. In fact, you may even try this before profiling, if you've already noticed debug-break samples always ending up in malloc.
If your number-space is known to be dense, you can switch to using a vector- or bitset-based implementation, using your numbers as indexes in the vector.
If your number-space is mostly sparse but has some natural clustering (this is a big if), you may switch to a map-of-vectors. Use higher-order bits for map indexing, and lower-order bits for vector indexing. This is functionally very similar to simply using a pooled allocator, but it is likely to give you better caching behavior. This makes sense, since you are providing more information to the machine (clustering is explicit and cache-friendly, rather than a random distribution you'd expect from pool allocation).
I would second the suggestion to sort them. There are already STL set algorithms that operate on sorted ranges (like set_intersection, set_union, etc):
set_intersection
I don't understand why you have to use a map to do intersection. Like people have said, you could put the sets in std::set's, and then use std::set_intersection().
Or you can put them into hash_set's. But then you would have to implement intersection manually: technically you only need to put one of the sets into a hash_set, and then loop through the other one, and test if each element is contained in the hash_set.
Intersection with maps are slow, try a hash_map. (however, this is not provided in all STL implementation.
Alternatively, sort both map and do it in a merge-sort-like way.
What is your intersection algorithm? Maybe there are some improvements to be made?
Here is an alternate method
I do not know it to be faster or slower, but it could be something to try. Before doing so, I also recommend using a profiler to ensure you really are working on the hotspot. Change the sets of numbers you are intersecting to use std::set<int> instead. Then iterate through the smallest one looking at each value you find. For each value in the smallest set, use the find method to see if the number is present in each of the other sets (for performance, search from smallest to largest).
This is optimised in the case that the number is not found in all of the sets, so if the intersection is relatively small, it may be fast.
Then, store the intersection in std::vector<int> instead - insertion using push_back is also very fast.
Here is another alternate method
Change the sets of numbers to std::vector<int> and use std::sort to sort from smallest to largest. Then use std::binary_search to find the values, using roughly the same method as above. This may be faster than searching a std::set since the array is more tightly packed in memory. Actually, never mind that, you can then just iterate through the values in lock-step, looking at the ones with the same value. Increment only the iterators which are less than the minimum value you saw at the previous step (if the values were different).
Might be your algorithm. As I understand it, you are spinning over each set (which I'm hoping is a standard set), and throwing them into yet another map. This is doing a lot of work you don't need to do, since the keys of a standard set are in sorted order already. Instead, take a "merge-sort" like approach. Spin over each iter, dereferencing to find the min. Count the number that have that min, and increment those. If the count was N, add it to the intersection. Repeat until the first map hits it's end (If you compare the sizes before starting, you won't have to check every map's end each time).
Responding to update: There do exist faculties to speed up memory allocation by pre-reserving space, like boost::pool_alloc. Something like:
std::map<int, int, std::less<int>, boost::pool_allocator< std::pair<int const, int> > > m;
But honestly, malloc is pretty good at what it does; I'd profile before doing anything too extreme.
Look at your algorithms, then choose the proper data type. If you're going to have set-like behaviour, and want to do intersections and the like, std::set is the container to use.
Since it's elements are stored in a sorted way, insertion may cost you O(log N), but intersection with another (sorted!) std::set can be done in linear time.
I figured something out: if I attach the debugger to either RELEASE or DEBUG builds (e.g. hit F5 in the IDE), then I get horrible times.