Which container is most efficient for multiple insertions / deletions in C++?

Which container is most efficient for multiple insertions / deletions in C++? - c++

I was set a homework challenge as part of an application process (I was rejected, by the way; I wouldn't be writing this otherwise) in which I was to implement the following functions:
// Store a collection of integers
class IntegerCollection {
public:
// Insert one entry with value x
void Insert(int x);
// Erase one entry with value x, if one exists
void Erase(int x);
// Erase all entries, x, from <= x < to
void Erase(int from, int to);
// Return the count of all entries, x, from <= x < to
size_t Count(int from, int to) const;
The functions were then put through a bunch of tests, most of which were trivial. The final test was the real challenge as it performed 500,000 single insertions, 500,000 calls to count and 500,000 single deletions.
The member variables of IntegerCollection were not specified and so I had to choose how to store the integers. Naturally, an STL container seemed like a good idea and keeping it sorted seemed an easy way to keep things efficient.
Here is my code for the four functions using a vector:
// Previous bit of code shown goes here
private:
std::vector<int> integerCollection;
};
void IntegerCollection::Insert(int x) {
/* using lower_bound to find the right place for x to be inserted
keeps the vector sorted and makes life much easier */
auto it = std::lower_bound(integerCollection.begin(), integerCollection.end(), x);
integerCollection.insert(it, x);
}
void IntegerCollection::Erase(int x) {
// find the location of the first element containing x and delete if it exists
auto it = std::find(integerCollection.begin(), integerCollection.end(), x);
if (it != integerCollection.end()) {
integerCollection.erase(it);
}
}
void IntegerCollection::Erase(int from, int to) {
if (integerCollection.empty()) return;
// lower_bound points to the first element of integerCollection >= from/to
auto fromBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), from);
auto toBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), to);
/* std::vector::erase deletes entries between the two pointers
fromBound (included) and toBound (not indcluded) */
integerCollection.erase(fromBound, toBound);
}
size_t IntegerCollection::Count(int from, int to) const {
if (integerCollection.empty()) return 0;
int count = 0;
// lower_bound points to the first element of integerCollection >= from/to
auto fromBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), from);
auto toBound = std::lower_bound(integerCollection.begin(), integerCollection.end(), to);
// increment pointer until fromBound == toBound (we don't count elements of value = to)
while (fromBound != toBound) {
++count; ++fromBound;
}
return count;
}
The company got back to me saying that they wouldn't be moving forward because my choice of container meant the runtime complexity was too high. I also tried using list and deque and compared the runtime. As I expected, I found that list was dreadful and that vector took the edge over deque. So as far as I was concerned I had made the best of a bad situation, but apparently not!
I would like to know what the correct container to use in this situation is? deque only makes sense if I can guarantee insertion or deletion to the ends of the container and list hogs memory. Is there something else that I'm completely overlooking?

We cannot know what would make the company happy. If they reject std::vector without concise reasoning I wouldn't want to work for them anyway. Moreover, we dont really know the precise requirements. Were you asked to provide one reasonably well performing implementation? Did they expect you to squeeze out the last percent of the provided benchmark by profiling a bunch of different implementations?
The latter is probably too much for a homework challenge as part of an application process. If it is the first you can either
roll your own. It is unlikely that the interface you were given can be implemented more efficiently than one of the std containers does... unless your requirements are so specific that you can write something that performs well under that specific benchmark.
std::vector for data locality. See eg here for Bjarne himself advocating std::vector rather than linked lists.
std::set for ease of implementation. It seems like you want the container sorted and the interface you have to implement fits that of std::set quite well.
Let's compare only isertion and erasure assuming the container needs to stay sorted:
operation std::set std::vector
insert log(N) N
erase log(N) N
Note that the log(N) for the binary_search to find the position to insert/erase in the vector can be neglected compared to the N.
Now you have to consider that the asymptotic complexity listed above completely neglects the non-linearity of memory access. In reality data can be far away in memory (std::set) leading to many cache misses or it can be local as with std::vector. The log(N) only wins for huge N. To get an idea of the difference 500000/log(500000) is roughly 26410 while 1000/log(1000) is only ~100.
I would expect std::vector to outperform std::set for considerably small container sizes, but at some point the log(N) wins over cache. The exact location of this turning point depends on many factors and can only reliably determined by profiling and measuring.

Nobody knows which container is MOST efficient for multiple insertions / deletions. That is like asking what is the most fuel-efficient design for a car engine possible. People are always innovating on the car engines. They make more efficient ones all the time. However, I would recommend a splay tree. The time required for a insertion or deletion is a splay tree is not constant. Some insertions take a long time and some take only a very a short time. However, the average time per insertion/deletion is always guaranteed to be be O(log n), where n is the number of items being stored in the splay tree. logarithmic time is extremely efficient. It should be good enough for your purposes.

The first thing that comes to mind is to hash the integer value so single look ups can be done in constant time.
The integer value can be hashed to compute an index in to an array of bools or bits, used to tell if the integer value is in the container or not.
Counting and and deleting large ranges could be sped up from there, by using multiple hash tables for specific integer ranges.
If you had 0x10000 hash tables, that each stored ints from 0 to 0xFFFF and were using 32 bit integers you could then mask and shift the upper half of the int value and use that as an index to find the correct hash table to insert / delete values from.
IntHashTable containers[0x10000];
u_int32 hashIndex = (u_int32)value / 0x10000;
u_int32int valueInTable = (u_int32)value - (hashIndex * 0x10000);
containers[hashIndex].insert(valueInTable);
Count for example could be implemented as so, if each hash table kept count of the number of elements it contained:
indexStart = startRange / 0x10000;
indexEnd = endRange / 0x10000;
int countTotal = 0;
for (int i = indexStart; i<=indexEnd; ++i) {
countTotal += containers[i].count();
}

Not sure if using sorting really is a requirement for removing the range. It might be based on position. Anyway, here is a link with some hints which STL container to use.
In which scenario do I use a particular STL container?
Just FYI.
Vector maybe a good choice, but it does a lot of re allocation, as you know. I prefer deque instead, as it doesn't require big chunk of memory to allocate all items. For such requirement as you had, list probably fit better.

Basic solution for this problem might be std::map<int, int>
where key is the integer you are storing and value is the number of occurences.
Problem with this is that you can not quickly remove/count ranges. In other words complexity is linear.
For quick count you would need to implement your own complete binary tree where you can know the number of nodes between 2 nodes(upper and lower bound node) because you know the size of tree, and you know how many left and right turns you took to upper and lower bound nodes. Note that we are talking about complete binary tree, in general binary tree you can not make this calculation fast.
For quick range remove I do not know how to make it faster than linear.

Related

Multiple insertions and deletion in priority_queue

Problem
In my algorithm I have to implement a "Ordered queue" (I choose this name to distinguish the idea in my mind from existing implementations). I have to insert some values into a queue where the value represents the order in the queue, and then I have to digest the queue following the order. I have the feeling that the best data structure for what I need to do is std::priority_queue, but I have some concerns about the efficiency of my program, in particular due to:
Interface which does not provide methods for insertions/deletions of multiple elements
(Possibly) Internal design of the class and its algorithms
From the documentation, both priority_queue::push and priority_queue::pop internally call std::push_heap and std::pop_heap, which both have complexity O(log(N)). I think it is very inefficient to insert/delete one element at a time, resorting the underlying container at every call.
Why is it implemented in this way? Maybe when you call std::push_heap and std::pop_heap in a sequence the underlying heap structure is in the optimal case and the complexity is reduced with respect to O(log(N))?
Otherwise, is there a best data structure which fits my needs that I have not considered? I thought also std::forward_list could fulfil my needs on deletion (through forward_list::pop_front), but I fear that the insertion becomes too expensive as I should find the iterator for the correct place to insert, which should be O(N).
I would prefer not to rely on any external library (Boost included) because the project must be lightweight and dependency-free.
Implementation
The program is equivalent to:
struct MyType{
double when;
int who;
MyType(double t, double i) : when(t), who(i) {};
bool operator<(const MyType & other) const{ return when < other.when; }
};
using OrderedQueue = priority_queue<MyType,std::vector<MyType>,std::less<MyType>>;
const double TMax = 1e9; // some BIG stopping condition
double some_time(){/*routine to generate the time*/ return TMax * rand(); }
int some_number(){/*routine to generate the number*/ return 100 * rand(); }
void populate(OrderedQueue & q){
unsigned Ni = 10; // number of insertions: it is not fixed in the real program
for (auto i = 0; i < Ni; ++i){
q.emplace(some_time(), some_number());
}
}
void use_MyType(MyType m){/*routine that uses the top value*/ return; }
void remove(double t, OrderedQueue & q){
while(q.top().when < t){
use_MyType(q.top());
q.pop();
}
}
int main(){
double t = 0;
OrderedQueue q;
while(t < TMax){
populate(q);
remove(t, q);
t += 1;
}
}
I am particularly interested in the efficiency of populate() and remove() because the loop when they are called has very many iterations.

std::priority_queue is an adaptor for a heap structure. Given your requirements of consuming elements in order one by one, a heap is the most efficient structure.
Heap insertions are worst case O(log(N)), but are on average O(1). This is faster than e.g. a binary tree (std::map) insertion, which is always O(log(N)). Similarly, removing the top element from a heap is worst case O(log(N)), but on average much faster since a heap is partially sorted.
With that said, the effects of branch prediction and caching in modern computers cannot be neglected. The best way to answer a performance question is to benchmark it with your actual data and a representative number of elements. I would suggest to benchmark using these 3 queue structures:
std::priority_queue<MyType, std::vector<MyType>>
std::priority_queue<MyType, std::deque<MyType>>
std::map<MyType>
std::deque as the backing store may offer improved pop_front performance, at the expense of slower random access. So it should be benchmarked.
I would disregard std::list (std::forward_list) at this point - inserting into a linked list at the right place is O(N), plus a linked list isn't cache-friendly, so is definitely going to be a much slower solution.
For more details on Heap vs Binary Tree performance, see this related question.
To address your concerns:
Interface which does not provide methods for insertions/deletions of multiple elements
Inserting an element into a heap involves appending the element at the end and "repairing" the heap structure. This is what the std::push_heap algorithm does. It is entirely feasible to implement an algorithm to insert multiple elements this way and/or simply invoke std::make_heap after appending multiple elements to repair the entire heap.
Removing multiple elements from a heap isn't possible, since a heap is only sorted with respect to the first (top) element. After removing it, the heap structure needs to be adjusted to find the next top element. This is what the std::pop_heap algorithm does.
Internal design of the class and its algorithms
std::priority_queue is just an adapter around heap algorithms. It's a convenience class that wraps a sequential container and invokes heap algorithms on it. You don't have to use it, you can use std::vector, std::push_heap and std::pop_heap with exact same results (though the code might be less readable and more error-prone).

Is there any array-like data structure that can grow in size on both sides?

I'm a student working on a small project for an high performance computing course, hence efficiency it's a key issue.
Let say that I have a vector of N floats and I want to remove the smallest n elements and the biggest n elements. There are two simple ways of doing this:
A
sort in ascending order // O(NlogN)
remove the last n elements // O(1)
invert elements order // O(N)
remove the last n elements // O(1)
B
sort in ascending order // O(NlogN)
remove the last n elements // O(1)
remove the first n elements // O(N)
In A inverting the elements order require swapping all the elements, while in B removing the first n elements require moving all the others to occupy the positions left empty. Using std::remove would give the same problem.
If I could remove the first n elements for free then solution B would be cheaper. That should be easy to achieve, if instead of having a vector, i.e. an array with some empty space after vector::end(), I would have a container with some free space also before vector::begin().
So the question is: does exist already an array-like (i.e. contiguous memory, no linked lists) in some libraries (STL, Boost) that allows for O(1) inserting/removing on both sides of the array?
If not, do you think that there are better solutions than creating such a data structure?

Have you thought of using std::partition with a custom functor like the example below:
#include <iostream>
#include <vector>
#include <algorithm>
template<typename T>
class greaterLess {
T low;
T up;
public:
greaterLess(T const &l, T const &u) : low(l), up(u) {}
bool operator()(T const &e) { return !(e < low || e > up); }
};
int main()
{
std::vector<double> v{2.0, 1.2, 3.2, 0.3, 5.9, 6.0, 4.3};
auto it = std::partition(v.begin(), v.end(), greaterLess<double>(2.0, 5.0));
v.erase(it, v.end());
for(auto i : v) std::cout << i << " ";
std::cout << std::endl;
return 0;
}
This way you would erase elements from your vector in O(N) time.

Try boost::circular_buffer:
It supports random access iterators, constant time insert and erase operations at the beginning or the end of the buffer and interoperability with std algorithms.
Having looked at the source, it seems (and is only logical) that data is kept as a continuous memory block.
The one caveat is that the buffer has fixed capacity and after exhausting it elements will get overwritten. You can either detect such cases yourself and resize the buffer manually, or use boost::circular_buffer_space_optimized with a humongous declared capacity, since it won't allocate it if not needed.

To shrink & grow a vector at both ends, you can use idea of slices, reserving extra memory to expand into ahead of time at front and back, if efficient growth is needed.
Simply, make a class with not only a length but indices for first & last elements and a suitably sized vector, to create a window of data on the underlying block of stored floats. A C++ class can provide inlined functions, for things like deleting items, address into the array, find the nth largest value, shift the slice values down or up to insert new elements maintaining sorted order. Should no spare elements be available, then dynamic allocation of a new larger float store, permits continuing growth at the cost of an array copy.
A circular buffer is designed as a FIFO, with new elements added at end, removal at front, and not allowing insertion in the middle, a self defined class can also (trivially) support array subscript values different from 0..N-1
Due to memory locality, avoiding excessive indirection due to pointer chains, and the pipelining of subscript calculations on a modern processor, a solution based on an array (or a vector), is likely to be most efficicent, despite element copying on insertion. Deque would be suitable but it fails to guarantee contiguous storage.
Additional supplementary info. Researching classes providing slices, finds some plausible alternatives to evaluate :
A) std::slice which uses slice_arrays
B) Boost Class Range
Hope this is the kind of specific information you were hoping for, in general a simpler clearer solution is more maintainable, than a tricky one. I would expect slices and ranges on sorted data sets, being quite common, for example filtering experimental data where "outliers" are excluded as faulty readings.
I think a good solution, should actually be - O(NlogN), 2xO(1), with any binary searches O(logN +1) for filtering on outlying values, in place of deleting a fixed number of small or large values; it matters that the "O" is relatively fast to, sometimes an O(1) algorithmn can be in practice slower for practical values of N than an O(N) one.

as a complementary to #40two 's answer, before partitioning the array, you will need to find the partitioning pivot, which is you will need to find the nth smallest number, and the nth greatest number in an unsorted array.
There is a discussion on that in SO: How to find the kth largest number in unsorted array
There are several algorithms to solve this problem. Some are deterministic O(N) - on of them is a variation on finding the median (median of medians). There are some non-deterministic algorithms with O(N) average-case.
A good source book to find those algorithms is Introduction to algorithms.
Also in books like
So eventually, your code will run in an O(N) time

Sorting 1000-2000 elements with many cache misses

I have an array of 1000-2000 elements which are pointers to objects. I want to keep my array sorted and obviously I want to do this as quick as possible. They are sorted by a member and not allocated contiguously so assume a cache miss whenever I access the sort-by member.
Currently I'm sorting on-demand rather than on-add, but because of the cache misses and [presumably] non-inlining of the member access the inner loop of my quick sort is slow.
I'm doing tests and trying things now, (and see what the actual bottleneck is) but can anyone recommend a good alternative to speeding this up?
Should I do an insert-sort instead of quicksorting on-demand, or should I try and change my model to make the elements contigious and reduce cache misses?
OR, is there a sort algorithm I've not come accross which is good for data that is going to cache miss?
Edit: Maybe I worded this wrong :), I don't actually need my array sorted all the time (I'm not iterating through them sequentially for anything) I just need it sorted when I'm doing a binary chop to find a matching object, and doing that quicksort at that time (when I want to search) is currently my bottleneck, because of the cache misses and jumps (I'm using a < operator on my object, but I'm hoping that inlines in release)

Simple approach: insertion sort on every insert. Since your elements are not aligned in memory I'm guessing linked list. If so, then you could transform it into a linked list with jumps to the 10th element, the 100th and so on. This is kind of similar to the next suggestion.
Or you reorganize your container structure into a binary tree (or what every tree you like, B, B*, red-black, ...) and insert elements like you would insert them into a search tree.

Running a quicksort on each insertion is enormously inefficient. Doing a binary search and insert operation would likely be orders of magnitude faster. Using a binary search tree instead of a linear array would reduce the insert cost.
Edit: I missed that you were doing sort on extraction, not insert. Regardless, keeping things sorted amortizes sorting time over each insert, which almost has to be a win, unless you have a lot of inserts for each extraction.
If you want to keep the sort on-extract methodology, then maybe switch to merge sort, or another sort that has good performance for mostly-sorted data.

I think the best approach in your case would be changing your data structure to something logarithmic and rethinking your architecture. Because the bottleneck of your application is not that sorting thing, but the question why do you have to sort everything on each insert and try to compensate that by adding on-demand sort?.
Another thing you could try (that is based on your current implementation) is implementing an external pointer - something mapping table / function and sort those second keys, but I actually doubt it would benefit in this case.

Instead of the array of the pointers you may consider an array of structs which consist of both a pointer to your object and the sort criteria. That is:
Instead of
struct MyType {
// ...
int m_SomeField; // this is the sort criteria
};
std::vector<MyType*> arr;
You may do this:
strcut ArrayElement {
MyType* m_pObj; // the actual object
int m_SortCriteria; // should be always equal to the m_pObj->m_SomeField
};
std::vector<ArrayElement> arr;
You may also remove the m_SomeField field from your struct, if you only access your object via this array.
By such in order to sort your array you won't need to dereference m_pObj every iteration. Hence you'll utilize the cache.
Of course you must keep the m_SortCriteria always synchronized with m_SomeField of the object (in case you're editing it).

As you mention, you're going to have to do some profiling to determine if this is a bottleneck and if other approaches provide any relief.
Alternatives to using an array are std::set or std::multiset which are normally implemented as R-B binary trees, and so have good performance for most applications. You're going to have to weigh using them against the frequency of the sort-when-searched pattern you implemented.
In either case, I wouldn't recommend rolling-your-own sort or search unless you're interested in learning more about how it's done.

I would think that sorting on insertion would be better. We are talking O(log N) comparisons here, so say ceil( O(log N) ) + 1 retrieval of the data to sort with.
For 2000, it amounts to: 8
What's great about this is that you can buffer the data of the element to be inserted, that's how you only have 8 function calls to actually insert.
You may wish to look at some inlining, but do profile before you're sure THIS is the tight spot.

Nowadays you could use a set, either a std::set, if you have unique values in your structure member, or, std::multiset if you have duplicate values in you structure member.
One side note: The concept using pointers, is in general not advisable.
STL containers (if used correctly) give you nearly always an optimized performance.
Anyway. Please see some example code:
#include <iostream>
#include <array>
#include <algorithm>
#include <set>
#include <iterator>
// Demo data structure, whatever
struct Data {
int i{};
};
// -----------------------------------------------------------------------------------------
// All in the below section is executed during compile time. Not during runtime
// It will create an array to some thousands pointer
constexpr std::size_t DemoSize = 4000u;
using DemoPtrData = std::array<const Data*, DemoSize>;
using DemoData = std::array<Data, DemoSize>;
consteval DemoData createDemoData() {
DemoData dd{};
int k{};
for (Data& d : dd)
d.i = k++*2;
return dd;
}
constexpr DemoData demoData = createDemoData();
consteval DemoPtrData createDemoPtrData(const DemoData& dd) {
DemoPtrData dpd{};
for (std::size_t k{}; k < dpd.size(); ++k)
dpd[k] = &dd[k];
return dpd;
}
constexpr DemoPtrData dpd = createDemoPtrData(demoData);
// -----------------------------------------------------------------------------------------
struct Comp {bool operator () (const Data* d1, const Data* d2) const { return d1->i < d2->i; }};
using MySet = std::multiset<const Data*, Comp>;
int main() {
// Add some thousand pointers. Will be sorted according to struct member
MySet mySet{ dpd.begin(), dpd.end() };
// Extract a range of data. integer values between 42 and 52
const Data* p42 = dpd[21];
const Data* p52 = dpd[26];
// Show result
for (auto iptr = mySet.lower_bound(p42); iptr != mySet.upper_bound(p52); ++iptr)
std::cout << (*iptr)->i << '\n';
// Insert a new element
Data d1{ 47 };
mySet.insert(&d1);
// Show again
std::cout << "\n\n";
for (auto iptr = mySet.lower_bound(p42); iptr != mySet.upper_bound(p52); ++iptr)
std::cout << (*iptr)->i << '\n';
}

How does one remove duplicate elements in place in an array in O(n) in C or C++?

Is there any method to remove the duplicate elements in an array in place in C/C++ in O(n)?
Suppose elements are a[5]={1,2,2,3,4}
then resulting array should contain {1,2,3,4}
The solution can be achieved using two for loops but that would be O(n^2) I believe.

If, and only if, the source array is sorted, this can be done in linear time:
std::unique(a, a + 5); //Returns a pointer to the new logical end of a.
Otherwise you'll have to sort first, which is (99.999% of the time) n lg n.

Best case is O(n log n). Perform a heap sort on the original array: O(n log n) in time, O(1)/in-place in space. Then run through the array sequentially with 2 indices (source & dest) to collapse out repetitions. This has the side effect of not preserving the original order, but since "remove duplicates" doesn't specify which duplicates to remove (first? second? last?), I'm hoping that you don't care that the order is lost.
If you do want to preserve the original order, there's no way to do things in-place. But it's trivial if you make an array of pointers to elements in the original array, do all your work on the pointers, and use them to collapse the original array at the end.
Anyone claiming it can be done in O(n) time and in-place is simply wrong, modulo some arguments about what O(n) and in-place mean. One obvious pseudo-solution, if your elements are 32-bit integers, is to use a 4-gigabit bit-array (512 megabytes in size) initialized to all zeros, flipping a bit on when you see that number and skipping over it if the bit was already on. Of course then you're taking advantage of the fact that n is bounded by a constant, so technically everything is O(1) but with a horrible constant factor. However, I do mention this approach since, if n is bounded by a small constant - for instance if you have 16-bit integers - it's a very practical solution.

Yes. Because access (insertion or lookup) on a hashtable is O(1), you can remove duplicates in O(N).
Pseudocode:
hashtable h = {}
numdups = 0
for (i = 0; i < input.length; i++) {
if (!h.contains(input[i])) {
input[i-numdups] = input[i]
h.add(input[i])
} else {
numdups = numdups + 1
}
This is O(N).
Some commenters have pointed out that whether a hashtable is O(1) depends on a number of things. But in the real world, with a good hash, you can expect constant-time performance. And it is possible to engineer a hash that is O(1) to satisfy the theoreticians.

I'm going to suggest a variation on Borealids answer, but I'll point out up front that it's cheating. Basically, it only works assuming some severe constraints on the values in the array - e.g. that all keys are 32-bit integers.
Instead of a hash table, the idea is to use a bitvector. This is an O(1) memory requirement which should in theory keep Rahul happy (but won't). With the 32-bit integers, the bitvector will require 512MB (ie 2**32 bits) - assuming 8-bit bytes, as some pedant may point out.
As Borealid should point out, this is a hashtable - just using a trivial hash function. This does guarantee that there won't be any collisions. The only way there could be a collision is by having the same value in the input array twice - but since the whole point is to ignore the second and later occurences, this doesn't matter.
Pseudocode for completeness...
src = dest = input.begin ();
while (src != input.end ())
{
if (!bitvector [*src])
{
bitvector [*src] = true;
*dest = *src; dest++;
}
src++;
}
// at this point, dest gives the new end of the array
Just to be really silly (but theoretically correct), I'll also point out that the space requirement is still O(1) even if the array holds 64-bit integers. The constant term is a bit big, I agree, and you may have issues with 64-bit CPUs that can't actually use the full 64 bits of an address, but...

Take your example. If the array elements are bounded integer, you can create a lookup bitarray.
If you find an integer such as 3, turn the 3rd bit on.
If you find an integer such as 5, turn the 5th bit on.
If the array contains elements rather than integer, or the element is not bounded, using a hashtable would be a good choice, since hashtable lookup cost is a constant.

The canonical implementation of the unique() algorithm looks like something similar to the following:
template<typename Fwd>
Fwd unique(Fwd first, Fwd last)
{
if( first == last ) return first;
Fwd result = first;
while( ++first != last ) {
if( !(*result == *first) )
*(++result) = *first;
}
return ++result;
}
This algorithm takes a range of sorted elements. If the range is not sorted, sort it before invoking the algorithm. The algorithm will run in-place, and return an iterator pointing to one-past-the-last-element of the unique'd sequence.
If you can't sort the elements then you've cornered yourself and you have no other choice but to use for the task an algorithm with runtime performance worse than O(n).
This algorithm runs in O(n) runtime. That's big-oh of n, worst case in all cases, not amortized time. It uses O(1) space.

The example you have given is a sorted array. It is possible only in that case (given your constant space constraint)

std::map and performance, intersecting sets

I'm intersecting some sets of numbers, and doing this by storing a count of each time I see a number in a map.
I'm finding the performance be very slow.
Details:
- One of the sets has 150,000 numbers in it
- The intersection of that set and another set takes about 300ms the first time, and about 5000ms the second time
- I haven't done any profiling yet, but every time I break the debugger while doing the intersection its in malloc.c!
So, how can I improve this performance? Switch to a different data structure? Some how improve the memory allocation performance of map?
Update:
Is there any way to ask std::map or
boost::unordered_map to pre-allocate
some space?
Or, are there any tips for using these efficiently?
Update2:
See Fast C++ container like the C# HashSet<T> and Dictionary<K,V>?
Update3:
I benchmarked set_intersection and got horrible results:
(set_intersection) Found 313 values in the intersection, in 11345ms
(set_intersection) Found 309 values in the intersection, in 12332ms
Code:
int runIntersectionTestAlgo()
{
set<int> set1;
set<int> set2;
set<int> intersection;
// Create 100,000 values for set1
for ( int i = 0; i < 100000; i++ )
{
int value = 1000000000 + i;
set1.insert(value);
}
// Create 1,000 values for set2
for ( int i = 0; i < 1000; i++ )
{
int random = rand() % 200000 + 1;
random *= 10;
int value = 1000000000 + random;
set2.insert(value);
}
set_intersection(set1.begin(),set1.end(), set2.begin(), set2.end(), inserter(intersection, intersection.end()));
return intersection.size();
}

You should definitely be using preallocated vectors which are way faster. The problem with doing set intersection with stl sets is that each time you move to the next element you're chasing a dynamically allocated pointer, which could easily not be in your CPU caches. With a vector the next element will often be in your cache because it's physically close to the previous element.
The trick with vectors, is that if you don't preallocate the memory for a task like this, it'll perform EVEN WORSE because it'll go on reallocating memory as it resizes itself during your initialization step.
Try something like this instaed - it'll be WAY faster.
int runIntersectionTestAlgo() {
vector<char> vector1; vector1.reserve(100000);
vector<char> vector2; vector2.reserve(1000);
// Create 100,000 values for set1
for ( int i = 0; i < 100000; i++ ) {
int value = 1000000000 + i;
set1.push_back(value);
}
sort(vector1.begin(), vector1.end());
// Create 1,000 values for set2
for ( int i = 0; i < 1000; i++ ) {
int random = rand() % 200000 + 1;
random *= 10;
int value = 1000000000 + random;
set2.push_back(value);
}
sort(vector2.begin(), vector2.end());
// Reserve at most 1,000 spots for the intersection
vector<char> intersection; intersection.reserve(min(vector1.size(),vector2.size()));
set_intersection(vector1.begin(), vector1.end(),vector2.begin(), vector2.end(),back_inserter(intersection));
return intersection.size();
}

Without knowing any more about your problem, "check with a good profiler" is the best general advise I can give. Beyond that...
If memory allocation is your problem, switch to some sort of pooled allocator that reduces calls to malloc. Boost has a number of custom allocators that should be compatible with std::allocator<T>. In fact, you may even try this before profiling, if you've already noticed debug-break samples always ending up in malloc.
If your number-space is known to be dense, you can switch to using a vector- or bitset-based implementation, using your numbers as indexes in the vector.
If your number-space is mostly sparse but has some natural clustering (this is a big if), you may switch to a map-of-vectors. Use higher-order bits for map indexing, and lower-order bits for vector indexing. This is functionally very similar to simply using a pooled allocator, but it is likely to give you better caching behavior. This makes sense, since you are providing more information to the machine (clustering is explicit and cache-friendly, rather than a random distribution you'd expect from pool allocation).

I would second the suggestion to sort them. There are already STL set algorithms that operate on sorted ranges (like set_intersection, set_union, etc):
set_intersection

I don't understand why you have to use a map to do intersection. Like people have said, you could put the sets in std::set's, and then use std::set_intersection().
Or you can put them into hash_set's. But then you would have to implement intersection manually: technically you only need to put one of the sets into a hash_set, and then loop through the other one, and test if each element is contained in the hash_set.

Intersection with maps are slow, try a hash_map. (however, this is not provided in all STL implementation.
Alternatively, sort both map and do it in a merge-sort-like way.

What is your intersection algorithm? Maybe there are some improvements to be made?
Here is an alternate method
I do not know it to be faster or slower, but it could be something to try. Before doing so, I also recommend using a profiler to ensure you really are working on the hotspot. Change the sets of numbers you are intersecting to use std::set<int> instead. Then iterate through the smallest one looking at each value you find. For each value in the smallest set, use the find method to see if the number is present in each of the other sets (for performance, search from smallest to largest).
This is optimised in the case that the number is not found in all of the sets, so if the intersection is relatively small, it may be fast.
Then, store the intersection in std::vector<int> instead - insertion using push_back is also very fast.
Here is another alternate method
Change the sets of numbers to std::vector<int> and use std::sort to sort from smallest to largest. Then use std::binary_search to find the values, using roughly the same method as above. This may be faster than searching a std::set since the array is more tightly packed in memory. Actually, never mind that, you can then just iterate through the values in lock-step, looking at the ones with the same value. Increment only the iterators which are less than the minimum value you saw at the previous step (if the values were different).

Might be your algorithm. As I understand it, you are spinning over each set (which I'm hoping is a standard set), and throwing them into yet another map. This is doing a lot of work you don't need to do, since the keys of a standard set are in sorted order already. Instead, take a "merge-sort" like approach. Spin over each iter, dereferencing to find the min. Count the number that have that min, and increment those. If the count was N, add it to the intersection. Repeat until the first map hits it's end (If you compare the sizes before starting, you won't have to check every map's end each time).
Responding to update: There do exist faculties to speed up memory allocation by pre-reserving space, like boost::pool_alloc. Something like:
std::map<int, int, std::less<int>, boost::pool_allocator< std::pair<int const, int> > > m;
But honestly, malloc is pretty good at what it does; I'd profile before doing anything too extreme.

Look at your algorithms, then choose the proper data type. If you're going to have set-like behaviour, and want to do intersections and the like, std::set is the container to use.
Since it's elements are stored in a sorted way, insertion may cost you O(log N), but intersection with another (sorted!) std::set can be done in linear time.

I figured something out: if I attach the debugger to either RELEASE or DEBUG builds (e.g. hit F5 in the IDE), then I get horrible times.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js