I have a bit of an issue, I was recently told that for an un-ordered value for input, a bunch of random values, lets say 1 Million of them, that using a set would be more efficient than using a vector, and then sorting said vector with the basic sort algorithm function, but when I used them, and checked them through the time function, in the terminal, and valgrind, it showed that both time complexity, and space usage were faster for the vector, even with the addition of the sort function being called. The person who gave me the advice to use the set is a lot more experienced than me in the C++ language, but I always have to test things out myself prior to taking peoples advice. The test codes follow.
For Set
std::set<int> testSet;
for(int i(0); i<= 1000000; ++i)
testSet.insert(-i);
For Vector
std::vector<int> testVector;
for(int i(0); i<= 1000000; ++i)
testVector.push_back(i * -1);
std::sort(testVector.begin(), testVector.end());
I know that these are not random variables, it wouldn't be fair since set does not allow duplicates, and vector does sothey would be different sizes for this basic function point. Can anyone clarify why the set should be used, sans the point of the no duplicates one.
I did not do any tests with the unordered set either. Not too sure of the differences between the two given points.
This is too vague and ignores/misses out several crucial factors. If your friend said precisely this, then your friend (regardless of his or her experience) was wrong. More likely you are somewhat misinterpreting their words and reading into them a simplified version of matters.
When you want a sorted final product, the sorting is "amortized" when you insert into a set, because you get little bits of sorting action each time. If you will be inserting periodically and many times, then that spreading-out of the workload may be what you want. The total, when added up, may still be more than for a vector (consider the occasional rebalancing and so forth; your vector just needs to be moved to a larger block of memory once in a while), but you've spread it out so as not to noticeably slow down some individual other part of your program.
But if you're just dumping all the elements into a vector and sorting straight away, not only is there less work for the container & algorithm to do but you probably don't mind it taking a noticeable amount of time.
You haven't really stated your use case in any detail so I won't pretend to give specifics here, but the only possible answer to your question as posed is both "it depends" and "the question is fundamentally somewhat meaningless"; you cannot just take two data structures and sorting methodologies, and ask "which is more efficient?" without a use case. You have, however, correctly measured the time and space requirements and if you've done that against your real-world use case then, well, you have your answer don't you?
Related
I'm building a little 2d game engine. Now I need to store the prototypes of the game objects (all type of informations). A container that will have at most I guess few thousand elements all with unique key and no elements will be deleted or added after a first load. The key value is a string.
Various threads will run, and I need to send to everyone a key(or index) and with that access other information(like a texture for the render process or sound for the mixer process) available only to those threads.
Normally I use vectors because they are way faster to accessing a known element. But I see that unordered map also usually have a constant speed if I use the ::at element access. It would make the code much cleaner and also easier to maintain because I will deal with much more understandable man made strings.
So the question is, the difference in speed between a access to a vector[n] compared to a unorderedmap.at("string") is negligible compared to his benefits?
From what I understand accessing various maps in different part of the program, with different threads running just with a "name" for me is a big deal and the speed difference isn't that great. But I'm too inexperienced to be sure of this. Although I found informations about it seem I can't really understand if I'm right or wrong.
Thank you for your time.
As an alternative, you could consider using an ordered vector because the vector itself will not be modified. You can easily write an implementation yourself with STL lower_bound etc, or use an implementation from libraries ( boost::flat_map).
There is a blog post from Scott Meyers about container performance in this case. He did some benchmarks and the conclusion would be that an unordered_mapis probably a very good choice with high chances that it will be the fastest option. If you have a restricted set of keys, you can also compute a minimal optimal hash function, e.g. with gperf
However, for these kind of problems the first rule is to measure yourself.
My problem was to find a record on a container by a given std::string type as Key access. Considering Keys that only EXISTS(not finding them was not a option) and the elements of this container are generated only at the beginning of the program and never touched thereafter.
I had huge fears unordered map was not fast enough. So I tested it, and I want to share the results hoping I've not mistaken everything.
I just hope that can help others like me and to get some feedback because in the end I'm beginner.
So, given a struct of record filled randomly like this:
struct The_Mess
{
std::string A_string;
long double A_ldouble;
char C[10];
int* intPointer;
std::vector<unsigned int> A_vector;
std::string Another_String;
}
I made a undordered map, give that A_string contain the key of the record:
std::unordered_map<std::string, The_Mess> The_UnOrdMap;
and a vector I sort by the A_string value(which contain the key):
std::vector<The_Mess> The_Vector;
with also a index vector sorted, and used to access as 3thrd way:
std::vector<std::string> index;
The key will be a random string of 0-20 characters in lenght(I wanted the worst possible scenario) containing letter both capital and normal and numbers or spaces.
So, in short our contendents are:
Unordered map I measure the time the program get to execute:
record = The_UnOrdMap.at( key ); record is just a The_Mess struct.
Sorted Vector measured statements:
low = std::lower_bound (The_Vector.begin(), The_Vector.end(), key, compare);
record = *low;
Sorted Index vector:
low2 = std::lower_bound( index.begin(), index.end(), key);
indice = low2 - index.begin();
record = The_Vector[indice];
The time is in nanoseconds and is a arithmetic average of 200 iterations. I have a vector that I shuffle at every iteration containing all the keys, and at every iteration I cycle through it and look for the key I have here in the three ways.
So this are my results:
I think the initials spikes are a fault of my testing logic(the table I iterate contains only the keys generated so far, so it only has 1-n elements). So 200 iterations of 1 key search for the first time. 200 iterations of 2 keys search the second time etc...
Anyway, it seem that in the end the best option is the unordered map, considering that is a lot less code, it's easier to implement and will make the whole program way easier to read and probably maintain/modify.
You have to think about caching as well. In case of std::vector you'll have very good cache performance when accessing the elements - when accessing one element in RAM, CPU will cache nearby memory values and this will include nearby portions of your std::vector.
When you use std::map (or std::unordered_map) this is no longer true. Maps are usually implemented as self balancing binary-search trees, and in this case values can be scattered around the RAM. This imposes great hit on cache performance, especially as maps get bigger and bigger as CPU just cannot cache the memory that you're about to access.
You'll have to run some tests and measure performance, but cache misses can greatly hurt the performance of your program.
You are most likely to get the same performance (the difference will not be measurable).
Contrary to what some people seem to believe, unordered_map is not a binary tree. The underlying data structure is a vector. As a result, cache locality does not matter here - it is the same as for vector. Granted, you are going to suffer if you have collissions due to your hashing function being bad. But if your key is a simple integer, this is not going to happen. As a result, access to to element in hash map will be exactly the same as access to the element in the vector with time spent on getting hash value for integer, which is really non-measurable.
It is follow-up question for this MIC question. When adding items to the vector of reference wrappers I spend about 80% of time inside ++ operator whatever iterating approach I choose.
The query works as following
VersionView getVersionData(int subdeliveryGroupId, int retargetingId,
const std::wstring &flightName) const {
VersionView versions;
for (auto i = 0; i < 3; ++i) {
for (auto j = 0; j < 3; ++j) {
versions.insert(m_data.get<mvKey>().equal_range(boost::make_tuple(subdeliveryGroupId + i, retargetingId + j,
flightName)));
}
}
return versions;
}
I've tried following ways to fill the reference wrapper
template <typename InputRange> void insert(const InputRange &rng) {
// 1) base::insert(end(), rng.first, rng.second); // 12ms
// 2) std::copy(rng.first, rng.second, std::back_inserter(*this)); // 6ms
/* 3) size_t start = size(); // 12ms
auto tmp = std::reference_wrapper<const
VersionData>(VersionData(0,0,L""));
resize(start + boost::size(rng), tmp);
auto beg = rng.first;
for (;beg != rng.second; ++beg, ++start)
{
this->operator[](start) = std::reference_wrapper<const VersionData>(*beg);
}
*/
std::copy(rng.first, rng.second, std::back_inserter(*this));
}
Whatever I do I pay for operator ++ or the size method which just increments the iterator - meaning I'm still stuck in ++. So the question is if there is a way to iterate result ranges faster. If there is no such a way is it worth to try and go down the implementation of equal_range adding new argument which holds reference to the container of reference_wrapper which will be filled with results instead of creating range?
EDIT 1: sample code
http://coliru.stacked-crooked.com/a/8b82857d302e4a06/
Due to this bug it will not compile on Coliru
EDIT 2: Call tree, with time spent in operator ++
EDIT 3: Some concrete stuff. First of all I didnt started this thread just because the operator++ takes too much time in overall execution time and I dont like it just "because" but at this very moment it is the major bottleneck in our performance tests. Each request usually processed in hundreds of microseconds, request similar to this one (they are somewhat more complex) are processed ~1000-1500 micro and it is still acceptable. The original problem was that once the number of items in datastructure grows to hundreds of thousands the performance deteriorates to something like 20 milliseconds. Now after switching to MIC (which drastically improved the code readability, maintainability and overall elegance) I can reach something like 13 milliseconds per request of which 80%-90% spent in operator++. Now the question if this could be improved somehow or should I look for some tar and feathers for me? :)
The fact that 80% of getVersionData´s execution time is spent in operator++ is not indicative of any performance problem per se --at most, it tells you that equal_range and std::reference_wrapper insertion are faster in comparison. Put another way, when you profile some piece of code you will typically find locations where the most time is spent, but whether this is a problem or not depends on the required overall performance.
#kreuzerkrieg, your sample code does not exercise any kind of insertion into a vector of std::reference_wrappers! Instead, you're projecting the result of equal_range into a boost::any_range, which is expected to be fairly slow at iteration --basically, increment ops resolve to virtual calls.
So, unless I'm seriously missing something here, the sample code performance or lack thereof does not have anything to do with whatever your problem is in real code (assuming VersionView, of which you don't show the code, is not using boost::any_range).
That said, if you can afford replacing your ordered indices with equivalent hashed indices, iteration will probably be faster, but this is is an utter shot in the dark given you're not showing the real stuff.
I think that you're measuring the wrong things entirely. When I scale up from 3x3x11111 to 10x10x111111 (so 111x as many items in the index), it still runs in 290ms.
And populating the stuff takes orders of magnitude more time. Even deallocating the container appears to take more time.
What Doesn't Matter?
I've contributed a version with some trade offs, which mainly show that there's no sense in tweaking things: View On Coliru
there's a switch to avoid the any_range (it doesn't make sense using that if you care for performance)
there's a switch to tweak the flyweight:
#define USE_FLYWEIGHT 0 // 0: none 1: full 2: no tracking 3: no tracking no locking
again, it merely shows you could easily do without, and should consider doing so unless you need the memory optimization for the string (?). If so, consider using the OPTIMIZE_ATOMS approach:
the OPTIMIZE_ATOMS basically does fly weight for wstring there. Since all the strings are repeated here it will be mighty storage efficient (although the implementation is quick and dirty and should be improved). The idea is much better applied here: How to improve performance of boost interval_map lookups
Here's some rudimentary timings:
As you can see, basically nothing actually matters for query/iteration performance
Any Iterators: Doe They Matter?
It might be the culprit on your compiler. On my compile (gcc 4.8.2) it wasn't anything big, but see the disassembly of the accumulate loop without the any iterator:
As you can see from the sections I've highlighted, there doesn't seem to be much fat from the algorithm, the lambda nor from the iterator traversal. Now with the any_iterator the situation is much less clear, and if your compile optimizes less well, I can imagine it failing to inline elementary operations making iteration slow. (Just guessing a little now)
Ok, so the solution I applied is as following:
in addition to the odered_non_unique index (the 'byKey') I've added random_access index. When the data is loaded I rearrange the random index with m_data.get.begin(). Then when the MIC is queried for the data I just do boost::equal_range on the random index with custom predicate which emulates the same logic that was applied in ordering of 'byKey' index. That's it, it gave me fast 'size()' (O(1), as I understand) and fast traversal.
Now I'm ready for your rotten tomatoes :)
EDIT 1:
of course I've changed the any_range from bidirectional traversal tag to the random access one
Context
I have a code like this:
..
vector<int> values = ..., vector<vector<int>> buckets;
//reserve space for values and each buckets sub-vector
for (int i = 0; i < values.size(); i++) {
buckets[values[i]].push_back(i);
}
...
So I get a "buckets" with indexes of entries that have same value. Those buckets are used then in further processing.
Actually I'm working with native dynamic arrays (int ** buckets;) but for simplicity's sake I've used vectors above.
I know the size of each bucket before filling.
Size of vectors is about 2,000,000,000.
The problem
As you can see the code above access "buckets" array in a random manner. Thus it have constant cache misses that slows execution time dramatically.
Yes, I see such misses in profile report.
Question
Is there a way to improve speed of such code?
I've tried to create a aux vector and put first occurrence of value there thus I can put two indexes in corresponding bucket as I found second one. This approach didn't give any speedup.
Thank you!
Why are you assuming it's cache misses that make your code slow? Have you profiled or is that just what came to mind?
There's a number of things very wrong with your code from a performance perspective. The first and most obvious is that you never reserve a vector size. What's happening is that your vector is starting out very small (say, 2 elements), then each time you add past the size, it'll resize again and copy the contents over to the new memory location. If you're saying there are 2 billion entries, you're resizing maybe 30 times!
You need to call the function vector.reserve() (or vector.resize(), depending on what behavior works best for you) before you look at other improvements.
EDIT
Seriously? You mention that you're not even using a vector in your PS? How are we supposed to guess what your actual code looks like and how it will perform?
Is foo at least reversible and surjective for a given interval? Then you can run through that interval and fill buckets[j] completely with bar(j,k) (if bar is the inverse of foo), for k in [0,...,MAX_BAR_J), then continue with j+1 and so forth.
If however foo has hashing properties, you have very little chance, because you cannot predict to which index the next i will get you. So I see no chance right now.
I've always been told vectors are fast, and in my years of programming experience, I've never seen anything to contract that. I decided to (prematurely optimize and) write a associative class that was a thin wrapper around a sequential container (namely ::std::vector and provided the same interface as ::std::map. Most of the code was really easy, and I got it working with little difficulty.
However, in my tests of various sized POD types (4 to 64 bytes), and std::strings, with counts varying from eight to two-thousand, ::std::map::find was faster than my ::associative::find, usually around 15% faster, for almost all tests. I made a Short, Self Contained, Correct (Compilable), Example that clearly shows this at ideone I checked MSVC9's implementation of ::std::map::find and confirmed that it matches my vecfind and ::std::lower_bound code quite closely, and cannot account for why the ::std::map::find runs faster, except for a discussion on Stack Overflow where people speculated that the binary search method does not benefit at all from the locality of the vector until the last comparison (making it no faster), and that it's requires pointer arithmetic that ::std::map nodes don't require, making it slower.
Today someone challenged me on this, and provided this code at ideone, which when I tested, showed the vector to be over twice as fast.
Do the coders of StackOverflow want to enlighten me on this apparent discrepancy? I've gone over both sets of code, and they seem euqivalent to me, but maybe I'm blind from playing with both of them so much.
(Footnote: this is very close to one of my previous questions, but my code had several bugs which were addressed. Due to new information/code, I felt this was different enough to justify a separate question. If not, I'll work on merging them.)
What makes you think that mapfind() is faster than vecfind()?
The ideone output for your code reports about 50% more ticks for mapfind() than for vecfind(). Running the code here (x86_64 linux, g++-4.5.1), mapfind() takes about twice as long as vecfind().
Making the map/vector larger by a factor of 10, the difference increases to about 3×.
Note however that the sum of the second components is different. The map contains only one pair with any given first component (with my local PRNG, that creates a map two elements short), while the vector can contain multiple such pairs.
The number of elements you're putting into your test container are more than the number of possible outputs from rand() in Microsoft, thus you're getting repeated numbers. The sorted vector will contain all of them while the map will throw out the duplicates. Check the sizes after filling them - the vector will have 100000 elements, the map 32768. Since the map is much shorter, of course it will have better performance.
Try a multimap for an apples-to-apples comparison.
I see some problems with the the code ( http://ideone.com/41iKt ) you posted on ideone.com. (ideone actually shows vector as faster, but a local build with the Visual Studio 11 Developer Preview shows map faster).
First I moved the map variable and used it to initialize the vector to get the same element ordering and uniquing, and then I gave lower_bound a custom comparator that only compares first, since that's what map will be doing. After these changes Visual Studio 11 shows the vector as faster for the same 100,000 elements (although the ideone time doesn't change significantly). http://ideone.com/b3OnH
With test_size reduced to 8 map is still faster. This isn't surprising because this is the way algorithm complexity works, all the constants in the function that truly describes the run time matter with small N. I have to raise test_size to about 2700 for vector to pull even and then ahead of map on this system.
A sorted std::vector has two advantages over std::map:
Better data locality: vector stores all data contiguously in memory
Smaller memory footprint: vector does not need much bookkeeping data (e.g., no tree node objects)
Whether these two effects matter depend on the scenario. There are two factors that are likely to have a major impact:
Data type
It is an advantage for the std::vector if the elements are primitive types like integers. In that case, the locality really helps because all data needed by the search is in a contiguous location in memory.
If the elements are say strings, then the locality does not help that much. The contiguous vector memory now only stores pointers objects that are potentially all over the heap.
Data size
If the std::vector fits into a particular cache level but the std::map does not, the std::vector will have an advantage. This is especially the case if you keep repeating the test over the same data.
I need a fast container with only two operations. Inserting keys on from a very sparse domain (all 32bit integers, and approx. 100 are set at a given time), and iterating over the inserted keys. It should deal with a lot of insertions which hit the same entries (like, 500k, but only 100 different ones).
Currently, I'm using a std::set (only insert and the iterating interface), which is decent, but still not fast enough. std::unordered_set was twice as slow, same for the Google Hash Maps. I wonder what data structure is optimized for this case?
Depending on the distribution of the input, you might be able to get some improvement without changing the structure.
If you tend to get a lot of runs of a single value, then you can probably speed up insertions by keeping a record of the last value you inserted, and don't bother doing the insertion if it matches. It costs an extra comparison per input, but saves a lookup for each element in a run beyond the first. So it could improve things no matter what data structure you're using, depending on the frequency of repeats and the relative cost of comparison vs insertion.
If you don't get runs, but you tend to find that values aren't evenly distributed, then a splay tree makes accessing the most commonly-used elements cheaper. It works by creating a deliberately-unbalanced tree with the frequent elements near the top, like a Huffman code.
I'm not sure I understand "a lot of insertions which hit the same entries". Do you mean that there are only 100 values which are ever members, but 500k mostly-duplicate operations which insert one of those 100 values?
If so, then I'd guess that the fastest container would be to generate a collision-free hash over those 100 values, then maintain an array (or vector) of flags (int or bit, according to what works out fastest on your architecture).
I leave generating the hash as an exercise for the reader, since it's something that I'm aware exists as a technique, but I've never looked into it myself. The point is to get a fast hash over as small a range as possible, such that for each n, m in your 100 values, hash(n) != hash(m).
So insertion looks like array[hash(value)] = 1;, deletion looks like array[hash(value)] = 0; (although you don't need that), and to enumerate you run over the array, and for each set value at index n, inverse_hash(n) is in your collection. For a small range you can easily maintain a lookup table to perform the inverse hash, or instead of scanning the whole array looking for set flags, you can run over the 100 potentially-in values checking each in turn.
Sorry if I've misunderstood the situation and this is useless to you. And to be honest, it's not very much faster than a regular hashtable, since realistically for 100 values you can easily size the table such that there will be few or no collisions, without using so much memory as to blow your caches.
For an in-use set expected to be this small, a non-bucketed hash table might be OK. If you can live with an occasional expansion operation, grow it in powers of 2 if it gets more than 70% full. Cuckoo hashing has been discussed on Stackoverflow before and might also be a good approach for a set this small. If you really need to optimise for speed, you can implement the hashing function and lookup in assembler - on linear data structures this will be very simple so the coding and maintenance effort for an assembler implementation shouldn't be unduly hard to maintain.
You might want to consider implementing a HashTree using a base 10 hash function at each level instead of a binary hash function. You could either make it non-bucketed, in which case your performance would be deterministic (log10) or adjust your bucket size based on your expected distribution so that you only have a couple of keys/bucket.
A randomized data structure might be perfect for your job. Take a look at the skip list – though I don't know any decend C++ implementation of it. I intended to submit one to Boost but never got around to do it.
Maybe a set with a b-tree (instead of binary tree) as internal data structure. I found this article on codeproject which implements this.
Note that while inserting into a hash table is fast, iterating over it isn't particularly fast, since you need to iterate over the entire array.
Which operation is slow for you? Do you do more insertions or more iteration?
How much memory do you have? 32-bits take "only" 4GB/8 bytes, which comes to 512MB, not much for a high-end server. That would make your insertions O(1). But that could make the iteration slow. Although skipping all words with only zeroes would optimize away most iterations. If your 100 numbers are in a relatively small range, you can optimize even further by keeping the minimum and maximum around.
I know this is just brute force, but sometimes brute force is good enough.
Since no one has explicitly mentioned it, have you thought about memory locality? A really great data structure with an algorithm for insertion that causes a page fault will do you no good. In fact a data structure with an insert that merely causes a cache miss would likely be really bad for perf.
Have you made sure a naive unordered set of elements packed in a fixed array with a simple swap to front when an insert collisides is too slow? Its a simple experiment that might show you have memory locality issues rather than algorithmic issues.