How to make a fast search for an object with a particular value in a vector of structs or classes? c++

How to make a fast search for an object with a particular value in a vector of structs or classes? c++ - c++

If I have thousands of struct or class objects in a vector, how to find those that are needed, in a fast way?
For example:
Making a game, and I need fastest way of collision detection. Each tile is a struct, there are many tiles in the vector map, with a values: x and y.
So basically I do:
For(i=0;i<end of vector list;i++)
{
//searching if x= 100 and y =200
}
So maybe there is a different way , like smart pointers or something to search for particular objects faster?

You should sort your vector and then use the standard library algorithms like binary_search, lower_bound, or upper_bound.
The above will give you a better compliexity than o(n) given by walk through of entire vector or by using standard library algorithm find.

i think you have to go more in depth that the simple research of a value inside a group of struct, even more if you are planning on searching among a elevated number.
How are the struct generated, how are they collected and how you keep track of them, there is a common key that you can you can use to order while you create them?
You should focus on sorting them while you add it to the whole structure, that way you avoid massive computation burst every time you have to perform a search. Choose a good algorithm (example AVL sorting), that way you can have a O(log(n))) adding/delete/searching.

A vector is just an unordered collection of objects. There is not really anyway to do what you are asking unless you start sorting your vector in specific ways (e.g. if it is sorted you can jump to the middle of the vector and potentially split your search time in half)
You may be better off picking a different data structure (either instead of the vector or in combination with it)

For example:
for_each(v.begin(),v.end(), [](int e)
{
if (e%2==1)//vector elements that are not divided by 2 without remainder
cout<<e<<endl;
});

Related

C++ Find in a vector of <int, pair>

So previously I only had 1 key I needed to look up, so I was able to use a map:
std::map <int, double> freqMap;
But now I need to look up 2 different keys. I was thinking of using a vector with std::pair i.e.:
std::vector <int, std::pair<int, double>> freqMap;
Eventually I need to look up both keys to find the correct value. Is there a better way to do this, or will this be efficient enough (will have ~3k entries). Also, not sure how to search using the second key (first key in the std::pair). Is there a find for the pair based on the first key? Essentially I can access the first key by:
freqMap[key1]
But not sure how to iterate and find the second key in the pair.
Edit: Ok adding the use case for clarification:
I need to look up a val based on 2 keys, a mux selection and a frequency selection. The raw data looks something like this:
Mux, Freq, Val
0, 1000, 1.1
0, 2000, 2.7
0, 10e9, 1,7
1, 1000, 2.2
1, 2500, 0.8
6, 2000, 2.2

The blanket answer to "which is faster" is generally "you have to benchmark it".
But besides that, you have a number of options. A std::map is more efficient than other data structures on paper, but not necessarily in practice. If you truly are in a situation where this is performance critical (i.e. avoid premature optimisation) try different approaches, as sketched below, and measure the performance you get (memory-wise and cpu-wise).
Instead of using a std::map, consider throwing your data into a struct, give it proper names and store all values in a simple std::vector. If you modify the data only seldom, you can optimise retrieval cost at the expense of additional insertion cost by sorting the vector according to the key you are typically using to find an entry. This will allow you to do binary search, which can be much faster than linear search.
However, linear search can be surprisingly fast on a std::vector because of both cache locality and branch prediction. Both of which you are likely to lose when dealing with a map, unordered_map or (binary searched) sorted vector. So, although O(n) sounds much more scary than, say, O(log n) for map or even O(1) for unordered_map, it can still be faster under the right conditions.
Especially if you discover that you don't have a discernible index member you can use to sort your entries, you will have to either stick to linear search in contiguous memory (i.e. vector) or invest into a doubly indexed data structure (effectively something akin to two maps or two unordered_maps). Having two indexes usually prevents you from using a single map/unordered_map.
If you can pack your data more tightly (i.e. do you need an int or would a std::uint8_t do the job?, do you need a double? etc.) you will amplify cache locality and for only 3k entries you have good chances of an unsorted vector to perform best. Although operations on an std::size_t are typically faster themselves than on smaller types, iterating over contiguous memory usually offsets this effect.
Conclusion: Try an unsorted vector, a sorted vector (+binary search), a map and an unordered_map. Do proper benchmarking (with several repetitions) and pick the fastest one. If it doesn't make a difference pick the one that is the most straight-forward to understand.
Edit: Given your example data, it sounds like the first key has an extremely small domain. As far as I can tell "Mux" seems to be limited to a small number of different values which are near each other, in such a situation you may consider using an std::array as your primary indexing structure and have a suitable lookup structure as your second one. For example:
std::array<std::vector<std::pair<std::uint64_t,double>>,10>
std::array<std::unordered_map<std::uint64_t,double>,10>

At what point does an std::map make more sense for grouping objects compared to two vectors and a linear search?

I am trying to sort a large collection of objects into a series of groups, which represent some kind of commonality between them.
There seems to be two ways I can go about this:
1) I can manage everything by hand, sorting out all the objects into a vector of vectors. However, this means that I have to iterate over all the upper level vectors every time I want to try and find an existing group for an ungrouped object. I imagine this will become very computationally expensive very quickly as the number of disjoint groups increases.
2) I can use the identifiers of each object that I'm using to classify them as a key for an std::map, where the value is a vector. At that point, all I have to do is iterate over all the input objects once, calling myMap[object.identifier].push_back(object) each time. The map will sort everything out into the appropriate vector, and then I can just iterate over the resulting values afterwards.
My question is...
Which method would be best to use? It seems like a vector of vectors would be faster initially, but it's going to slow down as more and more groups are created. AFAIK, std::map uses RB trees internally, which means that finding the appropriate vector to add the object to should be faster, but you're going to pay for that when the tree inevitably needs to be rebalanced.
The additional memory consumption from an std::map doesn't matter. I'm dealing with anywhere from 12000 to 80000 individual objects that need to be grouped together, and I expect there to be anywhere from 12000 to 20000 groups once everything is said and done.

Instead of using either of your mentioned approaches directly, I suggest you evaluate the use of std::unordered_map (docs here) for your use case. It uses maps with buckets and hashed values internally and has average constant complexity for search, insertion and removal.

chained hash table keys with universal hasing,does it need a rehash?

I am implementing a chained hash table using a vector < lists >. I resized my vector to a prime number, let's say 5. To choose the key I am using the universal hasing.
My question is, do I need to rehash my vector? I mean this code will generate always a key in a range between 0 and 5 because it depends from the size of my hashtable, causing collisions of course but the new strings will be all added in the lists of every position in the vector...so it seems I don't need to resize/rehash the whole thing. What do you think? Is this a mistake?

Yes, you do. Otherwise objects will be in the wrong hash bucket and when you search for them, you won't find them. The whole point of hashing is to make locating an object faster -- that won't work if objects aren't where they're supposed to be.
By the way, you probably shouldn't be doing this. There are people who have spent years developing efficient hashing algorithms. Trying to roll your own will result in poor performance. Start with the article on linear hashing in Wikipedia.

do I need to rehash my vector?
Your container could continue to function without rehashing, but searching, insertions and erasing will perform more and more like a plain list instead of a hash table: for example, if you've inserted 10,000 elements you can expect each list in your vector to have roundly 2000 elements, and you may have to search all 2000 to see if a value you're considering inserting is a duplicate, or to find a value to erase, or simply return an iterator to. Sure, 2,000 is better than 10,000, but it's a long way from the O(1) performance expected of a quality hash table implementation. Your non-resizing implementation is still "O(N)".
Is this a mistake?
Yes, a fundamental one.

3D-Grid of bins: nested std::vector vs std::unordered_map

pros, I need some performance-opinions with the following:
1st Question:
I want to store objects in a 3D-Grid-Structure, overall it will be ~33% filled, i.e. 2 out of 3 gridpoints will be empty.
Short image to illustrate:
Maybe Option A)
vector<vector<vector<deque<Obj>> grid;// (SizeX, SizeY, SizeZ);
grid[x][y][z].push_back(someObj);
This way I'd have a lot of empty deques, but accessing one of them would be fast, wouldn't it?
The Other Option B) would be
std::unordered_map<Pos3D, deque<Obj>, Pos3DHash, Pos3DEqual> Pos3DMap;
where I add&delete deques when data is added/deleted. Probably less memory used, but maybe less fast? What do you think?
2nd Question (follow up)
What if I had multiple containers at each position? Say 3 buckets for 3 different entities, say object types ObjA, ObjB, ObjC per grid point, then my data essentially becomes 4D?
Another illustration:
Using Option 1B I could just extend Pos3D to include the bucket number to account for even more sparse data.
Possible queries I want to optimize for:
Give me all Objects out of ObjA-buckets from the entire structure
Give me all Objects out of ObjB-buckets for a set of
grid-positions
Which is the nearest non-empty ObjC-bucket to
position x,y,z?
PS:
I had also thought about a tree based data-structure before, reading about nearest neighbour approaches. Since my data is so regular I had thought I'd save all the tree-building dividing of the cells into smaller pieces and just make a static 3D-grid of the final leafs. Thats how I came to ask about the best way to store this grid here.
Question associated with this, if I have a map<int, Obj> is there a fast way to ask for "all objects with keys between 780 and 790"? Or is the fastest way the building of the above mentioned tree?
EDIT
I ended up going with a 3D boost::multi_array that has fortran-ordering. It's a little bit like the chunks games like minecraft use. Which is a little like using a kd-tree with fixed leaf-size and fixed amount of leaves? Works pretty fast now so I'm happy with this approach.

Answer to 1st question
As #Joachim pointed out, this depends on whether you prefer fast access or small data. Roughly, this corresponds to your options A and B.
A) If you want fast access, go with a multidimensional std::vector or an array if you will. std::vector brings easier maintenance at a minimal overhead, so I'd prefer that. In terms of space it consumes O(N^3) space, where N is the number of grid points along one dimension. In order to get the best performance when iterating over the data, remember to resolve the indices in the reverse order as you defined it: innermost first, outermost last.
B) If you instead wish to keep things as small as possible, use a hash map, and use one which is optimized for space. That would result in space O(N), with N being the number of elements. Here is a benchmark comparing several hash maps. I made good experiences with google::sparse_hash_map, which has the smallest constant overhead I have seen so far. Plus, it is easy to add it to your build system.
If you need a mixture of speed and small data or don't know the size of each dimension in advance, use a hash map as well.
Answer to 2nd question
I'd say you data is 4D if you have a variable number of elements a long the 4th dimension, or a fixed large number of elements. With option 1B) you'd indeed add the bucket index, for 1A) you'd add another vector.
Which is the nearest non-empty ObjC-bucket to position x,y,z?
This operation is commonly called nearest neighbor search. You want a KDTree for that. There is libkdtree++, if you prefer small libraries. Otherwise, FLANN might be an option. It is a part of the Point Cloud Library which accomplishes a lot of tasks on multidimensional data and could be worth a look as well.

Map of vector of struct vs Vector of struct

I am making a small project program that involves inputting quotes that would be later saved into a database (in this case a .txt file). There are also commands that the user would input such as list (which shows the quote by author) and random (which displays a random quote).
Here's the structure if I would use a map (with the author string as the key):
struct Information{
string quoteContent;
vector<string> tags;
}
and here's the structure if I would use the vector instead:
struct Information{
string author;
string quoteContent;
vector<string> tags;
}
note: The largest largest number of quotes I've had in the database is 200. (imported from a file)
I was just wondering which data structure would yield better performance. I'm still pretty new to this c++ thing, so any help would be appreciated!

For your data volumes it obviously doesn't matter from a performance perspective, but multi_map will likely let you write shorter, more comprehensible and maintainable code. Regarding general performance of vector vs maps (which is good to know about but likely only becomes relevant with millions of data elements or low-latency requirements)...
vector doesn't do any automatic sorting for you, so you'd probably push_back quotes as you read them, then do one std::sort once the data's loaded, after which you can find elements very quickly by author with std::binary_search or std::lower_bound, or identify insertion positions for new quotes using e.g. std::lower_bound, but if you want to insert a new quote thereafter you have to move the existing vector elements from that position on out of the way to make room - that's relatively slow. As you're just doing a few ad-hoc insertions based on user input, the time to do that with only a few hundred quotes in the vector will be totally insignificant. For the purposes of learning programming though, it's good to understand that a multimap is arranged as a kind of branching binary tree, with pointers linking the data elements, which allows for relatively quick insertion (and deletion). For some applications following all those pointers around can be more expensive (i.e. slower) than vector's contiguous memory (which works better with CPU cache memory), but in your case the data elements are all strings and vectors of strings that will likely (unless Short String Optimisations kick in) require jumping all over memory anyway.
In general, if author is naturally a key for your data just use a multi_map... it'll do all your operations in reasonable time, maybe not the fastest but never particularly slow, unlike vector for post-data-population mid-container insertions (/deletions).

Depends on the purpose of usage. Both data-structures have their pros and cons.
Vectors
Position index at() or operator []
Find function not present You would have to use find algorithm func.
Maps:
Key can be searched
Position index is not applicable. Keys are stored
(use unordered map for better performance than map.)
Use datastructure on basis of what you want to achieve.

The golden rule is: "When in doubt, measure."
i.e. Write some tests, do some benchmarking.
Anyway, considering that you have circa 200 items, I don't think there should be an important difference from the two cases on modern PC hardware. Big-O notation matters when N is big (e.g. 10,000s, 100,000s, 1,000,000s, etc.)
vector tends to be simpler than map, and I'd use it as the default container of choice (unless your main goal is to access the items given the author's name as a key, in this case map seems more logically suited).
Another option might be to have a vector with items sorted using author's names, so you can use binary search (which is O(logN)) inside the vector.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js