I have a container of Employees with fields std::string name and bool is_manager. I want to iterate over container for regular and manager employees. The number of items in a container can be very large so I do not want to do a linear scan with checking is_manager property. Also the number of managers is very small i.e. 10 out of 100000, so doing a full scan over the container is ineffective. So I want to pre-cache the memory addresses of Employees with and without is_manager==true and have RegularEmployeeIterator and ManagerEmployeeIterator, I think this can be pre-cached and organized as a list/vector of pointers?
And I want to able to sort the container of Employees by name field and retain the ability to iterate over regulars and managers.
How to implement that in C++? Specifically, I have no idea how iterators are implemented in C++, how to define several of them for a single collection, based on property value, does my idea with caching the addresses works, etc.
organized as a list/vector of pointers?
Whatever the problem, a list is not the data structure you need.
1. Unrealistic answer for a beginner, but you claim it's the problem you're solving
ok actually I have tens of millions entries
I mean, this is very clearly a learning exercise, and you insisting that your data is 10s of millions of entries large is... mediocrily helpful, because that's the point where if access times are important, you stop storing the composite object in one container:
std::vector<Employee> employees; //10⁷ employees
but would group the data according to the properties you're going to work on at the same time:
std::vector<bool> bossiness; //10⁷ bits – std::vector<bool> has an optimization!
std::vector<std::string> names; //10⁷ std::strings
and as a matter of fact, if you know your data doesn't change, you wouldn't even do that, because the names vector is a dereferencing nightmare that wastes a lot of memory on redundant information, if you could as well just go
std::vector<bool> bossiness; //10⁷ bits – std::vector<bool> has an optimization!
std::string all_names; // a **very** long string containing all names, one after the other
std::vector<size_t> name_begins; // 10⁷ name beginnings; through all_names.substr(name_begins[i], name_begins[i+1]) you can access the i.th name
Now, to speed up looking for bosses, you just start by making a run-time encoded list of 64bit-regions in your bossiness vector where at least one bit is set. You could do elegant k-d trees if your problem becomes multidimensional, but at the sparsity you have, runtime encoding on machine word sizes will probably still beat the hell out of that.
But that's an optimization level you need when writing a database system or a 3D game with millions of vertices. You're learning C++. You're not writing these kinds of things, so:
2. Realistic answer that you didn't want when offered in the question
i.e. 10 out of 100000
so, let's really go with a problem size of 10⁵. I.e., a small problem.
You need to do your Employee vector, and add a bosses vector:
std::vector<Employee> employees;
std::vector<size_t> boss_indices;
Then you need to do your linear search once:
// if you know a safe and not too outlandish upper bound for the number of managers, reserve that memory once to avoid resizing the vector while filling it, as that's very expensive:
boss_indices.reserve(size_t(employees.size() * fraction_of_managers));
for(size_t idx = 0; idx < employees.size(); ++idx) {
if(employees[idx].is_manager) {
boss_indices.push_back(idx)
}
}
congratulations, an easy to use vector indices. Indices into std::vector are just as good as pointers to elements (it's a simple pointer deref, both ways, and the additional offset is usually merged into the deref operation on any modern CPU I know), but survive the target vector being moved.
And I want to able to sort the container of Employees by name field and retain the ability to iterate over regulars and managers.
have a class
#include <algorithm>
struct SortableEmployee {
const Employee* empl;
bool operator <(const SortableEmployee& other) const {
return std::lexicographical_compare(
empl->name.cbegin(); empl->name.cend(),
other.empl->name.cbegin(), other.empl->name.cend());
}
SortableEmployee(Employee* underlying) : empl(underlying){
}
};
and put it in a std::set to get a sorted version that you can iterate through:
std::set<SortableEmployee> namebook;
for(const auto& individual : employees) {
namebook.emplace(&individual);
}
You can then iterate through it linearly as well
for(const auto& sorted_empl : namebook) {
std::cout << std::format("{}: is {}a manager\n",
sorted_empl.empl->name,
sorted_empl.empl->is_manager ? "" : "not ");
}
Related
I want to keep a data structure for storing all the elements that I have seen till now. Considering that keeping an array for this is out of question as elements can be of the order of 10^9, what data structure should I use for achieving this : unordered_map or unordered_set in C++ ?
Maximum elements that will be visited in worst case : 10^5
-10^9 <= element <= 10^9
As #MikeCAT said in the comments, a map would only make sense if you wanted to store additional information about the element or the visitation. But if you wanted only to store the truth value of whether the element has been visited or not, the map would look something like this:
// if your elements were strings
std::unordered_map<std::string, bool> isVisited;
and then this would just be a waste of space. Storing the truth value is redundant, if the mere presence of the string within the map already indicates that it has been visited. Let's see a comparison:
std::unordered_map<std::string, bool> isVisitedMap;
std::unordered_set<std::string> isVisitedSet;
// Visit some places
isVisitedMap["madrid"] = true;
isVisitedMap["london"] = true;
isVisitedSet.insert("madrid");
isVisitedSet.insert("london");
// Maybe the information expires so you want to remove them
isVisitedMap["london"] = false;
isVisitedSet.erase("london");
Now the elements stored in each structure will be:
For the map:
{{"london", false}, {"madrid", true}} <--- 4 elements
{"madrid"} <--- 1 element. Much better
In a project in which I had a binary tree converted to a binary DAG for optimization purposes (GRAPHGEN) I passed the exploration function a map from node pointers to bool:
std::map<BinaryDrag<conact>::node*, bool> &visited_fl
The map kept track of the pointers in order not to go through the same nodes again when doing multiple passes.
You could use a std::unordered_map<Value, bool>.
I want to keep a data structure for storing all the elements that I have seen till now.
A way to re-phrase that is to say "I want a data structure to store the set of all elements that I've seen till now". The clue is in the name. Without more information, std::unordered_set seems like a reasonable choice to represent a set.
That said, in practice it depends on details like what you're planning to do with this set. Array can be a good choice as well (yes, even for billions of elements), other set implementations may be better and maps can be useful in some use cases.
I'm building a little 2d game engine. Now I need to store the prototypes of the game objects (all type of informations). A container that will have at most I guess few thousand elements all with unique key and no elements will be deleted or added after a first load. The key value is a string.
Various threads will run, and I need to send to everyone a key(or index) and with that access other information(like a texture for the render process or sound for the mixer process) available only to those threads.
Normally I use vectors because they are way faster to accessing a known element. But I see that unordered map also usually have a constant speed if I use the ::at element access. It would make the code much cleaner and also easier to maintain because I will deal with much more understandable man made strings.
So the question is, the difference in speed between a access to a vector[n] compared to a unorderedmap.at("string") is negligible compared to his benefits?
From what I understand accessing various maps in different part of the program, with different threads running just with a "name" for me is a big deal and the speed difference isn't that great. But I'm too inexperienced to be sure of this. Although I found informations about it seem I can't really understand if I'm right or wrong.
Thank you for your time.
As an alternative, you could consider using an ordered vector because the vector itself will not be modified. You can easily write an implementation yourself with STL lower_bound etc, or use an implementation from libraries ( boost::flat_map).
There is a blog post from Scott Meyers about container performance in this case. He did some benchmarks and the conclusion would be that an unordered_mapis probably a very good choice with high chances that it will be the fastest option. If you have a restricted set of keys, you can also compute a minimal optimal hash function, e.g. with gperf
However, for these kind of problems the first rule is to measure yourself.
My problem was to find a record on a container by a given std::string type as Key access. Considering Keys that only EXISTS(not finding them was not a option) and the elements of this container are generated only at the beginning of the program and never touched thereafter.
I had huge fears unordered map was not fast enough. So I tested it, and I want to share the results hoping I've not mistaken everything.
I just hope that can help others like me and to get some feedback because in the end I'm beginner.
So, given a struct of record filled randomly like this:
struct The_Mess
{
std::string A_string;
long double A_ldouble;
char C[10];
int* intPointer;
std::vector<unsigned int> A_vector;
std::string Another_String;
}
I made a undordered map, give that A_string contain the key of the record:
std::unordered_map<std::string, The_Mess> The_UnOrdMap;
and a vector I sort by the A_string value(which contain the key):
std::vector<The_Mess> The_Vector;
with also a index vector sorted, and used to access as 3thrd way:
std::vector<std::string> index;
The key will be a random string of 0-20 characters in lenght(I wanted the worst possible scenario) containing letter both capital and normal and numbers or spaces.
So, in short our contendents are:
Unordered map I measure the time the program get to execute:
record = The_UnOrdMap.at( key ); record is just a The_Mess struct.
Sorted Vector measured statements:
low = std::lower_bound (The_Vector.begin(), The_Vector.end(), key, compare);
record = *low;
Sorted Index vector:
low2 = std::lower_bound( index.begin(), index.end(), key);
indice = low2 - index.begin();
record = The_Vector[indice];
The time is in nanoseconds and is a arithmetic average of 200 iterations. I have a vector that I shuffle at every iteration containing all the keys, and at every iteration I cycle through it and look for the key I have here in the three ways.
So this are my results:
I think the initials spikes are a fault of my testing logic(the table I iterate contains only the keys generated so far, so it only has 1-n elements). So 200 iterations of 1 key search for the first time. 200 iterations of 2 keys search the second time etc...
Anyway, it seem that in the end the best option is the unordered map, considering that is a lot less code, it's easier to implement and will make the whole program way easier to read and probably maintain/modify.
You have to think about caching as well. In case of std::vector you'll have very good cache performance when accessing the elements - when accessing one element in RAM, CPU will cache nearby memory values and this will include nearby portions of your std::vector.
When you use std::map (or std::unordered_map) this is no longer true. Maps are usually implemented as self balancing binary-search trees, and in this case values can be scattered around the RAM. This imposes great hit on cache performance, especially as maps get bigger and bigger as CPU just cannot cache the memory that you're about to access.
You'll have to run some tests and measure performance, but cache misses can greatly hurt the performance of your program.
You are most likely to get the same performance (the difference will not be measurable).
Contrary to what some people seem to believe, unordered_map is not a binary tree. The underlying data structure is a vector. As a result, cache locality does not matter here - it is the same as for vector. Granted, you are going to suffer if you have collissions due to your hashing function being bad. But if your key is a simple integer, this is not going to happen. As a result, access to to element in hash map will be exactly the same as access to the element in the vector with time spent on getting hash value for integer, which is really non-measurable.
I have a list of IDs (integers).
They are sorted in a really efficient way so that my application can easily handle them, for example
9382
297832
92
83723
173934
(this sort is really important in my application).
Now I am facing the problem of having to access certain values of an ID in another vector.
For example certain values for ID 9382 are located on someVectorB[30].
I have been using
const int UNITS_MAX_SIZE = 400000;
class clsUnitsUnitIDToArrayIndex : public CBaseStructure
{
private:
int m_content[UNITS_MAX_SIZE];
long m_size;
protected:
void ProcessTxtLine(string line);
public:
clsUnitsUnitIDToArrayIndex();
int *Content();
long Size();
};
But now that I raised UNITS_MAX_SIZE to 400.000, I get page stack errors, and that tells me that I am doing something wrong. I think the entire approach is not really good.
What should I use if I want to locate an ID in a different vector if the "position" is different?
ps: I am looking for something simple that can be easily read-in from a file and that can also easily be serialized to a file. That is why I have been using this brute-force approach before.
If you want a mapping from int's to int's and your index numbers non-consecutive you should consider a std::map. In this case you would define it as such:
std::map<int, int> m_idLocations;
A map represents a mapping between two types. The first type is the "key" and is used for lookup up the second type known as the "value". For each id lookup you can insert it with:
m_idLocations[id] = position;
// or
m_idLocations.insert(std::pair<int,int>(id, position));
And you can look them up using the following syntax:
m_idLocations[id];
Typically a std::map in the stl is implemented using red-black trees which have a worse-cast lookup speed of O(log n). This is slightly slower then O(1) that you'll be getting from the huge array however it's a substantially better use of a space and you're unlikely to notice the difference in practise unless you're storing truly gigantic amounts of numbers or doing an enourmous amount of lookups.
Edit:
In response to some of the comments I think it's important to point out that moving from O(1) to O(log n) can make a significant difference in the speed of your application not to mention practical speed concerns from moving to fixed blocks of memory to tree based structure. However I think that it's important to initially represent what you're trying to say (an int-to-int) mapping and avoid the pitfall of premature optimization.
After you've represented the concept you should then use a profiler to determine if and where the speed issues are. If you find that the map is causing issues then you should look at replacing your mapping with something that you think will be quicker. Make sure to test that the optimization helped and don't forget to include a big comment about what you are representing and why it needed to be changed.
if nothing else works you can just allocate the array dynamically in the constructor. this will move the large array on the heap and avoid your page stack error. you should also remember to release the resource while destroying your clsUnitsUnitIDToArrayIndex
But the recommended usage is as suggested by other members, use a std::vector or std::map
Probably you are getting stackoverflow error due to int m_content[UNITS_MAX_SIZE]. The array is allocated on the stack and 400000 is a pretty big number for the stack. You can use std::vector instead, it is dynamically allocated and you can return a reference of vector member to avoid copy operation:
std::vector<int> m_content(UNITS_MAX_SIZE);
const std::vector<int> &clsUnitsUnitIDToArrayIndex::Content() const
{
return m_content;
}
I am using a red black tree implementation in C++ (std::map), but currently, I see that my unsigned long long int indices get bigger and bigger, for larger experiment. I am going for 700,000,000 indices, and each index stores a std::set that contains a few more int elements (about 1-10). We got 128 GB RAM, but I see that we start to run short of it; in fact, if possible, I wanna go down even to 1,000,000,000 indices, if possible, in my experiment.
I gave this some thought, and was thinking about a forest of several maps put together. Basically, after a map hits a certain size threshold (or perhaps when bad_alloc starts to be thrown), save it to disk, clear it off the memory and then create another map and keep on doing until I got all indices. However, during the loading part, this will be very inefficient, as we can only hold one map in the RAM at a time. Worse, we need to check all maps for consistency.
So in this case, what are some of the data structure should I be looking for?
From your description, I think you have this:
typedef std::map<long long, std::set<int>> MyMap;
where the map is very big, and the individual sets are quite small. There are several sources of overhead here:
the individual entries in the map, each of which is a separate allocation;
the individual entries in the sets, ditto;
the structures which describe each set, independent of their contents.
With standard library components, it's not possible to eliminate all of these overheads; the semantics of associative containers pretty well mandates the individual allocation of each entry, and the use of red-black trees requires the addition of several pointers to each entry (in theory, only two pointers are required, but efficient implementation of iterators is difficult without parent pointers.)
However, you can reduce the overhead without losing functionality by combining the map with the sets, using a datastructure like this:
typedef std::set<std::pair<long long, int>> MyMap;
You can still answer all the same queries, although a few of them are slightly less convenient. Remember that std::pair's default comparator sorts in lexicographical order, so all of the elements with the same first value will be contiguous. So you can, for example, query whether a given index has any ints associated with it by using:
it = theMap.lower_bound(std::make_pair(index, INT_MIN));
if (it != theMap.end() && it->first == index) {
// there is at least one int associated with index
}
The same call to lower_bound will give you a begin iterator for the ints associate with the key, while a call toupper_bound(std::make_pair(key, INT_MAX))` will give you the corresponding end iterator, so you can easily iterate over all the values associated with a given key.
That still might not be enough to store 700 million indices with associated sets of integers in 128GB unless the average set size is really small. The next step would have to be a b-tree of some form, which is not in the standard library. B-trees avoid the individual entry overhead by combining a number of entries into a single cluster; that should be sufficient for your needs.
it looks like it is time to switch to B-trees (may be B+ or B*) -- this structure used in databases to manage indices. take a look here -- this is replacement for std-like associative containers w/ btree inside... but btrees can be used to keep indices in memory and on disk...
For such a large scale dataset, you should really work with a proper database server such as an SQL server. These servers are intended to work with cached large-scale datasets. An SQL server saves the data to a permenant cache such as a HDD, while maintaining good read/write performance by caching frequently accessed pages etc.
I'm looking for strategies to speed up an agent-based model that's based on objects of class Host, pointers to which are stored in a Boost multi-index container. I've used Shark to determine that the vast majority of the time is consumed by a function calcSI():
Function calcSI() has to compute for every instance of class Host certain probabilities that depend on attributes of other instances of class Host. (There are approximately 10,000-50,000 instances of Host, and these calculations are run for each host approximately 25,600 times.)
If I'm interpreting the profile correctly, the majority of the time spent in calcSI() goes to Host::isInfectedZ(int), which simply counts instances of something in a Boost unordered_multimap of type InfectionMap:
struct Infection {
public:
explicit Infection( double it, double rt ) : infT( it ), recT( rt ) {}
double infT;
double recT;
};
typedef boost::unordered_multimap< int, Infection > InfectionMap;
All members of Host contain InfectionMap carriage, and Host::isInfectedZ(int) simply counts the number of Infections associated with a particular int key:
int Host::isInfectedZ( int z ) const {
return carriage.count( z );
}
I'm having trouble finding information on how costly the count function is for Boost's unordered multimaps. Should I increase the overhead by adding to Host a separate two-dimensional array to track the number of instances of each key (i.e., the number of Infections associated with each int)?
I'm wondering if a larger structural overhaul of the Boost multi-index, like eliminating one or two less-needed composite key indices, would be more helpful. The background maintenance of the multi-index doesn't appear in the profiler, which (maybe stupidly) makes me worry it might be large. I have 8 indices in the multi-index, most of which are ordered_non_unique.
Are there other things I should be concerned with that might not appear in the profiler, or am I missing a major result from the profiler?
Parallelization and multithreading of calcSI() are unfortunately not options.
Update: It might be helpful to know that InfectionMap carriage rarely has more than 10 pairs and usually has <5.
Update 2: I tried the strategy proposed in #1 above, giving each Host an array int carriageSummary[ INIT_NUM_STYPES ], which is indexed by the possible values of z (for most simulations, there are <10 possible values). The value of each entry tracks changes made to carriage. The Host::isInfectedZ( int z ) function now reads:
int Host::isInfectedZ( int z ) const {
//return carriage.count( z );
return carriageSummary[ z ];
}
And the time dedicated to this function appears to have dropped substantially--I can't do an exact comparison right now:
Obviously, using an array is kind of bulky but okay for small ranges of z. Would some other container (i.e., not an unordered_map) be more efficient for larger ranges?
Would love any feedback on changing multi-index too.
Like you suggested in #1, try maintaining a carriage count array next to the Host::carriage unordered_multimap and keep them both "synchronised". Your Host::isInfectedZ would then use the (hopefully) faster carriage count array:
int Host::isInfectedZ( int z ) const {
return carriageCount[ z ];
}
If the range of integers that can be passed into isInfected is large, then use an associative array for your carriage count.
You can use std::map or boost::unordered for the associative array. For lookups, the former has logarithmic temporal complexity and the latter has constant temporal complexity. But since this associative array would be typically very small, std::map might actually be faster. std::map may also take up less space overhead. Try both and run your profiler to see. My bet is on std::map. :-)
EDIT:
Upon seeing your answer to my comment, I would suggest using a regular fixed-size array for the carriage count. Forget about the associative array stuff.
EDIT2:
You might want to scrap
typedef boost::unordered_multimap< int, Infection > InfectionMap;
and roll-up your own hand-written InfectionMap class, since you're dealing with such small indices.
RESPONSE TO UPDATE #2:
Glad to see you've made an improvement. I doubt you'll find an container that is "less bulky" than a fixed array of, say, 16 integers. STL and boost containers allocate memory in chunks and will end up as large as your fixed-size array, even if they have few elements.
You might be interested in boost::array, which wraps an STL-like interface around a C-style fixed array. This will make it easier to "swap-out" your fixed-size array for a std::vector or std::map.