Vector or Map or Hash map for C++? - c++

I have a large number of records, say around 4,000,000, that I want to address them repeatedly and put information in a class that is linked to that record. I am not sure what kind of data structure should I use? Should I use the vectors, maps, or hash maps. I don't need to insert a record, but I need to read a table which contain sets of these records numbers (or names), and then grab some of the data which are linked to that record and do some processes on them. Is the finding on map fast enough to not go for hashmaps for this example? The records have a class as its structure and I have not done anything before with using the map or hashmap that has a class as its value (if it is possible).
Thanks in advance guys.
Edited:
I do not need to have all the records on the memory at the same time for now> I need to first give it a structure first and then grab the data from some of the records. The total number of records is around 20 million, and I want to read each of these raw records and then if its basic info doesn't exist in my new map or vector that I want to create and put the rest of data in there as a vector. Because I have 20 million records, I think it would be very excruciating that for every record go through 4 million record to find if the basic info of that record exist or not. I have around 4 million type of packages and each of these packages could have more than one type of service (roughly around 5 (20/4) per package). I want to read each of these records and then if the package ID does not exist into the vector or whatever I want to use and push the basic info into the vector and then for the services that are related to that package be saved in a vector inside the package class.

These three data structures have each a different purpose.
A vector is basically a dynamic array, which is good for indexed values.
A map is a sorted data-structure with O(log(n)) retrieval and insertion time (implemented using a balanced binary tree, usually Red-Black). This is best if you can't find an efficient hash method.
A hash_map uses hashes to retrieve object. If you have a well defined hash function with a low collision rate, you will get constant retrieval and insertion time on average. hash_maps are usually faster than map but not always. It is highly dependent on the hash function.
For your example, I think that it is best to use a hash_map where the key would be the record number (assuming record numbers are unique).
If these record numbers are dense (meaning there are little or no gaps between the indexes
, say: 1,2,4,5,8,9,10...), you can use a vector. If your records are coming from a database with an autoincrement primary key and not many deletions, this should be usually the case.

Related

Can I reinterpret a memory mapped file of key-value pairs as a map in order to sort them?

I have a memory mapped file that contains key-value pairs. Both the key and value are uint32_t, and all the keys and values are stored in the file in binary, where a key immediately proceeds the value. The file contains only these pairs, no delimiters.
I want to be able to sort all of these key-value pairs by increasing key.
The following just compiled in my code:
struct FileAsMap { map<uint32_t, uint32_t> keyValueMap; };
const FileAsMap* fileAsMap = reinterpret_cast<FileAsMap*>(mmappedData);
but I don't really know what to do from here, since by definition the map container keeps a strict weak ordering of the pairs by key. If I just reinterpret the mapped file as a map, how can I get the pairs to order?
it's not an answer but explanations don't fit into comment limitations.
The keys in a map are usually unique (at least in std::map they are). But maps in general differ one from another in method they sort stored keys. For example std::map is based on a balanced binary tree with average complexity of retrieving a given key equal to O(ln(n)) where n is a number of elements in the map. Or e.g. std::unordered_map is a hashmap internally with the average access time = O(1). That is it looks for a key in constant time regardless of number of elements inside.
In any case all these data containers demand dedicated internal in-memory structure which practically never looks like a simple stream of key-value pairs. That's why I've told above in the first comment that it's almost impossible to reuse one of standard maps as a convenient data accessor for mmap-ed data w/o prior read and unpack the data stream.
But you can create your own map-like class which would iterate over data in mmap-ed area and would check in its operator[](size_t i) if a stored key matches the requested one. Iguess that a simplest implementation would take a single screen of code.
But beware: sequental scan is a relatively expensive operation, so if you got enough elements in the file, it could become unacceptable slow. In this case you'll need some optimized indexing. For example all keys are read in the beginning of processing and an indexing array is built. But all these questions heavily depend on task details, ao it's better to stop explanations now.
If you have any further questions feel free to ask. Of course a good question assumes that you have already studied the subject and now have encountered a particular problem that you can't solve yoursef
There are a lot of reasons why the answer is no. The two simplest are:
Maps are a structure that stores data in a form in which it's already sorted. Your data isn't already sorted, so it's simply not a map.
The map class has its own internal data structure that it uses to store maps. Unless your file replicates this internal structure perfectly (which it almost certainly can't since it likely includes pointers into memory) the map class will misunderstand the data in the file.
How did u serialize the data to the file?
Assuming that you serialized a struct consisting of maps, you'd de-serialize as below:
FileAsMap* fileAsMap = reinterpret_cast<FileAsMap*>(mmappedData);
Gives access to entire structure (blob).
(*fileAsMap).keyValueMap gives access to map.

data structure advice on c++

I am looking for data structure in c++ and I need an advice.
I have nodes, every node has unique_id and group_id:
1 1.1.1.1
2 1.1.1.2
3 1.1.1.3
4 1.1.2.1
5 1.1.2.2
6 1.1.2.3
7 2.1.1.1
8 2.1.1.2
I need a data structure to answer those questions:
what is the group_id of node 4
give me list (probably vector) of unique_id's that belong to group 1.1.1
give me list (probably vector) of unique_id's that belong to group 1.1
give me list (probably vector) of unique_id's that belong to group 1
Is there a data structure that can answer those questions (what is the complexity time of inserting and answering)? or should I implement it?
I would appreciate an example.
EDIT:
at the beginning, I need to build this data structure. most of the action is reading by group id. insertion will happen but less then reading.
the time complexity is more important than memory space
To me, hierarchical data like the group ID calls for a tree structure. (I assume that for 500 elements this is not really necessary, but it seems natural and scales well.)
Each element in the first two levels of the tree would just hold vectors (if they come ordered) or maps (if they come un-ordered) of sub-IDs.
The third level in the tree hierarchy would hold pointers to leaves, again in a vector or map, which contain the fourth group ID part and the unique ID.
Questions 2-4 are easily and quickly answered by navigating the tree.
For question 1 one needs an additional map from unique IDs to leaves in the tree; each element inserted into the tree also has a pointer to it inserted into the map.
First of all, if you are going to have only a small number of nodes then it would probably make sense not to mess with advanced data structuring. Simple linear search could be sufficient.
Next, it looks like a good job for SQL. So may be it's a good idea to incorporate into your app SQLite library. But even if you really want to do it without SQL it's still a good hint: what you need are two index trees to support quick searching through your array. The complexity (if using balanced trees) will be logarithmic for all operations.
Depends...
How often do you insert? Or do you mostly read?
How often do you access by Id or GroupId?
With a max of 500 nodes I would put them in a simple Vector where the Id is the offset into the array (if the Ids are indeed as shown). The group-search can than be implemented by iterating over the array and comparing the partial gtroup-ids.
If this is too expensive and you really access the strcuture a lot and need very high performance, or you do a lot of inserts I would implement a tree with a HashMap for the Id's.
If the data is stored in a database you may use a SELECT/ CONNECT BY if your systems supports that and query the information directly from the DB.
Sorry for not providing a clear answer, but the solution depends on too many factors ;-)
Sounds like you need a container with two separate indexes on unique_id and group_id. Question 1 will be handled by the first index, Questions 2-4 will be handled by the second.
Maybe take a look at Boost Multi-index Containers Library
I am not sure of the perfect DS for this. But I would like to make use of a map.
It will give you O(1) efficiency for question 1 and for insertion O(logn) and deletion. The issue comes for question 2,3,4 where your efficiency will be O(n) where n is the number of nodes.

What data structure is the quickest for inserting and searching for elements?

I am writing a program that will do a basic compression using a lookup table. To create the table, I will read in a text file (size 2MB) and then find the 255 most common words and store them into another text file. I am trying to use a vector now, but the runtime is slow at about a minute of runtime to insert into the vector, sort it, and then output the top 255 elements to another text file. The insertion appears to be the problematic since I am having to check for whether or not it already exists inside of the vector and then increment a counter if it does exist, or add the element to the end of the vector if it doesn't. I need to find an efficient way of inserting elements into a data structure only when they are not already inside of the data structure (No Duplicates).
std::unordered_map is likely to be the best for your purpose, no guarantees. You can "add a key if and only if not already present" just by using operator[].
You'll make one pass over the 2MB splitting into words and counting the frequencies (one lookup in the structure per word). Then use std::partial_sort_copy (the version that takes a comparator) to get the top 255 by frequency count from the unordered_map. You should partial_sort_copy into a vector or array and then use that to write the file.
For 2 MB of data, anything over a few seconds is certainly slower than it "should" be, and a few seconds is still slower than it could be. So you're right to be concerned about your vector, but you should also profile your code to make sure that it really is the vector costing you the time, not some other issue.
Try using STL map or set it is much faster than vector: see here

How to sort LevelDB by value

I'm using leveldb to store records (key-value), where the key is a 64-bit hash and the value is a double. To make an analogy: think of the 64-bit hash is a unique ID of a customer and the double as an account balance (i.e. how much money they have in their account). I want to sort the database by account balance and list the customers with the highest account balance first. However, the database cannot fit into memory so I have to use some other method for sorting it in order to sort by account balance.
I'm considering using STXXL, but it requires that I make a copy of the database into a single flat file, then I can use STXXL to do an external sort (which would make a bunch of smaller files, sort them and then merge them back into another single flat file). Is there a better approach to sorting the data or should I go with the STXXL sort?
How many entries do you have? Could an unsigned 32-bit integer be used as an index (would allow 4,294,967,296 indexes) which could be used to identify how to sort the original array?
i.e. create pairs of 32-bit indexes and account balances, sort on the balances then use the 32 bit indexes to work out what order the original data should be in?

STL Map versus Static Array

I have to store information about contents in a lookup table such that it can be accessed very quickly.I might need to refer some of the elements in look up table recursively to get complete information about contents. What will be better data structure to use:
Map with one of parameter, which will be unique to all the entries in look up table, as key and rest of the information as value
Use static array for each unique entries and access them when needed according to key(which will be same as the one used in MAP).
I want my software to be robust as if we have any crash it will be catastrophic for my product.
It depends on the range of keys that you have.
Usually, when you say lookup table, you mean a smallish table which you can index directly ( O(1) ). As a dumb example, for a substitution cipher, you could have a char cipher[256] and simply index with the ASCII code of a character to get the substitution character. If the keys are complex objects or simply too many, you're probably stuck with a map.
You might also consider a hashtable (see unordered_map).
Reply:
If the key itself can be any 32-bit number, it wouldn't make sense to store a very sparse 4-billion element array.
If however your keys are themselves between say 0..10000, then you can have a 10000-element array containing pointers to your objects (or the objects themselves), with only 2000-5000 of your elements containing non-null pointers (or meaningful data, respectively). Access will be O(1).
If you can have large keys, then I'd probably go with the unordered_map. With a map of 5000 elements, you'd get O(log n) to mean around ~12 accesses, a hash table should be pretty much one or two accesses tops.
I'm not familiar with perfect hashes, so I can't advise about their implementation. If you do choose that, I'd be grateful for a link or two with ideas to keep in mind.
The lookup times in a std::map should be O=ln(n), with a linear search in a static array in the worst case O=n.
I'd strongly opt for a std::map even if it has a larger memory footprint (which should not matter, in the most cases).
Also you can make "maps of maps" or even deeper structures:
typedef std::map<MyKeyType, std::map<MyKeyType, MyValueType> > MyDoubleMapType;