Is it efficient to read the contents of a file into an unordered_map if it has over 1000 entries - c++

I'm making a hash table that's supposed to give pretty fast lookup time for some values I type before hand. I didn't know how to go about it but my friend said I should make a text file and just have an unordered map that reads from the text file and puts the values in the code before I run it. Is this efficient? Is there a better way to do this?
Also side note, the values are supposed to be structures. Is it going to be possible to read them into the code with an unordered map?

As said in comments, your idea is good enough unless these structures are really large, megabytes.
If you have reasons to worry about the performance of that, e.g. if you want to support millions of records or very large values, more complicated approaches can be more efficient.
When I only need 64-bit support, I sometimes make a single binary file, optimized for memory mapping the complete one. Specifically, a fixed-size header, then sorted arrays of (key,offset) tuples serving as a primary index (can use binary search there, the OS only fetches required pages from mapped files and it caches them in RAM in quite aggressive manner), then values at the offsets specified in the index.

Use std::map when
You need ordered data.
You would have to print/access the data (in
sorted order). You need predecessor/successor of elements.
Use std::unordered_map when
You need to keep count of some data (Example – strings) and no ordering is
required.
You need single element access i.e. no traversal.
Also side note, the values are supposed to be structures. Is it going to be possible to read them into the code with an unordered map?
Definately you can but i hope you knew that you cannot read file with map fstream is there for that purpose.

Related

Can I reinterpret a memory mapped file of key-value pairs as a map in order to sort them?

I have a memory mapped file that contains key-value pairs. Both the key and value are uint32_t, and all the keys and values are stored in the file in binary, where a key immediately proceeds the value. The file contains only these pairs, no delimiters.
I want to be able to sort all of these key-value pairs by increasing key.
The following just compiled in my code:
struct FileAsMap { map<uint32_t, uint32_t> keyValueMap; };
const FileAsMap* fileAsMap = reinterpret_cast<FileAsMap*>(mmappedData);
but I don't really know what to do from here, since by definition the map container keeps a strict weak ordering of the pairs by key. If I just reinterpret the mapped file as a map, how can I get the pairs to order?
it's not an answer but explanations don't fit into comment limitations.
The keys in a map are usually unique (at least in std::map they are). But maps in general differ one from another in method they sort stored keys. For example std::map is based on a balanced binary tree with average complexity of retrieving a given key equal to O(ln(n)) where n is a number of elements in the map. Or e.g. std::unordered_map is a hashmap internally with the average access time = O(1). That is it looks for a key in constant time regardless of number of elements inside.
In any case all these data containers demand dedicated internal in-memory structure which practically never looks like a simple stream of key-value pairs. That's why I've told above in the first comment that it's almost impossible to reuse one of standard maps as a convenient data accessor for mmap-ed data w/o prior read and unpack the data stream.
But you can create your own map-like class which would iterate over data in mmap-ed area and would check in its operator[](size_t i) if a stored key matches the requested one. Iguess that a simplest implementation would take a single screen of code.
But beware: sequental scan is a relatively expensive operation, so if you got enough elements in the file, it could become unacceptable slow. In this case you'll need some optimized indexing. For example all keys are read in the beginning of processing and an indexing array is built. But all these questions heavily depend on task details, ao it's better to stop explanations now.
If you have any further questions feel free to ask. Of course a good question assumes that you have already studied the subject and now have encountered a particular problem that you can't solve yoursef
There are a lot of reasons why the answer is no. The two simplest are:
Maps are a structure that stores data in a form in which it's already sorted. Your data isn't already sorted, so it's simply not a map.
The map class has its own internal data structure that it uses to store maps. Unless your file replicates this internal structure perfectly (which it almost certainly can't since it likely includes pointers into memory) the map class will misunderstand the data in the file.
How did u serialize the data to the file?
Assuming that you serialized a struct consisting of maps, you'd de-serialize as below:
FileAsMap* fileAsMap = reinterpret_cast<FileAsMap*>(mmappedData);
Gives access to entire structure (blob).
(*fileAsMap).keyValueMap gives access to map.

What data structure is the quickest for inserting and searching for elements?

I am writing a program that will do a basic compression using a lookup table. To create the table, I will read in a text file (size 2MB) and then find the 255 most common words and store them into another text file. I am trying to use a vector now, but the runtime is slow at about a minute of runtime to insert into the vector, sort it, and then output the top 255 elements to another text file. The insertion appears to be the problematic since I am having to check for whether or not it already exists inside of the vector and then increment a counter if it does exist, or add the element to the end of the vector if it doesn't. I need to find an efficient way of inserting elements into a data structure only when they are not already inside of the data structure (No Duplicates).
std::unordered_map is likely to be the best for your purpose, no guarantees. You can "add a key if and only if not already present" just by using operator[].
You'll make one pass over the 2MB splitting into words and counting the frequencies (one lookup in the structure per word). Then use std::partial_sort_copy (the version that takes a comparator) to get the top 255 by frequency count from the unordered_map. You should partial_sort_copy into a vector or array and then use that to write the file.
For 2 MB of data, anything over a few seconds is certainly slower than it "should" be, and a few seconds is still slower than it could be. So you're right to be concerned about your vector, but you should also profile your code to make sure that it really is the vector costing you the time, not some other issue.
Try using STL map or set it is much faster than vector: see here

STL container for combining multiple very large raster data files in C++

I am working on a piece of code that need to do something very similar to the combine function in ArcGIS in C/C++. See: http://webhelp.esri.com/arcgisdesktop/9.3/index.cfm?TopicName=Combining%20multiple%20rasters
The C++ code will read in multiple very large raster data files (2GB+) in chunks, find unique combinations and output to a single map. For example, if there were 3 maps and <1,3,5> existed, respectfully, in the first cell of the three maps then I want all subsequent instances of <1,3,5> to have the same key in the final output map.
What STL containers should I be using to store the maps? Reading in the files in chunks certainly adds more complexity of the project. The algorithm needs to be very fast so I cannot use vectors which have a O(n) complexity for searching. Currently, I'm thinking of using a unsorted_map of unsorted_multimaps but I am not sure if this is correct and if I am going to get the performance I need.
Any thoughts?
Yes, std::map or std::unordered_map is correct choice. unordered_map is faster and more compact if you don't need order. If you need something even faster, you can replace it with other map implementation.
Use something compact for the key, something like std::tuple or std::array.

STL Map versus Static Array

I have to store information about contents in a lookup table such that it can be accessed very quickly.I might need to refer some of the elements in look up table recursively to get complete information about contents. What will be better data structure to use:
Map with one of parameter, which will be unique to all the entries in look up table, as key and rest of the information as value
Use static array for each unique entries and access them when needed according to key(which will be same as the one used in MAP).
I want my software to be robust as if we have any crash it will be catastrophic for my product.
It depends on the range of keys that you have.
Usually, when you say lookup table, you mean a smallish table which you can index directly ( O(1) ). As a dumb example, for a substitution cipher, you could have a char cipher[256] and simply index with the ASCII code of a character to get the substitution character. If the keys are complex objects or simply too many, you're probably stuck with a map.
You might also consider a hashtable (see unordered_map).
Reply:
If the key itself can be any 32-bit number, it wouldn't make sense to store a very sparse 4-billion element array.
If however your keys are themselves between say 0..10000, then you can have a 10000-element array containing pointers to your objects (or the objects themselves), with only 2000-5000 of your elements containing non-null pointers (or meaningful data, respectively). Access will be O(1).
If you can have large keys, then I'd probably go with the unordered_map. With a map of 5000 elements, you'd get O(log n) to mean around ~12 accesses, a hash table should be pretty much one or two accesses tops.
I'm not familiar with perfect hashes, so I can't advise about their implementation. If you do choose that, I'd be grateful for a link or two with ideas to keep in mind.
The lookup times in a std::map should be O=ln(n), with a linear search in a static array in the worst case O=n.
I'd strongly opt for a std::map even if it has a larger memory footprint (which should not matter, in the most cases).
Also you can make "maps of maps" or even deeper structures:
typedef std::map<MyKeyType, std::map<MyKeyType, MyValueType> > MyDoubleMapType;

data structure for storing array of strings in a memory

I'm considering of data structure for storing a large array of strings in a memory. Strings will be inserted at the beginning of the programm and will not be added or deleted while programm is running. The crucial point is that search procedure should be as fast as it can be. Saving of memory is not important. I incline to standard structure hash_set from standard library, that allows to search elements in the structure with about constant time. But it's not guaranteed that this time will be short. Will anyone suggest a better standard desicion?
Many thanks!
Try a Prefix Tree
A Trie is better than a Binary Search Tree for searching elements. Compared against a hash table, you could see this question
If lookup time really is the only important thing, then at startup time, once you have all the strings, you could compute a perfect hash over them, and use this as the hashing function for a hashtable.
The problem is how you'd execute the hash - any kind of byte-code-based computation is probably going to be slower than using a fixed hash and dealing with collisions. But if all you care about is lookup speed, then you can require that your process has the necessary privileges to load and execute code. Write the code for the perfect hash, run it through a compiler, load it. Test at runtime whether it's actually faster for these strings than your best known data-agnostic structure (which might be a Trie, a hashtable, a Judy array or a splay tree, depending on implementation details and your typical access patterns), and if not fall back to that. Slow setup, fast lookup.
It's almost never truly the case that speed is the only crucial point.
There is e.g. google-sparsehash.
It includes a dense hash set/map (re)implementation that may perform better than the standard library hash set/map.
See performance. Make sure that you are using a good hash function. (My subjective vote: murmur2.)
Strings will be inserted at the
beginning of the programm and will not
be added or deleted while programm is running.
If the strings are immutable - so insertion/deletion is "infrequent", so to speak -, another option is to build a Directed Acyclic Word Graph or a Compact Directed Acyclic Word Graph that might* be faster than a hash table and has a better worst case guarantee.
**Standard disclaimer applies: depending on the use case, implementations, data set, phase of the moon, etc. Theoretical expectations may differ from observed results because of factors not accounted for (e.g. cache and memory latency, time complexity of certain machine instructions, etc.).*
A hash_set with a suitable number of buckets would be ideal, alternatively a vector with the strings in dictionary order, searched used binary search, would be great too.
The two standard data structures for fast string lookup are hash tables and tries, particularly Patricia tries. A good hash implementation and a good trie implementation should give similar performance, as long as the hash implementation is good enough to limit the number of collisions. Since you never modify the set of strings, you could try to build a perfect hash. If performance is more important than development time, try all solutions and benchmark them.
A complementary technique that could save lookups in the string table is to use atoms: each time you read a string that you know you're going to look up in the table, look it up immediately, and store a pointer to it (or an index in the data structure) instead of storing the string. That way, testing the equality of two strings is a simple pointer or integer equality (and you also save memory by storing each string once).
Your best bet would be as follows:
Building your structure:
Insert all your strings (char*s) into an array.
Sort the array lexicographically.
Lookup
Use a binary search on your array.
This maintains cache locality, allows for efficient lookup (Will search in a space of ~4 billion strings with 32 comparisons), and is dead simple to implement. There's no need to get fancy with tries, because they are complicated, and slower than they appear (especially if you have long strings).
Random sidenote: Combined with http://blogs.msdn.com/b/oldnewthing/archive/2005/05/19/420038.aspx, you'll be unstoppable!
Well, assuming you truly want an array and not an associative contaner as you've mentioned, the allocation strategy mentioned in Raymond Chen's Blog would be efficient.