I am trying to represent a relation (table) in C++ code:
The columns of the relation are integers.
The number of columns in the relation is fixed at runtime.
No duplicates should be allowed (this is the major source of cost).
I want to have a map from names to relations.
Any ideas for an efficient implementation, the main issue here is detecting duplicates at insertion time, it can be very costly.
Make each row of the table a struct Row.
Use a std::set or std::unordered_set to store these structs. Collision (querying) can be detected in (for std::set) O(log n + d) time or (for std::unordered_set) amortized O(d) time where d is the number of columns.
To efficiently map from names to rows, create a boost::bimap<std::string, Row>.
KennyTM has a point. You could use SQLite. As described in the link, you can use it to create a temporary in-memory database.
Related
I have a simple requirement, i need a map of type . however i need fastest theoretically possible retrieval time.
i used both map and the new proposed unordered_map from tr1
i found that at least while parsing a file and creating the map, by inserting an element at at time.
map took only 2 minutes while unordered_map took 5 mins.
As i it is going to be part of a code to be executed on Hadoop cluster and will contain ~100 million entries, i need smallest possible retrieval time.
Also another helpful information:
currently the data (keys) which is being inserted is range of integers from 1,2,... to ~10 million.
I can also impose user to specify max value and to use order as above, will that significantly effect my implementation? (i heard map is based on rb trees and inserting in increasing order leads to better performance (or worst?) )
here is the code
map<int,int> Label // this is being changed to unordered_map
fstream LabelFile("Labels.txt");
// Creating the map from the Label.txt
if (LabelFile.is_open())
{
while (! LabelFile.eof() )
{
getline (LabelFile,inputLine);
try
{
curnode=inputLine.substr(0,inputLine.find_first_of("\t"));
nodelabel=inputLine.substr(inputLine.find_first_of("\t")+1,inputLine.size()-1);
Label[atoi(curnode.c_str())]=atoi(nodelabel.c_str());
}
catch(char* strerr)
{
failed=true;
break;
}
}
LabelFile.close();
}
Tentative Solution: After review of comments and answers, i believe a Dynamic C++ array would be the best option, since the implementation will use dense keys. Thanks
Insertion for unordered_map should be O(1) and retrieval should be roughly O(1), (its essentially a hash-table).
Your timings as a result are way OFF, or there is something WRONG with your implementation or usage of unordered_map.
You need to provide some more information, and possibly how you are using the container.
As per section 6.3 of n1836 the complexities for insertion/retreival are given:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1836.pdf
One issue you should consider is that your implementation may need to continually be rehashing the structure, as you say you have 100mil+ items. In that case when instantiating the container, if you have a rough idea about how many "unique" elements will be inserted into the container, you can pass that in as a parameter to the constructor and the container will be instantiated accordingly with a bucket-table of appropriate size.
The extra time loading the unordered_map is due to dynamic array resizing. The resizing schedule is to double the number of cells each when the table exceeds it's load factor. So from an empty table, expect O(lg n) copies of the entire data table. You can eliminate these extra copies by sizing the hash table upfront. Specifically
Label.reserve(expected_number_of_entries / Label.max_load_factor());
Dividing by the max_load_factor is to account for the empty cells that are necessary for the hash table to operate.
unordered_map (at least in most implementations) gives fast retrieval, but relatively poor insertion speed compared to map. A tree is generally at its best when the data is randomly ordered, and at its worst when the data is ordered (you constantly insert at one end of the tree, increasing the frequency of re-balancing).
Given that it's ~10 million total entries, you could just allocate a large enough array, and get really fast lookups -- assuming enough physical memory that it didn't cause thrashing, but that's not a huge amount of memory by modern standards.
Edit: yes, a vector is basically a dynamic array.
Edit2: The code you've added some some problems. Your while (! LabelFile.eof() ) is broken. You normally want to do something like while (LabelFile >> inputdata) instead. You're also reading the data somewhat inefficiently -- what you apparently expecting is two numbers separated by a tab. That being the case, I'd write the loop something like:
while (LabelFile >> node >> label)
Label[node] = label;
I have an efficiency critical application, where I need such an array-type data structure A. Its keys are 0, 1, 2,..., and its values are uint64_t distinct values. I need two constant operations:
1. Given i, return A[i];
2. Given val, return i such that A[i] == val
I prefer not to use hash table. Because I tried GLib GHashTable, it took around 20 mins to load 60 million values into the hash table (If I remove the insertion statement, it took only around 6 seconds). The time is not acceptable for my application. Or maybe somebody recommend other hash table libraries? I tried uthash.c, it crashed immediately.
I also tried SDArray, but it seems not the right one.
Does anybody know any data structure that would fulfill my requirements? Or any efficient hash table implementations? I prefer using C/C++.
Thanks.
In general, you need two hash tables for this task. As you know, hash tables give you a key look-up in expected constant time. Searching for a value requires iterating through the whole data structure, since information about the values isn't encoded in the hash look-up table.
Use two hash tables: One for key-value and one (reversed) for value-key look-up. In your particular case, the forward search can be done using a vector as long as your keys are "sequential". But this doesn't change the requirement for a data structure enabling fast reverse look-up.
Regarding the hash table implementation: In C++11, you have the new standard container std::unordererd_map available.
An implementation might look like this (of course this is tweakable, like introducing const-correctness, calling by reference etc.):
std::unordered_map<K,T> kvMap; // hash table for forward search
std::unordered_map<T,K> vkMap; // hash table for backward search
void insert(std::pair<K,T> item) {
kvMap.insert(item);
vkMap.insert(std::make_pair(item.second, item.first));
}
// expected O(1)
T valueForKey(K key) {
return kvMap[key];
}
// expected O(1)
K keyForValue(T value) {
return vkMap[value];
}
A clean C++11 implementation should "wrap" around the key-value hash map, so you have the "standard" interface in your wrapper class. Always keep the reverse map in sync with your forward map.
Regarding the creation performance: In most implementations, there is a way to tell the data structure how much elements are going to be inserted, called "reserve". For hash tables, this is a huge performance benefit, as dynamically resizing the data structure (which happens during insertions every now and then) completely re-structures the whole hash table, as it changes the hash function itself.
I would go for two vectors (assuming that your values are really distinct), as this is O(1) in access where map is O(log n) in access
vector<uint64_t> values;
vector<size_t> keys
values.reserve(maxSize); // do memory reservation first, so reallocation doesn't occur during reading of data
keys.reserve(maxSize); // do memory reservation first, so reallocation doesn't occur during reading of data
Then, when reading in data
values[keyRead] = data;
keys[valueRead] = key;
Reading information is then the same
data = values[currentKey];
key = keys[currentData];
I have a table that is accessed by row and column, where those two are not integral. They will, however be unique and taken from the same set. The table will need to be expanded, but always from the end. Removal may be necessary from the middle, but is not a priority.
I am currently testing 2 approaches:
map<Key, int> headers;
vector<vector<Value> > table;
Or:
map<Key, map<Key, Value> > table;
Which is going to be more appropriate? I am also open to new suggestions.
Examples showing basic usage (though both very much oversimplified) are here and here.
It all depends on how this structure is going to be used: How densely populated the table will be, how efficient various operations have to be, how big the payload type (Value) will be, etc.
Your first approach (vector of vectors, with a map to translate indices) is a dense representation: Every value is stored explicitly in the table. If vectors grow by a factor of L, then the total allocation excess for the data proper can go up to L^2. For example, if L == 1.25, you may end up over 50% excess storage; if sizeof(Value) is large or if your table is large, that may be prohibitive. Also expanding the table may occasionally be quite expensive (when the vectors must be reallocated).
Your second approach (map of maps) is potentially sparse. However, if all table (row, column) pairs are accessed, it will get dense. Also, the book-keeping information for maps is a bit larger than for vectors. So for small Value sizes, the vector of vectors approach might be more space-efficient. If most of your table will be populated by "default" values, then you might improve things by distinguishing between read and write access to the table: A read for a value could perform a "find" and return a synthesized default value if notable entry was found.
I have to store information about contents in a lookup table such that it can be accessed very quickly.I might need to refer some of the elements in look up table recursively to get complete information about contents. What will be better data structure to use:
Map with one of parameter, which will be unique to all the entries in look up table, as key and rest of the information as value
Use static array for each unique entries and access them when needed according to key(which will be same as the one used in MAP).
I want my software to be robust as if we have any crash it will be catastrophic for my product.
It depends on the range of keys that you have.
Usually, when you say lookup table, you mean a smallish table which you can index directly ( O(1) ). As a dumb example, for a substitution cipher, you could have a char cipher[256] and simply index with the ASCII code of a character to get the substitution character. If the keys are complex objects or simply too many, you're probably stuck with a map.
You might also consider a hashtable (see unordered_map).
Reply:
If the key itself can be any 32-bit number, it wouldn't make sense to store a very sparse 4-billion element array.
If however your keys are themselves between say 0..10000, then you can have a 10000-element array containing pointers to your objects (or the objects themselves), with only 2000-5000 of your elements containing non-null pointers (or meaningful data, respectively). Access will be O(1).
If you can have large keys, then I'd probably go with the unordered_map. With a map of 5000 elements, you'd get O(log n) to mean around ~12 accesses, a hash table should be pretty much one or two accesses tops.
I'm not familiar with perfect hashes, so I can't advise about their implementation. If you do choose that, I'd be grateful for a link or two with ideas to keep in mind.
The lookup times in a std::map should be O=ln(n), with a linear search in a static array in the worst case O=n.
I'd strongly opt for a std::map even if it has a larger memory footprint (which should not matter, in the most cases).
Also you can make "maps of maps" or even deeper structures:
typedef std::map<MyKeyType, std::map<MyKeyType, MyValueType> > MyDoubleMapType;
I have a dense matrix where the indices correspond to genes. While gene identifiers are often integers, they are not contiguous integers. They could be strings instead, too.
I suppose I could use a boost sparse matrix of some sort with integer keys, and it wouldn't matter if they're contiguous. Or would this still occupy a great deal of space, particularly if some genes have identifiers that are nine digits?
Further, I am concerned that sparse storage is not appropriate, since this is an all-by-all matrix (there will be a distance in each and every cell, provided the gene exists).
I'm unlikely to need to perform any matrix operations (e.g., matrix multiplication). I will need to pull vectors out of the matrix (slices).
It seems like the best type of matrix would be keyed by a Boost unordered_map (a hash map), or perhaps even simply an STL map.
Am I looking at this the wrong way? Do I really need to roll my own? I thought I saw such a class somewhere before.
Thanks!
You could use a std::map to map the gene identifiers to unique, consecutively assigned integers (every time you add a new gene identifier to the map, you can give it the map's size as its identifier, assuming you never remove genes from the map).
If you want to be able to search for the identifier of a gene based on its unique integer, you can use a second map or you could use a boost::bimap, which provides a bidirectional mapping of elements.
As for which matrix container to use, you might consider boost::ublas::matrix; it provides vector-like access to rows and columns of the matrix.
If you don't need matrix operations, you don't need a matrix. A 2D map with string keys can be done with map<map<string> > in plain C++, or using a hash map accordingly from Boost.
There is Boost.MultiArray which will allow you to manage with non-continuous indexes.
If you want an efficient implementation working with matrices with static size, there is also Boost.LA, which in now on the review schedule.
And las there is also NT2 which should be submitted to Boost soon.