Find appropriate data storage and search engine - c++

I have many objects. Each object is accosiated with a number of key-value pairs. Key is an arbitrary string (keys may be different for different objects, and there is no full list of possible keys), value can be numeric, string, datetime, etc.
I need to search through this collection using complex search queries. In the simpliest implementation, the user must be able to specify a list of interesting keys and a condition on each key's value, e.g.
key1: not present
key2: present
key3: == "value3"
key4: < 42
key5: contains "value5"
The engine must find all objects which satisfy all conditions (i.e. conditions are AND'ed). In the perfect implementation, the user is able to specify the condition using some query language, e.g.
key1 = "value1" AND (key2 < 3 OR key3 < 3)
I'm using C++ with Qt on Windows (Linux support would be not necessary but good). I don't want to use databases that require some installation (especially with administrator rights), I want the solution to be portable.
Please suggest a good way to implement this from scratch or using any library or database that satisfy my needs.
Upd: the question is about storing large data amounts on disk and fast searching through it. Maybe also it's about parsing and processing search queries. It is not about data structures I need to use to represent data in memory. It's simple enough.

If they keys are unique, use std::map or (C++11) std::unordered_map.
If the keys are not unique, use std::multimap or (C++11) std::unordered_multimap.
The latter have O(1) search and insert, but require that you provide a quality hashing algorithm (not easy to make) and possibly re-hash the map as it grows & shrinks.
Searching is provided by the containers.
Serialization is left as an excercise for the reader.

use Nested Maps like std::map<key1,std::map<key2,value>>....

Related

suggestion for a C++ Datastructure with Two Column and supporting CRUD Operation

I want to design/find a C++ Datastructure/Container which supports two column of data and CRUD operation on those data. I reviewed the STL Containers but none of them supports my requirement (correct me if I am wrong).
My exact requirement is as follows
Datastructure with Two Columns.
Supports the following Functions
Search for a Specific item.
Search for a List of items matching a criteria
Both the column should support the above mentioned Search Operation. ie., I should be able to search for a data on both the columns.
Update a specific item
Delete a specific item
Add new item
I prefer search operation to be faster than add/delete operation.
In Addition I will be sharing this Data between Threads, hence need to support Mutex(I can also implement Mutex Lock on these data separately.)
Does any of the existing STL meets my requirement or do we have any other library or datastructure that fits best to my requirements.
Note: I cant use Database or SQLite to store my data.
Thank you
Regards,
Dinesh
If one of the column is unique then probably you can use Map. Otherwise define a class with two member variables representing the column and store it in a vector. There are algorithms that will help you in searching the container.
Search for a Specific item.
If you need one-way mapping (i.e. fast search over values in one column), you should use map or multimap container classes. There is no, however, bidirectional map in standart library, so you should build your own as pair of (multi)maps or use other libraries, such as boost::bimap
Your best bet is Boost.Bimap, because it will make it easy for you when you want to search based on either column. If you decided that you need more columns, then Boost.Multi_index might be better. Here is an example!

C++ (Hashmap style) Data Structure Ideal For This Scenario?

People have asked similar questions about the efficiency of various data structures but none I have read are totally applicable to my scenario so I wondered if people had suggestions for one that was tailored to satisfy the following criteria efficiently:
Each element will have a unique key. There will be no possibility of collisions because each element hashes to a different key. EDIT: *The key is a 32-bit uint.*
The elements are all unique and therefore can be thought of as a set.
The only operations required are adding and getting, not deletion. These need to be quick as they will be used several 100,000 times in a typical run!
The order in which elements are kept is irrelevant.
Speed is more important than memory-consumption... though it can't be too
greedy!
I am developing for a company that will use the program commercially so any third-party data structures should come with no copyright protection or anything, but if the STL has a data structure that will do the job efficiently then that would be perfect.
I know there are countless Hashmap/Dictionary style C++ data structures with implementations that are built to satisfy different criteria so if someone can suggest one ideal for this situation then that would be greatly appreciated.
Many thanks
Edit:
I found this passage on SO that seems to suggest unordered_map would be good?
hash_map and unordered_map are generally implemented with hash tables.
Thus the order is not maintained. unordered_map insert/delete/query
will be O(1) (constant time) where map will be O(log n) where n is the
number of items in the data structure. So unordered_map is faster, and
if you don't care about the order of the items should be preferred
over map. Sometimes you want to maintain order (ordered by the key)
and for that map would be the choice.
Looks like a prefix tree (with element at each node end) also fits in this scenario. It's damn fast, even faster than hash map because no hash value calculation is done and getting a value is purely O(n) where n is the key length. It's a bit memory hungry but common prefix of keys are shared in the same node path.
EDIT: I assume the keys are string, not simple values like integers
As for build-in solutions I'd recommand google::dense_hash_map. They are really fast especially for numeric keys. You'll have to decide on a specific key that will be reserved as "empty_key". Moreover here is a really nice comparison of different hash-map implementations.
An excerpt
Library Linux-intCPU (sec) Linux-strCPU (sec) Linux PeakMem (MB)
glib 3.490 4.720 24.968
ghthash 3.260 3.460 61.232
CC’s hashtable 3.040 4.050 129.020
TR1 1.750 3.300 28.648
STL hash_set 2.070 3.430 25.764
google-sparse 2.560 6.930 5.42/8.54
google-dense 0.550 2.820 24.7/49.3
khash (C++) 1.100 2.900 6.88/13.1
khash (C) 1.140 2.940 6.91/13.1
STL set (RB) 7.840 18.620 29.388
kbtree (C) 4.260 17.620 4.86/9.59
NP’s splaytree 11.180 27.610 19.024
However, when setting a "deleted_key", this map can also perform deletions. So maybe it'll be possible to create a custom solution that is even more efficient. But apart from that minor point, any hash-map should exactly suit your needs (note that "map" is an ordered tree-map and thus slower).
What you need definitely sounds like a hash set, C++ has this as either std::tr1::unordered_set or in Boost.Unordered.
P.S. Note, however, that TR1 is not yet standard, and you'll probably need to get Boost for the implementation.
It sounds like std::unordered_set would fit the bill, but without
knowing more about the key, it's difficult to say. I'm curious about
how you can guarantee that there will be no possibility of collisions:
this implies a small (less than the size of the table), finite set of
keys. If this is the case, it may be more efficient to map the keys to
a small int, and use std::vector (with empty slots for the entries not
present).
What you're looking for is an unordered_set. You can find one in Boost, TR1, or C++0x. If you're hoping to associate the key with a value, then unordered_map does just that- also in Boost/TR1/C++0x.

What's the best way to search from several map<key,value>?

I have created a vector which contains several map<>.
vector<map<key,value>*> v;
v.push_back(&map1);
// ...
v.push_back(&map2);
// ...
v.push_back(&map3);
At any point of time, if a value has to be retrieved, I iterate through the vector and find the key in every map element (i.e. v[0], v[1] etc.) until it's found. Is this the best way ? I am open for any suggestion. This is just an idea I have given, I am yet to implement this way (please show if any mistake).
Edit: It's not important, in which map the element is found. In multiple modules different maps are prepared. And they are added one by one as the code progresses. Whenever any key is searched, the result should be searched in all maps combined till that time.
Without more information on the purpose and use, it might be a little difficult to answer. For example, is it necessary to have multiple map objects? If not, then you could store all of the items in a single map and eliminate the vector altogether. This would be more efficient to do the lookups. If there are duplicate entries in the maps, then the key for each value could include the differentiating information that currently defines into which map the values are put.
If you need to know which submap the key was found in, try:
unordered_set<key, pair<mapid, value>>
This has much better complexity for searching.
If the keys do not overlap, i.e., are unique througout all maps, then I'd advice a set or unordered_set with a custom comparision functor, as this will help with the lookup. Or even extend the first map with the new maps, if profiling shows that is fast enough / faster.
If the keys are not unique, go with a multiset or unordered_multiset, again with a custom comparision functor.
You could also sort your vector manually and search it with a binary_search. In any case, I advice using a tree to store all maps.
It depends on how your maps are "independently created", but if it's an option, I'd make just one global map (or multimap) object and pass that to all your creators. If you have lots of small maps all over the place, you can just call insert on the global one to merge your maps into it.
That way you have only a single object in which to perform lookup, which is reasonably efficient (O(log n) for multimap, expected O(1) for unordered_multimap).
This also saves you from having to pass raw pointers to containers around and having to clean up!

Fastest C++ Container: Unique Values

I am writing an email application that interfaces with a MySQL database. I have two tables that are sourcing my data, one of which contains unsubscriptions, the other of which is a standard user table. As of now, I'm creating a vector of pointers to email objects, and storing all of the unsubscribed emails in it, initially. I then have a standard SQL loop in which I'm checking to see if the email is not in the unsubscribe vector, then adding it to the global send email vector. My question, is, is there a more efficient way of doing this? I have to search the unsub vector for every single email in my system, up to 50K different ones. Is there a better structure for searching? And, a better structure for maintaining a unique collection of values? Perhaps one that would simply discard the value if it already contains it?
If your C++ Standard Library implementation supports it, consider using a std::unordered_set or a std::hash_set.
You can also use std::set, though its overhead might be higher (it depends on the cost of generating a hash for the object versus the cost of comparing two of the objects several times).
If you do use a node based container like set or unordered_set, you also get the advantage that removal of elements is relatively cheap compared to removal from a vector.
Tasks like this (set manipulations) are better left to what is MEANT to execute them - the database!
E.g. something along the lines of:
SELECT email FROM all_emails_table e WHERE NOT EXISTS (
SELECT 1 FROM unsubscribed u where e.email=u.email
)
If you want an ALGORITHM, you can do this fast by retrieving both the list of emails AND a list of unsubscriptions as ORDERED lists. Then you can go through the e-mail list (which is ordered), and as you do it you glide along the unsubscribe list. The idea is that you move 1 forward in whichever list has the "biggest" current" element. This algo is O(M+N) instead of O(M*N) like your current one
Or, you can do a hash map which maps from unsubscribed e-mail address to 1. Then you do find() calls on that map whcih for correct hash implementations are O(1) for each lookup.
Unfortunately, there's no Hash Map standard in C++ - please see this SO question for existing implementations (couple of ideas there are SGI's STL hash_map and Boost and/or TR1 std::tr1::unordered_map).
One of the comments on that post indicates it will be added to the standard: "With this in mind, the C++ Standard Library Technical Report introduced the unordered associative containers, which are implemented using hash tables, and they have now been added to the Working Draft of the C++ Standard."
Store your email adresses in a std::set or use std::set_difference().
The best way to do this is within MySQL, I think. You can modify your users table schema with another column, a BIT column, for "is unsubscribed". Better yet: add a DATETIME column for "date deleted" with a default value of NULL.
If using a BIT column, your query becomes something like:
SELECT * FROM `users` WHERE `unsubscribed` <> 0b1;
If using a DATETIME column, your query becomes something like:
SELECT * FROM `users` WHERE `date_unsubscribed` IS NULL;

Python equivalent of std::set and std::multimap

I'm porting a C++ program to Python. There are some places where it uses std::set to store objects that define their own comparison operators. Since the Python standard library has no equivalent of std::set (a sorted key-value mapping data structure) I tried using a normal dictionary and then sorting it when iterating, like this:
def __iter__(self):
items = self._data.items()
items.sort()
return iter(items)
However, profiling has shown that all the calls from .sort() to __cmp__ are a serious bottleneck. I need a better data structure - essentially a sorted dictionary. Does anyone know of an existing implementation? Failing that, any recommendations on how I should implement this? Read performance is more important than write performance and time is more important than memory.
Bonus points if it supports multiple values per key, like the C++ std::multimap.
Note that the OrderedDict class doesn't fit my needs, because it returns items in the order of insertion, whereas I need them sorted using their __cmp__ methods.
For the sorted dictionary, you can (ab)use the stable nature of python's timsort: basically, keep the items partially sorted, append items at the end when needed, switching a "dirty" flag, and sort the remaining before iterating. See this entry for details and implementation (A Martelli's answer):
Key-ordered dict in Python
You should use sort(key=...).
The key function you use will be related to the cmp you are already using. The advantage is that the key function is called n times whereas the cmp is called nlog n times, and typically key does half the work that cmp does
If you can include your __cmp__() we can probably show you how to convert it to a key function
If you are doing lots of iterations between modifications, you should cache the value of the sorted items.
Python does not have built-in data-structures for this, though the bisect module provides functionality for keeping a sorted list with appropriately efficient algorithms.
If you have a list of sorted keys, you can couple it with a collections.defaultdict(list) to provide multimap-like functionality.
In his book "Programming in Python 3", Mark Summerfield introduces a sorted dictionary class. The source code is available in this zip archive - look for SortedDict.py. The SortedDict class is described in detail in the book (which I recommend very much). It supports arbitrary keys for comparison and multiple values per key (which any dictionary in Python does, so that's not that big a deal, I think).
This is a late post but if anyone is looking for this now, here you go: https://grantjenks.com/docs/sortedcontainers/
This is not a built-in but just an easy pip install. It has sorted dicts and lists both with full support for insert, delete, indexing and binary search. Most of the operations have amortised O(log(n)) complexity.