Fastest C++ Container: Unique Values

Fastest C++ Container: Unique Values - c++

I am writing an email application that interfaces with a MySQL database. I have two tables that are sourcing my data, one of which contains unsubscriptions, the other of which is a standard user table. As of now, I'm creating a vector of pointers to email objects, and storing all of the unsubscribed emails in it, initially. I then have a standard SQL loop in which I'm checking to see if the email is not in the unsubscribe vector, then adding it to the global send email vector. My question, is, is there a more efficient way of doing this? I have to search the unsub vector for every single email in my system, up to 50K different ones. Is there a better structure for searching? And, a better structure for maintaining a unique collection of values? Perhaps one that would simply discard the value if it already contains it?

If your C++ Standard Library implementation supports it, consider using a std::unordered_set or a std::hash_set.
You can also use std::set, though its overhead might be higher (it depends on the cost of generating a hash for the object versus the cost of comparing two of the objects several times).
If you do use a node based container like set or unordered_set, you also get the advantage that removal of elements is relatively cheap compared to removal from a vector.

Tasks like this (set manipulations) are better left to what is MEANT to execute them - the database!
E.g. something along the lines of:
SELECT email FROM all_emails_table e WHERE NOT EXISTS (
SELECT 1 FROM unsubscribed u where e.email=u.email
)
If you want an ALGORITHM, you can do this fast by retrieving both the list of emails AND a list of unsubscriptions as ORDERED lists. Then you can go through the e-mail list (which is ordered), and as you do it you glide along the unsubscribe list. The idea is that you move 1 forward in whichever list has the "biggest" current" element. This algo is O(M+N) instead of O(M*N) like your current one
Or, you can do a hash map which maps from unsubscribed e-mail address to 1. Then you do find() calls on that map whcih for correct hash implementations are O(1) for each lookup.
Unfortunately, there's no Hash Map standard in C++ - please see this SO question for existing implementations (couple of ideas there are SGI's STL hash_map and Boost and/or TR1 std::tr1::unordered_map).
One of the comments on that post indicates it will be added to the standard: "With this in mind, the C++ Standard Library Technical Report introduced the unordered associative containers, which are implemented using hash tables, and they have now been added to the Working Draft of the C++ Standard."

Store your email adresses in a std::set or use std::set_difference().

The best way to do this is within MySQL, I think. You can modify your users table schema with another column, a BIT column, for "is unsubscribed". Better yet: add a DATETIME column for "date deleted" with a default value of NULL.
If using a BIT column, your query becomes something like:
SELECT * FROM `users` WHERE `unsubscribed` <> 0b1;
If using a DATETIME column, your query becomes something like:
SELECT * FROM `users` WHERE `date_unsubscribed` IS NULL;

Related

Find appropriate data storage and search engine

I have many objects. Each object is accosiated with a number of key-value pairs. Key is an arbitrary string (keys may be different for different objects, and there is no full list of possible keys), value can be numeric, string, datetime, etc.
I need to search through this collection using complex search queries. In the simpliest implementation, the user must be able to specify a list of interesting keys and a condition on each key's value, e.g.
key1: not present
key2: present
key3: == "value3"
key4: < 42
key5: contains "value5"
The engine must find all objects which satisfy all conditions (i.e. conditions are AND'ed). In the perfect implementation, the user is able to specify the condition using some query language, e.g.
key1 = "value1" AND (key2 < 3 OR key3 < 3)
I'm using C++ with Qt on Windows (Linux support would be not necessary but good). I don't want to use databases that require some installation (especially with administrator rights), I want the solution to be portable.
Please suggest a good way to implement this from scratch or using any library or database that satisfy my needs.
Upd: the question is about storing large data amounts on disk and fast searching through it. Maybe also it's about parsing and processing search queries. It is not about data structures I need to use to represent data in memory. It's simple enough.

If they keys are unique, use std::map or (C++11) std::unordered_map.
If the keys are not unique, use std::multimap or (C++11) std::unordered_multimap.
The latter have O(1) search and insert, but require that you provide a quality hashing algorithm (not easy to make) and possibly re-hash the map as it grows & shrinks.
Searching is provided by the containers.
Serialization is left as an excercise for the reader.

use Nested Maps like std::map<key1,std::map<key2,value>>....

suggestion for a C++ Datastructure with Two Column and supporting CRUD Operation

I want to design/find a C++ Datastructure/Container which supports two column of data and CRUD operation on those data. I reviewed the STL Containers but none of them supports my requirement (correct me if I am wrong).
My exact requirement is as follows
Datastructure with Two Columns.
Supports the following Functions
Search for a Specific item.
Search for a List of items matching a criteria
Both the column should support the above mentioned Search Operation. ie., I should be able to search for a data on both the columns.
Update a specific item
Delete a specific item
Add new item
I prefer search operation to be faster than add/delete operation.
In Addition I will be sharing this Data between Threads, hence need to support Mutex(I can also implement Mutex Lock on these data separately.)
Does any of the existing STL meets my requirement or do we have any other library or datastructure that fits best to my requirements.
Note: I cant use Database or SQLite to store my data.
Thank you
Regards,
Dinesh

If one of the column is unique then probably you can use Map. Otherwise define a class with two member variables representing the column and store it in a vector. There are algorithms that will help you in searching the container.

Search for a Specific item.
If you need one-way mapping (i.e. fast search over values in one column), you should use map or multimap container classes. There is no, however, bidirectional map in standart library, so you should build your own as pair of (multi)maps or use other libraries, such as boost::bimap

Your best bet is Boost.Bimap, because it will make it easy for you when you want to search based on either column. If you decided that you need more columns, then Boost.Multi_index might be better. Here is an example!

multi_index_container

So I have a boost::multi_index_container with multiple non-unique indexes. I would like to find an elegant way to do an relational-database style query to find all elements that match a set of criteria using multiple indexes.
For instance, given a list of connections between devices, I'd like to search for all elements whose source is 'server' and whose destination is 'pc2'. I've got an index on Source and an index on Dest.
Source Dest/Port
---- ---------
server pc1 23
server pc1 27
server pc1 80
server pc2 80 <- want to find these two
server pc2 90 <-
printer pc3 110
printer pc1 110
scanner mac 8080
Normally I might do lower_bound and upper_bound on the first index (to match 'server'), then do a linear search between those iterators to find those elements that match in the "Dest" column, but that's not very satisfying, since I've got a second index. Is there an elegant stl/boost-like way to take advantage of the fact that there are two indexes and avoid a linear search (or an equivalent amount of work, such as adding all intermediate results to another container, etc.)?
(Obviously in the example, a linear search would be fastest, but if there were 10000 items with 'server' as the source, having the second index would start to be nice.)
Any ideas are appreciated!

You might simply get some inspiration from relational databases...
... but first we need to demystify a thing about indexes.
Compound Indexes
In a relational database there are two types of indexes:
regular indexes: an index on one column
compound indexes: an index on multiple columns at once
The two give different performance results. When you need to use two indexes, there is a merge pass to combine the results given by them (also called join), therefore compound indexes can provide a speed boost.
Multi-Index
Boost multi-index can use compounds indexes, you are free to provide your own hashing or comparison function after all.
A key difference with a relational database is that you cannot have an efficient merge pass (merging two ROWID sets) because this require intrinsic knowledge to be efficient, therefore you are indeed stuck with a linear search among the results of the first search. It is up to you to find the most discriminant first search.
Note: the name multi-index refers to the idea that it automatically maintains multiple index when you insert, update and delete your elements. It also means that you can search using any of those indexes with a performance profile that you decided. But it is not a full-blown database engine with statistics, heuristics and a query engine.

The most elegant way to do a relational-database style query is to use a relational database. I'm not being flippant; you're using the wrong data structure. If "relational-database style query" operations are going to be something that you do frequently, I would strongly urge you to invest in SQLite.
The purpose of Boost.MultiIndex is not to be a quick-and-dirty database.

Hash table with two keys

I have a large amount of data the I want to be able to access in two different ways. I would like constant time look up based on either key, constant time insertion with one key, and constant time deletion with the other. Is there such a data structure and can I construct one using the data structures in tr1 and maybe boost?

Use two parallel hash-tables. Make sure that the keys are stored inside the element value, because you'll need all the keys during deletion.

Have you looked at Bloom Filters? They aren't O(1), but I think they perform better than hash tables in terms of both time and space required to do lookups.

Hard to find why you need to do this but as someone said try using 2 different hashtables.
Just pseudocode in here:
Hashtable inHash;
Hashtable outHash;
//Hello myObj example!!
myObj.inKey="one";
myObj.outKey=1;
myObj.data="blahblah...";
//adding stuff
inHash.store(myObj.inKey,myObj.outKey);
outHash.store(myObj.outKey,myObj);
//deleting stuff
inHash.del(myObj.inKey,myObj.outKey);
outHash.del(myObj.outKey,myObj);
//findin stuff
//straight
myObj=outHash.get(1);
//the other way; still constant time
key=inHash.get("one");
myObj=outHash.get(key);
Not sure, thats what you're looking for.

This is one of the limits of the design of standard containers: a container in a sense "own" the contained data and expects to be the only owner... containers are not merely "indexes".
For your case a simple, but not 100% effective, solution is to have two std::maps with "Node *" as value and storing both keys in the Node structure (so you have each key stored twice). With this approach you can update your data structure with reasonable overhead (you will do some extra map search but that should be fast enough).
A possibly "correct" solution however would IMO be something like
struct Node
{
Key key1;
Key key2;
Payload data;
Node *Collision1Prev, *Collision1Next;
Node *Collision2Prev, *Collision2Next;
};
basically having each node in two different hash tables at the same time.
Standard containers cannot be combined this way. Other examples I coded by hand in the past are for example an hash table where all nodes are also in a doubly-linked list, or a tree where all nodes are also in an array.
For very complex data structures (e.g. network of structures where each one is both the "owner" of several chains and part of several other chains simultaneously) I even resorted sometimes to code generation (i.e. scripts that generate correct pointer-handling code given a description of the data structure).

Python equivalent of std::set and std::multimap

I'm porting a C++ program to Python. There are some places where it uses std::set to store objects that define their own comparison operators. Since the Python standard library has no equivalent of std::set (a sorted key-value mapping data structure) I tried using a normal dictionary and then sorting it when iterating, like this:
def __iter__(self):
items = self._data.items()
items.sort()
return iter(items)
However, profiling has shown that all the calls from .sort() to __cmp__ are a serious bottleneck. I need a better data structure - essentially a sorted dictionary. Does anyone know of an existing implementation? Failing that, any recommendations on how I should implement this? Read performance is more important than write performance and time is more important than memory.
Bonus points if it supports multiple values per key, like the C++ std::multimap.
Note that the OrderedDict class doesn't fit my needs, because it returns items in the order of insertion, whereas I need them sorted using their __cmp__ methods.

For the sorted dictionary, you can (ab)use the stable nature of python's timsort: basically, keep the items partially sorted, append items at the end when needed, switching a "dirty" flag, and sort the remaining before iterating. See this entry for details and implementation (A Martelli's answer):
Key-ordered dict in Python

You should use sort(key=...).
The key function you use will be related to the cmp you are already using. The advantage is that the key function is called n times whereas the cmp is called nlog n times, and typically key does half the work that cmp does
If you can include your __cmp__() we can probably show you how to convert it to a key function
If you are doing lots of iterations between modifications, you should cache the value of the sorted items.

Python does not have built-in data-structures for this, though the bisect module provides functionality for keeping a sorted list with appropriately efficient algorithms.
If you have a list of sorted keys, you can couple it with a collections.defaultdict(list) to provide multimap-like functionality.

In his book "Programming in Python 3", Mark Summerfield introduces a sorted dictionary class. The source code is available in this zip archive - look for SortedDict.py. The SortedDict class is described in detail in the book (which I recommend very much). It supports arbitrary keys for comparison and multiple values per key (which any dictionary in Python does, so that's not that big a deal, I think).

This is a late post but if anyone is looking for this now, here you go: https://grantjenks.com/docs/sortedcontainers/
This is not a built-in but just an easy pip install. It has sorted dicts and lists both with full support for insert, delete, indexing and binary search. Most of the operations have amortised O(log(n)) complexity.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Fastest C++ Container: Unique Values - c++

Store your email adresses in a std::set or use std::set_difference().

Related

Find appropriate data storage and search engine

suggestion for a C++ Datastructure with Two Column and supporting CRUD Operation

multi_index_container

Hash table with two keys

Python equivalent of std::set and std::multimap

Categories

Resources