Is there a structure in Python similar to C++ STL map? - c++

Is there a structure in Python which supports similar operations to C++ STL map and complexity of operations correspond to C++ STL map?

dict is usually close enough - what do you want that it doesn't do?
If the answer is "provide order", then what's actually wrong with for k in sorted(d.keys())? Uses too much memory, maybe? If you're doing lots of ordered traversals interspersed with inserts then OK, point taken, you really do want a tree.
dict is actually a hash table rather than a b-tree. But then map isn't defined to be a b-tree, so it doesn't let you do things like detaching sub-trees as a new map, it just has the same performance complexities. All that's really left to worry about is what happens to dict when there are large numbers of hash collisions, but it must be pretty rare to use Python in situations where you want tight worst-case performance guarantees.

Python SortedDict is a similar to C++ STL map. You can read about it here or here.
SortedDict is a container of key-value pairs in which an order is
imposed on the keys according to their ordered relation to each other.
As with Python’s built-in dict data type, SortedDict supports fast
insertion, deletion, and lookup by key.

I believe that the standard python type dict() will do the trick in most cases. The difference from C++'s std::map is that dict is impemented as a hash map and C++'s map is tree-based.

Python dictionaries [5.5].

Look at bintrees module (pip install bintrees). This package provides Binary- RedBlack- and AVL-Trees written in Python and Cython/C.

Have you looked into Python dictionaries?

In most cases, the python dictionary dict() will work just fine similar to stl map of C++ because they both are implemented using a hash table.
Please look at python_dictionaries

use this : from sortedcontainers import SortedDict
don't use this: from collections import OrderedDict

Related

Suggested method for fast reverse lookup by value in key/value pairs in C++

I need to have a large key/value pair array in C++ memory that might look something like the following. However, the real world example will be tens of thousands of records:
key value
----- -----
1 20
2 20
3 12
4 3
5 blank
6 3
7 blank
...
Given a value of '3', I need a fast reverse lookup of matching keys (4 & 6 in this example). I would like to avoid iterating the list. Is there a recommended solution in C++?
I would prefer to use the Qt library. I have looked at QMap and QHash, but a fast reverse lookup doesn't seem an option?
Is there a recommended solution in C++?
If Boost can be an option (I know you prefer Qt). Then you can use Boost.Bimap
Boost.Bimap is a bidirectional maps library for C++. With Boost.Bimap
you can create associative containers in which both types can be used
as key. A bimap can be thought of as a combination of a
std::map and a std::map. The learning curve of bimap is
almost flat if you know how to use standard containers. A great deal
of effort has been put into mapping the naming scheme of the STL in
Boost.Bimap. The library is designed to match the common STL
containers.
Otherwise, since Qt has not a suitable data-structure. You can use two maps, to achieve your goal. However, it duplicates memory.
What you're looking for is bidirectional map concept. Sadly, Qt does not provide a readily available solution for this.
If you're looking to use only Qt libraries, you'll have to maintain two QMaps (or QHashes or something equivalent).
QMap<int, QPair<int,bool> > forwardLookupMap
QMap<QPair<int, bool>, int> reverseLookupMap
You would have to maintain reverse lookup map by hand, which is cumbersome, and memory requirements don't look too good.
Alternative option would be using third-party implementation, like Boost.Bimap

C++ (Hashmap style) Data Structure Ideal For This Scenario?

People have asked similar questions about the efficiency of various data structures but none I have read are totally applicable to my scenario so I wondered if people had suggestions for one that was tailored to satisfy the following criteria efficiently:
Each element will have a unique key. There will be no possibility of collisions because each element hashes to a different key. EDIT: *The key is a 32-bit uint.*
The elements are all unique and therefore can be thought of as a set.
The only operations required are adding and getting, not deletion. These need to be quick as they will be used several 100,000 times in a typical run!
The order in which elements are kept is irrelevant.
Speed is more important than memory-consumption... though it can't be too
greedy!
I am developing for a company that will use the program commercially so any third-party data structures should come with no copyright protection or anything, but if the STL has a data structure that will do the job efficiently then that would be perfect.
I know there are countless Hashmap/Dictionary style C++ data structures with implementations that are built to satisfy different criteria so if someone can suggest one ideal for this situation then that would be greatly appreciated.
Many thanks
Edit:
I found this passage on SO that seems to suggest unordered_map would be good?
hash_map and unordered_map are generally implemented with hash tables.
Thus the order is not maintained. unordered_map insert/delete/query
will be O(1) (constant time) where map will be O(log n) where n is the
number of items in the data structure. So unordered_map is faster, and
if you don't care about the order of the items should be preferred
over map. Sometimes you want to maintain order (ordered by the key)
and for that map would be the choice.
Looks like a prefix tree (with element at each node end) also fits in this scenario. It's damn fast, even faster than hash map because no hash value calculation is done and getting a value is purely O(n) where n is the key length. It's a bit memory hungry but common prefix of keys are shared in the same node path.
EDIT: I assume the keys are string, not simple values like integers
As for build-in solutions I'd recommand google::dense_hash_map. They are really fast especially for numeric keys. You'll have to decide on a specific key that will be reserved as "empty_key". Moreover here is a really nice comparison of different hash-map implementations.
An excerpt
Library Linux-intCPU (sec) Linux-strCPU (sec) Linux PeakMem (MB)
glib 3.490 4.720 24.968
ghthash 3.260 3.460 61.232
CC’s hashtable 3.040 4.050 129.020
TR1 1.750 3.300 28.648
STL hash_set 2.070 3.430 25.764
google-sparse 2.560 6.930 5.42/8.54
google-dense 0.550 2.820 24.7/49.3
khash (C++) 1.100 2.900 6.88/13.1
khash (C) 1.140 2.940 6.91/13.1
STL set (RB) 7.840 18.620 29.388
kbtree (C) 4.260 17.620 4.86/9.59
NP’s splaytree 11.180 27.610 19.024
However, when setting a "deleted_key", this map can also perform deletions. So maybe it'll be possible to create a custom solution that is even more efficient. But apart from that minor point, any hash-map should exactly suit your needs (note that "map" is an ordered tree-map and thus slower).
What you need definitely sounds like a hash set, C++ has this as either std::tr1::unordered_set or in Boost.Unordered.
P.S. Note, however, that TR1 is not yet standard, and you'll probably need to get Boost for the implementation.
It sounds like std::unordered_set would fit the bill, but without
knowing more about the key, it's difficult to say. I'm curious about
how you can guarantee that there will be no possibility of collisions:
this implies a small (less than the size of the table), finite set of
keys. If this is the case, it may be more efficient to map the keys to
a small int, and use std::vector (with empty slots for the entries not
present).
What you're looking for is an unordered_set. You can find one in Boost, TR1, or C++0x. If you're hoping to associate the key with a value, then unordered_map does just that- also in Boost/TR1/C++0x.

Python equivalent of std::set and std::multimap

I'm porting a C++ program to Python. There are some places where it uses std::set to store objects that define their own comparison operators. Since the Python standard library has no equivalent of std::set (a sorted key-value mapping data structure) I tried using a normal dictionary and then sorting it when iterating, like this:
def __iter__(self):
items = self._data.items()
items.sort()
return iter(items)
However, profiling has shown that all the calls from .sort() to __cmp__ are a serious bottleneck. I need a better data structure - essentially a sorted dictionary. Does anyone know of an existing implementation? Failing that, any recommendations on how I should implement this? Read performance is more important than write performance and time is more important than memory.
Bonus points if it supports multiple values per key, like the C++ std::multimap.
Note that the OrderedDict class doesn't fit my needs, because it returns items in the order of insertion, whereas I need them sorted using their __cmp__ methods.
For the sorted dictionary, you can (ab)use the stable nature of python's timsort: basically, keep the items partially sorted, append items at the end when needed, switching a "dirty" flag, and sort the remaining before iterating. See this entry for details and implementation (A Martelli's answer):
Key-ordered dict in Python
You should use sort(key=...).
The key function you use will be related to the cmp you are already using. The advantage is that the key function is called n times whereas the cmp is called nlog n times, and typically key does half the work that cmp does
If you can include your __cmp__() we can probably show you how to convert it to a key function
If you are doing lots of iterations between modifications, you should cache the value of the sorted items.
Python does not have built-in data-structures for this, though the bisect module provides functionality for keeping a sorted list with appropriately efficient algorithms.
If you have a list of sorted keys, you can couple it with a collections.defaultdict(list) to provide multimap-like functionality.
In his book "Programming in Python 3", Mark Summerfield introduces a sorted dictionary class. The source code is available in this zip archive - look for SortedDict.py. The SortedDict class is described in detail in the book (which I recommend very much). It supports arbitrary keys for comparison and multiple values per key (which any dictionary in Python does, so that's not that big a deal, I think).
This is a late post but if anyone is looking for this now, here you go: https://grantjenks.com/docs/sortedcontainers/
This is not a built-in but just an easy pip install. It has sorted dicts and lists both with full support for insert, delete, indexing and binary search. Most of the operations have amortised O(log(n)) complexity.

hashing a dictionary in C++

hi I want to use a hashmap for words in the dictionary and the indices of the words in the dicionary.
What would be the fastest hash algorithm for this?
Thanks!
At the bottom of this page there is a section A Note on Hash Functions with some information which you might find useful.
For convenience, I'll just replicate some links here:
Bob Jenkins
Paul Hsieh
Fowler/Noll/Vo (FNV)
MurmurHash
There are many different hashing algorithms, of varying efficiency, but the most important issue is that it scatter the items fairly uniformly across the different hash buckets.
However, you may as well assume that the Microsoft engineers/library engineers have done a decent job of writing an efficient and effective hash algorithm, and just using the built-in libraries/classes.
The fastest hash function will be
template <class T>
size_t hash(T key) {
return 0;
}
however, though the hashing will be mighty fast, you will suffer performance elsewhere. What you want is to try several hashing algorithms on actual data and see which one actually gives you the best performance in aggregate on the actual data you expect to use if the hashing or lookup is even a performance bottleneck. Until then, go with something handy. MD5 is pretty widely available.
Have you tried just using the STL hash_map and seeing if it serves your needs before rolling anything more complex?
http://www.sgi.com/tech/stl/hash_map.html
boost has a hash function that you can reuse for your own data (predefined for common types). That'd probably work well & fast enough if your needs aren't special.
What is your use case? A radix search tree (trie) might be more suitable than a hash if you're mapping from string to integer. Tries have the advantage of reducing key comparisons for variable length keys. (e.g., strings)
Even a binary search tree (e.g., STL's map) might be superior to a hash based container in terms of memory use and number of key comparisons. A hash is more efficient only if you have very few collisions.

Idiomatic way to do list/dict in Cython?

My problem: I've found that processing large data sets with raw C++ using the STL map and vector can often be considerably faster (and with lower memory footprint) than using Cython.
I figure that part of this speed penalty is due to using Python lists and dicts, and that there might be some tricks to use less encumbered data structures in Cython. For example, this page (http://wiki.cython.org/tutorials/numpy) shows how to make numpy arrays very fast in Cython by predefining the size and types of the ND array.
Question: Is there any way to do something similar with lists/dicts, e.g. by stating roughly how many elements or (key,value) pairs you expect to have in them? That is, is there an idiomatic way to convert lists/dicts to (fast) data structures in Cython?
If not I guess I'll just have to write it in C++ and wrap in a Cython import.
Cython now has template support, and comes with declarations for some of the STL containers.
See http://docs.cython.org/src/userguide/wrapping_CPlusPlus.html#standard-library
Here's the example they give:
from libcpp.vector cimport vector
cdef vector[int] vect
cdef int i
for i in range(10):
vect.push_back(i)
for i in range(10):
print vect[i]
Doing similar operations in Python as in C++ can often be slower. list and dict are actually implemented very well, but you gain a lot of overhead using Python objects, which are more abstract than C++ objects and require a lot more lookup at runtime.
Incidentally, std::vector is implemented in a pretty similar way to list. std::map, though, is actually implemented in a way that many operations are slower than dict as its size gets large. For suitably large examples of each, dict overcomes the constant factor by which it's slower than std::map and will actually do operations like lookup, insertion, etc. faster.
If you want to use std::map and std::vector, nothing is stopping you. You'll have to wrap them yourself if you want to expose them to Python. Do not be shocked if this wrapping consumes all or much of the time you were hoping to save. I am not aware of any tools that make this automatic for you.
There are C API calls for controlling the creation of objects with some detail. You can say "Make a list with at least this many elements", but this doesn't improve the overall complexity of your list creation-and-filling operation. It certainly doesn't change much later as you try to change your list.
My general advice is
If you want a fixed-size array (you talk about specifying the size of a list), you may actually want something like a numpy array.
I doubt you are going to get any speedup you want out of using std::vector over list for a general replacement in your code. If you want to use it behind the scenes, it may give you a satisfying size and space improvement (I of course don't know without measuring, nor do you. ;) ).
dict actually does its job really well. I definitely wouldn't try introducing a new general-purpose type for use in Python based on std::map, which has worse algorithmic complexity in time for many important operations and—in at least some implementations—leaves some optimisations to the user that dict already has.
If I did want something that worked a little more like std::map, I'd probably use a database. This is generally what I do if stuff I want to store in a dict (or for that matter, stuff I store in a list) gets too big for me to feel comfortable storing in memory. Python has sqlite3 in the stdlib and drivers for all other major databases available.
C++ is fast not just because of the static declarations of the vector and the elements that go into it, but crucially because using templates/generics one specifies that the vector will only contain elements of a certain type, e.g. vector with tuples of three elements. Cython can't do this last thing and it sounds nontrivial -- it would have to be enforced at compile time, somehow (typechecking at runtime is what Python already does). So right now when you pop something off a list in Cython there is no way of knowing in advance what type it is , and putting it in a typed variable only adds a typecheck, not speed. This means that there is no way of bypassing the Python interpreter in this regard, and it seems to me it's the most crucial shortcoming of Cython for non-numerical tasks.
The manual way of solving this is to subclass the python list/dict (or perhaps std::vector) with a cdef class for a specific type of element or key-value combination. This would amount to the same thing as the code that templates are generating. As long as you use the resulting class in Cython code it should provide an improvement.
Using databases or arrays just solves a different problem, because this is about putting arbitrary objects (but with a specific type, and preferably a cdef class) in containers.
And std::map shouldn't be compared to dict; std::map maintains keys in sorted order because it is a balanced tree, dict solves a different problem. A better comparison would be dict and Google's hashtable.
You can take a look at the standard array module for Python if this is appropriate for your Cython setting. I'm not sure since I have never used Cython.
There is no way to get native Python lists/dicts up to the speed of a C++ map/vector or even anywhere close. It has nothing to do with allocation or type declaration but rather paying the interpreter overhead. The example you mention (numpy) is a C extension and is written in C for precisely this reason.
Just because it was not mentioned here: You can easily wrap for example a C++ vector in a custom extension type.
from libcpp.vector cimport vector
cdef class pyvector:
"""Extension type wrapping a vector"""
cdef vector[long] _data
cpdef void push_back(self, long x):
self._data.push_back(x)
#property
def data(self):
return self._data
In this way, you can store your data in a vector allowing fast Cython operations while still being able to access the data (with some overhead) from the Python side.