Idiomatic way to do list/dict in Cython? - c++

My problem: I've found that processing large data sets with raw C++ using the STL map and vector can often be considerably faster (and with lower memory footprint) than using Cython.
I figure that part of this speed penalty is due to using Python lists and dicts, and that there might be some tricks to use less encumbered data structures in Cython. For example, this page (http://wiki.cython.org/tutorials/numpy) shows how to make numpy arrays very fast in Cython by predefining the size and types of the ND array.
Question: Is there any way to do something similar with lists/dicts, e.g. by stating roughly how many elements or (key,value) pairs you expect to have in them? That is, is there an idiomatic way to convert lists/dicts to (fast) data structures in Cython?
If not I guess I'll just have to write it in C++ and wrap in a Cython import.

Cython now has template support, and comes with declarations for some of the STL containers.
See http://docs.cython.org/src/userguide/wrapping_CPlusPlus.html#standard-library
Here's the example they give:
from libcpp.vector cimport vector
cdef vector[int] vect
cdef int i
for i in range(10):
vect.push_back(i)
for i in range(10):
print vect[i]

Doing similar operations in Python as in C++ can often be slower. list and dict are actually implemented very well, but you gain a lot of overhead using Python objects, which are more abstract than C++ objects and require a lot more lookup at runtime.
Incidentally, std::vector is implemented in a pretty similar way to list. std::map, though, is actually implemented in a way that many operations are slower than dict as its size gets large. For suitably large examples of each, dict overcomes the constant factor by which it's slower than std::map and will actually do operations like lookup, insertion, etc. faster.
If you want to use std::map and std::vector, nothing is stopping you. You'll have to wrap them yourself if you want to expose them to Python. Do not be shocked if this wrapping consumes all or much of the time you were hoping to save. I am not aware of any tools that make this automatic for you.
There are C API calls for controlling the creation of objects with some detail. You can say "Make a list with at least this many elements", but this doesn't improve the overall complexity of your list creation-and-filling operation. It certainly doesn't change much later as you try to change your list.
My general advice is
If you want a fixed-size array (you talk about specifying the size of a list), you may actually want something like a numpy array.
I doubt you are going to get any speedup you want out of using std::vector over list for a general replacement in your code. If you want to use it behind the scenes, it may give you a satisfying size and space improvement (I of course don't know without measuring, nor do you. ;) ).
dict actually does its job really well. I definitely wouldn't try introducing a new general-purpose type for use in Python based on std::map, which has worse algorithmic complexity in time for many important operations and—in at least some implementations—leaves some optimisations to the user that dict already has.
If I did want something that worked a little more like std::map, I'd probably use a database. This is generally what I do if stuff I want to store in a dict (or for that matter, stuff I store in a list) gets too big for me to feel comfortable storing in memory. Python has sqlite3 in the stdlib and drivers for all other major databases available.

C++ is fast not just because of the static declarations of the vector and the elements that go into it, but crucially because using templates/generics one specifies that the vector will only contain elements of a certain type, e.g. vector with tuples of three elements. Cython can't do this last thing and it sounds nontrivial -- it would have to be enforced at compile time, somehow (typechecking at runtime is what Python already does). So right now when you pop something off a list in Cython there is no way of knowing in advance what type it is , and putting it in a typed variable only adds a typecheck, not speed. This means that there is no way of bypassing the Python interpreter in this regard, and it seems to me it's the most crucial shortcoming of Cython for non-numerical tasks.
The manual way of solving this is to subclass the python list/dict (or perhaps std::vector) with a cdef class for a specific type of element or key-value combination. This would amount to the same thing as the code that templates are generating. As long as you use the resulting class in Cython code it should provide an improvement.
Using databases or arrays just solves a different problem, because this is about putting arbitrary objects (but with a specific type, and preferably a cdef class) in containers.
And std::map shouldn't be compared to dict; std::map maintains keys in sorted order because it is a balanced tree, dict solves a different problem. A better comparison would be dict and Google's hashtable.

You can take a look at the standard array module for Python if this is appropriate for your Cython setting. I'm not sure since I have never used Cython.

There is no way to get native Python lists/dicts up to the speed of a C++ map/vector or even anywhere close. It has nothing to do with allocation or type declaration but rather paying the interpreter overhead. The example you mention (numpy) is a C extension and is written in C for precisely this reason.

Just because it was not mentioned here: You can easily wrap for example a C++ vector in a custom extension type.
from libcpp.vector cimport vector
cdef class pyvector:
"""Extension type wrapping a vector"""
cdef vector[long] _data
cpdef void push_back(self, long x):
self._data.push_back(x)
#property
def data(self):
return self._data
In this way, you can store your data in a vector allowing fast Cython operations while still being able to access the data (with some overhead) from the Python side.

Related

Mapping vectors of arbitrary type

I need to store a list vectors of different types, each to be referenced by a string identifier. For now, I'm using std::map with std::string as the key and boost::any as it's value (example implementation posted here).
I've come unstuck when trying to run a method on all the stored vector, e.g.:
std::map<std::string, boost::any>::iterator it;
for (it = map_.begin(); it != map_.end(); ++it) {
it->second.reserve(100); // FAIL: refers to boost::any not std::vector
}
My questions:
Is it possible to cast boost::any to an arbitrary vector type so I can execute its methods?
Is there a better way to map vectors of arbitrary types and retrieve then later on with the correct type?
At present, I'm toying with an alternative implementation which replaces boost::any with a pointer to a base container class as suggested in this answer. This opens up a whole new can of worms with other issues I need to work out. I'm happy to go down this route if necessary but I'm still interested to know if I can make it work with boost::any, of if there are other better solutions.
P.S. I'm a C++ n00b novice (and have been spoilt silly by Python's dynamic typing for far too long), so I may well be going about this the wrong way. Harsh criticism (ideally followed by suggestions) is very welcome.
The big picture:
As pointed out in comments, this may well be an XY problem so here's an overview of what I'm trying to achieve.
I'm writing a task scheduler for a simulation framework that manages the execution of tasks; each task is an elemental operation on a set of data vectors. For example, if task_A is defined in the model to be an operation on "x"(double), "y"(double), "scale"(int) then what we're effectively trying to emulate is the execution of task_A(double x[i], double y[i], int scale[i]) for all values of i.
Every task (function) operate on different subsets of data so these functions share a common function signature and only have access to data via specific APIs e.g. get_int("scale") and set_double("x", 0.2).
In a previous incarnation of the framework (written in C), tasks were scheduled statically and the framework generated code based on a given model to run the simulation. The ordering of tasks is based on a dependency graph extracted from the model definition.
We're now attempting to create a common runtime for all models with a run-time scheduler that executes tasks as their dependencies are met. The move from generating model-specific code to a generic one has brought about all sorts of pain. Essentially, I need to be able to generically handle heterogenous vectors and access them by "name" (and perhaps type_info), hence the above question.
I'm open to suggestions. Any suggestion.
Looking through the added detail, my immediate reaction would be to separate the data out into a number of separate maps, with the type as a template parameter. For example, you'd replace get_int("scale") with get<int>("scale") and set_double("x", 0.2) with set<double>("x", 0.2);
Alternatively, using std::map, you could pretty easily change that (for one example) to something like doubles["x"] = 0.2; or int scale_factor = ints["scale"]; (though you may need to be a bit wary with the latter -- if you try to retrieve a nonexistent value, it'll create it with default initialization rather than signaling an error).
Either way, you end up with a number of separate collections, each of which is homogeneous, instead of trying to put a number of collections of different types together into one big collection.
If you really do need to put those together into a single overall collection, I'd think hard about just using a struct, so it would become something like vals.doubles["x"] = 0.2; or int scale_factor = vals.ints["scale"];
At least offhand, I don't see this losing much of anything, and by retaining static typing throughout, it certainly seems to fit better with how C++ is intended to work.

Dealing with the surprising lack of ParList in scala.collections.parallel

So scala 2.9 recently turned up in Debian testing, bringing the newfangled parallel collections with it.
Suppose I have some code equivalent to
def expensiveFunction(x:Int):Int = {...}
def process(s:List[Int]):List[Int} = s.map(expensiveFunction)
now from the teeny bit I'd gleaned about parallel collections before the docs actually turned up on my machine, I was expecting to parallelize this just by switching the List to a ParList... but to my surprise, there isn't one! (Just ParVector, ParMap, ParSet...).
As a workround, this (or a one-line equivalent) seems to work well enough:
def process(s:List[Int]):List[Int} = {
val ps=scala.collection.parallel.immutable.ParVector()++s
val pr=ps.map(expensiveFunction)
List()++pr
}
yielding an approximately x3 performance improvement in my test code and achieving massively higher CPU usage (quad core plus hyperthreading i7). But it seems kind of clunky.
My question is a sort of an aggregated:
Why isn't there a ParList ?
Given there isn't a ParList, is there a
better pattern/idiom I should adopt so that
I don't feel like they're missing ?
Am I just "behind the times" using Lists a
lot in my scala programs (like all the Scala books I
bought back in the 2.7 days taught me) and
I should actually be making more use of
Vectors ? (I mean in C++ land
I'd generally need a pretty good reason to use
std::list over std::vector).
Lists are great when you want pattern matching (i.e. case x :: xs) and for efficient prepending/iteration. However, they are not so great when you want fast access-by-index, or splitting into chunks, or joining (i.e. xs ::: ys).
Hence it does not make much sense (to have a parallel List) when you think that this kind of thing (splitting and joining) is exactly what is needed for efficient parallelism. Use:
xs.toIndexedSeq.par
First, let me show you how to make a parallel version of that code:
def expensiveFunction(x:Int):Int = {...}
def process(s:List[Int]):Seq[Int] = s.par.map(expensiveFunction).seq
That will have Scala figure things out for you -- and, by the way, it uses ParVector. If you really want List, call .toList instead of .seq.
As for the questions:
There isn't a ParList because a List is an intrinsically non-parallel data structure, because any operation on it requires traversal.
You should code to traits instead of classes -- Seq, ParSeq and GenSeq, for example. Even performance characteristics of a List are guaranteed by LinearSeq.
All the books before Scala 2.8 did not have the new collections library in mind. In particular, the collections really didn't share a consistent and complete API. Now they do, and you'll gain much by taking advantage of it.
Furthermore, there wasn't a collection like Vector in Scala 2.7 -- an immutable collection with (near) constant indexed access.
A List cannot be easily split into various sub-lists which makes it hard to parallelise. For one, it has O(n) access; also a List cannot strip its tail, so one need to include a length parameter.
I guess, taking a Vector will be the better solution.
Note that Scala’s Vector is different from std::vector. The latter is basically a wrapper around standard array, a contiguous block in memory which needs to be copied every now and then when adding or removing data. Scala’s Vector is a specialised data structure which allows for efficient copying and splitting while keeping the data itself immutable.

D naming conventions: How is Phobos organized?

I'm making my own little library of handy functions and I'm trying to follow Phobos's naming convention but I'm getting really confused. How do I know where things would fit?
Example:
If there was a function like foldRight in Phobos (basically reduce in the reverse direction), which module would I find it in?
I can think of several:
std.algorithm: Because it's expressing an algorithm
std.array: Because I'm likely going to use it on arrays
std.container: Because it's used on containers, rather than single objects
std.functional: Because it's used mainly in functional programming
std.range: Because it operates on ranges as well
but I have no idea which one would be a good choice -- I could give a convincing argument for at least 3 of them.
What's the convention?
std.algorithm: yep and you can implement it like reduce!fun(retro(r))
this module specifies algorithms that run on sequences
std.array: no because it can also run on other ranges
these are helper functions that run only on build-in arrays
std.container: no because it doesn't define a data structure (like a treeset)
this defines data structures that are not build in into the language (for now a linked list, a binary tree and a deterministic array in terms of memory management)
std.functional: no because it doesn't operate on a function but on a range
this one takes a function and returns a different one
std.range: no because it doesn't define a range or provide a different way to iterate over one
the lack of a clear structure is one of my gripes with the phobos library TBH but really reading the first paragraph of the docs should tell you quite a bit of where to put the function

Nested Dictionary/Array in C++

Everybody,
I am a Python/C# guy and I am trying to learn C++.
In Python I used to do things like:
myRoutes = {0:[1,2,3], 1:[[1,2],[3,4]], 2:[[1,2,3],[[1,2,3],[1,2,3]],[4]]}
Basically when you have arrays of variable length and you don't want to wast a 2D matrix for them, nesting arrays into a dictionary to keep track of them is a good option.
In C++ I tried std::map<int, std:map<int, std:map<int, int> > > and it works but I feel there got to bo a better way to do this.
I prefer to stick to the standard libraries but popular libraries like boost are also acceptable for me.
I appreciate your help,
Ali
It looks like part of the question is: 'How do I store heterogeneous data in a container?' There are several different approaches:
1) Use a ready-made class of array types that abstracts away the exact details (i.e., dimensionality). Example: Boost Basic Linear Algebra
2) Make an explicit list of element types, using Boost.variant
#import "boost/variant.hpp"
typedef boost::variant< ArrayTypeA, ArrayTypeB > mapelement;
typedef std::map<int, mapelement> mappingtype;
Building a visitor for variant type is a bit involved (it involves writing a subclass of boost::static_visitor<desired_return_type> with a single operator() overload for each type in your variant). On the plus side, visitors are statically type checked to ens ure that they implement the exact right set of handlers.
3) Use a wrapper type, like Boost.Any to wrap the different types of interest.
Overall, I think the first option (using a specialized array class) is probably the most reliable. I have also used variants heavily in recent code, and although the compile errors are long, once one gets used to those, it is great to have compile time checking, which I prefer strongly over Python's "run and find out you were wrong later" paradigm.
Many of us share the pain you're experiencing right now but there are solutions to them. One of those solutions is the Boost library (it's like the 2nd standard library of C++.) There's quite a few collection libraries. In your case I'd use Boost::Multi-Dimentional Arrays.
Looks like this:
boost::multi_array<double,3> myArray(boost::extents[2][2][2]);
Which creates a 2x2x2 array. The first type in the template parameters "double" specifies what type the array will hold and the second "3" the dimension count of the array. You then use the "extents" to impart the actual size of each of the dimensions. Quite easy to use and syntatically clear in its intent.
Now, if you're dealing with something in Python like foo = {0:[1,2,3], 1:[3,4,5]} what you're really looking for is a multimap. This is part of the standard library and is essentially a Red-Black tree, indexed by key but with a List for the value.

Why is the use of tuples in C++ not more common?

Why does nobody seem to use tuples in C++, either the Boost Tuple Library or the standard library for TR1? I have read a lot of C++ code, and very rarely do I see the use of tuples, but I often see lots of places where tuples would solve many problems (usually returning multiple values from functions).
Tuples allow you to do all kinds of cool things like this:
tie(a,b) = make_tuple(b,a); //swap a and b
That is certainly better than this:
temp=a;
a=b;
b=temp;
Of course you could always do this:
swap(a,b);
But what if you want to rotate three values? You can do this with tuples:
tie(a,b,c) = make_tuple(b,c,a);
Tuples also make it much easier to return multiple variable from a function, which is probably a much more common case than swapping values. Using references to return values is certainly not very elegant.
Are there any big drawbacks to tuples that I'm not thinking of? If not, why are they rarely used? Are they slower? Or is it just that people are not used to them? Is it a good idea to use tuples?
A cynical answer is that many people program in C++, but do not understand and/or use the higher level functionality. Sometimes it is because they are not allowed, but many simply do not try (or even understand).
As a non-boost example: how many folks use functionality found in <algorithm>?
In other words, many C++ programmers are simply C programmers using C++ compilers, and perhaps std::vector and std::list. That is one reason why the use of boost::tuple is not more common.
Because it's not yet standard. Anything non-standard has a much higher hurdle. Pieces of Boost have become popular because programmers were clamoring for them. (hash_map leaps to mind). But while tuple is handy, it's not such an overwhelming and clear win that people bother with it.
The C++ tuple syntax can be quite a bit more verbose than most people would like.
Consider:
typedef boost::tuple<MyClass1,MyClass2,MyClass3> MyTuple;
So if you want to make extensive use of tuples you either get tuple typedefs everywhere or you get annoyingly long type names everywhere. I like tuples. I use them when necessary. But it's usually limited to a couple of situations, like an N-element index or when using multimaps to tie the range iterator pairs. And it's usually in a very limited scope.
It's all very ugly and hacky looking when compared to something like Haskell or Python. When C++0x gets here and we get the 'auto' keyword tuples will begin to look a lot more attractive.
The usefulness of tuples is inversely proportional to the number of keystrokes required to declare, pack, and unpack them.
For me, it's habit, hands down: Tuples don't solve any new problems for me, just a few I can already handle just fine. Swapping values still feels easier the old fashioned way -- and, more importantly, I don't really think about how to swap "better." It's good enough as-is.
Personally, I don't think tuples are a great solution to returning multiple values -- sounds like a job for structs.
But what if you want to rotate three values?
swap(a,b);
swap(b,c); // I knew those permutation theory lectures would come in handy.
OK, so with 4 etc values, eventually the n-tuple becomes less code than n-1 swaps. And with default swap this does 6 assignments instead of the 4 you'd have if you implemented a three-cycle template yourself, although I'd hope the compiler would solve that for simple types.
You can come up with scenarios where swaps are unwieldy or inappropriate, for example:
tie(a,b,c) = make_tuple(b*c,a*c,a*b);
is a bit awkward to unpack.
Point is, though, there are known ways of dealing with the most common situations that tuples are good for, and hence no great urgency to take up tuples. If nothing else, I'm not confident that:
tie(a,b,c) = make_tuple(b,c,a);
doesn't do 6 copies, making it utterly unsuitable for some types (collections being the most obvious). Feel free to persuade me that tuples are a good idea for "large" types, by saying this ain't so :-)
For returning multiple values, tuples are perfect if the values are of incompatible types, but some folks don't like them if it's possible for the caller to get them in the wrong order. Some folks don't like multiple return values at all, and don't want to encourage their use by making them easier. Some folks just prefer named structures for in and out parameters, and probably couldn't be persuaded with a baseball bat to use tuples. No accounting for taste.
As many people pointed out, tuples are just not that useful as other features.
The swapping and rotating gimmicks are just gimmicks. They are utterly confusing to those who have not seen them before, and since it is pretty much everyone, these gimmicks are just poor software engineering practice.
Returning multiple values using tuples is much less self-documenting then the alternatives -- returning named types or using named references. Without this self-documenting, it is easy to confuse the order of the returned values, if they are mutually convertible, and not be any wiser.
Not everyone can use boost, and TR1 isn't widely available yet.
When using C++ on embedded systems, pulling in Boost libraries gets complex. They couple to each other, so library size grows. You return data structures or use parameter passing instead of tuples. When returning tuples in Python the data structure is in the order and type of the returned values its just not explicit.
You rarely see them because well-designed code usually doesn't need them- there are not to many cases in the wild where using an anonymous struct is superior to using a named one.
Since all a tuple really represents is an anonymous struct, most coders in most situations just go with the real thing.
Say we have a function "f" where a tuple return might make sense. As a general rule, such functions are usually complicated enough that they can fail.
If "f" CAN fail, you need a status return- after all, you don't want callers to have to inspect every parameter to detect failure. "f" probably fits into the pattern:
struct ReturnInts ( int y,z; }
bool f(int x, ReturnInts& vals);
int x = 0;
ReturnInts vals;
if(!f(x, vals)) {
..report error..
..error handling/return...
}
That isn't pretty, but look at how ugly the alternative is. Note that I still need a status value, but the code is no more readable and not shorter. It is probably slower too, since I incur the cost of 1 copy with the tuple.
std::tuple<int, int, bool> f(int x);
int x = 0;
std::tuple<int, int, bool> result = f(x); // or "auto result = f(x)"
if(!result.get<2>()) {
... report error, error handling ...
}
Another, significant downside is hidden in here- with "ReturnInts" I can add alter "f"'s return by modifying "ReturnInts" WITHOUT ALTERING "f"'s INTERFACE. The tuple solution does not offer that critical feature, which makes it the inferior answer for any library code.
Certainly tuples can be useful, but as mentioned there's a bit of overhead and a hurdle or two you have to jump through before you can even really use them.
If your program consistently finds places where you need to return multiple values or swap several values, it might be worth it to go the tuple route, but otherwise sometimes it's just easier to do things the classic way.
Generally speaking, not everyone already has Boost installed, and I certainly wouldn't go through the hassle of downloading it and configuring my include directories to work with it just for its tuple facilities. I think you'll find that people already using Boost are more likely to find tuple uses in their programs than non-Boost users, and migrants from other languages (Python comes to mind) are more likely to simply be upset about the lack of tuples in C++ than to explore methods of adding tuple support.
As a data-store std::tuple has the worst characteristics of both a struct and an array; all access is nth position based but one cannot iterate through a tuple using a for loop.
So if the elements in the tuple are conceptually an array, I will use an array and if the elements are not conceptually an array, a struct (which has named elements) is more maintainable. ( a.lastname is more explanatory than std::get<1>(a)).
This leaves the transformation mentioned by the OP as the only viable usecase for tuples.
I have a feeling that many use Boost.Any and Boost.Variant (with some engineering) instead of Boost.Tuple.