Is there an MPI_AllReduce implemenentation for sparse vectors?

Is there an MPI_AllReduce implemenentation for sparse vectors? - mapreduce

I need to synchronize intermediate solutions of an optimization problem solved distributively over a number of worker processors. The solution vector is known to be sparse.
I have noticed that if I use MPI_AllReduce, the performance is good compared to my own AllReduce implementation.
However, I believe, the performance can be further improved if AllReduce could communicate only the nonzero entries in the solution vector. I could not find any such implementation of AllReduce.
Any ideas?
It seems that MPI_type_indexed can not be used as the indices of the nonzero entries are not known in advance.

There aren't sparse collectives in MPI. It's something that the MPI Forum has discussed in the past (to what end, I don't know), but there has also been research in the area. Usually though, when discussing these sorts of things in the forum, I believe they relate more to collectives that don't involve all processes rather than all of the data.
As Hristo said in the comments, the goal of MPI (according to some) has always been to enable more optimized tricks on top of MPI and to just use it as a low level library to abstract the communication calls. Obviously, this hasn't been how MPI has actually be used most of the time, but you can still write your own sparse collectives. Sounds like a good paper to me.

Similar problem here. Most likely you will need to implement your custom MPI_Allreduce().
There is an optimized implementation here. Very possibly you have already found this link: https://fs.hlrs.de/projects/par/mpi//myreduce.html
If you want ideas for a better performance implementation, some here:
https://dl.acm.org/citation.cfm?id=2642791
https://dl.acm.org/citation.cfm?id=2642773
Note that they don't provide an implementation and you may need to pay an small fee.
Good luck

Related

Solver software for finding ALL solutions to a pure integer MIP

I have a set of problems (sets of equations and inequalities) for which I know that all variables have to be integers, and have finitely many solutions. I know that if I take any random objective function and let an lp or mip solver onto it, it finds a solution, however I want all solutions to the problem, and of course, as efficiently as possible. I don't really care about optimizing anything, but apparently most of the software that deals with it does. Is there any solver that can do that? If so, which one is the best/simplest one, or which one would you recommend? At best one that can be used as a C/C++ library.

There is a nice blog post by Paul Rubin on how to find K best solutions, which can be easily generalized to get all the solutions. As Ali suggested one of the approaches is to use a solution pool. Two other approaches are:
Use an incumbent callback to track and reject solutions.
Use an incumbent callback with solution injection.
See the blog post for details.

IBM ILOG CPLEX has a solution pool feature and it's free for academic purposes.
I guess you can probably get all solution if you set the maximum pool size sufficiently large. I don't know for sure, never tried.

Super high performance C/C++ hash map (table, dictionary) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I need to map primitive keys (int, maybe long) to struct values in a high-performance hash map data structure.
My program will have a few hundred of these maps, and each map will generally have at most a few thousand entries. However, the maps will be "refreshing" or "churning" constantly; imagine processing millions of add and delete messages a second.
What libraries in C or C++ have a data structure that fits this use case? Or, how would you recommend building your own? Thanks!

I would recommend you to try Google SparseHash (or the C11 version Google SparseHash-c11) and see if it suits your needs. They have a memory efficient implementation as well as one optimized for speed.
I did a benchmark a long time ago, it was the best hashtable implementation available in terms of speed (however with drawbacks).

What libraries in C or C++ have a data structure that fits this use case? Or, how would you recommend building your own? Thanks!
Check out the LGPL'd Judy arrays. Never used myself, but was advertised to me on few occasions.
You can also try to benchmark STL containers (std::hash_map, etc). Depending on platform/implementation and source code tuning (preallocate as much as you can dynamic memory management is expensive) they could be performant enough.
Also, if performance of the final solution trumps the cost of the solution, you can try to order the system with sufficient RAM to put everything into plain arrays. Performance of access by index is unbeatable.
The add/delete operations are much (100x) more frequent than the get operation.
That hints that you might want to concentrate on improving algorithms first. If data are only written, not read, then why write them at all?

Just use boost::unordered_map (or tr1 etc) by default. Then profile your code and see if that code is the bottleneck. Only then would I suggest to precisely analyze your requirements to find a faster substitute.

If you have a multithreaded program, you can find some useful hash tables in intel thread building blocks library. For example, tbb::concurrent_unordered_map has the same api as std::unordered_map, but it's main functions are thread safe.
Also have a look at facebook's folly library, it has high performance concurrent hash table and skip list.

khash is very efficient. There is author's detailed benchmark: https://attractivechaos.wordpress.com/2008/10/07/another-look-at-my-old-benchmark/ and it also shows khash beats many other hash libraries.

from android sources (thus Apache 2 licensed)
https://github.com/CyanogenMod/android_system_core/tree/ics/libcutils
look at hashmap.c, pick include/cutils/hashmap.h, if you don't need thread safety you can remove mutex code, a sample implementation is in libcutils/str_parms.c

First check if existing solutions like libmemcache fits your need.
If not ...
Hash maps seems to be the definite answer to your requirement. It provides o(1) lookup based on the keys. Most STL libraries provide some sort of hash these days. So use the one provided by your platform.
Once that part is done, you have to test the solution to see if the default hashing algorithm is good enough performance wise for your needs.
If it is not, you should explore some good fast hashing algorithms found on the net
good old prime number multiply algo
http://www.azillionmonkeys.com/qed/hash.html
http://burtleburtle.net/bob/
http://code.google.com/p/google-sparsehash/
If this is not good enough, you could roll a hashing module by yourself, that fixes the problem that you saw with the STL containers you have tested, and one of the hashing algorithms above. Be sure to post the results somewhere.
Oh and its interesting that you have multiple maps ... perhaps you can simplify by having your key as a 64 bit num with the high bits used to distinguish which map it belongs to and add all key value pairs to one giant hash. I have seen hashes that have hundred thousand or so symbols working perfectly well on the basic prime number hashing algorithm quite well.
You can check how that solution performs compared to hundreds of maps .. i think that could be better from a memory profiling point of view ... please do post the results somewhere if you do get to do this exercise
I believe that more than the hashing algorithm it could be the constant add/delete of memory (can it be avoided?) and the cpu cache usage profile that might be more crucial for the performance of your application
good luck

Try hash tables from Miscellaneous Container Templates. Its closed_hash_map is about the same speed as Google's dense_hash_map, but is easier to use (no restriction on contained values) and has some other perks as well.

I would suggest uthash. Just include #include "uthash.h" then add a UT_hash_handle to the structure and choose one or more fields in your structure to act as the key. A word about performance here.

http://incise.org/hash-table-benchmarks.html gcc has a very very good implementation. However, mind that it must respect a very bad standard decision :
If a rehash happens, all iterators are invalidated, but references and
pointers to individual elements remain valid. If no actual rehash
happens, no changes.
http://www.cplusplus.com/reference/unordered_map/unordered_map/rehash/
This means basically the standard says that the implementation MUST BE based on linked lists.
It prevents open addressing which has better performance.
I think google sparse is using open addressing, though in these benchmarks only the dense version outperforms the competition.
However, the sparse version outperforms all competition in memory usage. (also it doesn't have any plateau, pure straight line wrt number of elements)

When should the STL algorithms be used instead of using your own?

I frequently use the STL containers but have never used the STL algorithms that are to be used with the STL containers.
One benefit of using the STL algorithms is that they provide a method for removing loops so that code logic complexity is reduced. There are other benefits that I won't list here.
I have never seen C++ code that uses the STL algorithms. From sample code within web page articles to open source projects, I haven't seen their use.
Are they used more frequently than it seems?

Short answer: Always.
Long answer: Always. That's what they are there for. They're optimized for use with STL containers, and they're faster, clearer, and more idiomatic than anything you can write yourself. The only situation you should consider rolling your own is if you can articulate a very specific, mission-critical need that the STL algorithms don't satisfy.
Edited to add: (Okay, so not really really always, but if you have to ask whether you should use STL, the answer is "yes".)

You've gotten a number of answers already, but I can't really agree with any of them. A few come fairly close to the mark, but fail to mention the crucial point (IMO, of course).
At least to me, the crucial point is quite simple: you should use the standard algorithms when they help clarify the code you're writing.
It's really that simple. In some cases, what you're doing would require an arcane invocation using std::bind1st and std::mem_fun_ref (or something on that order) that's extremely dense and opaque, where a for loop would be almost trivially simple and straightforward. In such a case, go ahead and use the for loop.
If there is no standard algorithm that does what you want, take some care and look again -- you'll often have missed something that really will do what you want (one place that's often missed: the algorithms in <numeric> are often useful for non-numeric uses). Having looked a couple of times, and confirmed that there's really not a standard algorithm to do what you want, instead of writing that for loop (or whatever) inline, consider writing an generic algorithm to do what you need done. If you're using it one place, there's a pretty good chance you can use it two or three more, at which point it can be a big win in clarity.
Writing generic algorithms isn't all that hard -- in fact, it's often almost no extra work compared to writing a loop inline, so even if you can only use it twice you're already saving a little bit of work, even if you ignore the improvement in the code's readability and clarity.

STL algorithms should be used whenever they fit what you need to do. Which is almost all the time.

When should the STL algorithms be used instead of using your own?
When you value your time and sanity and have more fun things to do than reinventing the wheel again and again.
You need to use your own algorithms when project demands it, and there are no acceptable alternatives to writing stuff yourself, or if you identified STL algorithm as a bottleneck (using profiler, of course), or have some kind of restrictions STL doesn't conform to, or adapting STL for the task will take longer than writing algorithm from scratch (I had to use twisted version of binary search few times...). STL is not perfect and isn't fit for everything, but when you can, you should use it. When someone already did all the work for you, there is frequently no reason to do the same thing again.

I write performance critical applications. These are the kinds of things that need to process millions of pieces of information in as fast a time as possible. I wouldn't be able to do some of the things that I do now if it weren't for STL. Use them always.

There are many good algorithms besides stuff like std::foreach.
However there are lots of non-trivial and very useful algorithms:
Sorting: std::sort, std::upper_bound, std::lower_bound, std::binary_search
Min/Max std::max, std::min, std::partition, std::min_element, std::max_element
Search like std::find, std::find_first_of etc.
And many others.
Algorithms like std::transform are much useful with C++0x lambda expressions or stuff like boost::lambda or boost::bind

If I had to write something due this afternoon, and I knew how to do it using hand-made loops and would need to figure out how to do it in STL algorithms, I would write it using hand-made loops.
Having said that, I would work to make the STL algorithms a reliable part of my toolkit, for reasons articulated in the other answers.
--
Reasons you might not see it is in code is that it is either legacy code or written by legacy programmers. We had about 20 years of C++ programming before the STL came out, and at that point we had a community of programmers who knew how to do things the old way and had not yet learned the STL way. This will likely remain for a generation.

Bear in mind that the STL algorithms cover a lot of bases, but most C++ developers will probably end up coding something that does something equivalent to std::find(), std::find_if() and std::max() almost every day of their working lives (if they're not using the STL versions already). By using the STL versions you separate the algorithm from both the logical flow of your code and from the data representation.
For other less commonly used STL algorithms such as std::merge() or std::lower_bound() these are hugely useful routines (the first for merging two sorted containers, the second for working out where to insert an item in a container to keep it ordered). If you were to try to implement them yourself then it would probably take a few attempts (the algorithms aren't complicated, but you'd probably get off-by-one errors or the like).
I myself use them every day of my professional career. Some legacy codebases that predate a stable STL may not use it as extensively, but if there's a newer project that is intentionally avoiding it I would be inclined to think it was by a part-time hacker who was still labouring under the mid-90's assumption that templates are slow and therefore to be avoided.

The only time I don't use STL algorithms is when the cross-platform implementation differences affect the outcome of my program. This has only happened in one or two rare cases (on the PlayStation 3). Although the interface of the STL is standardized across platforms, the implementation is not.
Also, in certain extremely high performance applications (think: video games, video game servers) we replaced a some STL structures with our own to eke out a bit more efficiency.
However, the vast majority of the time using STL is the way to go. And in my other (non-video game) jobs, I used the STL exclusively.

The main problem with STL algorithms until now was that, even though the algorithm call itself clearer, defining the functors that you'd need to pass to them would make your code longer and more complex, due to the way the language forced you to do it. C++0x is expected to change that, with its support for lambda expressions.
I've been using STL heavily for the past six years and although I tried to use STL algorithms anywhere I could, in most instances it would make my code more obscure, so I got back to a simple loop. Now with C++0x is the opposite, the code seems to always look simpler with them.
The problem is that by now C++0x support is still limited to a few compilers, even because the standard is not completely finished yet. So probably we will have to wait a few years to really see widespread use of STL algorithms in production code.

I would not use STL in two cases:
When STL is not designed for your task. STL is nearly the best for general purposes. However, for specific applications STL may not always be the best. For example, in one of my programs, I need a huge hash table while STL/tr1's hashmap equivalence takes too much memory.
When you are learning algorithms. I am one of the few who enjoy reinventing the wheels and learn a lot in this process. For that program, I reimplemented a hash table. It really took me a lot of time, but in the end all the efforts paid off. I have learned many things that greatly benefit my future career as a programmer.

When you think you can code it better than a really clever coder who spent weeks researching and testing and trying to cope with every conceivable set of inputs.
For most Earthlings the answer is never!

I want to answer the "when not to use STL" case, with a clear example.
(as a challenge to all of you, show me that you can solve this with any STL algorithm until C++17)
Convert a vector of integers std::vector<int> into a vector of std::pair<int,int>, i.e.:
Convert
std::vector<int> values = {1,2,3,4,5,6,7,8,9,10};
to
std::vector<std::pair<int,int>> values = { {1,2}, {3,4} , {5,6}, {7,8} ,{9,10} };
And guess what? It's impossible with any STL algorithm until C++17.
See the complete discussion on solving this problem here:
How can I convert std::vector<T> to a vector of pairs std::vector<std::pair<T,T>> using an STL algorithm?
So to answer your question: Use STL algorithm always only when it perfectly fits your problem. Don't hack an STL algorithm to make it fit your problem.

Are they used more frequently than it seems?
I've never seen them used; except in books. Maybe they're used in the implementation of the STL itself. Maybe they'll become more used because easier to use (see for example Lambda functions and expressions), or even become obsoleted (see for example the Range-based for-loop), in the next version of C++ .

A few sorting questions

I have found a way that improves (as far as I have tested) upon the quicksort algorithm beyond what has already been done. I am working on testing it and then I want to get the word out about it. However, I would appreciate some help with some things. So here are my questions. All of my code is in C++ by the way.
One of the sorts I have been comparing to my quicksort is the std::sort from the C++ Standard Library. However, it appears to be extremely slow. I am only sorting arrays of ints and longs, but it appears to be around 8-10 times slower than both my quicksort and a standard quicksort by Bentley and McIlroy (and maybe Sedgewick). Does anyone have any ideas as to why it is so slow? The code I use for the sort is just
std::sort(a,a+numelem);
where a is the array of longs or ints and numelem is the number of elements in the array. The numbers are very random, and I have tried different sizes as well as different amounts of repeated elements. I also tried qsort, but it is even worse as I expected.
Edit: Ignore this first question - it's been resolved.
I would like to find more good quicksort implementations to compare with my quicksort. So far I have a Bentley-McIlroy one and I have also compared with the first published version of Vladimir Yaroslavskiy's dual-pivot quicksort. In addition, I plan on porting timsort (which is a merge sort I believe) and the optimized dual-pivot quicksort from the jdk 7 source. What other good quicksorts implementations do you know about? If they aren't in C or C++ that might be okay because I am pretty good at porting, but I would prefer C or C++ ones if you know of them.
How would you recommend getting out the word about my additions to the quicksort? So far my quicksort seems to be significantly faster than all other quicksorts that I've tested it against. The main source of its speed is that it handles repeated elements much more efficiently than other methods that I've found. It almost completely eradicates worst case behavior without adding much time in checking for repeated elements. I posted about it on the Java forums, but got no response. I also tried writing to Jon Bentley because he was working with Vladimir on his dual-pivot quicksort and got no response (though I wasn't terribly surprised by this). Should I write a paper about it and put it on arxiv.org? Should I post in some forums? Are there some mailing lists to which I should post? I have been working on this for some time now and my method is legit. I do have some experience with publishing research because I am a PhD candidate in computational physics. Should I try approaching someone in the Computer Science department of my university? By the way, I have also developed a different dual-pivot quicksort, but it isn't better than my single-pivot quicksort (though it is better than Vladimir's dual-pivot quicksort with some datasets).
I really appreciate your help. I just want to add what I can to the computing world. I'm not interested in patenting this or any absurd thing like that.

If you have confidence in your work, definitely try to discuss it with someone knowledgeable at your university as soon as possible. It's not enough to show that your code runs faster than another procedure on your machine. You have to mathematically prove whatever performance gain you claim to have achieved through analysis of your algorithm. I'd say the first thing to do is make sure both algorithms you are comparing are implemented and compiled optimally - you may just be fooling yourself here. The likelihood of an individual achieving such a marked improvement upon such an important sorting method without already having thorough knowledge of its accepted variants just seems minuscule. However, don't let me discourage you. It should be interesting anyway. Would you be willing to post the code here?
...Also, since quicksort is especially vulnerable to worst-case scenarios, the tests you choose to run may have a huge effect, as will the choice of pivots. In general, I would say that any data set with a large number of equivalent elements or one that is already highly sorted is never a good choice for quicksort - and there are already well-known ways of combating that situation, and better alternative sorting methods.

If you have truly made a breakthrough and have the math to prove it, you should try to get it published in the Journal of the ACM. It's definitely one of the more prestigious journals for computer science.
The second best would be one of the IEEE journals such as Transactions on Software Engineering.

Matrix classes in c++

I'm doing some linear algebra math, and was looking for some really lightweight and simple to use matrix class that could handle different dimensions: 2x2, 2x1, 3x1 and 1x2 basically.
I presume such class could be implemented with templates and using some specialization in some cases, for performance.
Anybody know of any simple implementation available for use? I don't want "bloated" implementations, as I'll running this in an embedded environment where memory is constrained.
Thanks

You could try Blitz++ -- or Boost's uBLAS

I've recently looked at a variety of C++ matrix libraries, and my vote goes to Armadillo.
The library is heavily templated and header-only.
Armadillo also leverages templates to implement a delayed evaluation framework (resolved at compile time) to minimize temporaries in the generated code (resulting in reduced memory usage and increased performance).
However, these advanced features are only a burden to the compiler and not your implementation running in the embedded environment, because most Armadillo code 'evaporates' during compilation due to its design approach based on templates.
And despite all that, one of its main design goals has been ease of use - the API is deliberately similar in style to Matlab syntax (see the comparison table on the site).
Additionally, although Armadillo can work standalone, you might want to consider using it with LAPACK (and BLAS) implementations available to improve performance. A good option would be for instance OpenBLAS (or ATLAS). Check Armadillo's FAQ, it covers some important topics.
A quick search on Google dug up this presentation showing that Armadillo has already been used in embedded systems.

std::valarray is pretty lightweight.

I use Newmat libraries for matrix computations. It's open source and easy to use, although I'm not sure it fits your definition of lightweight (it includes over 50 source files which Visual Studio compiles it into a 1.8MB static library).

CML matrix is pretty good, but may not be lightweight enough for an embedded environment. Check it out anyway: http://cmldev.net/?p=418

Another option, altough may be too late is:
https://launchpad.net/lwmatrix

I for one wasn't able to find simple enough library so I wrote it myself: http://koti.welho.com/aarpikar/lib/
I think it should be able to handle different matrix dimensions (2x2, 3x3, 3x1, etc) by simply setting some rows or columns to zero. It won't be the most fastest approach since internally all operations will be done with 4x4 matrices. Although in theory there might exist that kind of processors that can handle 4x4-operations in one tick. At least I would much rather believe in existence of such processors that than go optimizing those low level matrix calculations. :)

How about just store the matrix in an array, like
2x3 matrix = {2,3,val1,val2,...,val6}
This is really simple, and addition operations are trivial. However, you need to write your own multiplication function.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js