Are there any working examples of distributed domain maps for associative and/or opaque domains in Chapel, or any hints on how one would distribute a non-rectangular structure such as a graph over multiple locales? I know about distributed sparse arrays, but I am looking at less structured data. The documentation mentions a prototype domain map for associative domains -- is it available anywhere to experiment with? Thank you.
Yes, these distributed associative domains are new in 1.19 (which as of this writing will be released soon, but you can try them out using a master branch before then). The documentation for them here has an example:
https://chapel-lang.org/docs/master/modules/dists/HashedDist.html
Related
I just finished a course in data structs and algorithms (cpp) in my school and I am interested in databasing in the real world... so specifically SQL.
So my question is what is the difference between SQL and for example c++ stl std::multimap? Is SQL faster? or can I make an equally as fast (time complexity wise) homemade SQL using a c++ STL?
thanks!
(sorry I'm new to programming outside the boundaries of my classes)
The obvious difference is that SQL is a query language to interact with database while STL is library (conventionally, STL is also used to refer to a certain subset of the standard C++ library). As such, these are apples and oranges.
SQL actually entails a suite of standards specifying various parts of a database system. For a database system to be useful it is desirable that certain characteristics are met (ACID. Even just looking at these there is no requirement that they are met by STL containers. I think only the consistency would even be desirable for STL container:
STL container mutations are not required to be atomic: when an exception is thrown from within one of the mutating functions the container may become unusable, i.e., STL containers are only required to meet the basic exception guarantee.
As mentioned, [successful] mutations yield to a consistent state.
STL containers can't be currently mutated and read, i.e., there is no concept of isolation. If you want to access an STL container in a concurrent environment you need to make sure that there is no other accessor when the container is being mutated (you can have as many concurrent readers while there is not mutator, though).
There is on concept of durability for STL containers while it may be considered the core feature of databases to be durable (well, all ACID features can be considered core database features).
Database internally certainly use some data structures and algorithms to provide the ACID features. This is an area where STL may come in, although primarily with its key strength, i.e., algorithms which aren't really "algorithms" but rather "solvers for specific problems": STL is the foundation of an efficient algorithm library which is applicable to arbitrary data structures (well, that's the objective - I don't think it is, yet, achieved). Sadly, important areas of data structures are not appropriately covered, though. In particular with respect to databases algorithms on trees especially b-trees tend to be important but are not at all covered by STL.
The STL container std::multimap<...> does contain a tree (typically a red/black-tree but that's not mandated) but it is tied to this particular in-memory representation. There is no way to apply the algorithms used to implement this particular data structure to some suitable persistent representation. Also, a std::multimap<...> still uses just one key (the multi refers to allowing multiple elements with the same key, not to having multiple keys) while database typically require multiple look-up mechanisms (indices which are utilized when executing queries based on a query plan for each query.
You got multiple questions and the interesting one (in my opinion) is this: "... or can I make an equally as fast (time complexity wise) homemade SQL using a c++ STL?"
In a perfect world where STL covers all algorithms, yes, you could create a query evaluator for a database based on the STL algorithms. You could even use some of the STL containers as auxiliary data structures although the primary data structures in a database are properly represented in persistent storage. To create an actual database you'd also need something which translates the query into a good query plan which can then be executed.
Of course, if all you really need are some look-ups by a key in a data structure which is read at some point by the program, you wouldn't need a full-blown database and looks are probably faster using suitable STL containers.
Note, that the time complexity tends to be useful for guiding a quick evaluations of different approaches. However, in practice the constant factors tend to matter and often the algorithms with the inferior time complexity behaves better. The canonical example is quicksort which outperforms the "superior" algorithms (e.g. heapsort or mergesort) for typical inputs (although in practice actually introsort is used which is a hybrid of quicksort, heapsort, and insertion-sort is used which combines the strength of these respective algorithms to behave well on all inputs). BTW, to get an illustration of the algorithms you may want to watch the Hungarian Sort Dancers.
I have a large data set with BINARY user/items feature matrix:
I need to cluster both users and items. Is there anyway to do them simultaneously in Mahout?
More importantly, if I use loglikelihood as a similarity measure, what clustering
algorithms will actually support such distance metric to cluster the data?
No, clustering by users and items are separate processes. Though in spirit it's exactly the same process, just applied two different ways.
If you want more specific answers within Mahout you'd have to say more about what parts of the code you are using because there are several different parts that involve clustering.
There are some agglomerative clustering pieces in the project, which works for any similarity metric. The other implementations that I know of are definitely of the "k-means" variety, assuming a continuous vector space and not vectors over {0,1}. You would need a k-medoids kind of algorithm I think and this isn't in the project that I know of.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I need to map primitive keys (int, maybe long) to struct values in a high-performance hash map data structure.
My program will have a few hundred of these maps, and each map will generally have at most a few thousand entries. However, the maps will be "refreshing" or "churning" constantly; imagine processing millions of add and delete messages a second.
What libraries in C or C++ have a data structure that fits this use case? Or, how would you recommend building your own? Thanks!
I would recommend you to try Google SparseHash (or the C11 version Google SparseHash-c11) and see if it suits your needs. They have a memory efficient implementation as well as one optimized for speed.
I did a benchmark a long time ago, it was the best hashtable implementation available in terms of speed (however with drawbacks).
What libraries in C or C++ have a data structure that fits this use case? Or, how would you recommend building your own? Thanks!
Check out the LGPL'd Judy arrays. Never used myself, but was advertised to me on few occasions.
You can also try to benchmark STL containers (std::hash_map, etc). Depending on platform/implementation and source code tuning (preallocate as much as you can dynamic memory management is expensive) they could be performant enough.
Also, if performance of the final solution trumps the cost of the solution, you can try to order the system with sufficient RAM to put everything into plain arrays. Performance of access by index is unbeatable.
The add/delete operations are much (100x) more frequent than the get operation.
That hints that you might want to concentrate on improving algorithms first. If data are only written, not read, then why write them at all?
Just use boost::unordered_map (or tr1 etc) by default. Then profile your code and see if that code is the bottleneck. Only then would I suggest to precisely analyze your requirements to find a faster substitute.
If you have a multithreaded program, you can find some useful hash tables in intel thread building blocks library. For example, tbb::concurrent_unordered_map has the same api as std::unordered_map, but it's main functions are thread safe.
Also have a look at facebook's folly library, it has high performance concurrent hash table and skip list.
khash is very efficient. There is author's detailed benchmark: https://attractivechaos.wordpress.com/2008/10/07/another-look-at-my-old-benchmark/ and it also shows khash beats many other hash libraries.
from android sources (thus Apache 2 licensed)
https://github.com/CyanogenMod/android_system_core/tree/ics/libcutils
look at hashmap.c, pick include/cutils/hashmap.h, if you don't need thread safety you can remove mutex code, a sample implementation is in libcutils/str_parms.c
First check if existing solutions like libmemcache fits your need.
If not ...
Hash maps seems to be the definite answer to your requirement. It provides o(1) lookup based on the keys. Most STL libraries provide some sort of hash these days. So use the one provided by your platform.
Once that part is done, you have to test the solution to see if the default hashing algorithm is good enough performance wise for your needs.
If it is not, you should explore some good fast hashing algorithms found on the net
good old prime number multiply algo
http://www.azillionmonkeys.com/qed/hash.html
http://burtleburtle.net/bob/
http://code.google.com/p/google-sparsehash/
If this is not good enough, you could roll a hashing module by yourself, that fixes the problem that you saw with the STL containers you have tested, and one of the hashing algorithms above. Be sure to post the results somewhere.
Oh and its interesting that you have multiple maps ... perhaps you can simplify by having your key as a 64 bit num with the high bits used to distinguish which map it belongs to and add all key value pairs to one giant hash. I have seen hashes that have hundred thousand or so symbols working perfectly well on the basic prime number hashing algorithm quite well.
You can check how that solution performs compared to hundreds of maps .. i think that could be better from a memory profiling point of view ... please do post the results somewhere if you do get to do this exercise
I believe that more than the hashing algorithm it could be the constant add/delete of memory (can it be avoided?) and the cpu cache usage profile that might be more crucial for the performance of your application
good luck
Try hash tables from Miscellaneous Container Templates. Its closed_hash_map is about the same speed as Google's dense_hash_map, but is easier to use (no restriction on contained values) and has some other perks as well.
I would suggest uthash. Just include #include "uthash.h" then add a UT_hash_handle to the structure and choose one or more fields in your structure to act as the key. A word about performance here.
http://incise.org/hash-table-benchmarks.html gcc has a very very good implementation. However, mind that it must respect a very bad standard decision :
If a rehash happens, all iterators are invalidated, but references and
pointers to individual elements remain valid. If no actual rehash
happens, no changes.
http://www.cplusplus.com/reference/unordered_map/unordered_map/rehash/
This means basically the standard says that the implementation MUST BE based on linked lists.
It prevents open addressing which has better performance.
I think google sparse is using open addressing, though in these benchmarks only the dense version outperforms the competition.
However, the sparse version outperforms all competition in memory usage. (also it doesn't have any plateau, pure straight line wrt number of elements)
I'm looking for suggestions regarding in-memory key-value store engines or libraries, that have C++ interfaces or that are written in C++.
I'm looking for solutions that can scale without any problems to about 100mill key-value pairs and that are compatible/compilable on linux and win32/64
How about std::map?
http://cplusplus.com/reference/stl/map/
If you really need to store such amount of pairs in memory consider this Sparse Hash. It has special implementation which is optimized for low memory consumption.
std::map is fine given that size of key and value is small and the available memory is large ( for about 100million pairs).
If its not the case, and you want to run a program over the key-value pairs, consider using a standard MapReduce API. Map Reduce is specifically meant to be used on distributed systems and process large data specially key-value pairs. Also there are nice C++ APIs for Map Reduce.
http://en.wikipedia.org/wiki/MapReduce
Try Tokyo Cabinet, it supports hashtables and B+trees:
http://1978th.net/tokyocabinet/
Try FastDB, though you may get more than you ask for. Tokyo cabinet also seems to support in-memory databases. (Or, backed by file mapped by mmap. With modern operating systems, there's no much difference between "in-ram" database and something mmap'd as the OS caching makes also the latter very efficient).
A hash map (also called unordered map) is the best bet for that many pairs. You can find an implementation in Boost and TR1.
Edit:
Some people have questioned the size- if he's got, say, a 64bit server, there's plenty of space for 100million kv pairs.
Oracle Berkeley_db is what you need.
I need to store large number of integers. There can be
duplicates in the input stream of integers, I just need
to store distinct amongst them.
I was using stl set initially but It went OutOfMem when
input number of integers went too high.
I am looking for some C++ container library which would
allow me to store numbers with the said requirement possibly
backed by file i.e container should not try to keep all numbers in-mem.
I don't need to store this data persistently, I just need to find
unique values amongst it.
Take a look at the STXXL; might be what you're looking for.
Edit: I haven't used it myself, but from the docs - you could use stream::runs_creator to create sorted runs of your data (however much fits in memory), then stream::runs_merger to merge the sorted streams, and finally use stream::unique to filter uniques.
Since you need larger than RAM allows you might look at memcached
Have you considered using DB (maybe SQLite)? Or it would be too slow?
You should seriously at least try a database before concluding it is too slow. All you need is one of the lightweight key-value store ones. In the past I have used Berkeley DB, but here is a list of other ones.