I know that ETS has limited concurrency, for instance two writes to the same row at the same time aren't going to collide. But I can't seem to find out this sort of information for DETS. Does anyone know?
Note that I'm not asking about DETS running under the auspices of Mnesia, and I'm not asking about any particular scheme - say assigning a single process per row of the DETS table. I just want to know what the limited concurrency guarantees of DETS are, if any.
Thanks.
As far as I can tell, DETS currently does not support concurrency.
From the dets manual (my highlights):
It is worth noting that the ordered_set type present in Ets is not yet
implemented by Dets, neither is the limited support for concurrent
updates which makes a sequence of first and next calls safe to use on
fixed Ets tables. Both these features will be implemented by Dets in a
future release of Erlang/OTP. Until then, the Mnesia application (or
some user implemented method for locking) has to be used to implement
safe concurrency. Currently, no library of Erlang/OTP has support for
ordered disk based term storage.
Related
I'm finishing up implementing a sort of "Terasort Lite" program in Chapel, based on a distributed bucket sort, and I'm noticing what seems to be significant performance bottlenecks around access to a block-distributed array. A rough benchmark shows Chapel takes ~7 seconds to do the whole sort of 3.5MB with 5 locales, while the original MPI+C program does it in around 8.2ms with 5 processes. My local machine has 16 cores, so I don't need to oversubscribe MPI to get 5 processes working.
The data to be sorted are loaded into a block-distributed array across each of the locales so that each locale has an even (and contiguous) share of the unsorted records. In an MPI+C bucket sort, each process would have its records in memory and sort those local records. To that end, I've written a locale-aware implementation of qsort (based on the C stdlib implementation), and this is where I see extreme performance bottlenecks. The overall bucket sort procedure takes a reference to a block distributed array, and qsort is called with the local subdomain: qsort(records[records.localSubdomain()]) from within the coforall block and on loc do clause.
My main question is how Chapel maintains coherence on distributed arrays, and whether any type of coherence actions across locales are what's obliterating my performance. I've checked, and each locale's qsort call is only ever accessing array indices within its local subdomain; I would expect that this means that no communication is required, since each locale accesses only the portion of the domain that it owns. Is this a reasonable expectation, or is this simultaneous access to private portions of a distributed array causing communication overhead?
For context, I am running locally on one physical machine using the UDP GASNET communication substrate, and the documentation notes that this is not expected to give good performance. Nonetheless, I do plan to move the code to a cluster with InfiniBand, so I'd still like to know if I should approach the problem a different way to get better performance. Please let me know if there's any other information that would help this question be answered. Thank you!
Thanks for your question. I can answer some of the questions here.
First, I'd like to point out some other distributed sort implementations in Chapel:
The distributed sort in Arkouda
distributedPartitioningSortWithScratchSpace I have not worked on this in a while but if I recall correctly it is a distributed sample sort backed by a not-in-place radix sort and I think it is tested so it should run
I have other partial work trying to port over ips4o but that algorithm is pretty complicated and I didn't get all the way through and it's not running yet.
Generally I would expect a radix sort to outperform a quick sort for the local problems unless they are very small.
Now to your questions:
My main question is how Chapel maintains coherence on distributed arrays, and whether any type of coherence actions across locales are what's obliterating my performance.
There is a cache for remote data that is on by default. It can be disabled with --no-cache-remote when compiling but I suspect it is not the problem here. In particular, it mainly does any coherence activities on some sort of memory fence (which includes on statement, task end, use of sync/atomic variables). But, you can turn it off and see if that changes things.
Distributed arrays and domains currently use an eager privatization strategy. That means that once they are created, some elements of the data structure is replicated across all locales. Since this involves all locales, it can cause performance problems when running multilocale.
You can check for communication within your kernel with the CommDiagnostics module or with the local block. The CommDiagnostics module will allow you to count or trace communication events while the local block will halt your program if communication is attempted within it.
Another possibility is that the compiler is not generating communication but it is running slower because it has trouble optimizing when the data might be remote. The indicator that this is the problem would be that the performance you get when compiling with CHPL_COMM=none is significantly faster than when running with 1 locale with gasnet and UDP. (You could alternatively use --local and --no-local flags to compare). Some ways to potentially help that:
instead of records[records.localSubdomain()], you could try records.localSlice(records.localSubdomain()) but that uses an undocumented feature. I do not know why it is undocumented, though.
using a local block within your computation should solve this as well but note that we generally try to solve the problem in other ways since the local block is a big hammer.
Slicing has more overhead than we would like. See e.g. https://github.com/chapel-lang/chapel/issues/13317 . As I said, there might also be privatization costs (I don't remember what the current situation of slicing and privatization is, off-hand). In my experience, for local sorting code, you are better off passing a start and end argument as ints, or maybe a range; but slicing to get the local part of the array is certainly more important in the distributed setting.
Lastly, you mentioned that you're trying to analyze the performance when running oversubscribed. If you haven't already seen it, check out this documentation about oversubscription that recommends a setting.
I am getting up to speed on distributed systems (studying for an upcoming interview), and specifically on the basics for how a distributed system works for a distributed, consistent key-value storage system managed in memory.
My specific questions I am stuck on that I would love just a high level answer on if it's no trouble:
#1
Let's say we have 5 servers that are responsible to act as readers, and I have one writer. When I write the value 'foo' to the key 'k1', I understand it has to propagate to all of those servers so they all store the value 'foo' for the key k1. Is this correct, or does the writer only write to the majority (quorum) for this to work?
#2
After #1 above takes place, let's say concurrently a read comes in for k1, and a write comes in to replace 'foo' with 'bar', however not all of the servers are updated with 'bar. This means some are 'foo' and some are 'bar'. If I had lots of concurrent reads, it's conceivable some would return 'foo' and some 'bar' since it's not updated everywhere yet.
When we're talking about eventual consistency, this is expected, but if we're talking about strong consistency, how do you avoid #2 above? I keep seeing content about quorum and timestamps but on a high level, is there some sort of intermediary that sorts out what the correct value is? Just wanted to get a basic idea first before I dive in more.
Thank you so much for any help!
In doing more research, I found that "consensus algorithms" such as Paxos or Raft is the correct solution here. The idea is that your nodes need to arrive at a consensus of what the value is. If you read up on Paxos or Raft you'll learn everything you need to - it's quite complex to explain here, but there are videos/resources out there that cover this well.
Another thing I found helpful was learning more about Dynamo and DynamoDB. They handle the subject as well, although not strongly consistent/distributed.
Hope this helps someone, and message me if you'd like more details!
Read the CAP theorem will help you solve your problem. You are looking for consistence and network partition in this question, so you have to sacrifice the availability. The system needs to block and wait until all nodes finish writing. In other word, the change can not be read before all nodes have updated it.
In theoretical computer science, the CAP theorem, also named Brewer's
theorem after computer scientist Eric Brewer, states that any
distributed data store can only provide two of the following three
guarantees:
Consistency Every read receives the most recent write or an error.
Availability Every request receives a (non-error) response, without
the guarantee that it contains the most recent write.
Partition tolerance The system continues to operate despite an arbitrary number
of messages being dropped (or delayed) by the network between nodes.
Considering my lack of c++ knowledge, please try to read my intent and not my poor technical question.
This is the backbone of my program https://github.com/zaphoyd/websocketpp/blob/experimental/examples/broadcast_server/broadcast_server.cpp
I'm building a websocket server with websocket++ (and oh is websocket++ sweet. I highly recommend), and I can easily manipulate per user data thread-safely because it really doesn't need to be manipulated by different threads; however, I do want to be able to write to an array (I'm going to use the catch-all term "array" from weaker languages like vb, php, js) in one function thread (with multiple iterations that could be running simultanously) and also read in 1 or more threads.
Take stack as an example: if I wanted to have all of the ids (PRIMARY column of all articles) sorted in a particular way, in this case by net votes, and held in memory, I'm thinking I would have a function that's called in its' own boost::thread, fired whenever a vote on the site comes in to reorder the array.
How can I do this without locking & blocking? I'm 100% fine with users reading from an old array while another is being built, but I absolutely do not want their reads or the thread writes to ever fail/be blocked.
Does a lock-free array exist? If not, is there some way to build the new array in a temporary array and then write it to the actual array when the building is finished without locking & blocking?
Have you looked at Boost.Lockfree?
Uh, uh, uh. Complicated.
Look here (for an example): RCU -- and this is only about multiple reads along with ONE write.
My guess is that multiple writers at once are not going to work. You should rather look for a more efficient representation than an array, one that allows for faster updates. How about a balanced tree? log(n) should never block anything in a noticeable fashion.
Regarding boost -- I'm happy that it finally has proper support for thread synchronization.
Of course, you could also keep a copy and batch the updates. Then a background process merges the updates and copies the result for the readers.
We are developing a network application based C/S, we find there are too many locks adding to std::map that the performance of server became poor.
I wonder if it is possible to implement a lock-free map, if yes, how? Is there any open source code there?
EDIT:
Actually we use the std::map to store sockets information, we did encapsulation based on the socket file description to include some other necessary information such as ip address, port, socket type, tcp or udp, etc.
To summary, we have a global map say it's
map<int fileDescriptor, socketInfor*> SocketsMap,
then every thread which is used to send data needs to access SocketsMap, and they have to add mutex before reading from SocketsMap or writing to SocketsMap, thus the concurrency level of the whole application would be greatly decreased because of so many locks addding to SocketsMap.
To avoid the concurrency level problem, we have two solutions: 1. store each socketInfor* separately 2. use some kind of lock-free map.
I would like to find some kind of lock-free map, because codes changes required by this solution are much less than that of solution 1.
Actually there's a way, although I haven't implemented it myself there's a paper on a lock free map using hazard pointers from eminent C++ expert Andrei Alexandrescu.
Yes, I have implemented a Lock-Free Unordered Map (docs) in C++ using the "Split-Ordered Lists" concept. It's an auto-expanding container and supports millions of elements on a 64-bit CAS without ABA issues. Performance-wise, it's a beast (see page 5). It's been extensively tested with millions of random ops.
HashMap would suit? Have a look at Intel Threading Building Blocks, they have an interesting concurrent map. I'm not sure it's lock-free, but hopefully you're interested in good multithreading performance, not particularly in lock-freeness. Also you can check CityHash lib
EDIT:
Actually TBB's hash map is not lock-free
I'm surprised nobody has mentioned it, but Click Cliff has implemented a wait-free hashmap in Java, which I believe could be ported to C++,
If you use C++11, you can have a look at AtomicHashMap of facebook/folly
You can implement the map using optimistic design or transactional memory.
This approach is especially effective if the chance of two operations concurrently addressing the map and one is changing the structure of it is relatively small - and you do not want the overhead of locking every time.
However, from time to time - collision will occur, and you will have to result it somehow (usually by rolling back to the last stable state and retrying the operations).
If your hardware support good enough atomic operations - this can be easily done with Compare And Swap (CAS) - where you change the reference alone (and whenever you change the map, you work on a copy of the map, and not the original, and set it as the primary only when you commit).
What's the optimal level of concurrency that the C++ implementation of BerkeleyDB can reasonably support?
How many threads can I have hammering away at the DB before throughput starts to suffer because of resource contention?
I've read the manual and know how to set the number of locks, lockers, database page size, etc. but I'd just like some advice from someone who has real-world experience with BDB concurrency.
My application is pretty simple, I'll be doing gets and puts of records that are about 1KB each. No cursors, no deleting.
It depends on what kind of application you are building. Create a representative test scenario, and start hammering away. Then you will know the definitive answer.
Besides your use case, it also depends on CPU, memory, front-side bus, operating system, cache settings, etcetera.
Seriously, just test your own scenario.
If you need some numbers (that actually may mean nothing in your scenario):
Oracle Berkeley DB:
Performance Metrics and
Benchmarks
Performance Metrics
& Benchmarks:
Berkeley DB
I strongly agree with Daan's point: create a test program, and make sure the way in which it accesses data mimics as closely as possible the patterns you expect your application to have. This is extremely important with BDB because different access patterns yield very different throughput.
Other than that, these are general factors I found to be of major impact on throughput:
Access method (which in your case i guess is BTREE).
Level of persistency with which you configured DBD (for example, in my case the 'DB_TXN_WRITE_NOSYNC' environment flag improved write performance by an order of magnitude, but it compromises persistency)
Does the working set fit in cache?
Number of Reads Vs. Writes.
How spread out your access is (remember that BTREE has a page level locking - so accessing different pages with different threads is a big advantage).
Access pattern - meanig how likely are threads to lock one another, or even deadlock, and what is your deadlock resolution policy (this one may be a killer).
Hardware (disk & memory for cache).
This amounts to the following point:
Scaling a solution based on DBD so that it offers greater concurrency has two key ways of going about it; either minimize the number of locks in your design or add more hardware.
Doesn't this depend on the hardware as well as number of threads and stuff?
I would make a simple test and run it with increasing amounts of threads hammering and see what seems best.
What I did when working against a database of unknown performance was to measure turnaround time on my queries. I kept upping the thread count until turn-around time dropped, and dropping the thread count until turn-around time improved (well, it was processes in my environment, but whatever).
There were moving averages and all sorts of metrics involved, but the take-away lesson was: just adapt to how things are working at the moment. You never know when the DBAs will improve performance or hardware will be upgraded, or perhaps another process will come along to load down the system while you're running. So adapt.
Oh, and another thing: avoid process switches if you can - batch things up.
Oh, I should make this clear: this all happened at run time, not during development.
The way I understand things, Samba created tdb to allow "multiple concurrent writers" for any particular database file. So if your workload has multiple writers your performance may be bad (as in, the Samba project chose to write its own system, apparently because it wasn't happy with Berkeley DB's performance in this case).
On the other hand, if your workload has lots of readers, then the question is how well your operating system handles multiple readers.