Transactions in Graph Engine - graphengine

Is it possible to implement transactions in Graph Engine?
I like to do multiple updates on different cells and then commit or rollback these changes.
Even with one cell it is difficult. When I use the next code the modification is not written to disk but the memory is changed!
using (Character_Accessor characterAccessor = Global.LocalStorage.UseCharacter(cellId, CellAccessOptions.StrongLogAhead))
{
characterAccessor.Name = "Modified";
throw new Exception("Test exception");
}

My understanding is: Regardless of you throwing this Exception or not: The changes are always only in memory - until you explicitly call Global.LocalStorage.SaveStorage().
You could implement your transaction by saving the storage before you start the transaction, then make the changes, and in case you want to rollback, just call Global.LocalStorage.ResetStorage().
All this, of course, only if you do not need high-performance throughput and access the database on a single thread.

The write-ahead log is only flushed to the disk at the end of the "using" scope -- when the accessor is being disposed and the lock in the memory storage is about to be released.
This is like a mini-transaction on a single cell. Others cannot access the cell when you hold the lock. You could do multiple changes to the cell and "commit" them at the end -- or, making a shadow copy at the beginning of the using scope, and then rollback later to this copy when anything goes wrong (this is still a manual process though).
Also, please check this out: https://github.com/Microsoft/GraphEngine/tree/multi_cell_lock
We are working on enabling one thread to hold multiple locks. This will make multi-entity transactions much easier to implement.

Related

How to minimize the mutex locking for an object when only 1 thread mostly uses that object and the other thread(s) use it rarely?

Scenario
Suppose there are "Thread_Main" and "Thread_DB", with a shared SQLite database object. It's guaranteed that,
"Thread_main" seldom uses SQLite object for reading (i.e. SELECT())
"Thread_DB" uses the SQLite object most of the time for various INSERT, UPDATE, DELETE operations
To avoid data races and UB, SQLite should be compiled with SQLITE_THREADSAFE=1 (default) option. That means, before every operation, an internal mutex will be locked, so that DB is not writing when reading and vice versa.
"Thread_Main" "Thread_DB" no. of operation on DB
============= =========== ======================
something INSERT 1
something UPDATE 2
something DELETE 3
something INSERT 4
... ... ... (collapsed)
something INSERT 500
something DELETE 501
... ... ... (collapsed)
something UPDATE 1000
something UPDATE 1001
... ... ... (collapsed)
SELECT INSERT 1200 <--- here is a serious requirement of mutex
... ... ... (collapsed)
Problem
As seen in above, out of 100s of operations, the need of real mutex is required only once in a while. However to safeguard that small situation, we have to lock it for all the operations.
Question: Is there a way in which "Thread_DB" holds the mutex most of the time, so that every time locking is not required? The lock/unlocks can happen only when "Thread_Main" requests for it.
Notes
One way is to queue up the SELECT in the "Thread_DB". But in larger scenario with several DBs running, this will slow down the response and it won't be real time. Can't keep the main thread waiting for it.
I also considered to have a "Thread_Main" integer/boolean variable which will suggest that "Thread_Main" wants to SELECT. Now if any operation is running in "Thread_DB" at that time, it can unlock the mutex. This is fine. But if no writeable operation is running on that SQLite object, then "Thread_main" will keep waiting, as there is no one in "Thread_DB" to unlock. Which will again delay or even hang the "Thread_Main".
Here's a suggestion: modify your program somewhat so that Thread_Main has no access to the shared object; only Thread_DB is able to access it. Once you've done that, you won't need to do any serialization at all, and Thread_DB can work at full efficiency.
Of course the fly in the ointment is that Thread_Main does sometimes need to interact with the DB object; how can it do that if it doesn't have any access to it?
The solution to that issue is message-passing. When Thread_Main needs to do something with the DB, it should pass a Message object of some sort to Thread_DB. The Message object should contain all the details necessary to characterize the desired interaction. When Thread_DB receives the Message object, Thread_DB can call its execute(SQLite & db) method (or whatever you want to call it), at which point the necessary data insertion/extraction can occur from within the context of the Thread_DB thread. When the interaction has completed, any results can be stored inside the Message object and the Message object can then be passed back to the main thread for the main thread to deal with the results. (the main thread can either block waiting for the Message to be sent back, or continue to operate asynchronously to the DB thread, it's up to you)

Backing up a running rocksdb-instance

I would like to backup a running rocksdb-instance to a location on the same disk in a way that is safe, and without interrupting processing during the backup.
I have read:
Rocksdb Backup Instructions
Checkpoints Documentation
Documentation in rocksdb/utilities/{checkpoint.h,backupable_db.{h,cc}}
My question is whether the call to CreateNewBackupWithMetadata is marked as NOT threadsafe to express, that two concurrent calls to this function will have unsafe behavior, or to indicate that ANY concurrent call on the database will be unsafe. I have checked the implementation, which appears to be creating a checkpoint - which the second article claims are used for online backups of MyRocks -, but I am still unsure, what part of the call is not threadsafe.
I currently interpret this as, it is unsafe, because CreateBackup... calls DisableFileDeletions and later EnableFileDeletions, which, of course, if two overlapping calls are made, may cause trouble. Since the SST-files are immutable, I am not worried about them, but am unsure whether modifying the WAL through insertions can corrupt the backup. I would assume that triggering a flush on backup should prevent this, but I would like to be sure.
Any pointers or help are appreciated.
I ended up looking into the implementation way deeper, and here is what I found:
Recall a rocksdb database consists of Memtables, SSTs and a single WAL, which protects data in the Memtables against crashes.
When you call rocksdb::BackupEngine::CreateBackupWithMetadata, there is no lock taken internally, so this call can race, if two calls are active at the same time. Most notably this call does Disable/EnableFileDeletions, which, if called by one call, while another is still active spells doom for the other call.
The process of copying the files from the database to the backup is protected from modifications while the call is active by creating a rocksdb::Checkpoint, which, if flush_before_backup was set to true, will first flush the Memtables, thus clearing the active WAL.
Internally the call to CreateCustomCheckpoint calls DB::GetLiveFiles in db_filecheckpoint.cc. GetLiveFiles takes the global database lock (_mutex), optionally flushes the Memtables, and retrieves the list of SSTs. If a flush in GetLiveFiles happens while holding the global database-lock, the WAL must be empty at this time, which means the list should always contain the SST-files representing a complete and consistent database state from the time of the checkpoint. Since the SSTs are immutable, and since file deletion through compaction is turned off by the backup-call, you should always get a complete backup without holding writes on the database. However this, of course, means it is not possible to determine the exact last write/sequence number in the backup when concurrent updates happen - at least not without inspecting the backup after it has been created.
For the non-flushing version, there maybe WAL-files, which are retrieved in a different call than GetLiveFiles, with no lock held in between, i.e. these are not necessarily consistent, but I did not investigate further, since the non-flushing case was not applicable to my use.

When can safely access mutex protected variable without locking?

A common pattern of storing config in my code is a "map[string]interface{}" protected by RWMutex, but usually after app initiated (could be triggered in multiple go-routine), the map becomes totally readonly. So I have a feeling that from some point of time on, the RWMutex on read should be unnecessary.
An example of this config map is at
http://play.golang.org/p/tkbj9DBok_
One fact that brought me to think of this is in some of production code it actually doing this way of unprotected access of shared object (though it's mostly readonly after it's init'ed), I understand normal way of using RWMutex to protect, but interesting part is this malformed code haven't run into problem in past months.
Is that true that after some accurate "time point" that writes are flushed from cache into memory and with a guarantee of no more writes needed, reads can actually go without RWMutex.RLock? If YES, when is the time point or how to setup the conditions before lockless access?
As long as no one is modifying the map, it should be safe for multiple threads to read it at once. Unfortunately, without any locking you'll have no way to make sure no one else is reading the map when you want to update it.
So one solution is to never update the map, but instead replace it atomically. The read-copy-update algorithm could be used here. Rather than directly accessing the map, so you need to dereference the pointer to access the map. To update it, you can do the following:
acquire an "update lock" mutex.
make a copy of the map. You want to copy all keys/values manually: simple assignment won't work because maps are reference types.
make your changes to the copy of the map.
use StorePointer from the sync/atomic package to atomically update the pointer to the live map to point to your new map.
release the mutex.
Everything that runs before the atomic update in (4) will see the old map, and everything after will see the new map. At no point will those goroutines be reading from a map that is being written to, so there is no need for an RWMutex.
RLock and RUnlock are very fast operations. If there are no writers they are essentially lockless and take just 1 atomic operation each (http://golang.org/src/sync/rwmutex.go?h=RLock#L29). So unless your application performs poorly because reading configuration is slow I would suggest to use RWLock.
Note, that my initial answer was a proposal to implement Freeze() operation for Register but later I realized that the correct implementation won't be any faster than using RWLock.

LInux/c++, how to protect two data structures at same time?

I'm developing some programs with c/c++ in Linux. my question is:
I have an upper-level class called Vault, inside of which there are an OrderMap which uses unordered_map as data structure and a OrderBook, which has 2 std::list in side.
Both the OrderMap and OrderBook stores Order* as the element, they share an Order object which is allocated on heap. So either the OrderBook or the OrderMap modifies the order within.
I have two threads that will do read and write operations on them. the 2 threads can insert/modify/retrieve(read)/delete the elements.
My Question is: how can I protect this big structure of "Vault"? I can actually protect the map or list, but I don't know how to protect them both at the same time.
anybody gives me some ideas?
Add a mutex and lock it before accessing any of the two.
Make them private, so you know accesses are made via your member functions (which have proper locks)
Consider using std::shared_ptr instead of Order *
I don't think I've ever seen a multi-thread usage of an order book. I really think you'd be better off using it in one thread only.
But, to get back to your question, I'll assume you're staying with 2 threads for whatever reasons.
This data structure is too complex to make it lock free. So these are the multi-thread options I can see:
1. If you use a single Vault instance for both threads you will have to lock around it. I assume you don't care to burn some cpu time, so I strongly suggest you use a spin lock, not a mutex.
2. If you can allow having 2 Vault instances this can improve things as every thread can keep it's own private instance and exchange modifications with the other thread using some other means: a lock free queue, or something else.
3. If your book is fast enough to copy, you can have a single central Vault pointer, make a copy on every update, or a group of updates, CAS onto that central pointer, and reuse the old one to not have to allocate every time. You'll end up with one spare instance per thread. Like this:
Vault* old_ptr = centralVault;
Vault* new_ptr = cachedVault ? cachedVault : new Vault;
do {
*new_ptr = *old_ptr;
makeChanges(new_ptr);
}
while( !cas(&centralVault, old_ptr, new_ptr) );
cachedVault = old_ptr;

How to modify a data structure while a process is already accessing it ?

I have written a program (suppose X) in c++ which creates a data structure and then uses it continuously.
Now I would like to modify that data structure without aborting the previous program.
I tried 2 ways to accomplish this task :
In the same program X, first I created data structure and then tried to create a child process which starts accessing and using that data structure for some purpose. The parent process continues with its execution and asks the user for any modification like insertion, deletion, etc and takes input from console and subsequently modification is done. The problem here is, it doesn't modify the copy of data structure that the child process was using. Later on, I figured out this won't help because the child process is using its own copy of data structure and hence modifications done via parent process won't be reflected in it. But definitely, I didn't want this to happen. So I went for multithreading.
Instead of creating child process, I created an another thread which access that data structure and uses it and tried to take user input from console in different thread. Even,
this didn't work because of very fast switching between threads.
So, please help me to solve this issue. I want the modification to be reflected in the original data structure. Also I don't want the process (which is accessing and using it continuously) to wait for sometimes since it's time crucial.
First point: this is not a trivial problem. To handle it at all well, you need to design a system, not just a quick hack or two.
First of all, to support the dynamic changing, you'll almost certainly want to define the data structure in code in something like a DLL or .so, so you can load it dynamically.
Part of how to proceed will depend on whether you're talking about data that's stored strictly in memory, or whether it's more file oriented. In the latter case, some of the decisions will depend a bit on whether the new form of a data structure is larger than an old one (i.e., whether you can upgrade in place or no).
Let's start out simple, and assume you're only dealing with structures in memory. Each data item will be represented as an object. In addition to whatever's needed to access the data, each object will provide locking, and a way to build itself from an object of the previous version of the object (lazily -- i.e., on demand, not just in the ctor).
When you load the DLL/.so defining a new object type, you'll create a collection of those the same size as your current collection of existing objects. Each new object will be in the "lazy" state, where it's initialized, but hasn't really been created from the old object yet.
You'll then kick off a thread that walks makes the new collection known to the rest of the program, then walks through the collection of new objects, locking an old object, using it to create a new object, then destroying the old object and removing it from the old collection. It'll use a fairly short timeout when it tries to lock the old object (i.e., if an object is in use, it won't wait for it very long, just go on to the next. It'll iterate repeatedly until all the old objects have been updated and the collection of old objects is empty.
For data on disk, things can be just about the same, except your collections of objects provide access to the data on disk. You create two separate files, and copy data from one to the other, converting as needed.
Another possibility (especially if the data can be upgraded in place) is to use a single file, but embed a version number into each record. Read some raw data, check the version number, and use appropriate code to read/write it. If you're reading an old version number, read with the old code, convert to the new format, and write in the new format. If you don't have space to update in place, write the new record to the end of the file, and update the index to indicate the new position.
Your approach to concurrent access is similar to sharing a cake between a classroom full of blindfolded toddlers. It's no surprise that you end up with a sticky mess. Each toddler will either have to wait their turn to dig in or know exactly which part of the cake she alone can touch.
Translating to code, the former means having a lock or mutex that controls access to a data structure so that only one thread can modify it at any time.
The latter can be done by having a data structure that is modified in place by threads that each know exactly which parts of the data structure they can update, e.g. by passing a struct with details on which range to update, effectively splitting up the data beforehand. These should not overlap and iterators should not be invalidated (e.g. by resizing), which may not be possible for a given problem.
There are many many algorithms for handling resource competition, so this is grossly simplified. Distributed computing is a significant field of computer science dedicated to these kinds problems; study the problem (you didn't give details) and don't expect magic.