OpenRDF Sesame: how to handle locking?

OpenRDF Sesame: how to handle locking? - concurrency

On my Apache Tomcat server I have an OpenRDF Sesame triplestore to handle RDF triples related to users and documents and bidirectional links between such entities:
http://local/id/doc/123456 myvocabulary:title "EU Economy"
http://local/id/doc/456789 myvocabulary:title "United States Economy"
http://local/id/user/JohnDoe myvocabulary:email "john#doe.com"
http://local/id/user/JohnDoe myvocabylary:hasWritten http://local/id/doc/123456
This triple state that user John Doe with email "john#doe.com" has written "EU Economy" book.
A Java application running on multiples clients used such server through an HTTPRespository to insert/update/remove such triples.
Problems comes from concurrent connections. If a Java Client delete the book "456789" and an other Client simultaneously link the same book to "John Doe", then there may have a situation that "John Doe" links to a book that doesn't exists any more.
To try to find a solution I have made two transactions. The first one is (T1):
(a) Check if book id exists (i.e. "456789").
(b) If yes, link the given profile (i.e. "JohnDoe") to this book.
(c) If no, return an error.
The second one is (T2):
(d) Delete book by id (i.e. "456789").
The problem is if the sequence is (T1,a) (T2,d) (T1,b) (T1,c), there is again consistency issues.
My question is: how to handle locking (like MySQL FOR UPDATE or GET_LOCK) to properly isolate such transactions with sesame ?

Older versions of Sesame (2.7.x and older) support no transaction isolation over HTTP. In a HTTP connection, transactions merely batch operations together at the client side, but no lock is obtained from the server, so there is no way to control isolation in this scenario.
So the only way to deal with this in older Sesame versions is to be robust in your queries, rather than relying on full data consistency (which is a bit of an odd concept in a schemaless/semi-structured data paradigm anyway). For example in this particular case, make sure that when you query for the books linked to a profile, the book data is actually there - don't just rely on the reference.
In Sesame 2.8 and newer, however, full transaction isolation support is available over HTTP, and additional control over the exact transaction isolation level is available as well, on a per-transaction basis. The locking scheme is dependent on the specific triplestore implementation you use.
Sesame's native store uses optimistic locking, which means that it assumes a transaction to be able to make the update it wants to, and throws an exception when a conflict occurs. Setting the isolation level for a transaction controls how the store handles locking for concurrent transactions. The Sesame Programmers manual has more details on transaction handling and the available isolation levels. The default isolation level for transactions on the native store is SNAPSHOT_READ.
As for your example transactions: in the default isolation level, T1 and T2 both observe consistent snapshots of the store for their queries, and the sequence as you sketch it plays out: T1 sees the book exists, thus adds it to the profile, and T2 gets to delete it. The end result will be that the profile is linked to a non-existent book - but actually, this is not technically an inconsistency, because T2 does not do any verification on whether a particular book is used in a profile, or not. No matter which transaction isolation level you use, if in your scenario T2 gets executed after T1, the end result will be a link to a non-existent book. If you want to ensure that you cannot get into that situation, you need to extend T2 to do a check that the book about to be deleted is not linked to a profile, and make the isolation level SNAPSHOT or SERIALIZABLE.

Related

How does Raft compare with CRDT for collaborative editing?

I am trying to understand how good Raft can be for collaborative editing when the state is just a JSON blob that can have arrays in it.
My intuition is that Raft is built for safety while CRDT is built for speed (sacrificing availability). Curious to get more opinions on how feasible it is to use Raft for collaborative editing.

First of all Raft requires, that all writes must come through the same actor (leader) and exist in the same order before being committed. This means that:
If you don't have access to a current leader from your machine, you won't be able to commit any writes.
In order to secure total order, you need to wait for a commit confirmation from leader, which may require more that 1 roundtrip. For collaborative editing case this means crippling the responsiveness of your application, because you cannot commit the next update (eg. key press) before previous one was confirmed by the remote server.
If your leader will fail, you'll need to wait until the next one is elected before any further updates could be committed.
There's a specific set of conflict resolution problems, that Raft doesn't really know how do deal with. The simplest example: two people typing under the cursor at the same position - you could easily end up with text from both of them being interleaved (eg. at the same position A writes 'hello', B writes 'world', in result you could have text being any permutation of these eg. 'hwelolrldo').
Besides other concerns - like membership and redeliveries - Raft by itself doesn't offer valuable solution for the issues above. You'd need to solve them by yourself.

How do I run Substrate in a way so that transactions get validated instantly for development pauperises?

How do I run Substrate in fake validating mode for development purposes (is there anything similar to --dev in geth where transactions are mined instantly)?

Actually, substrate recently integrated two new consensus (or "block authoring" --whichever you prefer to call them) algorithms that might be exactly what you need:
1- Manual seal: Where there is one author and it authors a block whenever you tell it via an RPC call.
2- Instant seal: Where there is one author and it attempts to author a block as soon as it sees a transaction in the pool, most often leading to one transaction per block
This is pretty recent work and perhaps has not be reflected in the docs yet. But you can find an in-depth intro to it here. Check out the video description for code examples.

Has the transaction behavior changed when a conflict occurred in firestore datastore?

I created new Google Cloud Platform project and Datastore.
Datastore was created as "Firestore in Datastore mode".
But, I think Firestore Datastore and Old Datastore behave differently if Conflict occurred.
e.g Following case.
procA: -> enter transaction -> get -> put -----------------> exit transaction
procB: -----> enter transaction -> get -> put -> exit transaction
Old Datastore;
procB is Completed and data is updated.
procA Conflict occured and data is rollbacked.
Firestore in Datastore mode;
procB is waited before exit transaction until procA is completed.Then Conflict occured.
procA Completed and data is updated.
Is it spec?
I cannot find document on Google Cloud Platform documentation.

I've been giving it some thought and I think the change may actually be intentional.
In the old behaviour that you describe basically the shorter transaction, even if it starts after the longer does, is successful, preempting the longer one and causing it to fail and be re-tried. Effectively this give priority to the shorter transactions.
But imagine that you have a peak of activity with a bunch of shorter transactions - they will keep preempting the longer one(s) which will keep being re-tried until eventually reaching the maximum retries limit and failing permanently. Also increasing the datastore contention in the process, due to the retries. I actually hit such scenario in my transaction-heavy app, I had to adjust my algorithms to work around it.
By contrast the new behaviour gives all transactions a fair chance of success, regardless of their duration or the level of activity - no priority handling. It's true, at a price - the shorter transactions started after the longer ones and overlapping them will take overall longer. IMHO the new behaviour is preferable to the older one.

The behavior you describe is caused by the chosen concurrency mode for Firestore in Datastore mode. The default mode is Pessimistic concurrency for newly created databases. From the concurrency mode documentation:
Pessimistic
Read-write transactions use reader/writer locks to enforce isolation
and serializability. When two or more concurrent read-write
transactions read or write the same data, the lock held by one
transaction can delay the other transactions. If your transaction does
not require any writes, you can improve performance and avoid
contention with other transactions by using a read-only transaction.
To get back the 'old' behavior of Datastore, choose "Optimistic" concurrency instead (link to command). This will make the faster transaction win and remove the blocking behavior.

I would recommend you to take a look at the documentation Transactions and batched writes. On this documentation you will be able to find more information and examples on how to perform transactions with Firestore.
On it, you will find more clarification on the get(), set(),update(), and delete() operations.
I can highlight the following from the documentation for you, that is very important for you to notice when working with transactions:
Read operations must come before write operations.
A function calling a transaction (transaction function) might run more than once if a concurrent edit affects a document that the transaction reads.
Transaction functions should not directly modify application state.
Transactions will fail when the client is offline.
Let me know if the information helped you!

Is there a database implementation that has notifications and revisions?

I am looking for a database library that can be used within an editor to replace a custom document format. In my case the document would contain a functional program.
I want application data to be persistent even while editing, so that when the program crashes, no data is lost. I know that all databases offer that.
On top of that, I want to access and edit the document from multiple threads, processes, possibly even multiple computers.
Format: a simple key/value database would totally suffice. SQL usually needs to be wrapped, and if I can avoid pulling in a heavy ORM dependency, that would be splendid.
Revisions: I want to be able to roll back changes up to the first change to the document that has ever been made, not only in one session, but also between sessions/program runs.
I need notifications: each process must be able to be notified of changes to the document so it can update its view accordingly.
I see these requirements as rather basic, a foundation to solve the usual tough problems of an editing application: undo/redo, multiple views on the same data. Thus, the database system should be lightweight and undemanding.
Thank you for your insights in advance :)

Berkeley DB is an undemanding, light-weight key-value database that supports locking and transactions. There are bindings for it in a lot of programming languages, including C++ and python. You'll have to implement revisions and notifications yourself, but that's actually not all that difficult.

It might be a bit more power than what you ask for, but You should definitely look at CouchDB.
It is a document database with "document" being defined as a JSON record.
It stores all the changes to the documents as revisions, so you instantly get revisions.
It has powerful javascript based view engine to aggregate all the data you need from the database.
All the commits to the database are written to the end of the repository file and the writes are atomic, meaning that unsuccessful writes do not corrupt the database.
Another nice bonus You'll get is easy and flexible replication and of your database.
See the full feature list on their homepage
On the minus side (depending on Your point of view) is the fact that it is written in Erlang and (as far as I know) runs as an external process...
I don't know anything about notifications though - it seems that if you are working with replicated databases, the changes are instantly replicated/synchronized between databases. Other than that I suppose you should be able to roll your own notification schema...

Check out ZODB. It doesn't have notifications built in, so you would need a messaging system there (since you may use separate computers). But it has transactions, you can roll back forever (unless you pack the database, which removes earlier revisions), you can access it directly as an integrated part of the application, or it can run as client/server (with multiple clients of course), you can have automatic persistency, there is no ORM, etc.
It's pretty much Python-only though (it's based on Pickles).
http://en.wikipedia.org/wiki/Zope_Object_Database
http://pypi.python.org/pypi/ZODB3
http://wiki.zope.org/ZODB/guide/index.html
http://wiki.zope.org/ZODB/Documentation

Document Server: Handling Concurrent Saves

I'm implementing a document server. Currently, if two users open the same document, then modify it and save the changes, the document's state will be undefined (either the first user's changes are saved permanently, or the second's). This is entirely unsatisfactory. I considered two possibilities to solve this problem:
The first is to lock the document when it is opened by someone the first time, and unlock it when it is closed. But if the network connection to the server is suddenly interrupted, the document would stay in a forever-locked state. The obvious solution is to send regular pings to the server. If the server doesn't receive K pings in a row (K > 1) from a particular client, documents locked by this client are unlocked. If that client re-appears, documents are locked again, if someone hadn't already locked them. This could also help if the client application (running in web browser) is terminated unexpectedly, making it impossible to send a 'quitting, unlock my documents' signal to the server.
The second is to store multiple versions of the same document saved by different users. If changes to the document are made in rapid succession, the system would offer either to merge versions or to select a preferred version. To optimize storage space, only document diffs should be kept (just like source control software).
What method should I choose, taking into consideration that the connection to the server might sometimes be slow and unresponsive? How should the parameters (ping interval, rapid succession interval) be determined?
P.S. Unfortunately, I can't store the documents in a database.

The first option you describe is essentially a pessimistic locking model whilst the second is an optimistic model.
Which one to choose really comes down to a number of factors but essentially boils down to how the business wants to work. For example, would it unduly inconvenience the users if a document they needed to edit was locked by another user? What happens if a document is locked and someone goes on holiday with their client connected? What is the likely contention for each document - i.e. how likely is it that the same document will be modified by two users at the same time?, how localised are the modifications likely to be within a single document? (If the same section is modified regularly then performing a merge may take longer than simply making the changes again).
Assuming the contention is relatively low and/or the size of each change is fairly small then I would probably opt for an optimistic model that resolves conflicts using an automatic or manual merge. A version number or a checksum of the document's contents can be used to determine if a merge is required.

My suggestion would be something like your first one. When the first user (Bob) opens the document, he acquires a lock so that other users can only read the current document. If the user saves the document while he is using it, he keeps the lock. Only when he exits the document, it is unlocked and other people can edit it.
If the second user (Kate) opens the document while Bob has the lock on it, Kate will get a message saying the document is uneditable but she can read it until it the lock has been released.
So what happens when Bob acquires the lock, maybe saves the document once or twice but then exits the application leaving the lock hanging?
As you said yourself, requiring the client with the lock to send pings at a certain frequency is probably the best option. If you don't get a ping from the client for a set amount of time, this effectively means his client is not responding anymore. If this is a web application you can use javascript for the pings. The document that was last saved releases its lock and Kate can now acquire it.
A ping can contain the name of the document that the client has a lock on, and the server can calculate when the last ping for that document was received.

Currently documents are published by a limited group of people, each of them working on a separate subject. So, the inconvenience introduced by locks is minimized.
People mostly extend existing documents and correct mistakes in them.
Speaking about the pessimistic model, the 'left client connected for N days' scenario could be avoided by setting lock expire date to, say, one day before lock start date. Because documents edited are by no means mission critical, and are modified by multiple users quite rarely, that could be enough.
Now consider the optimistic model. How should the differences be detected, if the documents have some regular (say, hierarchical) structure? If not? What are the chances of successful automatic merge in these cases?
The situation becomes more complicated, because some of the documents (edited by the 'admins' user group) contain important configuration information (document global index, user roles, etc.). To my mind, locks are more advantageous for precisely this kind of information, because it's not changed on everyday basis. So some hybrid solution might be acceptable.
What do you think?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js