Document Server: Handling Concurrent Saves - concurrency

I'm implementing a document server. Currently, if two users open the same document, then modify it and save the changes, the document's state will be undefined (either the first user's changes are saved permanently, or the second's). This is entirely unsatisfactory. I considered two possibilities to solve this problem:
The first is to lock the document when it is opened by someone the first time, and unlock it when it is closed. But if the network connection to the server is suddenly interrupted, the document would stay in a forever-locked state. The obvious solution is to send regular pings to the server. If the server doesn't receive K pings in a row (K > 1) from a particular client, documents locked by this client are unlocked. If that client re-appears, documents are locked again, if someone hadn't already locked them. This could also help if the client application (running in web browser) is terminated unexpectedly, making it impossible to send a 'quitting, unlock my documents' signal to the server.
The second is to store multiple versions of the same document saved by different users. If changes to the document are made in rapid succession, the system would offer either to merge versions or to select a preferred version. To optimize storage space, only document diffs should be kept (just like source control software).
What method should I choose, taking into consideration that the connection to the server might sometimes be slow and unresponsive? How should the parameters (ping interval, rapid succession interval) be determined?
P.S. Unfortunately, I can't store the documents in a database.

The first option you describe is essentially a pessimistic locking model whilst the second is an optimistic model.
Which one to choose really comes down to a number of factors but essentially boils down to how the business wants to work. For example, would it unduly inconvenience the users if a document they needed to edit was locked by another user? What happens if a document is locked and someone goes on holiday with their client connected? What is the likely contention for each document - i.e. how likely is it that the same document will be modified by two users at the same time?, how localised are the modifications likely to be within a single document? (If the same section is modified regularly then performing a merge may take longer than simply making the changes again).
Assuming the contention is relatively low and/or the size of each change is fairly small then I would probably opt for an optimistic model that resolves conflicts using an automatic or manual merge. A version number or a checksum of the document's contents can be used to determine if a merge is required.

My suggestion would be something like your first one. When the first user (Bob) opens the document, he acquires a lock so that other users can only read the current document. If the user saves the document while he is using it, he keeps the lock. Only when he exits the document, it is unlocked and other people can edit it.
If the second user (Kate) opens the document while Bob has the lock on it, Kate will get a message saying the document is uneditable but she can read it until it the lock has been released.
So what happens when Bob acquires the lock, maybe saves the document once or twice but then exits the application leaving the lock hanging?
As you said yourself, requiring the client with the lock to send pings at a certain frequency is probably the best option. If you don't get a ping from the client for a set amount of time, this effectively means his client is not responding anymore. If this is a web application you can use javascript for the pings. The document that was last saved releases its lock and Kate can now acquire it.
A ping can contain the name of the document that the client has a lock on, and the server can calculate when the last ping for that document was received.

Currently documents are published by a limited group of people, each of them working on a separate subject. So, the inconvenience introduced by locks is minimized.
People mostly extend existing documents and correct mistakes in them.
Speaking about the pessimistic model, the 'left client connected for N days' scenario could be avoided by setting lock expire date to, say, one day before lock start date. Because documents edited are by no means mission critical, and are modified by multiple users quite rarely, that could be enough.
Now consider the optimistic model. How should the differences be detected, if the documents have some regular (say, hierarchical) structure? If not? What are the chances of successful automatic merge in these cases?
The situation becomes more complicated, because some of the documents (edited by the 'admins' user group) contain important configuration information (document global index, user roles, etc.). To my mind, locks are more advantageous for precisely this kind of information, because it's not changed on everyday basis. So some hybrid solution might be acceptable.
What do you think?

Related

How does Raft compare with CRDT for collaborative editing?

I am trying to understand how good Raft can be for collaborative editing when the state is just a JSON blob that can have arrays in it.
My intuition is that Raft is built for safety while CRDT is built for speed (sacrificing availability). Curious to get more opinions on how feasible it is to use Raft for collaborative editing.
First of all Raft requires, that all writes must come through the same actor (leader) and exist in the same order before being committed. This means that:
If you don't have access to a current leader from your machine, you won't be able to commit any writes.
In order to secure total order, you need to wait for a commit confirmation from leader, which may require more that 1 roundtrip. For collaborative editing case this means crippling the responsiveness of your application, because you cannot commit the next update (eg. key press) before previous one was confirmed by the remote server.
If your leader will fail, you'll need to wait until the next one is elected before any further updates could be committed.
There's a specific set of conflict resolution problems, that Raft doesn't really know how do deal with. The simplest example: two people typing under the cursor at the same position - you could easily end up with text from both of them being interleaved (eg. at the same position A writes 'hello', B writes 'world', in result you could have text being any permutation of these eg. 'hwelolrldo').
Besides other concerns - like membership and redeliveries - Raft by itself doesn't offer valuable solution for the issues above. You'd need to solve them by yourself.

Has the transaction behavior changed when a conflict occurred in firestore datastore?

I created new Google Cloud Platform project and Datastore.
Datastore was created as "Firestore in Datastore mode".
But, I think Firestore Datastore and Old Datastore behave differently if Conflict occurred.
e.g Following case.
procA: -> enter transaction -> get -> put -----------------> exit transaction
procB: -----> enter transaction -> get -> put -> exit transaction
Old Datastore;
procB is Completed and data is updated.
procA Conflict occured and data is rollbacked.
Firestore in Datastore mode;
procB is waited before exit transaction until procA is completed.Then Conflict occured.
procA Completed and data is updated.
Is it spec?
I cannot find document on Google Cloud Platform documentation.
I've been giving it some thought and I think the change may actually be intentional.
In the old behaviour that you describe basically the shorter transaction, even if it starts after the longer does, is successful, preempting the longer one and causing it to fail and be re-tried. Effectively this give priority to the shorter transactions.
But imagine that you have a peak of activity with a bunch of shorter transactions - they will keep preempting the longer one(s) which will keep being re-tried until eventually reaching the maximum retries limit and failing permanently. Also increasing the datastore contention in the process, due to the retries. I actually hit such scenario in my transaction-heavy app, I had to adjust my algorithms to work around it.
By contrast the new behaviour gives all transactions a fair chance of success, regardless of their duration or the level of activity - no priority handling. It's true, at a price - the shorter transactions started after the longer ones and overlapping them will take overall longer. IMHO the new behaviour is preferable to the older one.
The behavior you describe is caused by the chosen concurrency mode for Firestore in Datastore mode. The default mode is Pessimistic concurrency for newly created databases. From the concurrency mode documentation:
Pessimistic
Read-write transactions use reader/writer locks to enforce isolation
and serializability. When two or more concurrent read-write
transactions read or write the same data, the lock held by one
transaction can delay the other transactions. If your transaction does
not require any writes, you can improve performance and avoid
contention with other transactions by using a read-only transaction.
To get back the 'old' behavior of Datastore, choose "Optimistic" concurrency instead (link to command). This will make the faster transaction win and remove the blocking behavior.
I would recommend you to take a look at the documentation Transactions and batched writes. On this documentation you will be able to find more information and examples on how to perform transactions with Firestore.
On it, you will find more clarification on the get(), set(),update(), and delete() operations.
I can highlight the following from the documentation for you, that is very important for you to notice when working with transactions:
Read operations must come before write operations.
A function calling a transaction (transaction function) might run more than once if a concurrent edit affects a document that the transaction reads.
Transaction functions should not directly modify application state.
Transactions will fail when the client is offline.
Let me know if the information helped you!

ReST philosophy - how to handle services and side effects

I've been diving into ReST lately, and a few things still bug me:
1) Since there are only resources and no services to call, how can I provide operations to the client that only do stuff and don't change any data?
For example, in my application it is possible to trigger a service that connects to a remote server and executes a shell scripts. I don't know how this scenario would apply to a resource?
2) Another thing I'm not sure about is side effects: Let's say I have a resource that can be in certain states. When transitioning into another state, a lot of things might happen (e-mails might be sent). The transition is triggered by the client. Should I handle this transition merely by letting the resource be updated via PUT? This feels a bit odd.
For the client this means that updating an attribute of this ressource might only change the attribute, or it also might do a lot of other things. So PUT =/= PUT, kind of.
And implementation wise, I have to check what exacty the PUT request changed, and according to that trigger the side effects. So there would be a lot of checks like if(old_attribute != new_attribute) {side_effects}
Is this how it's supposed to be?
BR,
Philipp
Since there are only resources and no services to call, how can I provide operations to the client that only do stuff and don't change any data?
HTTP is a document transport application. Send documents (ie: messages) that trigger the behaviors that you want.
In other words, you can think about the message you are sending as a description of a task, or as an entry being added to a task queue. "I'm creating a task resource that describes some work I want done."
Jim Webber covers this pretty well.
Another thing I'm not sure about is side effects: Let's say I have a resource that can be in certain states. When transitioning into another state, a lot of things might happen (e-mails might be sent). The transition is triggered by the client. Should I handle this transition merely by letting the resource be updated via PUT?
Maybe, but that's not your only choice -- you could handle the transition by having the client put some other resource (ie, a message describing the change to be made). That affords having a number of messages (commands) that describe very specific modifications to the domain entity.
In other words, you can work around PUT =/= PUT by putting more specific things.
(In HTTP, the semantics of PUT are effectively create or replace. Which is great for dumb documents, or CRUD, but need a bit of design help when applied to an entity with its own agency.)
And implementation wise, I have to check what exacty the PUT request changed, and according to that trigger the side effects.
Is this how it's supposed to be?
Sort of. Review Udi Dahan's talk on reliable messaging; it's not REST specific, but it may help clarify the separation of responsibilities here.

Choice of storage and caching

I hope the title is chosen well enough to ask this question.
Feel free to edit if not and please accept my apologies.
I am currently laying out an application that is interacting with the web.
Explanation of the basic flow of the program:
The user is entering a UserID into my program, which is then used to access multiple xml-files over the web:
http://example.org/user/userid/?xml=1
This file contains several ID's of products the user owns in a DRM-System. This list is then used to access stats and informations about the users interaction with the product:
http://example.org/user/appid/stats/?xml=1
This also contains links to various images which are specific to that application. And those may change at any time and need to be downloaded for display in the app.
This is where the horror starts, at least for me :D.
1.) How do I store that information on the PC of the user?
I thought about using a directory for the userid, then subfolders with the appid to cache images and the xml-files to load them on demand. I also thought about using a zipfile while using the same structure.
Or would one rather use a local db like sqlite for that?
Average Number of Applications might be around ~100-300 and stats and images per app from basically 5-700.
2.) When should I refresh the content?
The bad thing is, the website from where this data is downloaded, or rather the xmls, do not contain any timestamps when it was refreshed/changed the last time. So I would need to hash all the files and compare them in the moment the user is accessing that data, which can take an inifite amount of time, because it is webbased. Okay, there are timeouts, but I would need to block the access to the content until the data is either downloaded and processed or the timeout occurs. In both cases, the application would not be accessible for a short or maybe even long time and I want to avoid that. I could let the user do the refresh manually when he needs it, but then I hoped there are some better methods for that.
Especially with the above mentioned numbers of apps and stuff.
Thanks for reading and all of that and please feel free to ask if I forgot to explain something.
It's probably worth using a DB since it saves you messing around with file formats for structured data. Remember to delete and rebuild it from time to time (or make sure old stuff is thoroughly removed and compact it from time to time, but it's probably easier to start again, since it's just a cache).
If the web service gives you no clues when to reload, then you'll just have to decide for yourself, but do be sure to check the HTTP headers for any caching instructions as well as the XML data[*]. Decide a reasonable staleness for data (the amount of time a user spends staring at the results is a absolute minimum, since they'll see results that stale no matter what you do). Whenever you download anything, record what date/time you downloaded it. Flush old data from the cache.
To prevent long delays refreshing data, you could:
visually indicate that the data is stale, but display it anyway and replace it once you've refreshed.
allow staler data when the user has a lot of stuff visible, than you do when they're just looking at a small amount of stuff. So, you'll "do nothing" while waiting for a small amount of stuff, but not while waiting for a large amount of stuff.
run a background task that does nothing other than expiring old stuff out of the cache and reloading it. The main app always displays the best available, however old that is.
Or some combination of tactics.
[*] Come to think of it, if the web server is providing reasonable caching instructions, then it might be simplest to forget about any sort of storage or caching in your app. Just grab the XML files and display them, but grab them via a caching web proxy that you've integrated into your app. I don't know what proxies make this easy - you can compile Squid yourself (of course), but I don't know whether you can link it into another app without modifying it yourself.

Is there a database implementation that has notifications and revisions?

I am looking for a database library that can be used within an editor to replace a custom document format. In my case the document would contain a functional program.
I want application data to be persistent even while editing, so that when the program crashes, no data is lost. I know that all databases offer that.
On top of that, I want to access and edit the document from multiple threads, processes, possibly even multiple computers.
Format: a simple key/value database would totally suffice. SQL usually needs to be wrapped, and if I can avoid pulling in a heavy ORM dependency, that would be splendid.
Revisions: I want to be able to roll back changes up to the first change to the document that has ever been made, not only in one session, but also between sessions/program runs.
I need notifications: each process must be able to be notified of changes to the document so it can update its view accordingly.
I see these requirements as rather basic, a foundation to solve the usual tough problems of an editing application: undo/redo, multiple views on the same data. Thus, the database system should be lightweight and undemanding.
Thank you for your insights in advance :)
Berkeley DB is an undemanding, light-weight key-value database that supports locking and transactions. There are bindings for it in a lot of programming languages, including C++ and python. You'll have to implement revisions and notifications yourself, but that's actually not all that difficult.
It might be a bit more power than what you ask for, but You should definitely look at CouchDB.
It is a document database with "document" being defined as a JSON record.
It stores all the changes to the documents as revisions, so you instantly get revisions.
It has powerful javascript based view engine to aggregate all the data you need from the database.
All the commits to the database are written to the end of the repository file and the writes are atomic, meaning that unsuccessful writes do not corrupt the database.
Another nice bonus You'll get is easy and flexible replication and of your database.
See the full feature list on their homepage
On the minus side (depending on Your point of view) is the fact that it is written in Erlang and (as far as I know) runs as an external process...
I don't know anything about notifications though - it seems that if you are working with replicated databases, the changes are instantly replicated/synchronized between databases. Other than that I suppose you should be able to roll your own notification schema...
Check out ZODB. It doesn't have notifications built in, so you would need a messaging system there (since you may use separate computers). But it has transactions, you can roll back forever (unless you pack the database, which removes earlier revisions), you can access it directly as an integrated part of the application, or it can run as client/server (with multiple clients of course), you can have automatic persistency, there is no ORM, etc.
It's pretty much Python-only though (it's based on Pickles).
http://en.wikipedia.org/wiki/Zope_Object_Database
http://pypi.python.org/pypi/ZODB3
http://wiki.zope.org/ZODB/guide/index.html
http://wiki.zope.org/ZODB/Documentation