Cassandra, schema and process design for concurrent writes - concurrency

This is a long-winded question. It is about Cassandra schema design. I'm here to get inputs from your respected experts on a use-case I'm working on. All inputs, suggestions, and critics are welcome. Here goes my question.
We would like to collect REVIEWS from our USERS about some PAPERS we are about to publish. For each paper we seek for 3 reviews. But We send out review invites to 3*2= 6 users. All 6 users can submit their reviews to our system, but only the first 3 count; and these first 3 reviewers will get reward their work.
In our Cassandra DB, there are three tables: USER, PAPER and REVIEW. The USER and PAPER tables are simple: each user corresponds to a row in the USER table with an unique USER_ID; similarly, each paper has a unique PAPER_ID in the PAPER table.
The REVIEW table looks like this
CREATE TABLE REVIEW(
PAPER_ID uuid,
USER_ID uuid,
REVIEW_CONTENT text,
PRIMARY KEY(PAPER_ID, USER_ID)
);
We use PAPER_ID as the partition key of the REVIEW table so that all reviews of a given paper is stored in a single Cassandra row. For each paper we have, we pick up 6 users, insert 6 entries into the REVIEW table and send out 6 invites to those users. So, for paper "P1", there are 6 entries in the REVIEW table that look like this
----------------------------------------------------
PAPER_ID | USER_ID | REVIEW_CONTENT |
----------------------------------------------------
P1 | U1 | null |
----------------------------------------------------
P1 | U2 | null |
----------------------------------------------------
P1 | U3 | null |
----------------------------------------------------
P1 | U4 | null |
----------------------------------------------------
P1 | U5 | null |
----------------------------------------------------
P1 | U6 | This paper ... |
---------------------------------------------------
... | ... | ... |
Users submit review via a web browser using http. At the backend, we use the following process to handle submitted reviews (use paper "P1" as an example):
Use partition key "P1" to get all 6 entries out from the REVIEW table.
Find out how many of these 6 entries have non-null values at the REVIEW_CONTENT column (non-null values indicate that the corresponding user has already submitted his review. For example, in the above table, user "U6" has submitted his review, while other 5 have not yet).
If this number >=3, we already had enough reviews, return to the current reviewer with a message like "Thanks, we already had enough reviews."
If this number < 2, save the current review to the corresponding entry in the REVIEW table, return to the reviewer with a message like "Your review has been accepted." (E.g. If the current reviewer is "U1", then fill the REVIEW_CONTENT column of "P1, U1" entry with the current review content.)
If this number =2, this is the most complicated the case as the current submission is the last one we'll accept. In this case, we first save the current review to the REVIEW table, then we find the ids of all three users that have submitted reviews (including the current user), record their ids into a transaction table to pay them rewards later.
But this process does not work. The problem is that it does not handle concurrent submissions correctly. Consider the following case: two users have already submitted their reviews, and meanwhile 3 other users are submitting their reviews via three concurrent process shown above. At step 5, each of the three will think he is the 3rd and last submitter and insert new records into the transaction table. This leads to a double counting: a single user may be rewarded more than once for the same review he submitted.
Another problem of this process is that it may never reach to step 5. Let's say there is no submission in the REVIEW table, and 4 users submit their reviews at the same time. All of them saved their reviews at step 4. After this, later submitter will always be rejected as there are 4 accepted reviews already. But since we never reach step 5, no ids will be recorded into the transaction table and users will never get any rewards.
So here comes my question: How should I handle my use case using Cassandra as the back-end DB? Will Cassandra COUNTER help? If so, how? I have not thought through how to use COUNTER yet, but this blog (http://aphyr.com/posts/294-call-me-maybe-cassandra) warned that Cassandra COUNTER is not safe (quote "Consequently, Cassandra counters will over- or under-count by a wide range during a network partition.") Will Cassandra's Compare and Set (CAS) feature help? If so, how? Again the save blog warned that "Cassandra lightweight transactions are not even close to correct."

Rather than creating empty entries in your review table, I would consider leaving it empty and only filling it as the reviews are submitted. To handle concurrency, add a timeuuid field as a sorting key:
CREATE TABLE review(
paper_id uuid,
submission_time timeuuid,
user_id uuid,
content text,
PRIMARY KEY (paper_id, submission_time)
);
When a user makes their submission, add the entry to the table. Then AFTER the write is successful, query the table (on only the paper_id) and find out if the user's id is one of the first three. Respond to the user accordingly. Since you're committed to a small set of reviewers, the extra overhead of fetching all the reviews should be minimal (especially since you wouldn't need to include the content column in the query).
If you need to track who's reviewing the papers, add a set of user ids to the paper table and write the six user ids there.

Related

how do i save transaction log to database

good day guys,
I need your opinion on this problem. although am using Django for my project but am sure this problem is not tie to django alone. So, I am working on these services booking system. In my database I have 3 tables listed below:
User_Table with field
• Id
• Username
• Fullname
Services_Table with field
• Id
• name
• Price
Transaction_Table with field
• Id
• User_id
• Services_id (many to many relationship)
When this services get booked, I send it to the transaction table using the user_id and services_id as foreign key for User Table and Services Table meaning it’s the id values that are saved.
When a client want to view his or her transaction history, I provide it by running the query:
price = transaction.service.price
service_name = transaction.service.name
total_cost = sum of all services selected
as not to present the user with id values for price and service_name.
now here is my problem, in future, if the admin decide to change the name and price of a service and the client goes back to view his old transaction log, the new value get populated cus I referenced them by ids which is not what I want, I want the client to see the old value as a receipt would be even when I updated the services table.
What do you suggest I do in this case?
You should record every transaction made and record the price and amount it totalled up to at the moment the txn was made. Transaction model should have fields to record every detail about the transaction.
This means:
You would have a txn_service table, where all services in a transaction are saved and linked to the transaction table.

How can I select records n-at-a-time for multiple users to edit?

I am using a Django backend with postgresql.
Let's say I have a database with a table called Employees with about 20,000 records.
I need to allow multiple users to edit and verify the Area Code field for every record in Employees.
I'd prefer to allow a user to view the records, say, 30 at a time (to reduce burnout).
How can I select 30 records at a time from Employees to send to the front end UI for editing, without letting multiple users edit the same records, or re-selecting a record that has already been verified?
I don't need comments on the content of the database (these are example table and field names).
One way to do this would be to add 2 more fields to your table, say for example assigned_to and verified. You can update assigned_to, which can be a foreign key to the verifying user, when you allow the user to view that Employee. This will create a record preventing the Employee from being chosen twice. assigned_to can also double as a record of who verified this Employee for future reference.
verified could be simply a Boolean field which keeps track if the Employee has already been verified and can be updated when the user confirms the verification
The actual selects can be done like this:
employees = Employee.objects.filter(assigned_to=None, verified=False)[:30]
Then
for emp in employees:
emp.assigned_to = user
emp.save()
Note: This can still potentially cause a race condition if 2 users make this request at exactly the same time. To avoid this, another possibility could be to partition the employee tables into groups for each user with no overlap. This would ensure that no 2 users would ever have the same employees

Deduplicaton / matching in Couchdb?

I have documents in couchdb. The schema looks like below:
userId
email
personal_blog_url
telephone
I assume two users are actually the same person as long as they have
email or
personal_blog_url or
telephone
be identical.
I have 3 views created, which basically maps email/blog_url/telephone to userIds and then combines the userIds into the group under the same key, e.g.,
_view/by_email:
----------------------------------
key values
a_email#gmail.com [123, 345]
b_email#gmail.com [23, 45, 333]
_view/by_blog_url:
----------------------------------
key values
http://myblog.com [23, 45]
http://mysite.com/ss [2, 123, 345]
_view/by_telephone:
----------------------------------
key values
232-932-9088 [2, 123]
000-111-9999 [45, 1234]
999-999-0000 [1]
My questions:
How can I merge the results from the 3 different views into a final user table/view which contains no duplicates?
Or whether it is a good practice to do such deduplication in couchdb?
Or what would be a good way to do a deduplication in couch then?
ps. in the finial view, suppose for all dupes, we only keep the smallest userId.
Thanks.
Good question. Perhaps you could listen to _changes and search for the fields you want to be unique for the real user in the views you suggested (by_*).
Merge the views into one (emit different fields in one map):
function (doc) {
if (!doc.email || !doc.personal_blog_url || !doc.telephone) return;
emit([1, doc.email], [doc._id]);
emit([2, doc.personal_blog_url], [doc._id]);
emit([3, doc.telephone], [doc._id]);
}
Merge the lists of id's in reduce
When new doc in changes feed arrives, you can query the view with keys=[[1, email], [2, personal_blog_url], ...] and merge the three lists. If its minimal id is smaller then the changed doc, update the field realId, otherwise update the documents in the list with the changed id.
I suggest using different document to store { userId, realId } relation.
You can't create new documents by just using a view. You'd need a task of some sort to do the actual merging.
Here's one idea.
Instead of creating 3 views, you could create one view (that indexes the data if it exists):
Key Values
--- ------
[userId, 'phone'] 777-555-1212
[userId, 'email'] username#example.com
[userId, 'url'] favorite.url.example.com
I wouldn't store anything else except the raw value, as you'd end up with lots of unnecessary duplication of data (if you stored the full object for example).
Then, to query, you could do something like:
...startkey=[userId]&endkey=[userId,{}]
That would give you all of the duplicate information as a series of docs for that user Id. You'd still need to parse it apart to see if there were duplicates. But, this way, the results would be nicely merged into a single CouchDB call.
Here's a nice example of using arrays as keys on StackOverflow.
You'd still probably load the original "user" document if it had other data that wasn't part of the de-duplication process.
Once discovered, you could consider cleaning up the data on the fly and prevent new duplicates from occurring as new data is entered into your application.

Doctrine 2.1 - Recording activities throughout entities

i want to be able to record actions a logged in user does... persist / updates etc
i have set up discriminators etc and it works perfect however, it only records on all new persisted data...
so i have info on a table called user_actions,
1 - Added a new customer,
2 - Added a new memo
etc
however, it doesnt record any updates to entities on my db...
such as 1 - Updated user - id 1
...
i am thinking of dumping the discriminator superclass and use a the old way to record,,,... like create a table with the fields:
id | action type | description | user ID | date
im not sure, what is the best way to log all transactions in doctrine 2.1?
thanks
Have you consider of HasLifecycleCallbacks? You can track not only PostPersist but also PostUpdate and PostRemove(or even Pre*)

Grouping Custom Attributes in a Query

I have an application that allows for "contacts" to be made completely customized. My method of doing that is letting the administrator setup all of the fields allowed for the contact. My database is as follows:
Contacts
id
active
lastactive
created_on
Fields
id
label
FieldValues
id
fieldid
contactid
response
So the contact table only tells whether they are active and their identifier; the fields tables only holds the label of the field and identifier, and the fieldvalues table is what actually holds the data for contacts (name, address, etc.)
So this setup has worked just fine for me up until now. The client would like to be able to pull a cumulative report, but say state of all the contacts in a certain city. Effectively the data would have to look like the following
California (from fields table)
Costa Mesa - (from fields table) 5 - (counted in fieldvalues table)
Newport 2
Connecticut
Wallingford 2
Clinton 2
Berlin 5
The state field might be id 6 and the city field might be id 4. I don't know if I have just been looking at this code way to long to figure it out or what,
The SQL to create those three tables can be found at https://s3.amazonaws.com/davejlong/Contact.sql
You've got an Entity Attribute Value (EAV) model. Use the field and fieldvalue tables for searching only - the WHERE caluse. Then make life easier by keeping the full entity's data in a CLOB off the main table (e.g. Contacts.data) in a serialized format (WDDX is good for this). Read the data column out, deserialize, and work with on the server side. This is much easier than the myriad of joins you'd need to do otherwise to reproduce the fully hydrated entity from an EAV setup.