Database Sharding with Concurrency Control And ACID Properties

Database Sharding with Concurrency Control And ACID Properties - concurrency

Lately, i've been reading about Database sharding and database concurrency control as well as ACID properties and i've been thinking about some scenarios that are a little bit tricky for me.
Suppose we want to have a transaction for money transfer from an Account to another. Suppose we have Customers (Accounts) sharded by Country, like US Customers on a specific server separated from European Customers (for the sake of scalability)
A transaction for such system should be something like:
BEGIN TRANSACTION
UPDATE FROM Account SET balance = balance - 100.0 WHERE id = 1;
UPDATE FROM Account SET balance = balance + 100.0 WHERE id = 2;
COMMIT;
1- Suppose that Account #1 is from Europe and Account #2 is from US. How ACID properties are going to be kept in this situation ? because from an application we would have a different session with each shard (separated database servers) which means different transactions !
2- This might also be a problem in detecting deadlocks, how could we detect for example deadlocks in a concurrent application if the above transaction would get executed by 2 different thread in a different order !
I know that this can be easily done if we had only 1 database holding all the records as it have total control over the data, but in distributed databases, i believe we maybe do need some communication with different databases or maybe a central agent to handle such cases !

What you need is consensus algorithm, so that account 1 and account 2 both can agree that $100 transaction happened between them.
This is pretty vast topic with people having PHDs only studying consensus algorithm.
List of famous consensus algorithms are:
Paxos
Raft
Two phase commit
I will start with 2 phase commit (simplest of all 3 above) if bar for extreme performance can be lowered a bit.

Related

How do we know the amount of money in each account in blockchain?

I am trying to learn how blockchain works. I know that blockchain is nothing but list of blocks containing transactions, what I cannot understand is how do we then know how much money is contained in each account, since we are only maintaining list of transactions?

In distributed platforms, 2 models of accounting for account balances are most commonly used:
a state model, when each of the nodes, after the execution (or reception) of the transaction, accordingly changes the account state record in its local database (Ethereum, Hyperleger Fabric)
model of unspent remainings (UTXO), when the account balance is formed from the sum of the balances of "unused" transactions (Bitcoin, Corda)

DynamoDB - Reducing number of queries

After my users log in the app makes too many requests to DynamoDB and I am thinking about different ways to reduce the number of calls.
The app allows user to trigger certain alerts that get sent to other users. For instance: "Shipment received, come to the deck", "Shipment completed", etc.
These are the calls made:
Get company's software license expiration date.
Get the computer's location in the building (i.e. "Office A").
Get the kinds of alerts that can be triggered (i.e. "Shipment received, come to the deck", "Shipment completed", etc).
Get information about the user (i.e. company teams the user belongs to, and admin level the user has (which can be 0, 1, 2, or 3).
Potential solutions I have though about:
Put the company's license expiration date as an attribute of each computer (This would reduce the number of queries by 1). However, if I need to update the company's license expiration date, then I need to update it for EVERY SINGLE computer I have in the system, which sounds impractical to me since I may have 200, 300 or perhaps even more computers in the database.
Add the company's license expiration date as an attribute of the alerts (This would reduce the number of queries by 1); which seems more reasonable because there are only about 15 different kinds of alerts, so if I need to change the license expiration date later on, it is not too bad.
Cache information on the user's device; however, I can't seem to find a good strategy to keep the information stored locally as updated as possible.
I still think these 3 options do not sound too good, so I am hoping someone can point me in the right direction. Is there a good way to reduce the number of calls? I am retrieving information about 4 different entities (license, computer, alert, user), should I leave those 4 calls after users log in?

here are few things that can be done wrt each component.
Get information about the user
keep it in session store and whenever details changes update the store. session stores are usually implemented using cache like redis.
Computer location
Keep it in a distributed cache like redis. lazily initialise it. and whenever new write happens to computer location (rare IMO) remove the entry from redis using dynamodb streams and aws lambda.
Kind of alerts
Same as Computer location
License expiration date
If possible don't allow license expiry date (issue a new one for these cases, so that traceability is maintained.) and cache licence expiry forever. OR same as Computer location.

Complex events spread over years

I have a scenario whereby if part of a query matches an event, I want to fetch some other events from a datastore to test against the rest of the query
eg. "If JANE DOE buys from my store did she buy anything else over last 3 years" sort of thing.
Does Flink, Storm or WSO2 provide support for such complex event processing?

Flink can do this, but it would require that you process all events starting from the earliest that you care about (e.g. 3 years ago), so that you can construct the state for each customer. Flink then lets you manage this state (typically with RocksDB) so that you wouldn't have to replay all the events in the face of system failures.
If you can't replay all of the history, then typically you'd put this into some other store (Cassandra/HBase, Elasticsearch, etc) with the scalability and performance characteristics you need, and then use Flink's async function support to query it when you receive a new event.

WSO2 Stream processor let’s you implement such functionality with it’s time incremental analytics feature. To implement the scenario you’ve mentioned you can feed the events that are triggered when a customer arrives to an construct called ‘aggregate’. When you keep feeding events to an aggregate it will summarize data over time and will saved in a configured persistence store such as DB.
You can query this aggregate to get the state for a given period of time. For an example following query will fetch the name, total items bought and avg transaction value with the year 2014-2015
from CustomerSummaryRetrievalStream as b join CustoemrAggregation as a
on a.name == b.name
within "2014-01-01 00:00:00 +05:30", "2015-01-01 00:00:00 +05:30"
per “years”
select a.name, a.total, a.avgTxValue
insert into CustomerSummaryStream;

DynamoDB - limit on number of tables per account

We are working on deploying our product (currently on prem) on AWS and are looking at DynamoDB as a alternative to Cassandra mainly to avoid the devop costs associated with a large number of Cassandra clusters.
The DynamoDB doc says that the per account limit on the number of tables is 256 per region but can be increased by calling AWS support. How much is the max limit for this per account?
Our product is separated into distinct logical units where each such unit will have several tables (say 100). Each customer can have several of such units. Each logical unit can be backed up (i.e. a snapshot taken) and that snapshot can be restored at any time in the future (to overwrite the current content of all tables). The backup/restore performance - time taken to take a snapshot/import old data for all the tables - need to be good - it cannot be several minutes/hrs.
We were thinking of using distinct set of tables for each such logical unit - so that backup/restore is quick using EMR on S3. But if we follow this approach, we will run out of the 256 table number limit even with one customer. Looks like there are 2 options
Create a new account for each such logical unit for each customer. Is this possible? We will have a main corporate account I suppose (I am still learning about this), but can it have a set of sub-accounts for our customers using IAM each of which is considered as an independent AWS account?
Use each table in a true multi-tenant manner - where the primary key contains the customer id + logical unit id. But in this scenario,when using EMR to backup an entire table, we will need to selectively back up specific set of rows/items which may be in millions and this will go on while other write/read operations are going on on a different set of items. Is this feasible in terms of large scale?
Any other thoughts on how to approach this?
Thanks for any info.

I would suggest changing the approach - rather then thinking how to get more tables via creating more accounts.
I would think of how to use less tables.
Having said that - you could contact support and increase the amount of tables for you account.
I think that you will run into a money problem, due to the current pricing model of provisioning throughput per table.
Many people split tables based on time frame.
e.x: this weeks table, last weeks table, then move it to last months table and so on..
This helps when analyzing the data with EMR/Redshift - so you wont have to pull the whole table every time.

Django webapp - tracking financial account information

I need some coding advice as I am worried that I am creating, well, bloated code that is inefficient.
I have a webapp that keeps track of a company's financial data. I have a table called Accounts with a collection of records corresponding to the typical financial accounts such as revenue, cash, accounts payable, accounts receivable, and so on. These records are simply name holders to be pointed at as foreign keys.
I also have a table called Account_Transaction which records all the transactions of money in and out of all the accounts in Accounts. Essentially, the Account_Transaction table does all the heavy lifting while pointing to the various accounts being altered.
For example, when a sale is made, two records are created in the Account_Transaction table. One record to increase the cash balance and a second record to increase the revenue balance.
Trans Record 1:
Acct: Cash
Amt: 50.00
Date: Nov 1, 2011
Trans Record 2:
Acct: Revenue
Amt: 50.00
Date: Nov 1, 2011
So now I have two records, but they each point to a different account. Now if I want to view my cash balance, I have to look at each Account_Transaction record and check if the record deals with Cash. If so, add/subtract the amount of that record and move to the next.
During a typical business day, there may be upwards of 200-300 transactions like the one above. As such, the Account_Transaction table will grow pretty quickly. After a few months, the table could have a few thousand records. Granted this isn't much for a database, however, every time the user wants to know the current balance of, say, accounts receivable, I have to traverse the entire Account_Transaction table to sum up all records that deal with the account name "Accounts Receivable".
I'm not sure I have designed this in the most optimal manner. I had considered creating a distinct table for each account (one for "Cash", another for "Accounts Receivable" another for "Revenue" etc...), but with that approach I was creating 15-20 tables with the exact same parameters, other than their name. This seemed like poor design so I went with this Account_Transaction idea.
Does this seem like an appropriate way to handle this kind of data? Is there a better way to do this that I should really be adopting?
Thanks!

Why do you need to iterate through all the records to figure out the status of Accounts Receievable accounts? Am I missing something in thinking you can't just use a .filter within the Django ORM to selectively pick the records you need?
As your records grow, you could add some date filtering to your reports. In most cases, your accountant will only want numbers for this quarter, month, etc., not entire historic data.
Add an index to that column to optimize selection and then check out Djangos aggregation to Sum up values from your database.
Finally, you could do some conservative caching to speed up things for "quick view" style reports where you just want a total number very quickly, but you need to be careful with this to not have false positives, so reseting that cache on any change to the records would be a must.

Why don't you keep track of the exact available amount in the Account table? The Account_Transaction could only be used to view transaction history.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js