DDD - Can we update a value with incremental database feature in the save repository method to avoid concurrency? - concurrency

Hi everyone,
I don't see why when we are talking about concurrency in DDD with aggregates, we don't force it with the incremental database feature (when we can).
Imagine I have to deal with a quantity of concert ticket (number) in an entity Concert.
The invariant here is: the quantity can't be lower than 0 (logic).
async execute(command: BookConcertCommand): Promise<any> {
const concert = await this.concertRepository.findById(command.concertId);
if (!concert) {
return;
}
concert.ticketQuantity -= 1;
await this.concertRepository.save(concert);
}
If two users buy a ticket in the same, there will maybe be a concurrency problem.
Why the advices are:
Optmistic concurrency
Pessimistic concurrency
Eventual consistency
...
While with just this in the save:
update concerts set ticketQuantity = ticketQuantity - 1
And also, Why we can't force invariant inside this save with a WHERE clause like: WHERE ticketQuantity > 0 ?

Something I think when talking about DDD is the difference in analyzing the domain using DDD tools (BCs, aggregates, etc...) and implementation of a domain model.
Usually saying DDD imply both the things.
The approach you describe is against a domain model, as your are making the invariants part of the SQL logic, rather then the domain model (think of it as a model in the sense of a math model).
The approach you propose needs two different "save" methods for saying buying a ticket vs change the event name.
Usually I have heard this kind of pattern referred to as a Transactional Script approach, that is kind of the opposite of what DDD is.
DDD comes in help when it is not so easy to map these logics inside the SQL queries.
Still, nobody prohibits to study the business in a DDD way (event storming?) and then implement it as a transactional script

Related

Django custom creation manager logic for temporal database

I am trying to develop a Django application that has built-in logic around temporal states for objects. The desire is to be able to have a singular object representing a resource, while having attributes of that resource be able to change over time. For example, a desired use case is to query the owner of a resource at any given time (last year, yesterday, tomorrow, next year, ...).
Here is what I am working with...
class Resource(models.Model):
id = models.AutoField(primary_key=True)
class ResourceState(models.Model):
id = models.AutoField(primary_key=True)
# Link the resource this state is applied to
resource = models.ForeignKey(Resource, related_name='states', on_delete=models.CASCADE)
# Track when this state is ACTIVE on a resource
start_dt = models.DateTimeField()
end_dt = models.DateTimeField()
# Temporal fields, can change between ResourceStates
owner = models.CharField(max_length=100)
description = models.TextField(max_length=500)
I feel like I am going to have to create a custom interface to interact with this state. Some example use cases (interface is completely up in the air)...
# Get all of the states that were ever active on resource 1 (this is already possible)
Resource.objects.get(id=1).states.objects.all()
# Get the owner of resource 1 from the state that was active yesterday, this is non-standard behavior
Resource.objects.get(id=1).states.at(YESTERDAY).owner
# Create a new state for resource 1, active between tomorrow and infinity (None == infinity)
# This is obviously non standard if I want to enforce one-state-per-timepoint
Resource.objects.get(id=1).states.create(
start_dt=TOMORROW,
end_dt=None,
owner="New Owner",
description="New Description"
)
I feel the largest amount of custom logic will be required to do creates. I want to enforce that only one ResourceState can be active on a Resource for any given timepoint. This means that to create some ResourceState objects, I will need to adjust/remove others.
>> resource = Resource.objects.get(id=1)
>> resource.states.objects.all()
[ResourceState(start_dt=None, end_dt=None, owner='owner1')]
>> resource.states.create(start_dt=YESTERDAY, end_dt=TOMORROW, owner='owner2')
>> resource.states.objects.all()
[
ResourceState(start_dt=None, end_dt=YESTERDAY, owner='owner1'),
ResourceState(start_dt=YESTERDAY, end_dt=TOMORROW, owner='owner2'),
ResourceState(start_dt=TOMORROW, end_dt=None, owner='owner1')
]
I know I will have to do most of the legwork around defining the logic, but is there any intuitive place where I should put it? Does Django provide an easy place for me to create these methods? If so, where is the best place to apply them? Against the Resource object? Using a custom Manager to deal with interacting with related 'ResourceState' objects?
Re-reading the above it is a bit confusing, but this isnt a simple topic either!! Please let me know if anyone has any ideas for how to do something like the above!
Thanks a ton!
too long for a comment, and purely some thoughts, not a full answer, but having dealt with many date effective records in financial systems (not in Django) some things come to mind:
My gut would be to start by putting it on the save method of the resource model. You are probably right in needing a custom manager as well.
I'd probably also flirt with the idea of a is_current boolean field in the state model but certain care would need to be considered with future date effective state records. If there is only one active state at a time, I'd also examine the need for an enddate. Having both start and end definitely makes the raw sql queries (if ever needed) easier: date() between state.start and state.end <- this would give current record, sub in any date to get that date's effective record. Also, give some consideration to the open ended end date where you don't know the end date date. Your queries will have to handle the nulls properly. YOu probably also may need to consider the open ended start date (say for a load of historical data where the original start date is unknown). I'd suggest staying away from using some super early date as a fill in (same for date far in the future for unknown end dates) - If you end up with lots of transactions, your query optimizer may thank you, however, I may be old and this doesn't matter anymore.
If you like to read about this stuff, I'd recommend a look at 1.8 in https://www.amazon.ca/Art-SQL-Stephane-Faroult/dp/0596008945/ and chapter 6:
"But before settling for one solution, we must acknowledge that
valuation tables come in all shapes and sizes. For instance, those of
telecom companies, which handle tremendous amounts of data, have a
relatively short price list that doesn't change very often. By
contrast, an investment bank stores new prices for all the securities,
derivatives, and any type of financial product it may be dealing with
almost continuously. A good solution in one case will not necessarily
be a good solution in another.
Handling data that both accumulates and changes requires very careful
design and tactics that vary according to the rate of change."

AWS DynamoDB Table Design: Store two UserIDs and Details in Table

I'm building an app where two users can connect with each other and I need to store that connection (e.g. a friendship) in a DynamoDB table. Basically, the connection table has have two fields:
userIdA (hash key)
userIdB (sort key)
I was thinking to add an index on userIdB to query on both fields. Should I store a connection with one record (ALICE, BOB) or two records (ALICE, BOB; BOB, ALICE)? The first option needs one write operation and less space, but I have to query twice to get all all connections of an user. The second option needs two write operations and more space, but I only have to query once for the userId.
The user tablehas details like name and email:
userId (hash key)
name (sort key)
email
In my app, I want to show all connections of a certain user with user details in a listview. That means I have two options:
Store the user details of the connected users also in the connection table, e.g. add two name fields to that table. This is fast, but if the user name changes (name and email are retrieved from Facebook), the details are invalid and I need to update all entries.
Query the user details of each userId with a Batch Get request to read multiple items. This may be slower, but I always have up to date user details and don't need to store them in the connection table.
So what is the better solution, or are there any other advantages/disadvantages that I may have overlooked?
EDIT
After some google research regarding friendship tables with NoSQL databases, I found the following two links:
How does Facebook maintain a list of friends for each user? Does it maintain a separate table for each user?
NoSQL Design Patterns for Relational Data
The first link suggests to store the connection (or friendship) in a two way direction with two records, because it makes it easier and faster to query:
Connections:
1 userIdA userIdB
2 userIdB userIdA
The second link suggests to save a subset of duplicated data (“summary”) into the tables to read it faster with just one query. That would be mean to save the user details also into the connection table and to save the userIds into an attribute of the user table:
Connections:
# userIdA userIdB userDetails status
1 123 456 { userId: 456, name: "Bob" } connected
2 456 123 { userId: 123, name: "Alice" } connected
Users:
# userId name connections
1 123 Alice { 456 }
2 456 Bob { 123 }
This database model makes it pretty easy to query connections, but seems to be difficult to update if some user details may change. Also, I'm not sure if I need the userIds within the user table again because I can easily query on a userId.
What do you think about that database model?
In general, nosql databases are often combined with a couple of assumptions:
Eventual consistency is acceptable. That is, it's often acceptable in application design if during an update some of the intermediate answers aren't right. That is, it might be fine if for a few seconds while alice is becoming Bob's friend, It's OK if "Is Alice Bob's friend" returns true while "is Bob Alice's friend" returns false
Performance is important. If you're using nosql it's generally because performance matters to you. It's also almost certainly because you care about the performance of operations that happen most commonly. (It's possible that you have a problem where the performance of some uncommon operation is so bad that you can't do it; nosql is not generally the answer in that situation)
You're willing to make uncommon operations slower to improve the performance of common operations.
So, how does that apply to your question. First, it suggests that ultimately the answer depends on performance. That is, no matter what people say here, the right answer depends on what you observe in practice. You can try multiple options and see what results you get.
With regard to the specific options you enumerated.
Assuming that performance is enough of a concern that nosql is a reasonable solution for your application, it's almost certainly query rather than update performance you care about. You probably will be happy if you make updates slower and more expensive so that queries can be faster. That's kind of the whole point.
You can likely handle updates out of band--that is eventually consistency likely works for you. You could submit update operations to a SQS queue rather than handling them during your page load. So if someone clicks a confirm friend button, you could queue a request to actually update your database. It is OK even if that involves rebuilding their user row, rebuilding the friend rows, and even updating some counts about how many friends they have.
It probably does make sense to store a friend row in each direction so you only need one query.
It probably does make sense to store the user information like Name and picture that you typically display in a friend list duplicated in the friendship rows. Note that whenever the name or picture changes you'll need to go update all those rows.
It's less clear that storing the friends in the user table makes sense. That could get big. Also, it could be tricky to guarantee eventual consistency. Consider what happens if you are processing updates to two users' friendships at the same time. It's very important that you not end up with inconsistency once all the dust has settled.
Whenever you have non-normalized data such as duplicating rows in each direction, or copying user info into friendship tables, you want some way to revalidate and fix your data. You want to write code that in the background can go scan your system for inconsistencies caused by bugs or crashed activities and fix them.
I suggest you have the following fields in the table:
userId (hash key)
name (sort key)
email
connections (Comma separated or an array of userId assuming you have multiple connections for a user)
This structure can ensure consistency across your data.

Identifying needed statistics - Azure SQL Data Warehouse

Is there any hint or directive that can be used with EXPLAIN of a query on Azure SQL Data Warehouse that would return recommended statistics that were not available for the optimizer? Alternatively is there a tool that can analyze a workload and make any recommendation.
Today, no. Right now the recommendation is to create statistics on every column as these are needed to create an optimal parallel query plan (I.e. how to move data around between nodes to return a result since it's a MPP architecture).
https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-best-practices#maintain-statistics
An example of how to script this out can be found here as well (example H).
https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-statistics#examples-create-statistics
As you know, statistics should be created (according to this article):
on columns involved in JOINs, GROUP BY, HAVING and WHERE clauses.
There are no tools to do this (yet), but if you have access to the EXPLAIN plans they give you certain information. For example the shuffle_columns element lists all columns involved in a SHUFFLE_MOVE:
<shuffle_columns>col;</shuffle_columns>
as well as myriad other information. Review the annotation I did of an Azure SQL Data Warehouse plan here.
Lastly, (and I haven't actually done this, I've only been thinking about doing it), you could set up a copy of your database on SQL Server 2016, bearing in mind the syntax differences (eg distribution, lack of unique indexes etc). this would give you access to certain useful resources like execution plans, including index suggestions, and certain trace flags which tell you what stats were used. I mean the database engines and indexing are really different so I don't know how worthwhile this might be. I'll post back if I progress my thinking on this. I do find the question "Why is this query going slow?" much harder to answer on this platform that ordinary "box product" SQL Server because the tools aren't as mature yet.

Reducing the number of calls to MongoDB with mongoengine

I'm working to optimize a Django application that's (mainly) backed by MongoDB. It's dying under load testing. On the current problematic page, New Relic shows over 700 calls to pymongo.collection:Collection.find. Much of the code was written by junior coders and normally I would look for places to add indicies, make smarter joins and remove loops to reduce query calls, but joins aren't an option here. What I have done (after adding indicies based on EXPLAINs) is tried to reduce the cost in loops by making a general query and then filtering that smaller set in the loops*. While I've gotten the number down from 900 queries, 700 still seems insane even with the intense amount of work being done on the page. I thought perhaps find was called even when filtering an existing queryset, but the code suggests it's always a database query.
I've added some logging to mongoengine to see where the queries come from and to look at EXPLAIN statements, but I'm not having a ton of luck sifting through the wall of info. mongoengine itself seems to be part of the performance problem: I switched to mongomallard as a test and got a 50% performance improvement on the page. Unfortunately, I got errors on a bunch of other pages (as best I can tell it appears Mallard doesn't do well when filtering an existing queryset; the error complains about a call to deepcopy that's happening in a generator, which you can't do-- I hit a brick wall there). While Mallard doesn't seem like a workable replacement for us, it does suggest a lot of the proessing time is spent converting objects to and from Python in mongoengine.
What can I do to further reduce the calls? Or am I focusing on the wrong thing and should be attacking the problem somewhere else?
EDIT: providing some code/ models
The page in question displays the syllabus for a course, showing all the modules in the course, their lessons and the concepts under the lessons. For each concept, the user's progress in the concept is also shown. So there's a lot of looping to get the hierarchy teased out (and it's not stored according to any of the patterns the Mongo docs suggest).
class CourseVersion(Document):
...
course_instances = ListField(ReferenceField('CourseInstance'))
courseware_containers = ListField(EmbeddedDocumentField('CoursewareContainer'))
class CoursewareContainer(EmbeddedDocument):
id = UUIDField(required=True, binary=False, default=uuid.uuid4)
....
courseware_containers = ListField(EmbeddedDocumentField('self'))
teaching_element_instances = ListField(StringField())
The course's modules, lessons and concepts are stored in courseware_containers; we need to get all of the concepts so we can get the list of ids in teaching_element_instances to find the most recent one the user has worked on (if any) for that concept and then look up their progress.
* Just to be clear, I am using a profiler and looking at times and doings things The Right Way as best I know, not simply changing things and hoping for the best.
The code sample isn't bad per-sae but there are a number of areas that should be considered and may help improve performance.
class CourseVersion(Document):
...
course_instances = ListField(ReferenceField('CourseInstance'))
courseware_containers = ListField(EmbeddedDocumentField('CoursewareContainer'))
class CoursewareContainer(EmbeddedDocument):
id = UUIDField(required=True, binary=False, default=uuid.uuid4)
....
courseware_containers = ListField(EmbeddedDocumentField('self'))
teaching_element_instances = ListField(StringField())
Review
Unbounded lists.
course_instances, courseware_containers, teaching_element_instances
If these fields are unbounded and continuously grow then the document will move on disk as it grows, causing disk contention on heavily loaded systems. There are two patterns to help minimise this:
a) Turn on Power of two sizes. This will cost disk space but should lower the amount of io churn as the document grows
b) Initial Padding - custom pad the document on insert so it gets put into a larger extent and then remove the padding. Really an anti pattern but it may give you some mileage.
The final barrier is the maximum document size - 16MB you can't grow your data bigger than that.
Lists of ReferenceFields - course_instances
MongoDB doesn't have joins so it costs an extra query to look up a ReferenceField - essentially they are an in app join. Which isn't bad per-sae but its important to understand the tradeoff. By default mongoengine won't automatically dereference the field only doing course_version.course_instances will it do another query and then populate the whole list of references. So it can cost you another query - if you don't need the data then exclude() it from the query to stop any leaking queries.
EmbeddedFields
These fields are part of the document, so there is no cost for them, other than the wire costs of transmitting and loading the data. **As they are part of the document, you don't need select_related to get this data.
teaching_element_instances
Are these a list of id's? It says its a StringField in the code sample above. Either way, if you don't need to dereference the whole list then storing the _ids as a StringField and manually dereferencing may be more efficient if coded correctly - especially if you just need the latest (last?) id.
Model complexity
The CoursewareContainer is complex. For any given CourseVersion you have n CoursewareContainers with themselves have a list of n containers and those each have n containers and on...
Finding the most recent instances
We need to get all of the concepts so we can get the list of ids in
teaching_element_instances to find the most recent one the user has
worked on (if any) for that concept and then look up their progress.
I'm unsure if there is a single instance you are after or one per Container or one per Course. Either way - the logic for querying the data should be examined. If its a single instance you are after - then that could be stored against the user so to simplify the logic of looking this up. If its per course or container then to improve performance ensure you minimise the number of queries - if possible collect all the ids and then at the end issue a single $in query, rather than doing a query per container.
Mongoengine costs
Currently, there is a performance cost to loading the data into Mongoengine classes - if you don't need the classes and are happy to work with simple dictionaries then either issue a raw pymongo query or use as_pymongo.
Schema design
The schema looks logical enough but is it suitable for the use case - in essence is it using MongoDB's strengths or is it putting a relational peg in a document database shaped hole? I can't answer than for you but I do know the way to the happy path with MongoDB is design the schema based on its use case. With relational databases schema design from the outset is simple - you normalise, with document databases how the data is used is a primary factor.
MongoDB best practices
There are many other best practices and mongodb have a guide which might be of interest: MongoDB Operations Best Practices.
Feel free to contact me via the Mongoengine mailing list to discuss further and if needs be discuss in private.
Ross

Unit Testing & Primary Keys

I am new to Unit Testing and think I might have dug myself into a corner.
In your Unit Tests, what is the better way to handle primary keys?
Hopefully an example will paint some context. If create several instances of an object (Lets' say Person).
My unit test is to test the correct relationships are being created.
My code is to create Homer, he children Bart and Lisa. He also has a friend Barney, Karl & Lenny.
I've seperated my data layer with an Interface. My preference is to keep the primary key simple. Eg On Save, Person.ProductID = new Random().Next(10000); instead of say Barney.PersonID = 9110 Homer.PersonID = 3243 etc.
It doesn't matter what the primary key is, it just needs to be unique.
Any thoughts???
EDIT:
Sorry I haven't made it clear. My project is setup to use Dependency Injection. The data layer is totally separate. The focus of my question is, what is practical?
I have a class called "Unique" which produces unique objects (strings, integers, etc). It makes sure they're unique per-test by keeping a internal static counter. That counter value is incremented per key generated, and included in the key somehow.
So when I'm setting up my test
var Foo = {
ID = Unique.Integer()
}
I like this as it communicates that the value is not important for this test, just the uniqueness.
I have a similar class 'Some' that does not guarantee uniqueness. I use it when I need an arbitrary value for a test. Its useful for enums and entity objects.
None of these are threadsafe or anything like that, its strictly test code.
There are several possible corners you may have dug yourself into that could ultimately lead to the question that you're asking.
Maybe you're worried about re-using primary keys and overwriting or incorrectly loading data that's already in the database (say, if you're testing against a dev database as opposed to a clean test database). In that case, I'd recommend you set up your unit tests to create their records' PKs using whatever sequence a normal application would or to test in a clean, dedicated testing database.
Maybe you're concerned about the efficacy of your code with PKs beyond a simple 1,2,3. Rest assured, this isn't something one would typically test for in a straightforward application, because most of it is outside the concern of your application: generating a number from a sequence is the DB vendor's problem, keeping track of a number in memory is the runtime/VM's problem.
Maybe you're just trying to learn what the best practice is for this sort of thing. I would suggest you set up the database by inserting records before executing your test cases using the same facilities that your application itself will use to insert records; presumably your application code will rely on a database-vended sequence number for PKs, and if so, use that. Finally, after your test cases have executed, your tests should roll back any changes they made to the database to ensure the test is idempotent over multiple executions. This is my sorry attempt of describing a design pattern called test fixtures.
Consider using GUIDs. They're unique across space and time, meaning that even if two different computers generated them at the same exact instance in time, they will be different. In other words, they're guaranteed to be unique. Random numbers are never good, there is a considerable risk of collision.
You can generate a Guid using the static class and method:
Guid.NewGuid();
Assuming this is C#.
Edit:
Another thing, if you just want to generate a lot of test data without having to code it by hand or write a bunch of for loops, look into NBuilder. It might be a bit tough to get started with (Fluent methods with method chaining aren't always better for readability), but it's a great way to create a huge amount of test data.
Why use random numbers? Does the numeric value of the key matter? I would just use a sequence in the database and call nextval.
The essential problem with database unit testing is that primary keys do not get reused. Rather, the database creates a new key each time you create a new record, even if you delete the record with the original key.
There are two basic ways to deal with this:
Read the generated Primary Key, from the database and use it in your tests, or
Use a fresh copy of the database each time you test.
You could put each test in a transaction and roll the transaction back when the test completes, but rolling back transactions doesn't always work with Primary Keys; the database engine will still not reuse keys that have been generated once (in SQL Server anyway).
When a test executes against a database through another piece of code it ceases to be an unit test. It is called an "integration test" because you are testing the interactions of different pieces of code and how they "integrate" together. Not that it really matters, but its fun to know.
When you perform a test, the following things should occur:
Begin a db transaction
Insert known (possibly bogus) test items/entities
Call the (one and only one) function to be tested
Test the results
Rollback the transaction
These things should happen for each and every test. With NUnit, you can get away with writing step 1 and 5 just once in a base class and then inheriting from that in each test class. NUnit will execute Setup and Teardown decorated methods in a base class.
In step 2, if you're using SQL, you'll have to write your queries such that they return the PK numbers back to your test code.
INSERT INTO Person(FirstName, LastName)
VALUES ('Fred', 'Flintstone');
SELECT SCOPE_IDENTITY(); --SQL Server example, other db vendors vary on this.
Then you can do this
INSERT INTO Person(FirstName, LastName, SpouseId)
VALUES('Wilma', 'Flintstone', #husbandId);
SET #wifeId = SCOPE_IDENTITY();
UPDATE Person SET SpouseId = #wifeId
WHERE Person.Id = #husbandId;
SELECT #wifeId;
or whatever else you need.
In step 4, if you use SQL, you have to re-SELECT your data and test the values returned.
Steps 2 and 4 are less complicated if you are lucky enough to be able to use a decent ORM like (N)Hibernate (or whatever).