How to deal with big aggregate roots with lots of children? - concurrency

I'm quite new to DDD and in our recent project we faced a problem I didn't expect.
In our domain we model the budget of a company.
Keeping things simple, the budget is a table with a bunch of rows and columns. Every department in a company has a different budget for a given year. The table could have quite a bit of rows and fixed amount of columns (basically, a name and values for every month).
Business rules our business person want to enforce are applied at a whole table level, like you can't have two rows with same name or can't edit locked table, etc.
So, after some elaboration we're decided to make a single table an aggregate in this bounding context.
Then things started to be interesting.
Basically, we have two problems now:
One table could be edited by many users at the same time, and even if someone is editing a single cell and another person edits completely different cell, from the aggregate point of view those operations are parallel edits for the single aggregate, and every one of them requires us to load the whole aggregate, apply a change, check business rules and save it back to the database.
Since we were using event sourcing, load operation becomes slower and slower with every event we commit to the database, so we decided to use a snapshot approach. We do the snapshot every X events so the load operation won't take a lot of time. But at some point we realised that after a week one of our tables had thousands of edits and the snapshot event is a giant json string like 1_000_000 characters long. Even transfering it from the database is quite slow.
At this point I started to think that making the entire table an aggregate was a mistake, and we could take a more granular approach, but I don't know any such rule in DDD I could refer to, and don't quite understand how I could enforce business rules on the entire table if I just split the aggregate to rows or something like that.
Could any of you please tell me where I was wrong and what I could do to improve the model, with references to sources I could reason about with the team?

Aggregates are consistency boundaries, which generally means that you want to keep them small for the reasons you've encountered.
It probably makes sense to have the table be an aggregate so that table-level constraints and transitions (e.g. locked/unlocked state, for sure) can be enforced/handled. But I'm not sure the table needs to contain all the contents of the rows: the content of each row can be modeled as its own aggregate, with the table aggregate tracking the sequence of rows (obviously, accessing the rows would be through the aggregate root).
In this approach, the table aggregate can enforce at write-time invariants that only touch the table, and the row aggregate can enforce at write-time invariants which only touch a single row. Enforcement of invariants crossing multiple rows will have to be done via at least one projection, which means that the system cannot guarantee to reject all writes which violates the invariant.
This implies that there will almost surely eventually be a case where the desired invariant is violated (in which case, calling it an invariant is a little bit of an abuse of terminology, but bear with me...), and the system (including users, operators, etc.) will need to take a compensating action to restore the invariant. The nature of the compensating action is a business concern: sometimes it can be automated, sometimes it's a matter of alerting for manual action.
If that's actually unacceptable, and the business demands are such that the table needs to be able to enforce invariants covering the content of multiple rows, then you are pretty well stuck with the giant table aggregate (note that you'd hit this problem even if you weren't doing DDD).
Depending on which language you're implementing in, you may find that taking advantage of the actor model and having each table be the internal state of a single actor pays off in performance terms: the actor is effectively serving as an in-memory always-current snapshot (because the actor will only process one message at a time) of the aggregate (the events since the latest persisted snapshot and the events since then only need to be replayed once to rehydrate the actor). If the table state is a million bytes of serialized JSON, it does not seem like data size forces you to take advantage of the clustering features of some actor frameworks, so any actor model implementation should work. This talk explores aspects of this approach.

If you have defined the budgeting domain around tables and rows then you may need to reconsider your design. I understand that many budgets are probably defined in spreadsheets, which is where the thinking comes from, but the domain terminology (ubiquitous language) would probably not be around a table/sheet and rows.
Chances are that the domain expert refers to budgeting items and probably states that an item is defined on a row and that column A is the item name and B is the January budget, etc.
The domain should be defined along the lines of the actual business language which would probably make a lot more sense and keep aggregates substantially smaller and more manageable. I've worked on some pretty horrendous budgeting spreadsheets in my career so I can imagine trying to store the history as a serious of events would explode the storage quite quickly (even with snapshots). Well, not even a real spreadsheet stores the data as a serious of events. There is an undo history in the current session but that is about it.
I'm no financial guru but you probably have a budget item of sorts and then for each year you'll have a budget with the values for each month. One budget item is quite a bit simpler than everything in one large aggregate. The budget items can be grouped and categorised as a separate exercise.
It isn't inconceivable that I've misinterpreted your scenario but that would be my take on this.

Related

Algorithm or data structure for broadcast messages in 3D

Let's say some threads produce data and every piece of data has associated 3D coordinate. And other threads consumes these data and every consumer thread has cubic volume of interest described by center and "radius" (size of the cube). Consumer threads can update their cube of interest parameter (like move it) from time to time. Every piece of data is broadcasted - a copy of it should be received by every thread which has cube of interest which includes this coordinate.
What multi-threaded data structure can be used for this with the best performance? I am using C++, but generic algorithm pointer is fine too.
Bonus: it would be nice if an algorithm will have possibility to generalize to multiple network nodes (some nodes produce data and some consumes with the same rules as threads).
Extra information: there are more consumers than producers, there are much more data broadcasts than cube of interest changes (cube size changes are very rare, but moving is quite common event). It's okay if consumer will start receiving data from the new cube of interest after some delay after changing it (but before that it should continue receive data from the previous cube).
Your terminology is problematic. A cube by definition does not have a radius; a sphere does. A broadcast by definition is received by everyone, it is not received only by those who are interested; a multicast is.
I have encountered this problem in the development of an MMORPG. The approach taken in the development of that MMORPG was a bit wacky, but in the decade that followed my thinking has evolved so I have a much better idea of how to go about it now.
The solution is a bit involved, but it does not require any advanced notions like space partitioning, and it is reusable for all kinds of information that the consumers will inevitably need besides just 3D coordinates. Furthermore, it is reusable for entirely different projects.
We begin by building a light-weight data modelling framework which allows us to describe, instantiate, and manipulate finite, self-contained sets of inter-related observable data known as "Entities" in memory and perform various operations on them in an application-agnostic way.
Description can be done in simple object-relational terms. ("Object-relational" means relational with inheritance.)
Instantiation means that given a schema, the framework creates a container (an "EntitySpace") to hold, during runtime, instances of entities described by the schema.
Manipulation means being able to read and write properties of those entities.
Self-contained means that although an entity may contain a property which is a reference to another entity, the other entity must reside within the same EntitySpace.
Observable means that when the value of a property changes, a notification is issued by the EntitySpace, telling us which property of which entity has changed. Anyone can register for notifications from an EntitySpace, and receives all of them.
Once you have such a framework, you can build lots of useful functionality around it in an entirely application-agnostic way. For example:
Serialization: you can serialize and de-serialize an EntitySpace to and from markup.
Filtering: you can create a special kind of EntitySpace which does not contain storage, and instead acts as a view into a subset of another EntitySpace, filtering entities based on the values of certain properties.
Mirroring: You can keep an EntitySpace in sync with another, by responding to each property-changed notification from one and applying the change to the other, and vice versa.
Remoting: You can interject a transport layer between the two mirrored parts, thus keeping them mirrored while they reside on different threads or on different physical machines.
Every node in the network must have a corresponding "agent" object running inside every node that it needs data from. If you have a centralized architecture, (and I will continue under this hypothesis,) this means that within the server you will have one agent object for each client connected to that server. The agent represents the client, so the fact that the client is remote becomes irrelevant. The agent is only responsible for filtering and sending data to the client that it represents, so multi-threading becomes irrelevant, too.
An agent registers for notifications from the server's EntitySpace and filters them based on whatever criteria you choose. One such criterion for an Entity which contains a 3D-coordinate property can be whether that 3D-coordinate is within the client's area of interest. The center-of-sphere-and-radius approach will work, the center-of-cube-and-size approach will probably work even better. (No need for calculating a square.)

Are materialized views in redshift worth their costs

I'm working on a project in aws redshift with a few billion rows where the main queries are rollups on time units. The current implementation has mvs for all these rollups. It seems to me that if redshift is all it's cracked up to be and the dist and sort keys are defined correctly the mvs should not be necessary and their costs in extra storage and maintenance (refresh). I'm wondering if anyone has analyzed this in a similar application.
You're thinking along the right path but the real world doesn't always allow for 'just do it better'.
You are correct that sometimes MVs are just used to forego the effort of optimizing a complex query but sometimes not. The selection of keys, especially distribution key, is a compromise between optimizing different workloads. Distribute one way and query A gets faster but query B gets slower. But if the results of query B don't need to be completely up to date, one can make an MV out of B and only pay the price on refresh.
Sometimes queries are very complex and time consuming (and not because they aren't optimized). The results of this query doesn't need to include the latest info to be valid so an MV can can make the cost of this query infrequent. [In reality MVs often represent complex subqueries that are referenced by a number of other queries which makes accentuates the frequent vs. infrequent value of the MV.]
Sometimes query types don't match well to Redshift's distributed, columnar nature and just don't perform well. Again, current-ness of data can be played off against cluster workload and these queries can be run at low usage times.
With all that said I think you are on the right path as I've also been trying to get people to see that many, many queries are just poorly written. Too often in the data world functionally correct equals done and in reality this is only half done. I've rewritten queries that were taking 90 minutes to execute (browning out the cluster when they ran) and got them down to 17 seconds. So keep up the good fight but use MVs as a last resort when compromise is the only solution.

Best way to partition AWS Athena tables for querying S3 data with high cardinality

We have a bucket in S3 where we store thousands of records every day (we end up having many GBs of data that keep increasing) and we want to be able to run Athena queries on them.
The data in S3 is stored in patterns like this:S3://bucket/Category/Subcategory/file.
There are multiple categories (more than 100) and each category has 1-20 subcategories. All the files we store in S3 (in apache parquet format) include sensor readings. There are categories with millions of sensor readings (sensors send thousands per day) and categories with just a few hundreds of readings (sensors send on average a few readings per month), so the data is not split evenly across categories. A reading includes a timestamp, a sensorid and a value among other things.
We want to run Athena queries on this bucket's objects, based on date and sensorid with the lowest cost possible. e.g.: Give me all the readings in that category above that value, or Give me the last readings of all sensorids in a category.
What is the best way to partition our athena table? And what is the best way to store our readings in S3 so that it is easier for Athena to run the queries? We have the freedom to save one reading per file - resulting in millions of files (be able to easily partition per sensorid or date but what about performance if we have millions of files per day?) or multiple readings per file (much less files but not able to directly partition per sensor id or date because not all readings in a file are from the same sensor and we need to save them in the order they arrive). Is Athena a good solution for our case or is there a better alternative?
Any insight would be helpful. Thank you in advance
Some comments.
Is Athena a good solution for our case or is there a better alternative?
Athena is great when you don't need or want to set up a more sophisticated big data pipeline: you simply put (or already have) your data in S3, and you can start querying it immediately. If that's enough for you, then Athena may be enough for you.
Here are few things that are important to consider to properly answer that specific question:
How often are you querying? (i.e., is it worth have some sort of big data cluster running non-stop like an EMR cluster? or is it better to just pay when you query, even if it means that per query your cost could end up higher?)
How much flexibility do you want when processing the dataset? (i.e., does Athena offer all the capabilities you need?)
What are all the data stores that you may want to query "together"? (i.e., is and will all the data be in S3? or do you or will you have data in other services such as DynamoDB, Redshift, EMR, etc...?)
Note that none of these answers would necessarily say "don't use Athena" — they may just suggest what kind of path you may want to follow going forward. In any case, since your data is in S3 already, in a format suitable for Athena, and you want to start querying it already, Athena is a very good choice right now.
Give me all the readings in that category above that value, or Give me the last readings of all sensorids in a category.
In both examples, you are filtering by category. This suggests that partitioning by category may be a good idea (whether you're using Athena or not!). You're doing that already, by having /Category/ as part of the objects' keys in S3.
One way to identify good candidates for partitioning schemes is to think about all the queries (at least the most common ones) that you're going to run, and check the filters by equality or the groups that they're doing. E.g., thinking in terms of SQL, if you often have queries with WHERE XXX = ?.
Maybe you have many more different types of queries, but I couldn't help but notice that both your examples had filters on category, thus it feels "natural" to partition by category (like you did).
Feel free to add a comment with other examples of common queries if that was just some coincidence and filtering by category is not as important/common as the examples suggest.
What is the best way to partition our athena table? And what is the best way to store our readings in S3 so that it is easier for Athena to run the queries?
There's hardly a single (i.e., best) answer here. It's always a trade-off based on lots of characteristics of the data set (structure; size; number of records; growth; etc) and the access patterns (proportion of reads and writes; kinds of writes, e.g. append-only, updates, removals, etc; presence of common filters among a large proportion of queries; which queries you're willing to sacrifice in order to optimize others; etc).
Here's some general guidance (not only for Athena, but in general, in case you decide you may need something other than Athena).
There are two very important things to focus on to optimize a big data environment:
I/O is slow.
Spread work evenly across all "processing units" you have, ideally fully utilizing each of them.
Here's why they matter.
First, for a lot of "real world access patterns", I/O is the bottleneck: reading from storage is many orders of magnitude slower than filtering a record in the CPU. So try to focus on reducing the amount of I/O. This means both reducing the volume of data read as well as reducing the number of individual I/O operations.
Second, if you end up with uneven distribution of work across multiple workers, it may happen that some workers finish quickly but other works take much longer, and their work cannot be divided further. This is also a very common issue. In this case, you'll have to wait for the slowest worker to complete before you can get your results. When you ensure that all workers are doing an equivalent amount of work, they'll all be working at near 100% and they'll all finish approximately at the same time. This way, you don't have to keep waiting longer for the slower ones.
Things to have in mind to help with those goals:
Avoid too big and too small files.
If you have a huge number of tiny files, then your analytics system will have to issue a huge number of I/O operations to retrieve data. This hurts performance (and, in case of S3, in which you pay per request, can dramatically increase cost).
If you have a small number of huge files, depending on the characteristics of the file format and the worker units, you may end up not being able to parallelize work too much, which can cause performance to suffer.
Try to keep the file sizes uniform, so that you don't end up with a worker unit finishing too quickly and then idling (may be an issue in some querying systems, but not in others).
Keeping files in the range of "a few GB per file" is usually a good choice.
Use compression (and prefer splittable compression algos).
Compressing files greatly improves performance because it reduces I/O tremendously: most "real world" datasets have a lot of common patterns, thus are highly compressible. When data is compressed, the analytics system spends less time reading from storage — and the "extra CPU time" spent to decompress the data before it can truly be queried is negligible compared to the time saved on reading form storage.
Keep in mind that there are some compression algorithms that are non-splittable: it means that one must start from the beginning of the compressed stream to access some bytes in the middle. When using a splittable compressions algorithm, it's possible to start decompressing from multiple positions in the file. There are multiple benefits, including that (1) an analytics system may be able to skip large portions of the compressed file and only read what matters, and (2) multiple workers may be able to work on the same file simultaneously, as they can each access different parts of the file without having to go over the entire thing from the beginning.
Notably, gzip is non-splittable (but since you mention Parquet specifically, keep in mind that the Parquet format may use gzip internally, and may compress multiple parts independently and just combine them into one Parquet file, leading to a structure that is splittable; in other words: read the specifics about the format you're using and check if it's splittable).
Use columnar storage.
That is, storing data "per columns" rather than "per rows". This way, a single large I/O operation will retrieve a lot of data for the column you need rather than retrieving all the columns for a few records and then discarding the unnecessary columns (reading unnecessary data hurts performance tremendously).
Not only you reduce the volume of data read from storage, you also improve how fast a CPU can process that data, since you'll have lots of pages of memory with useful data, and the CPU has a very simple set of operations to perform — this can dramatically improve performance at the CPU level.
Also, by keeping data organized by columns, you generally achieve better compression, leading to even less I/O.
You mention Parquet, so this is taken care of. If you ever want to change it, remember about using columnar storage.
Think about queries you need in order to decide about partitioning scheme.
Like in the example above about the category filtering, that was present in both queries you gave as examples.
When you partition like in the example above, you are greatly reducing I/O: the querying system will know exactly which files it needs to retrieve, and will avoid having to reading the entire dataset.
There you go.
These are just some high-level guidance. For more specific guidance, it would be necessary to know more about your dataset, but this should at least get you started in asking yourself the right questions.

How to store data structures when certain operations need to be performed in faster than O(n) time?

(I'm a newcomer to databases, so apologies if this is a strange question. Feel free to disagree with my point of view if you think I'm not thinking clearly.)
Some data structures have support for operations that can be completed in better than O(n) time, where n is the number of items currently stored in the structure. For example, heaps allow for O(log n) insertion and deletion of items. I don't understand the correct way to store such data structures inside a database.
Question. In regards to databases in general, and in regards to Django 2.2.7 and Postgres 12.0 specifically, what is the correct way to store data structures when faster than O(n) operations are required?
The remainder of this question is elaboration and discussion.
Suppose, for example, that our database consists of two tables, one called Person and one called Task. Each Task has an associated Person field called assignee representing the person to whom the task was assigned, and an associated int field called priority representing the urgency of the task.
Now any given person in the real world might we might want to query the database for the highest priority Task that they've been assigned. The simplest way to service such a request is just to go through each row of Task, one at a time. Unfortunately, assuming that each Person has at least one task, this quickly becomes inefficient as the number of rows in Person grows.
To improve the time-complexity of the query, we might add another column to the Person table with type list(Task) called tasks. The idea is that this field will maintain a list of all tasks that this person has been assigned. This change causes the database to use a little more space, but greatly improves performance when someone asks for the highest-priority task with which they're assigned. (I'm not an expert, but I think this process of adding redundant information to improve performance is called 'denormalization' - can someone who knows their stuff comment and just confirm that I'm using this term correctly?)
Anyway, even with the aforementioned denormalization in place, there's still an issue. Namely, what happens if someone has a huge number of tasks associated to them? In this case, even if the Person table includes a tasks field, the amount of time it takes the database to service the request could be very high.
In my computer science degree, we were taught to solve these kinds of problems by choosing appropriate data structures. In this case, we would probably change the type of the tasks field, so that instead of pointing to an object of type list(Task) it instead points to an object of type heap(Task).
However the correct way to do this is very non-obvious to me. If the heap is stored as an array of items on the hard drive, and if every operation with the heap requires us to load the entire array into memory, perform the operation, and then store it again, well now we're back to O(n) time complexity just to perform one insertion or pop operation, which usually take O(log n) time.
So my question is really how to avoid this.
I don't understand the correct way to store such data structures inside a database.
You don't store data structures inside a database, you store data inside a database using the specific data structures offered by the database.
You mentioned PostgreSQL. That is a specific product, broadly compatible with the SQL database standard, using data structures compatible with the relational model of data. It defines the data structures it uses, and the time-complexity of using them. In your specific example, relational databases offer a data structure to solve your problem, called an index.
In my computer science degree, we were taught to solve these kinds of problems by choosing appropriate data structures.
Right. Having chosen the appropriate data structure, you then use a data storage product that implements it to store your data. A data structure is not a thing that can be stored.
Note that relational databases are just one way of storing and representing data. They have proven to be very useful, and offer a variety of data structures (most notably, the table and the index). But there are other data structures that cannot be implemented well by a relational database. In which case you use a different product. Redis, for example, bills itself as a data-structures server, and offers a specific set of data structures and access patterns that are quite different from a relational database. Graph databases would be another example.
The big-O notation may be misleading. For example, an index lookup is O(log n), but the base of the logarithm is so high that even the biggest possible tables don't have to read more than perhaps 6 index blocks.
It is similar with locating free space in a table. Even a large table has a free space map that is so small that locating a block with free space will be fast.
I think your concerns fall into the area of premature optimization.
If you want to manually implement all your own data structures, what are you doing using PostgreSQL in the first place? You use sophisticated software because it has already done those things for you.
create index on task (assignee, priority);
select * from task where assignee=314159 order by priority desc limit 1;
Unless you have tried this simple solution and it did not work, then there is nothing here which needs bespoke optimizations.

DynamoDB Eventually consistent reads vs Strongly consistent reads

I recently came to know about two read modes of DynamoDB. But I am not clear about when to choose what. Can anyone explain the trade-offs?
Basically, if you NEED to have the latest values, use a fully consistent read. You'll get the guaranteed current value.
If your app is okay with potentially outdated information (mere seconds or less out of date), then use eventually consistent reads.
Examples of fully-consistent:
Bank balance (Want to know the latest amount)
Location of a locomotive on a train network (Need absolute certainty to guarantee safety)
Stock trading (Need to know the latest price)
Use-cases for eventually consistent reads:
Number of Facebook friends (Does it matter if another was added in the last few seconds?)
Number of commuters who used a particular turnstile in the past 5 minutes (Not important if it is out by a few people)
Stock research (Doesn't matter if it's out by a few seconds)
Apart from the other answers shortly the reason for this read modes is:
Lets say you have table User in eu-west-1 region. Without you being aware there are multiple Availability Zones AWS handles in the background. Like replicates your data in case of failure etc..Basically there are copies of your tables and once you insert a table there needs to be multiple resources to be updated.
But now when you wanna read there might be a chance that you are reading from not-yet-updated table without being aware of. Usually takes under a second for dynamodb to update. This is why its called eventually consistent. It will eventually be consistent in a short amount of time :)
When making decision knowing this reasoning helps me to understand and design my use cases.