CQRS - foreign key on read side database - foreign-keys

I'm working on my first project using CQRS, and some things are not really clear to me.
Suppose I have in my model some customers and every customers has a list of orders.
In my read model (supported by a standard relational database), I will have a projection with the list of all customers. Moreover I'll have a projection with the list of all orders.
In this second projection, does it make sense to have a foreign key to key table with all the customers? Or is it better the denormalize immediately and store in the table of the orders also all the relevant data of the customers?

I think it depends on your requirements.
One school of thought is to denormalise the data for all your view models in so far as you would have one-table-per-view. On the other end of the spectrum you could keep a highly normalised database to support your views. You could also opt for somewhere in between. There are trade-offs in terms of speed, storage size, ease of use and scalability in these decisions. For example, if you had hundreds of very similar view models, it might make more sense to have a more normalised data model. Another example might be where one particular view generates orders of magnitude more traffic than any other view - you'd likely want to optimise this particular view more than the others. There isn't really a one-size-fits-all solution.
How about this crazy thought- do both ;) see which you prefer after working with them for a while. One of the great things about CQRS is you have the freedom to make these decisions. If you combine this with event sourcing and the ability to rebuild your views, then you can just change your mind later :)


Django model internationalization

What is the best way to have model fields translated in Django? I've thought about adding extra fields to the model (one field per language) or creating another model with all texts in every language, is there a recommended way to achieve that?
Thank you very much
NB : I first voted to close this as primarily opinion based but then it struck me that there were actually technical reasons to choose one solution or the other...
Both approach are valid and as a matter of fact you'll find reusable django apps based on either one of the other.
Technically, there are pros and cons to each design.
Using distinct "translation" objects means you'll have an additional join or query (to get both the "master" model and it's translation(s)), but you have no overhead on the master model itself (without translation). Also, it makes create/update operations more complicated.
Using additional "hidden" per-language fields avoids the join / additional query overhead and keeps create/update operation simple, but makes records much bigger so it has some overhead wrt/ the database itself (page cache management etc) and the volume of data going back and forth between your django process and the database.
As a general rule, if you have to support a lot of languages and/or have to translate a lot of text fields for each model, you'll probably want to use a distinct model for translations, while if you have few languages and only a couple "translatable" fields per model the "hidden field" approach will be simpler to implement and will avoid the extra queries / joins.
As far as I'm concerned, I've had experience with both solutions and found the "hidden field" solution (using django-modeltranslations) to work fine for our current needs (four languages supported and we should not get much more, no more than =~ four translatable fields per model, and those models are rarely updated so we can cache aggressively if needed), but you may have totally different needs.
In all cases, don't even try implementing this from scratch, use one of the existing django apps instead, it will save you a lot of time and pain.
Check out their docs django translation
This will help too
Localization: How to Create Language Files - Python Django Tutorials

Correct implementation of the Filter (Criteria) Design Pattern

The design pattern is explained here:
I'm working on a software very similar to Adobe Lightroom or ACDSee but with different purposes. The user (photographer) is able to import thousands of images from his hard drive (it wouldn't be weird to have over 100k/200k images).
We have a side panel where users can create custom "filters" which are expressions like:
Does contain the keyword: "car"
Does not contain the keyword "woods"
Camera model is "Nikon D300s"
Camera model is "Canon 7D Mark II"
Directory is "C:\today_pictures"
You can get the idea from the above example.
We have a SQLite database where all image information is stored. The question is, should we load ALL Photo objects into memory from the database the first time the program is loaded and implement the Criteria/Filter design pattern as explained in the website cited above so our Criteria classes filter objects or is better to do the criteria classes actually generate an SQL query that is finally executed in order to retrieve only what's needed from the database?
We are developing the program with C++ (QT).
TL;DR: It's already properly implemented in SQLITE3, and look at how long that took. You'll face the same burden.
It'd be a horrible case of data duplication to read the data from the database and store it again in another data structure. Use database queries to implement the query that the user gave you. Let the database execute the query. That's what databases are for.
By reimplementing a search/query system for ~500k records, you'll be rewriting large chunks of a bog-standard database yourself. It'd be a mostly pointless exercise. SQLITE3 is very well tested and is essentially foolproof. It'll cost you thousands of hours of work to reimplement even a small fraction of its capabilities and reliability/resiliency. If that doesn't scream "reinventing the wheel", I don't know what does.
The database also allows you to very easily implement lookahead/dropdowns to aid the user in writing the query. For example, as you're typing out "camera model is", the user can have an option of autocompletion or a dropdown to select one or more models from.
You paid the "price" of a database, it'd be a shame for it all to go to waste. So, use it. It'll give you lots of leverage, and allow you to implement features two orders of magnitude faster than otherwise.
The pattern you've linked to is just a pattern. It doesn't mean that it's an exact blueprint of how to design your application to perform on real data. You'll be, eventually, fighting things such as concurrency (a file scanning thread running to update the metadata), indexing, resiliency in face of crashes, etc. In the end you'll end up with big chunks of SQLITE reimplemented for your particular application. 500k metadata records are nothing much, if you design your query translator well and support it with proper indexes, it'll work perfectly well.

When creating models in Django, should database normalization be taken into consideration?

I'm new to Django and web frameworks in general, but have worked with DBMSs quite often. Knowing that each class within django models maps to a table in the database, should the models be based on an ERD where tables are normalized? Would normal form matter in this case? Thanks!
Assuming you're using a SQL backend (i.e. not something like mongodb), then the same guidelines for normalisation would apply. Remember, django is just a pretty way to access the database, but in the backgroud you're still executing a series of sql queries which will benefit in the same way from normalisation.
That said, a lot of the business logic that you would normally build into the database can now be handled by django, so it is possible to get away with a slightly de-normalised structure if it makes working with it easier. The approach I usually take is to normalise where it makes sense to avoid duplication, and de-normalise where a normalised structure would result in really complicated queries (django doesn't like complicated queries). I ensure consistency in the data though the use of receivers or overloading the save method.

When to use Haystack/ElasticSearch vs Django's ORM

So I implemented Haystack with ElasticSearch a week ago within our BETA application. One thing I can notice is that getting some data (large amount) back to our users (for example listing all the users within the application) is much faster by going through Haystack then Django's ORM. Now, I will be releasing a REST service (with TastyPie) to serve the possible tablets within the next weeks, as I want to be able to access the information from iPads, Nexus tablets and so on.
One thing I was wondering, is when should I be querying the ORM vs Haystack/ElasticSearch? For example, if the user on the tablet is requesting a specific set of users, should we let TastyPie query the ORM, or go to ElasticSearch?
If we look at this answer Django: Haystack or ORM, we can all agree that a DB is made to retrieve and write data. However, could we say that retrieving faster can be faster with Haystack/ElasticSearch once the search engine was updated?
I am a bit confused as to when, should we not be querying Haystack if it is much faster?!
To make things clear I guess you're talking about querying Elasticsearch via Haystack without later instantiating any objects for your search results with data from you database.
Some points to consider besides the points mentioned in the other post:
A search engine like Elasticsearch is highly optimized when dealing with full-text searches (When doing something with SQL it highly depends on the database/engine you are using)
Queries that are involving a lot of relations/joins will most like be easier to handle with the ORM, but on the other hand you can eg save data from foreign-key relations in a denormalized fashion when using ES which could give you a performance boost. Of course you can denormalize your database tables as well but this is quite often considered as a bad practice as long as you know what you are doing, eg when solving a performance bottleneck.
ES is somehow quite easy to scale while scaling your SQL DB might be more complicated.
Most likely this is a decision that depends very much on your use case, the amount of data to process and the queries you are intending to run. So the best thing of course is - as always - to do some benchmarking yourself and compare this two solutions. But don't do any premature optimisations as one big advantage of the ORM is to keep things simple - you don't have to care much about the integrity of your data and maintain an additional system.

case studies or examples of high throughput services with highly dynamic data

I'm looking for some architecture ideas on a problem at work that I may have to solve.
the problem.
1) our enterprise LDAP has become a "contact master" filled with years of stale data and unused and unmaintained attributes.
2) management has decided that LDAP will no longer serve as a company phone book. it is for authorization purposes only.
3) the company has contact type data about people in hundreds of different sources. we need to scrub all the junk out of LDAP and give the other applications a central repo to store all this data about a person.
the ideal goal
1) have a single source to store all the various attributes about a person
2) the company probably has info on 500k people ( read 500K rows)
3) i estimate there could be 500 to 1000 optional attributes on these people. (read 500+ columns)
4) data would primarily be set/get via xml over jms (this infrastructure is already in place)
5) individual groups within the company could "own" columns. only they would be allowed to write to their columns, they would be responsible for keeping the data clean.
6) a single record lookup should be returned in sub seconds
7) system should support 1 million requests per hour at peak.
8) the primary goal is to serve real time data to the enterprise, reporting is a secondary goal.
9) we are a java, oracle, terradata shop. we are your typical big IT shop.
my thoughts:
1) originally i thought LDAP might work, but it doesn't scale when new columns are added.
2) my next thought was some kind of no-sql solution, but from what i have read, I don't think i cant get the performance I need, and its still relatively new. I'm not sure i can get my manager to sign off on something like that for such a critical project.
3) i think there will be a meta-data component to the solution that will track who owns the columns and what each column represents, and the original source system.
Thanks for reading, and thanks in advance for any thoughts.
With Teradata-grade tools an SQL-based solution may be feasible. I came across an article on database design awhile ago that discussed "anchor modeling".
Basically, the idea is to create a single, dumb, synthetic primary key table, while all real or meta data lives in other tables (subsets) and is attached by way of a foreign key + join.
I see the benefit of this design being two-fold. First, you can more easily compartmentalize data storage either for organizational or performance reasons. Second, you only create additional rows for records that have data in any given subset, so you use less space and indexing and searching are faster.
Subsets might be based on maintainer or some other criteria. XML set/get would be per-subset/record (rather than global record). All subsets for a given records can be composited and cached. Additional subsets can be created for metadata, search indexes, etc., and these can be queried independently.
NoSQL seems similar to LDAP (in theory, at least) but the benefit of a good NoSQL tool would include greater abstraction of metadata, versioning, and organization. In fact, from what I've read it seems that NoSQL datastores are designed to address some of the issues you've raised with respect to scaling and loosely structured data. There's a good question on SO regarding datastores.
Production NoSQL
Off-hand, there are a handful of large companies using NoSQL in massively-scaled environments, such as Google's Bigtable. It seems like the perfect tool for:
6) a single record lookup should be returned in sub seconds
7) system should support 1 million requests per hour at peak.
Bigtable is only available (to my knowledge) through AppEngine. Other, similar technologies are listed here.
Other Thoughts
The bigger picture view looks more or less the same regardless of the technology you decide to use. E.g. compartmentalize storage, composite views, cache views, stick metadata somewhere so you can find things.
The performance characteristics you're targeting are going to require some kind of caching and/or optimization based on real-world usage patterns. Regardless of the solution you choose, you probably can't resolve that in the design phase.
A couple thoughts:
1) our enterprise LDAP has become a "contact master" filled with years of stale data and unused and unmaintained attributes.
This isn't really a technological problem. You will have this problem with a new system as well, LDAP or not.
"LDAP ... doesn't scale"
There are lots of huge LDAP systems out there. LDAP is surely a dark art, but I'd willing to bet that it scales better than any SQL equivalent in this situation. Not to mention that LDAP is a standard for this kind of info, and as such it is accessible from zillions of different kinds of systems.
Maybe what you're looking for is a new LDAP system that's easier to manage / has better admin tools?
You may want to look into Len Silverston's Party Model. Here's a link to his book: http://www.amazon.com/Data-Model-Resource-Book-Vol/dp/0471380237.
I have no experience building something on that scale, though I think that thinking of it as 500k rows x 500 - 1000 columns sounds a bit ridiculous.