I'm trying to understand the layout of the microservices pattern. Given that each microservice would run on its on VM (for sake of example) how does the database fit into this architecture?
Would each service, in turn, connect to the consolidated database to read/write data?
Thanks for any insight
There's no one size fits all solution.
The general principle is that each microservice should make the right decision for itself in terms of what the right persistence architecture should be. It might be connected to a central SQL database, or it could be using a filesystem, or it could be using NoSQL data store, or memcached, or whatever. (This is why people talk about eventual consistency a lot with microservices.)
You want to do it this way to really capture the benefits of microservices.
You want each microservice to be independently shippable, so that you're not blocked on anything. Stronger coupling to centralized infrastructure reduces the independence of the microservice.
Persistence requirements are highly variable. If you're running a search microservice, you don't need the ACID semantics of a typical SQL database. If you're doing payments, you need ACID. If you're storing and processing images, you might just use the filesystem. Etc.
In my experience when dealing with mSOA it always comes to Data Warehouse solution in the end. And this is the natural choice if you have a dedicated DB (cluster) per micro-service. After all the business should be able to use that info from your domain. Even Data Vault Modeling will be a good fit here.
Related
Amazon said
NoSQL design requires a different mindset than RDBMS design. For an RDBMS, you can create a normalized data model without thinking about access patterns. You can then extend it later when new questions and query requirements arise. For DynamoDB, by contrast, you shouldn't start designing your schema until you know the questions it needs to answer. Understanding the business problems and the application use cases up front is absolutely essential.
It seems that I should design the tables after designing the product for efficient query cost.
But a product can be pivoted or be appended new features. In early stage, nobody knows where the product goes.
Is dynamodb suitable for growing or pivotable product?
In my opinion, the main benefit of Dynamo DB over other NoSQL solutions is that it is a managed database service. You pay for reads and writes and you never worry about scaling to handle larger data, more users. If you are doing a prototype or don't have technical know-how to setup a database server and host in the cloud it could be useful and cost effective. It has its limitations however so if you do have technical resources consider another open source NoSQL option.
I think that statement by Amazon is confusing and is probably more marketing than anything else. Use NoSQL in cases where your data is only accessed in distinct elements that do no have to be combined in a complex manner. It's also helpful if you don't have an exact schema defined because NoSQL doesn't require a hard set schema you can store any fields in a table and you can always add new fields. This is helpful when things are changing rapidly and you don't want to migrate everything as strictly as an RDBMS would require. If however you're going to have to run complex logic or calculations combining data from across tables you should use an RDBMS. You could use NoSQL for some data and and RDBMS for other data in a hybrid fashion but in that case you probably wouldn't want to use Dynamo DB because you'd want full ownership to set it up properly. Hope this helps I'm sure others have more to say and I welcome comments to help me refine my answer.
When building a microservice oriented application, i wonder what could be the appropriate microservice granularity.
Let's image an application consisting of:
A set of various resources types where each resource map a given business model. (ex: In a todo app resources could be User, TodoList and TodoItem...)
Each of those resources are saved within a NoSQL database that could be replicated.
Each of those resources are exposed through a REST Api
The application manage an internal chat room.
An Api gateway for gathering chat room and REST api interaction.
The application front end: an SPA application connected to the API Gateway
The first (and naive) approach when thinking about how microservices could match the need of this application would be:
One monolith service for managing EVERY resources and business logic:
By managing i mean providing the REST API for all of those resources and handling the persistance of those resources within the database.
One service for each Database replica.
One service providing the internal chat room using websocket or whatever.
One service for Authentification.
One service for the api gateway.
One service serving the static assets for the SPA front end.
An other approach could be to split service 1 into as many service as business models exist in the system. (let's call those services resource services)
I wonder what are the benefit of this second approach.
In fact i see a lot of downsides with this approach:
Need to setup an inter service communication process.
When requesting a service representing resource X that have a relation with resource Y, a lot more work are needed (i.e: interservice request)
More devops work.
More difficulty to share common code between resource services.
Where to put business logic ?
When starting a fresh project this second approach seams to me a bit of an over engineered work.
I feel like starting with the first approach and THEN split the monolith resource service into several specific services depending on the observed needs will minimize the complexity and risks.
What's your opinions regarding that ?
Is there any best practices ?
Thanks a lot !
The first approach is not microservice way, by definition.
And yes, idea is to split - each service for Bounded Context - One for Users, one for Inventory, Todo things etc etc.
The idea of microservices, at very simple, assumes:
You want to pay extra dev-ops work for modularity, and complete/as much as possible removal of dependencies between different bounded contexts (see dev/product/pjm teams).
It's idea lies around ownership, modularity, allowing separate teams develop their own piece of code, without requirement from them to know the rest of the system . As long as there is Umbiqutious Language (common set of conventions/communication protocols/terminology/documentation) they can work in completeley isolated, autotonmous fashion.
Maintaining, managing, testing, and develpoing become much faster - in cost of initial dev-ops and sophisticated architecture engeneering investment.
Sharing code should be minimal, and if required, could be done to represent the Umbiqutious Language (common communication interface/set of conventions). Sharing well-documented code, which acts as integration/infrastructure mini-framework, and have special dev/dev-ops/team attached to it ccould be easy business, as long as it, as i said, well-documented, and threated as separate architecture-related sub-project.
Properly engeneered Microservice architecture could lessen maintenance and development times by huge margin, but it requires quite serious reason to use it (there lot of reasons, and lots of articles on that, I wont start it here) and quite serious engeneering investment at start.
It brings modularity, concept of ownership, de-coupling of different contexts of your app.
My personal advise check if you really need MS architecture. If you can not invest engenerring though and dev-ops effort at start and do not have proper reasons for such system - why bother?
If you do need MS, i would really advise against the first method. You will develop wrong thing's, will miss the true challenges of MS, and could end with huge refactor, which could take more work than engeneering MS system from start properly. It's like to make square to make it fit into round bucket later.
Now answering your question title: granularity. (your question body bit different from your post title).
Attach it to Domain Model / Bounded Context. You can make meaty services at start, in order to avoid complex distributed transactions.
First just answer question if you need them in your design/architecture?
If not, probably you did a good design.
Passing reference ids between models from different microservices should suffice, and if not, try to rethink if more of complex transactions could be avoided.
If your system have unavoidable amount of distributed trasnactions, perhaps look towards using/making some CQRS mini-framework as your "shared code infrastructure component" / communication protocol.
It is the key problem of the microservices or any other SOA approach. It is where the theory meets the reality. In general you should not force the microservices architecture for the sake of it. This should rather naturally come from functional decomposition (top-down) and operational, technological, dev-ops needs (bottom-up). First approach is closer to what you would need to do, however at the first step do not focus so much on the technology aspect. Ask yourself why would you need to implement a separate service for particular business function. Treat it as a micro-application with all its technical resources. Ask yourself if there is reason to implement particular function as a full-stack app.
Some, of the functionalities you have mentioned in scenario 1) are naturally ok, such as 'authentication' service - this is probably good candidate.
For the business functions decomposition into separate service, focus on the 'dependencies' problem, if there are too many dependencies and you see that you have to implement bigger chunk of data mode - naturally this is not a micro service any more.
Try to put litmus test , if you can 'turn off' particular functionality and the system still makes sense - it is the candidate for service or further decomposition
I read the document that both for data analysis and in cluster structure but I don't understand what use case different.
Amazon Elasticsearch is a popular open-source search and analytics engine for use cases such as log analytics, real-time application monitoring, and clickstream analytics.Amazon Elasticsearch
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. Amazon Redshift
Amazon Redshift is a hosted data warehouse product, while Amazon Elasticsearch is a hosted ElasticSearch cluster.
Redshift is based on PostgreSQL and (afaik) mostly used for BI purpuses and other compute-intensive jobs, the Amazon Elasticsearch is an out-of-the-box ElasticSearch managed cluster (which you cannot use to run SQL queries, since ES is a NoSQL database).
Both Amazon Redshift and Amazon ES are managed services, which means you don't need to do anything in order to manage your servers (this is what you pay for). Using the AWS Console you can add new cluster and you don't need to run any commands on order to install any software - you just need to choose which server to run your cluster on (number of nodes, disk, ram, etc).
If you are not familiar with ElasticSearch you should check their website.
Edit: It is now possible to write SQL queries on ElasticSearch: SQL Support for AWS ElasticSearch
I agree with #IMSoP's assertions above...
To compare the two is like comparing an elephant and a tiger - you're not really asking the right question quite yet.
What you should really be asking is - what are my requirements for my use cases to best fulfill my stakeholder / customer needs, first, and then which data storage technology best aligns with my requirements second...
To be clear - Whether speaking of AWS ElasticSearch Service, or FOSS / Enterprise ElasticSearch (which have signifficant differences, between, even) - ElasticSearch is NOT a Relational Database (RDBMS), nor is it quite a NoSQL (Document Store) Database, either...
ElasticSearch is a Search Engine / Index. It does some things very well, for very specific use cases, however unlike RDBMS data models most signifficantly, ElasticSearch or NoSQL are not going to provide you with FULL ACID Compliance, or Transactional Statement Processing, so if your use case prioritizes data integrity, constrainability, reliability, audit ability, regulatory compliance, recover ability (to Point in Time, even), and normalization of data model for performance and least repetition of data while providing deep cardinality and enforcing model constraints for optimal integrity, "NoSQL and Elastic are not the Droids you're looking for..." and you should be implementing a RDBMS solution. As already mentioned, the AWS Redshift Service is based on PostgreSQL - which is one of the most popular OpenSource RDBMS flavors out there, just offered by AWS as a fully managed solution / service for their customers.
Elastic falls between RDBMS and NoSQL categories, as it is a Search Engine / Index that works most optimally with "single index" type use cases, where A LOT of content is indexed all at once and those documents aren't updated very frequently after the initial bulk indexing,but perhaps the most important thing I could stress is that in my experience it typically does not scale very cost effectively (even managed cluster services) if you want your clusters to perform well, not degrade over time, retain large historical datasets, and remain highly available for your consumers - and for most will likely become cost PROHIBITIVE VERY fast. That said, Elastic Search DOES still have very optimal use cases, so is always worth evaluating against your unique requirements - just keep scalability and cost in mind while doing so.
Lastly let's call NoSQL what it is, a Document Store that stores collections of documents (most often in JSON format) and while they also do indexing, offer some semblance of an Authentication and Authorization model, provide CRUD operability (or even SQL support nowadays, which makes the career Enterprise Data Engineer in me giggle, that SQL is now the preferred means of querying data from their NoSQL instances! :D )- Still NOT a traditional database, likely won't provide you with much control over your data's integrity - BUT that is precisely what "NoSQL" Document Stores were designed to work best for - UNSTRUCTURED DATA - where you may not always know what your data model is going to look like from the start, or your use case prioritizes data model flexibility over enforcing data integrity in general (non mission critical data). Last - while most modern NoSQL Document Stores may have SOME features that appear on the surface to resemble RDBMS, I am not aware of ANY in that category at current that could claim to offer all that a relational database does, with Oracle MySQL's DocumentStore being probably the best of both worlds in my opinion (and not just because I've worked with it every day for the last decade, either...).
So - I hope Developers with similar questions come across this thread, and after reading are much better informed to make the most optimal design decisions for their use cases - because if we're all being honest with ourselves - everything we do in our profession is about data - either generating it, transporting it, rendering it, transforming it....it all starts and ends with data, and making the most optimal data storage decisions for your applications will literally define the rest of your project!
Cheers!
This strikes me as like asking "What is the difference between apples and oranges? I've heard they're both types of fruit."
AWS has an overview of the analytics products they offer, which at the time of writing lists 21 different services. They also have a list of database products which includes Redshift and 10 others. There's no particularly obvious reason why these two should be compared, and the others on both pages ignored.
There is inevitably a lot of overlap between the capabilities of these tools, so there is no way to write an exhaustive list of use cases for each. Their strengths and weaknesses, and the other tools they integrate easily with, will change over time, and some differences are a matter of "taste" or "style".
Regarding the two picked out in the question:
Elasticsearch is a product built by elastic.co, which AWS can manage the installation and configuration for. As its name suggests, its core functionality is based around search - it can be used to build a flexible but fast product search for an e-commerce site, for instance. It's also commonly used along with other tools to search and aggregate logs and monitoring data.
Redshift is a database system built by AWS, based on PostgreSQL but optimised for extremely large data sets. It is designed for "data warehouse" applications, where you want to write complex logical queries against the data, like "how many people in each city bought both a toothbrush and toothpaste, this year compared to last year".
Rather than trying to make an abstract comparison of all the different services available, it makes more sense to start from the use case which you actually have, and see which tool best fits that need.
I am building an application (using Django's ORM) that will ingest a lot of events, let's say 50/s (1-2k per msg). Initially some "real time" processing and monitoring of the events is in scope so I'll be using redis to keep some of that data to make decisions, expunging them when it makes sense. I was going to persist all of the entities, including events in Postgres for "at rest" storage for now.
In the future I will need "analytical" capability for dashboards and other features. I want to use Amazon Redshift for this. I considered just going straight for Redshift and skipping Postgres. But I also see folks say that it should play more of a passive role. Maybe I could keep a window of data in the SQL backend and archive to Redshift regularly.
My question is:
Is it even normal to use something like Redshift as a backend for web applications or does it typically play more of a passive role? If not is it realistic to think I can scale the Postgres enough for the event data to start with only that? Also if not, does the "window of data and archival" method make sense?
EDIT Here are some things I've seen before writing the post:
Some say "yes go for it" regarding the should I use Redshift for this question.
Some say "eh not performant enough for most web apps" and support the front it with a postgres database camp.
Redshift (ParAccel) is an OLAP-optimised DB, based on a fork of a very old version of PostgreSQL.
It's good at parallelised read-mostly queries across lots of data. It's bad at many small transactions, especially many small write transactions as seen in typical OLTP workloads.
You're partway in between. If you don't mind a data loss window, then you could reasonably accumulate data points and have a writer thread or two write batches of them to Redshift in decent sized transactions.
If you can't afford any data loss window and expect to be processing 50+ TPS, then don't consider using Redshift directly. The round-trip costs alone would be horrifying. Use a local database - or even a file based append-only journal that you periodically rotate. Then periodically upload new data to Redshift for analysis.
A few other good reasons you probably shouldn't use Redshift directly:
OLAP DBs with column store designs often work best with star schemas or similar structures. Such schemas are slow and inefficient for OLTP workloads as inserts and updates touch many tables, but they make querying the data along various axes for analysis much more efficient.
Using an ORM to talk to an OLAP DB is asking for trouble. ORMs are quite bad enough on OLTP-optimised DBs, with their unfortunate tendency toward n+1 SELECTs and/or wasteful chained left joins, tendency to do many small inserts instead of a few big ones, etc. This will be even worse on most OLAP-optimised DBs.
Redshift is based on a painfully old PostgreSQL with a bunch of limitations and incompatibilities. Code written for normal PostgreSQL may not work with it.
Personally I'd avoid an ORM entirely for this - I'd just accumulate data locally in an SQLite or a local PostgreSQL or something, sending multi-valued INSERTs or using PostgreSQL's COPY to load chunks of data as I received it from an in-memory buffer. Then I'd use appropriate ETL tools to periodically transform the data from the local DB and merge it with what was already on the analytics server.
Now forget everything I just said and go do some benchmarks with a simulation of your app's workload. That's the only really useful way to tell.
In addition to Redshift's slow transaction processing (by modern DB standards) there's another big challenge:
Redshift only supports serializable transaction isolation, most likely as a compromise to support ACID transactions while also optimizing for parallel OLAP mostly-read workload.
That can result in all kinds of concurrency-related failures that would not have been failures on typical DB that support read-committed isolation by default.
So if for example I am trying to implement something that looks like Facebook's Graph API that needs to be very quick and support millions of users, what is the disadvantage of just using Redis instead of a RDBMS?
Thanks!
Jonathan
There are plenty of potential benefits and potential drawbacks of using Redis instead of a classical RDBMS. They are very different beasts indeed.
Focusing only on the potential drawbacks:
Redis is an in-memory store: all your data must fit in memory. RDBMS usually stores the data on disks, and cache part of the data in memory. With a RDBMS, you can manage more data than you have memory. With Redis, you cannot.
Redis is a data structure server. There is no query language (only commands) and no support for a relational algebra. You cannot submit ad-hoc queries (like you can using SQL on a RDBMS). All data accesses should be anticipated by the developer, and proper data access paths must be designed. A lot of flexibility is lost.
Redis offers 2 options for persistency: regular snapshotting and append-only files. None of them is as secure as a real transactional server providing redo/undo logging, block checksuming, point-in-time recovery, flashback capabilities, etc ...
Redis only offers basic security (in term of access rights) at the instance level. RDBMS all provide fine grained per-object access control lists (or role management).
A unique Redis instance is not scalable. It only runs on one CPU core in single-threaded mode. To get scalability, several Redis instances must be deployed and started. Distribution and sharding are done on client-side (i.e. the developer has to take care of them). If you compare them to a unique Redis instance, most RDBMS provide more scalability (typically providing parallelism at the connection level). They are multi-processed (Oracle, PostgreSQL, ...) or multi-threaded (MySQL, Microsoft SQL Server, ... ), taking benefits of multi-cores machines.
Here, I have only described the main drawbacks, but keep in mind there are also plenty of benefits in using Redis (very fast, good concurrency support, low latency, protocol pipelining, good to easily implement optimistic concurrent patterns, good usability/complexity ratio, excellent support from Salvatore and Pieter, pragmatic no-nonsense approach, ...)
For your specific problem (graph), I would suggest to have a look at neo4J or OrientDB which are specifically designed to store graph-oriented data.
I have some additions:
There is a value length limitations in redis. When using redis, you always think about your redis K,V size, especially in redis cluster