Our shop has recently started taking on an SOA approach to application development. We are seeing some great benefits with the separation of concerns, reusability, and other benefits of SOA/microservices.
However, one big item we're stuck on is aggregating, filtering, and paginating results across services. Let me describe the issue with a scenario.
Say we have 3 services:
PersonService - Stores information on people (names, addresses, etc)
ItemService - Stores information on items that are purchasable.
PaymentService - Stores information regarding payments that people have made for different items.
Now, say we want to build a reporting/admin tool that can display / report on multiple services in aggregate. For instance, we want to display a paginated list of Payments, along with the Person and Item that each payment was for. This is pretty straightforward: Grab the list of payments, then query PersonService and ItemService for the respective Person and Item records.
However, the issue comes into play when we want to then filter down that data: For instance, displaying a paginated list of payments made by people with the first name 'Bob', who have purchased the item 'Car'. This makes things much more complicated, because we need to filter results from 3 different services without knowing how many results each service is going to return.
From a performance perspective, querying all of the services over and over again to narrow down the results would be costly, so I've been researching better solutions. However, I cannot find concrete solutions to this problem (or at least a "best practice"). In a monolithic application, we'd simply use SQL joins across the different tables. I'm having a ton of trouble figuring out how/if something similar is possible across services.
My question to the community is: What would your approach be? Things I've considered:
Using some sort of search index (Elasticsearch, Solr) that contains all data for all services (updated via events pushed out by services), and then querying the search index for results.
Attempting to understand how projects like GraphQL and Neo4j may assist us with these issues.

I stick with Sam Newman who says in Chapter 4 "The shared Database" of his book something like:
Remember when we talked about the core principles behind good microservices? Strong cohesion and loose coupling --with database integration, we lose both things. Database integration makes it very easy for services to share data, but does nothing about sharing behaviour. Our internal representation is exposed over the wire to our consumers, and it can be very difficult to avoid making breaking changes, wich inevitably leads to fear of any changes at all. Avoid at (nearly) all costs.
This is the point I make when I curse at Content-Management-Systems.
In my view a microservice is autonomous, what it cannot be if it shares things or consumes shared things. The only exception I make here are Domain-Objects, those represent the shared understanding of the business model and must be used in communication between microservices solely.
It depends on the microservice itself if an ER or AggregationOriented database (divided into document based or graph based) better suits the needs.
The funny thing is, by being loosley coupled and by being autonomus you are able to do just that!
If an PaymentService shares the behaviour of "how many payments for Person A"
He needs to know Person A in order to fullfill this. But Everything he knows about Person A must origin from the PersonService, maybe at runtime (the PaymentService maybe just stores an id) or event based (the PaymentService stores the data it needs up to the Domain-Object user, what gets updated triggered and supplied by the PersonService). The PaymentService itself does not share users itself.

The answer to this question is that you need a separate Read Database or Materialized View that aggregates data from multiple databases, and makes it ready for fast retrieval. See the CQRS pattern: https://learn.microsoft.com/en-us/azure/architecture/patterns/cqrs
The data in the Materialized View might not be "the most up to date", meaning there might be a small delay between when the change is made by the respective microservice, and when time the "Materialized View" is updated, but this is fine, as retrieving the data fast is more important than if the data is stale for a few seconds or even minutes (there are systems where the Materialized View can take 2-5 minutes to be updated, and yet that might still be acceptable)
The best pattern to implement this Read Database or Materialized View from CQRS, is typically the Event Sourcing pattern, where we can listen to a queue for new updates and update the Read Database immediately. See the Event Sourcing pattern: https://learn.microsoft.com/en-us/azure/architecture/patterns/event-sourcing

Storing this data in elasticsearch/solr/cognitivesearch type service in addition to SQL could help solve some of these problems.
In your given example,
In the search index(elasticsearch/solr/cognitivesearch) person object will have a property called "items" that will contain a list of items that are paid for by that person.
That way, you can filter across objects, get a paginated list that is sorted by any property of the person. You can add similar information on other documents to better suit your business needs.
Using a GraphDatabase would seem to solve your problem from a 10000ft, but you will run into pagination problems when you operate at scale. GraphDatabases do not do pagination well(they will have to visit all the nodes anyway, even when you need a paginated list) and will cause timeouts/performance issues.

You can use replication tables.
All databases have replication feature
If you have personService that has person table and PaymentService that has payment table then create reportService that has person and payment tables, that they filled by replication feature.


Is dynamodb suitable for growing or pivotable product?

Amazon said
NoSQL design requires a different mindset than RDBMS design. For an RDBMS, you can create a normalized data model without thinking about access patterns. You can then extend it later when new questions and query requirements arise. For DynamoDB, by contrast, you shouldn't start designing your schema until you know the questions it needs to answer. Understanding the business problems and the application use cases up front is absolutely essential.
It seems that I should design the tables after designing the product for efficient query cost.
But a product can be pivoted or be appended new features. In early stage, nobody knows where the product goes.
Is dynamodb suitable for growing or pivotable product?
In my opinion, the main benefit of Dynamo DB over other NoSQL solutions is that it is a managed database service. You pay for reads and writes and you never worry about scaling to handle larger data, more users. If you are doing a prototype or don't have technical know-how to setup a database server and host in the cloud it could be useful and cost effective. It has its limitations however so if you do have technical resources consider another open source NoSQL option.
I think that statement by Amazon is confusing and is probably more marketing than anything else. Use NoSQL in cases where your data is only accessed in distinct elements that do no have to be combined in a complex manner. It's also helpful if you don't have an exact schema defined because NoSQL doesn't require a hard set schema you can store any fields in a table and you can always add new fields. This is helpful when things are changing rapidly and you don't want to migrate everything as strictly as an RDBMS would require. If however you're going to have to run complex logic or calculations combining data from across tables you should use an RDBMS. You could use NoSQL for some data and and RDBMS for other data in a hybrid fashion but in that case you probably wouldn't want to use Dynamo DB because you'd want full ownership to set it up properly. Hope this helps I'm sure others have more to say and I welcome comments to help me refine my answer.

I've been researching different approaches to streaming data to a real-time dashboard. One way that I have done in the past is using a star schema/dimension and fact tables. This would be an implementation of aggregate tables. For example, the dashboard would contain multiple charts, one being the total sales for the day, total sales by product, total sales by manufacturer, etc. etc.
But what if this needed to be real-time? What if the data needs to stream to these charts and do the analytical processing real-time?
I've been looking into solutions like Kinesis streams and Kafka, but I may be missing something obvious. For example, consider the following example. A company runs a website where they sell pies. The company has a backend dashboard where they keep track of all data and analytics related to sales, users, orders, etc.
Custom places order through website
The relational (mysql) database receives this new order
The charts and analytical data updates real-time on the backend, for example total sales for the day, or total sales for the year by user.
If the scenario is that this data needs to be streamed, what is the best approach to this? Aggregate tables seem like the obvious but it seems that would be periodic and not real-time. Kinesis/Kafka feels like it would fit somewhere in here. The other option would be something like Redshift but it's pretty pricey and still may not be the best way to address the issue and scale.
Here is an example of a chart that would need to be updated in real-time that could suffer by just doing place aggregate SQL queries when there are tons and tons of rows to parse.
In case of "always up-to-date" reports like this (sales, users, orders etc) that don't need live updates with near-zero-latency streaming processing might be overkill, and ROLAP-like approach seems to be more optimal in meaning of efforts/result.
You mentioned Redshift, and if you already ready to mirror your data for analytics purposes and only problem is a price you can consider another free open-source alternatives that could be used for handling OLAP (aggregate) queries in the real-time (like Yandex ClickHouse, or maybe MongoDb in some cases).
A lot of depends on the dataset size; unless you have really big data that need to be aggregated (hundreds of GB) you can try to keep using mysql and use some tricks:
use separate slave mysql server with high IOPS for analytics and replicate only tables needed to build your reports; possibly use another table engine, more suitable for analytical queries. Setup indexes specially for these queries, to avoid table full scan if you need to get numbers only for last weeks.
pre-calculate metrics for previous periods (with materialized view-like approach) and refresh them on schedule (say, daily), and then combine pre-calculated aggregates with on-the-fly aggregates only for last period to get actual report data without need to scan whole facts table each time.
use data visualization backend that can efficiently cache reports data in-memory to prevent SQL DB overload because of many similar queries (and if the same report or dashboard is displayed for 100 users SQL DB load will be the same as for 1). BTW, I develop solution like that (cannot adv it here as it is commercial product).
This is a typical trade-off for most the architects. Amazon Redshift offers exemplary read optimisations but AWS stack comes for a price. You may try using Cassandra, but it comes with its own set of challenges. When it comes to analytics, I never recommend going real time for the reasons elaborated below.
Doing analytics at real time is not desired, specially using MySQL
The solution for above comes by seggregating transactional and analytical infra. This involves cost but will make sure you don't have to spend time in housekeeping once you scale. MySQL is a row based RDBMS mostly used for storing transactional data. Being row based, it optimises writes i.e. the writes are almost real time and thus, it compromises on reads. When I say this, I refer to a typical analytics dataset running into millions of records/day. If your dataset is not that voluminous, you might still be able to render a graph showing transactional status. But since you're referring to Kafka, I assume the dataset is very large.
A real-time dashboard with visualisations gives a bad customer experience
Considering the above point, even if you go for a warehouse / a read optimised infra, you need to understand how the visualisations work. If 100 people access the dashboard at the same time, 100 connections will be made to the database, all fetching the same data, putting them in memory, applying calculations, parameters and filters defined in your dashboard, adjust the refined dataset in the visualisation and then render the dashboard. Till this time, the dashboard will simply freeze. A poorly constructed query, inefficient use of indexes etc will further make the matter worse.
The above problems will amplify more and more with the increase in your dataset. Good practices to achieve what you need would be:
To have almost realtime (delay of 1hr, 30 mins, 15 mins etc) rather than an absolute real time system. This will help you to create a flat file with the data already fetched in the memory. Your dashboard will simply read this data and will be extremely fast in terms of responses to filters etc. Also, multiple connections to databases will be avoided.
Have a data structure, database/warehouse optimised for reads.
For these types of operational analytics use-cases where the real-time nature of the data is critical, you're completely correct that most "traditional" methods can be quite clumsy, especially as your data size increases. A quick overview of your options:
Historical Approach (TLDR– Meh)
Up until about 5 years ago, the de facto way to do this looked something like
Set up a primary OLTP database that will handle the data in its raw form and have stricter guarantees on performance or ACID properties. Usually this is something SQL-esque, i.e. MySQL, PostgreSQL.
Set up a secondary OLAP database that is meant for serving offline (aka non user-facing) queries. This could also be a SQL-esque db but its schema would be drastically different because it stores the data in enriched form.
Set up some mechanism by which you can keep these 2 in sync. This pretty much boils down to either a) changing your application to always write to both databases and performing the necessary data enrichment or b) building a stand-alone application that reads from your OLTP database, performs the necessary transformations and enrichment and writes to your OLAP database
Plug your dashboard into your OLAP database which will have a schema and indexes optimized for the kind of queries you want.
Using your example about the pie store, the OLTP database would be used to store the purchases of all the pies and reference things like customer ids, billing information, delivery information, etc. In contrast, the OLAP database might just maintain a table with a schema
purchase_totals(day: Date, weekNumber: int, dayOfWeek: int, year: int, total: float)
While the weekNumber, dayOfWeek, and year and technically redundant they make your queries faster! With the proper indexes on these fields, your dashboard has turned into 5 simple (and fast!) aggregation queries with a group by and sum, and then the differences week-over-week or year-over-year can be computed on the client-side. As long as your dashboard refreshes every minute or so you have near-real-time data at your fingertips.
Current Approach (TLDR– Ok)
The recent trends in computing, database technologies, and data science/analytics have led to improvements to the above process, namely by replacing certain components of it. The changes include
Making the OLTP db, the OLAP db, or both a NoSQL database (Mongo usually being the most popular). The pro here is that you have a more flexible schema which won't break if something upstream changes (say, you start selling cakes in addition to pies).
Keeping the SQL db but shifting to cloud provider solution like AWS RDS or Google Cloud SQL. This fundamentally doesn't change anything about the architecture, but it does significantly reduce your operational burden.
Using hard-to-maintain ETL pipelines on top of streaming platforms like Kafka or AWS Kinesis to act as the middle layer between OLAP and OLTP.
Using dedicated tools for data cleaning and transformation as you plan out how to do your ETL
Using dedicated visualization tools on top of your OLAP db (think Tableau)
Using a pull-based approach for getting data out of your OLTP db or your application directly instead of waiting for it to eventually reach your OLAP db. This is helpful for online services because it actually gives you both the data you want AND confirmation that the service is alive and running well (because it just served your request for data). Systems like Prometheus are quite popular for this now.

Microservices Architecture: Cross Service data sharing

Consider the following micro services for an online store project:
Users Service keeps account data about the store's users (including first name, last name, email address, etc')
Purchase Service keeps track of details about user's purchases.
Each service provides a UI for viewing and managing it's relevant entities.
The Purchase Service index page lists purchases. Each purchase item should have the following fields:
id, full name of purchasing user, purchased item title and price.
Furthermore, as part of the index page, I'd like to have a search box to let the store manager search purchases by purchasing user name.
It is not clear to me how to get back data which the Purchase Service does not hold - for example: a user's full name.
The problem gets worse when trying to do more complicated things like search purchases by purchasing user name.
I figured that I can obviously solve this by syncing users between the two services by broadcasting some sort of event on user creation (and saving only the relevant user properties on the Purchase Service end). That's far from ideal in my perspective. How do you deal with this when you have millions of users? would you create millions of records in each service which consumes users data?
Another obvious option is exposing an API at the Users Service end which brings back user details based on given ids. That means that every page load in the Purchase Service, I'll have to make a call to the Users Service in order to get the right user names. Not ideal, but I can live with it.
What about implementing a purchase search based on user name? Well I can always expose another API endpoint at the Users Service end which receives the query term, perform a text search over user names in the Users Service, and then return all user details which match the criteria. At the Purchase Service, map the relevant ids back to the right names and show them in the page. This approach is not ideal either.
Am I missing something? Is there another approach for implementing the above? Maybe the fact that I'm facing this issue is sort of a code smell? would love to hear other solutions.
This seems to be a very common and central question when moving into microservices. I wish there was a good answer for that :-)
About the suggested pattern already mentioned here, I would use the term Data Denormalization rather than Polyglot Persistence, as it doesn't necessarily needs to be in different persistence technologies. The point is that each service handles its own data. And yes, you have data duplication and you usually need some kind of event bus to share data across services.
There's another option, which is a sort of a take on the first - making the search itself as a separate service.
So in your example, you have the User service for managing users. The Purchases services manages purchases. Each handles its own data and only the data it needs (so, for instance, the Purchases service doesn't really need the user name, only the ID). And you have a third service - the Search Service - that consumes data produced by other services, and creates a search "view" from the combined data.
It's totally fine to keep appropriate data in different databases, it's called Polyglot Persistence. Yes, you would like to keep user data and data about purchases separately and use message queue for sync. Millions of users seems fine to me, it's scalability, not design issue ;-)
In case of search - you probably want to search more than just username, right? So, if you use message queue to update data between services you can also easily route this data to ElasticSearch, for example. And from ElasticSearch perspective it doesn't really matter what field to index - username or product title.
I usually use both approaches. Sometimes i have another service which is sitting on top on x other services and combines the data. I don't really like this approach because it is causing dependencies and coupling between services. So in general, within my last projects we tried to stick to polyglot persistence.
Also think about, if you need to have x sub http requests for combining data in some kind of middleware service, it will lead you to higher latency. We always try to cut down the amount of requests for one task and handle everything what is possible through asynchronous queues. ( especially data sync )
If you conceptualize modules as the owners and controllers of the data they work on, then your model must also communicate that data out of that module to others. In contrast, the modules in a manufacturing process have the access to change data without possessing and controlling it.
Microservices is an architecture for distributed processing, like most code, where modules pass the data around to work on it. From classic articles by Harvard Business Review and McKinsey on the subject of owning members of a supply chain, I identified complexities arising from this model and wrote an article teaching programmers what you need to know: http://www.powersemantics.com/p.html
Manufacturing is an architecture for integrated processing, where modules work on the data without passing it around from point to point. This can be accomplished by having modules configured to access the same memory, files or database tables. My architecture shows how to accomplish this on memory via reference properties.
When you consider "exposing an API at the Users Service end which brings back user details based on given ids", you need to be aware that creates what HBR calls "irreversible" complexity, which I've dubbed centralization complexity. Don't build A->B (distributed) systems, because you can't decentralize them later after failing to separate requirements. Requirements in production processes represent user instructions, and centralized modules only enable you to change the wrong users' processes. In other words, centralized modules don't document user groups or distinguish them from derived-product-users.

Can Datomic simplify querying data contained in dynamically accessed HTML documents?

I need to write an API which would provide access to data being served as HTML documents from a web server. I need for my users to be able to perform queries over the data.
Say on a web site there is a page which lists items and their owners. Then there is additional set of profile pages for owners which for each owner provide information about their reputation. An example query I may need to answer is "Give me ID's and owners of all items submitted in 2013 whose owners have reputation of at least 10".
Given a query to answer, I need to be able to screen scrape only the parts of the web site I need for answering the query at hand. And ideally cache the obtained information for future use with new queries.
I have no problem writing the screen scraping part, but I am struggling with designing the storage/query/cache part. Is there something about Clojure/Datomic that makes it an especially suitable technology choice for this kind of processing of data? I have been pointed in this direction before.
It seems a nice challenge but not sure about a few things: a) would you like to expose to your users a Datalog query box and so make them learn datalog-like syntax? b) what exact kind of results do you wish to cache, raw DB responses, html fomatted text, json ?
Anyway I suggest you to install and play a little bit with the Datomic console to get a grasp if you didn't before as it seems to me the more close idea to what you want to achieve atm https://www.youtube.com/watch?v=jyuBnl0XQ6s http://blog.datomic.com/2013/10/datomic-console.html
For the API I suggest you to use http://clojure-liberator.github.io/liberator/ as it provides sane defaults to implement REST services and let you focus on your app behaviour

Best practices for integrating two systems via a web-service

In my case the separate system is a web-service (but it could conceivably be anything).
My question is what are the best practices when you integrate against a separate system such as a web-service when it comes to data?
Example: Web-service provides a list of products. Products are grouped using categories. You can get all products in a sub-category. You can get a specific product by its id (an integer) or its name (a unique value).
In my application:
I display the list of categories and products - and the user can choose the product and specify an order quantity.
Should I store the name of the category or the id of the category?
Should I store the name of the product or the id of the product?
How should I name the field in the database that stores the data from the web-service
(CategoryId or WsCategoryId: so that by convention one knows where the value is coming from?)
Any other best practices?
Any other references?
From your question I understand that the web service's interface looks something like this:
Since you are asking if you should store CategoryName, I assume that it is unique (same as ProductName).
I also assume that the web service handles cases where products or categories are renamed transparently (i.e. by providing a redirect or any other means which allow you to detect this and handle it accordingly). If it doesn't, do not consider storing names as references to products or categories - always use IDs.
I would provide the same answer to your questions #1 and #2. Even though uniqueness of ProductName and CategoryName will technically allow you to store them in your application as unique identifiers of products and categories, I would opt for storing their IDs instead. The main decision point would be your storage medium. Since you are using a database, and the web service allows you to access objects by unique numerical IDs, database normalization rules should apply - hence you should store IDs.
The above however assumes that you are using a relational database - if you are using a NoSQL database, I assume that storing names instead of IDs would be a viable option as well (at least as far as I can tell with my current understanding of NoSQL solutions, unfortunately I don't have any practical experience with any of them yet).
Regarding question #3 - I would stick with the naming conventions that you already use in your database. There are many different conventions for naming tables and columns out there, so I really doubt that there are any standardized conventions on how to name columns referencing web service objects. I would name them according to your existing naming conventions and in a way that purpose of the columns is clear to everybody who is using the system. Note that if there is a chance that you will be using other web services in the future, you should consider keeping the name of the service in the column name rather than using a generic ws prefix - e.g. AmazonProductId or AmazonCategoryId.
I'll try to point out a few items from my experience, but I would not label them as best practices - just topics to think about.
In my experience, I found it useful to treat data from web services in the same fashion as the data from a database - at least from an application's perspective, where your storage layer would be abstracted from application logic. By this I mean that you would should think about and prepare for similar scenarios regardless if your storage medium is a database or a web service. Same as databases, web services can go down, both can have their data or integrity corrupt, both will require you to sanitize or otherwise process data on input.
Caching of data should be an item which is high on your list - apart from the obvious performance reasons, it can allow you to deal with outages of the web service (to an extend limited by which data you cache).
An example would be that your application displays a list products most frequently purchased products in your application. If your application stores only IDs of products, you will have to do one or more requests to the web service in order to retrieve the names of all products which you need to display in the list. If you cache product names locally or in your database, you will achieve better performance, conserve your resources and you will also have a failsafe scenario in case that the web service goes down.
Referential integrity is one other important aspect to think about when working with web services. As the web service is completely separate from your database, you do not have the option to create foreign keys as you would do in a database-only solution. This means that data changes in the web service (i.e. product updates or deletions) can break the integrity of data in your database.
Regarding references, these depend mostly on the type of web service that you are about to use (you didn't specify which service you will be using). If the service is based on REST principles, I can recommend Restful Web Services by Leonard Richardson and Sam Ruby. Even though it isn't focused on application/service integration as such, it's a great introduction into REST.