Building a serverless secure MMO auction house system with Lambda + DynamoDB

Building a serverless secure MMO auction house system with Lambda + DynamoDB - amazon-web-services

I'm building an auction house system for an MMO, the system is pretty straightforward:
Players can post items to sell, item will be retained inside the order and appear on the market
Players can post items to buy, gold will be retained inside the order and system will look for orders, if there is a match, it will swap.
We're considering using AWS to make things simpler to scale. But, we're also having some concerns in matter of security, since any hacks on the auction house would pretty much ruin the game. And, well, since it's an online game, there will be people trying to hack it pretty much from day 1.
Is the serverless architecture recommended for this type of system in terms of security? Are there any unforeseen problems I might run in the future?
Is NoSQL(DynamoDB) recommended for this? I can visualize relational databases keeping things a bit more secure because of integrity restriction and ACID, but I'm a bit uncomfortable with having my code handling all the relationships in such a critical software.

For the DB architecture:
I think the key point here is your system will look for orders, if there is a match sentence. What matters is your indexing to achieve this. If you can create an index that is good (has a large number of distinct values that are requested fairly uniformly) then those items in the mall would be less costly and fast for querying. Otherwise searching items in the mall will be very inefficient, for instance, if your index key is say GoldenSword and 10000+ of items have the same id GoldenSword. I don't know the requirements for your 'Auction house' like in how many ways you can search or add so I can't speculate much on this. But a downside is if your DB architecture changes in time or is weakly designed at first it forces you to use Global Secondary Indexes and become very costly as DynamoDB is costly it self.
Best practices for DynamoDB
For the safety:
It is very fine and secure if you have decent restrictions on your IAM Roles and policies for accessing dynamoDB and for running lambda on your AWS account. Not like giving full access to everything and all dynamoDB operations and tables. As long as your IAM credentials are safe they will be safe. Never seen a security issue regarding to serverless architecture. Plus DynamoDB has a easy to use encryption at rest feature.

Related

Which one is better among DynamoDB and AWS ElasticSearchService for querying and storing logs?

I'm building a GUI tool for querying logs and looking for a cheaper option. DDB will fetch logs from an S3 bucket using lambda whereas ES will get the same logs streamed from CloudWatch. The thing is my queries are gonna be simple, not complex ones so I'm inclining towards DDB. Any inputs will be appreciated.

If you have fixed access patterns that can be queried using the partition key and sort key, staying within the limits of querying on a sort key, then DynamoDB is certainly a very good option. There are other factors, like size of the data, and number of records in a partition.
If you can do most of the filtering with the above, but need to further reduce the data based on values outside the key you still can use DynamoDB, but your milage may vary on how good it is. It becomes very dependent on data size and filtering complexity.
There is certainly a point where the complexity of the queries goes beyond what DynamoDB is designed for. At that point ES is often a good answer. Keep in mind that ES isn't a fully managed service, and it's a paid for the time it's running, regardless of use. I tend to try to avoid these types of services when I can, but if cost is not a significant factor for you, and you feel comfortable managing the ES cluster, then ES is a great option for advanced querying.

Is dynamodb suitable for growing or pivotable product?

Amazon said
NoSQL design requires a different mindset than RDBMS design. For an RDBMS, you can create a normalized data model without thinking about access patterns. You can then extend it later when new questions and query requirements arise. For DynamoDB, by contrast, you shouldn't start designing your schema until you know the questions it needs to answer. Understanding the business problems and the application use cases up front is absolutely essential.
It seems that I should design the tables after designing the product for efficient query cost.
But a product can be pivoted or be appended new features. In early stage, nobody knows where the product goes.
Is dynamodb suitable for growing or pivotable product?

In my opinion, the main benefit of Dynamo DB over other NoSQL solutions is that it is a managed database service. You pay for reads and writes and you never worry about scaling to handle larger data, more users. If you are doing a prototype or don't have technical know-how to setup a database server and host in the cloud it could be useful and cost effective. It has its limitations however so if you do have technical resources consider another open source NoSQL option.
I think that statement by Amazon is confusing and is probably more marketing than anything else. Use NoSQL in cases where your data is only accessed in distinct elements that do no have to be combined in a complex manner. It's also helpful if you don't have an exact schema defined because NoSQL doesn't require a hard set schema you can store any fields in a table and you can always add new fields. This is helpful when things are changing rapidly and you don't want to migrate everything as strictly as an RDBMS would require. If however you're going to have to run complex logic or calculations combining data from across tables you should use an RDBMS. You could use NoSQL for some data and and RDBMS for other data in a hybrid fashion but in that case you probably wouldn't want to use Dynamo DB because you'd want full ownership to set it up properly. Hope this helps I'm sure others have more to say and I welcome comments to help me refine my answer.

Querying / Pagination Across Microservices

Our shop has recently started taking on an SOA approach to application development. We are seeing some great benefits with the separation of concerns, reusability, and other benefits of SOA/microservices.
However, one big item we're stuck on is aggregating, filtering, and paginating results across services. Let me describe the issue with a scenario.
Say we have 3 services:
PersonService - Stores information on people (names, addresses, etc)
ItemService - Stores information on items that are purchasable.
PaymentService - Stores information regarding payments that people have made for different items.
Now, say we want to build a reporting/admin tool that can display / report on multiple services in aggregate. For instance, we want to display a paginated list of Payments, along with the Person and Item that each payment was for. This is pretty straightforward: Grab the list of payments, then query PersonService and ItemService for the respective Person and Item records.
However, the issue comes into play when we want to then filter down that data: For instance, displaying a paginated list of payments made by people with the first name 'Bob', who have purchased the item 'Car'. This makes things much more complicated, because we need to filter results from 3 different services without knowing how many results each service is going to return.
From a performance perspective, querying all of the services over and over again to narrow down the results would be costly, so I've been researching better solutions. However, I cannot find concrete solutions to this problem (or at least a "best practice"). In a monolithic application, we'd simply use SQL joins across the different tables. I'm having a ton of trouble figuring out how/if something similar is possible across services.
My question to the community is: What would your approach be? Things I've considered:
Using some sort of search index (Elasticsearch, Solr) that contains all data for all services (updated via events pushed out by services), and then querying the search index for results.
Attempting to understand how projects like GraphQL and Neo4j may assist us with these issues.

I stick with Sam Newman who says in Chapter 4 "The shared Database" of his book something like:
Remember when we talked about the core principles behind good microservices? Strong cohesion and loose coupling --with database integration, we lose both things. Database integration makes it very easy for services to share data, but does nothing about sharing behaviour. Our internal representation is exposed over the wire to our consumers, and it can be very difficult to avoid making breaking changes, wich inevitably leads to fear of any changes at all. Avoid at (nearly) all costs.
This is the point I make when I curse at Content-Management-Systems.
In my view a microservice is autonomous, what it cannot be if it shares things or consumes shared things. The only exception I make here are Domain-Objects, those represent the shared understanding of the business model and must be used in communication between microservices solely.
It depends on the microservice itself if an ER or AggregationOriented database (divided into document based or graph based) better suits the needs.
The funny thing is, by being loosley coupled and by being autonomus you are able to do just that!
If an PaymentService shares the behaviour of "how many payments for Person A"
He needs to know Person A in order to fullfill this. But Everything he knows about Person A must origin from the PersonService, maybe at runtime (the PaymentService maybe just stores an id) or event based (the PaymentService stores the data it needs up to the Domain-Object user, what gets updated triggered and supplied by the PersonService). The PaymentService itself does not share users itself.

The answer to this question is that you need a separate Read Database or Materialized View that aggregates data from multiple databases, and makes it ready for fast retrieval. See the CQRS pattern: https://learn.microsoft.com/en-us/azure/architecture/patterns/cqrs
The data in the Materialized View might not be "the most up to date", meaning there might be a small delay between when the change is made by the respective microservice, and when time the "Materialized View" is updated, but this is fine, as retrieving the data fast is more important than if the data is stale for a few seconds or even minutes (there are systems where the Materialized View can take 2-5 minutes to be updated, and yet that might still be acceptable)
The best pattern to implement this Read Database or Materialized View from CQRS, is typically the Event Sourcing pattern, where we can listen to a queue for new updates and update the Read Database immediately. See the Event Sourcing pattern: https://learn.microsoft.com/en-us/azure/architecture/patterns/event-sourcing

Storing this data in elasticsearch/solr/cognitivesearch type service in addition to SQL could help solve some of these problems.
In your given example,
In the search index(elasticsearch/solr/cognitivesearch) person object will have a property called "items" that will contain a list of items that are paid for by that person.
That way, you can filter across objects, get a paginated list that is sorted by any property of the person. You can add similar information on other documents to better suit your business needs.
Using a GraphDatabase would seem to solve your problem from a 10000ft, but you will run into pagination problems when you operate at scale. GraphDatabases do not do pagination well(they will have to visit all the nodes anyway, even when you need a paginated list) and will cause timeouts/performance issues.

You can use replication tables.
All databases have replication feature
If you have personService that has person table and PaymentService that has payment table then create reportService that has person and payment tables, that they filled by replication feature.

What is the different between AWS Elasticsearch and AWS Redshift

I read the document that both for data analysis and in cluster structure but I don't understand what use case different.
Amazon Elasticsearch is a popular open-source search and analytics engine for use cases such as log analytics, real-time application monitoring, and clickstream analytics.Amazon Elasticsearch
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. Amazon Redshift

Amazon Redshift is a hosted data warehouse product, while Amazon Elasticsearch is a hosted ElasticSearch cluster.
Redshift is based on PostgreSQL and (afaik) mostly used for BI purpuses and other compute-intensive jobs, the Amazon Elasticsearch is an out-of-the-box ElasticSearch managed cluster (which you cannot use to run SQL queries, since ES is a NoSQL database).
Both Amazon Redshift and Amazon ES are managed services, which means you don't need to do anything in order to manage your servers (this is what you pay for). Using the AWS Console you can add new cluster and you don't need to run any commands on order to install any software - you just need to choose which server to run your cluster on (number of nodes, disk, ram, etc).
If you are not familiar with ElasticSearch you should check their website.
Edit: It is now possible to write SQL queries on ElasticSearch: SQL Support for AWS ElasticSearch

I agree with #IMSoP's assertions above...
To compare the two is like comparing an elephant and a tiger - you're not really asking the right question quite yet.
What you should really be asking is - what are my requirements for my use cases to best fulfill my stakeholder / customer needs, first, and then which data storage technology best aligns with my requirements second...
To be clear - Whether speaking of AWS ElasticSearch Service, or FOSS / Enterprise ElasticSearch (which have signifficant differences, between, even) - ElasticSearch is NOT a Relational Database (RDBMS), nor is it quite a NoSQL (Document Store) Database, either...
ElasticSearch is a Search Engine / Index. It does some things very well, for very specific use cases, however unlike RDBMS data models most signifficantly, ElasticSearch or NoSQL are not going to provide you with FULL ACID Compliance, or Transactional Statement Processing, so if your use case prioritizes data integrity, constrainability, reliability, audit ability, regulatory compliance, recover ability (to Point in Time, even), and normalization of data model for performance and least repetition of data while providing deep cardinality and enforcing model constraints for optimal integrity, "NoSQL and Elastic are not the Droids you're looking for..." and you should be implementing a RDBMS solution. As already mentioned, the AWS Redshift Service is based on PostgreSQL - which is one of the most popular OpenSource RDBMS flavors out there, just offered by AWS as a fully managed solution / service for their customers.
Elastic falls between RDBMS and NoSQL categories, as it is a Search Engine / Index that works most optimally with "single index" type use cases, where A LOT of content is indexed all at once and those documents aren't updated very frequently after the initial bulk indexing,but perhaps the most important thing I could stress is that in my experience it typically does not scale very cost effectively (even managed cluster services) if you want your clusters to perform well, not degrade over time, retain large historical datasets, and remain highly available for your consumers - and for most will likely become cost PROHIBITIVE VERY fast. That said, Elastic Search DOES still have very optimal use cases, so is always worth evaluating against your unique requirements - just keep scalability and cost in mind while doing so.
Lastly let's call NoSQL what it is, a Document Store that stores collections of documents (most often in JSON format) and while they also do indexing, offer some semblance of an Authentication and Authorization model, provide CRUD operability (or even SQL support nowadays, which makes the career Enterprise Data Engineer in me giggle, that SQL is now the preferred means of querying data from their NoSQL instances! :D )- Still NOT a traditional database, likely won't provide you with much control over your data's integrity - BUT that is precisely what "NoSQL" Document Stores were designed to work best for - UNSTRUCTURED DATA - where you may not always know what your data model is going to look like from the start, or your use case prioritizes data model flexibility over enforcing data integrity in general (non mission critical data). Last - while most modern NoSQL Document Stores may have SOME features that appear on the surface to resemble RDBMS, I am not aware of ANY in that category at current that could claim to offer all that a relational database does, with Oracle MySQL's DocumentStore being probably the best of both worlds in my opinion (and not just because I've worked with it every day for the last decade, either...).
So - I hope Developers with similar questions come across this thread, and after reading are much better informed to make the most optimal design decisions for their use cases - because if we're all being honest with ourselves - everything we do in our profession is about data - either generating it, transporting it, rendering it, transforming it....it all starts and ends with data, and making the most optimal data storage decisions for your applications will literally define the rest of your project!
Cheers!

This strikes me as like asking "What is the difference between apples and oranges? I've heard they're both types of fruit."
AWS has an overview of the analytics products they offer, which at the time of writing lists 21 different services. They also have a list of database products which includes Redshift and 10 others. There's no particularly obvious reason why these two should be compared, and the others on both pages ignored.
There is inevitably a lot of overlap between the capabilities of these tools, so there is no way to write an exhaustive list of use cases for each. Their strengths and weaknesses, and the other tools they integrate easily with, will change over time, and some differences are a matter of "taste" or "style".
Regarding the two picked out in the question:
Elasticsearch is a product built by elastic.co, which AWS can manage the installation and configuration for. As its name suggests, its core functionality is based around search - it can be used to build a flexible but fast product search for an e-commerce site, for instance. It's also commonly used along with other tools to search and aggregate logs and monitoring data.
Redshift is a database system built by AWS, based on PostgreSQL but optimised for extremely large data sets. It is designed for "data warehouse" applications, where you want to write complex logical queries against the data, like "how many people in each city bought both a toothbrush and toothpaste, this year compared to last year".
Rather than trying to make an abstract comparison of all the different services available, it makes more sense to start from the use case which you actually have, and see which tool best fits that need.

What type of database is best for high number of users and high concurrency?

We are building a web-based application that needs to support large number of users in a very high concurrency environment. Users will be attempting to change the same record at the same time. In terms of data volume in the database, we expect it to be very low (we're not trying to build the next Facebook), instead we need to provide each user very quick turnaround time for each request, so from the database perspective we need a solution that scales very easily as we add more users and records.
We are currently looking at relational and object-based databases, and also distributed database systems such as Cassandra and Hypertable. We prefer the open source solutions over commercial.
We're just looking for some direction, we don't need details on how to build the solution. Any suggestions would be greatly appreciated.

Amazon's SimpleDB supports conditional puts and consistent reads, but at that point, you're defeating the purpose and might as well just use mysql/percona and scale out vertically.
do you really need ACID? something's gotta give. and eventual consistency isn't all that bad, right? :)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js