AWS Labmda with backing database performance

AWS Labmda with backing database performance - amazon-web-services

Does anyone have any performance metric links or tips & tricks lists for using AWS Lambdas in cooperation with a backing RDMS database (Aurora with MySQL or PostgreSQL background) and/or DocumentDB. I want to use this functionality but I want to get at least some idea of what my performance calculations will need to be to determine whether or not it is feasible for certain operations.
I see things like these:
https://docs.aws.amazon.com/lambda/latest/dg/configuration-database.html
https://cloudonaut.io/passwordless-database-authentication-for-aws-lambda
etc.
However, nothing seems to include performance metrics. Since database authc/authz can be expensive, I want to see what I'm in for. I don't want to blow pricing for concurrent connections out of the water nor have excessive delays when a Lambda needs to access something in some cases.
The issue here is trying to run mostly serverless for some things but understanding that it's simply not the best solution for all scenarios.

Related

Use of redis cluster vs standalone redis

I have a question about when it makes sense to use a Redis cluster versus standalone Redis.
Suppose one has a real-time gaming application that will allow multiple instances of the game and wish to implement
real time leaderboard for each instance. (Games are created by communities of users).
Suppose at any time we have say 100 simultaneous matches running.
Based on the use cases outlined here :
https://d0.awsstatic.com/whitepapers/performance-at-scale-with-amazon-elasticache.pdf
https://redislabs.com/solutions/use-cases/leaderboards/
https://aws.amazon.com/blogs/database/building-a-real-time-gaming-leaderboard-with-amazon-elasticache-for-redis/
We can implement each leaderboard using a Sorted Set dataset in memory.
Now I would like to implement some sort of persistence where leaderboard state is saved at the end of each
game as a snapshot. Thus each of these independent Sorted Sets are saved as a snapshot file.
I have a question about design choices:
Would a redis cluster make sense for this scenario ? Or would it make more sense to have standalone redis instances and create a new database for each game ?
As far as I know there is only a single database 0 for a single redis cluster.(https://redis.io/topics/cluster-spec)
In that case, how would one be able to snapshot datasets for each leaderboard at different times work ?
https://redis.io/topics/cluster-spec
From what I can see using a Redis cluster only makes sense for large-scale monolithic applications and may not be the best approach for the scenario described above. Is that the case ?
Or if one goes with AWS Elasticache for Redis Cluster mode can I configure snapshotting for individual datasets ?

You are correct, clustering is a way of scaling out to handle really high request loads and store tons of data.
It really doesn't quite sound like you need to bother with a cluster.
I'd quite be very surprised if a standalone Redis setup would be your bottleneck before having several tens of thousands of simultaneous players.
If you are unsure, you can probably mock some simulated load and see what it can handle. My guess is that you are better off focusing on other complexities of your game until you start reaching quite serious usage. Which is a good problem to have. :)
You might however want to consider having one or two replica instances, which is a different thing.
Secondly, regardless of cluster or not, why do you want to use snap-shots (SAVE or BGSAVE) to persist your scoreboard?
If you want to have individual snapshots per game, and its only a few keys per game, why don't you just have your application read and persist those keys when needed to a traditional db? You can for example use MULTI, DUMP and RESTORE to achieve something that is very similar to snapshotting, but on the specific keys you want.
It doesn't sound like multiple databases is warranted for this.
Multiple databases on clustered Redis is only supported in the Enterprise version, so not on ElastiCache. But the above mentioned approach should work just fine.

How to build complex apps with AWS Lambda and SOA?

We currently run a Java backend which we're hoping to move away from and switch to Node running on AWS Lambda & Serverless.
Ideally during this process we want to build out a fully service orientated architecture.
My question is if our frontend angular app requests the current user's ordered items to get that information it would need to hit three services, the user service, the order service and the item service.
Does this mean we would need make three get requests to these services? At the moment we would have a single endpoint built for that specific request, which can then take advantage of DB joins for optimal performance.
I understand the benefits SOA, but how to do we scale when performing more compex requests such as this? Are there any good resources I can take a look at?

Looking at your question I would advise to align your priorities first: why do you want to move away from the Java backend that you're running on now? Which problems do you want to overcome?
You're combining the microservices architecture and the concept of serverless infrastructure in your question. Both can be used in conjunction, but they don't have to. A lot of companies are using microservices, even bigger enterprises like Uber (on NodeJS), but serverless infrastructures like Lambda are really just getting started. I would advise you to read up on microservices especially, e.g. here are some nice articles. You'll also find answers to your question about performance and joins.
When considering an architecture based on Lambda, do consider that there's no state whatsoever possible in a Lambda function. This is a step further then stateless services that we usually talk about; they generally target 'client state' that does not exist anymore. But a Lambda function cannot have any state, so e.g. a persistent DB-connection pool is not possible. For all the downsides, there's also a lot of stuff you don't have to deal with which can be very beneficial, especially in terms of scalability.

What is the different between AWS Elasticsearch and AWS Redshift

I read the document that both for data analysis and in cluster structure but I don't understand what use case different.
Amazon Elasticsearch is a popular open-source search and analytics engine for use cases such as log analytics, real-time application monitoring, and clickstream analytics.Amazon Elasticsearch
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. Amazon Redshift

Amazon Redshift is a hosted data warehouse product, while Amazon Elasticsearch is a hosted ElasticSearch cluster.
Redshift is based on PostgreSQL and (afaik) mostly used for BI purpuses and other compute-intensive jobs, the Amazon Elasticsearch is an out-of-the-box ElasticSearch managed cluster (which you cannot use to run SQL queries, since ES is a NoSQL database).
Both Amazon Redshift and Amazon ES are managed services, which means you don't need to do anything in order to manage your servers (this is what you pay for). Using the AWS Console you can add new cluster and you don't need to run any commands on order to install any software - you just need to choose which server to run your cluster on (number of nodes, disk, ram, etc).
If you are not familiar with ElasticSearch you should check their website.
Edit: It is now possible to write SQL queries on ElasticSearch: SQL Support for AWS ElasticSearch

I agree with #IMSoP's assertions above...
To compare the two is like comparing an elephant and a tiger - you're not really asking the right question quite yet.
What you should really be asking is - what are my requirements for my use cases to best fulfill my stakeholder / customer needs, first, and then which data storage technology best aligns with my requirements second...
To be clear - Whether speaking of AWS ElasticSearch Service, or FOSS / Enterprise ElasticSearch (which have signifficant differences, between, even) - ElasticSearch is NOT a Relational Database (RDBMS), nor is it quite a NoSQL (Document Store) Database, either...
ElasticSearch is a Search Engine / Index. It does some things very well, for very specific use cases, however unlike RDBMS data models most signifficantly, ElasticSearch or NoSQL are not going to provide you with FULL ACID Compliance, or Transactional Statement Processing, so if your use case prioritizes data integrity, constrainability, reliability, audit ability, regulatory compliance, recover ability (to Point in Time, even), and normalization of data model for performance and least repetition of data while providing deep cardinality and enforcing model constraints for optimal integrity, "NoSQL and Elastic are not the Droids you're looking for..." and you should be implementing a RDBMS solution. As already mentioned, the AWS Redshift Service is based on PostgreSQL - which is one of the most popular OpenSource RDBMS flavors out there, just offered by AWS as a fully managed solution / service for their customers.
Elastic falls between RDBMS and NoSQL categories, as it is a Search Engine / Index that works most optimally with "single index" type use cases, where A LOT of content is indexed all at once and those documents aren't updated very frequently after the initial bulk indexing,but perhaps the most important thing I could stress is that in my experience it typically does not scale very cost effectively (even managed cluster services) if you want your clusters to perform well, not degrade over time, retain large historical datasets, and remain highly available for your consumers - and for most will likely become cost PROHIBITIVE VERY fast. That said, Elastic Search DOES still have very optimal use cases, so is always worth evaluating against your unique requirements - just keep scalability and cost in mind while doing so.
Lastly let's call NoSQL what it is, a Document Store that stores collections of documents (most often in JSON format) and while they also do indexing, offer some semblance of an Authentication and Authorization model, provide CRUD operability (or even SQL support nowadays, which makes the career Enterprise Data Engineer in me giggle, that SQL is now the preferred means of querying data from their NoSQL instances! :D )- Still NOT a traditional database, likely won't provide you with much control over your data's integrity - BUT that is precisely what "NoSQL" Document Stores were designed to work best for - UNSTRUCTURED DATA - where you may not always know what your data model is going to look like from the start, or your use case prioritizes data model flexibility over enforcing data integrity in general (non mission critical data). Last - while most modern NoSQL Document Stores may have SOME features that appear on the surface to resemble RDBMS, I am not aware of ANY in that category at current that could claim to offer all that a relational database does, with Oracle MySQL's DocumentStore being probably the best of both worlds in my opinion (and not just because I've worked with it every day for the last decade, either...).
So - I hope Developers with similar questions come across this thread, and after reading are much better informed to make the most optimal design decisions for their use cases - because if we're all being honest with ourselves - everything we do in our profession is about data - either generating it, transporting it, rendering it, transforming it....it all starts and ends with data, and making the most optimal data storage decisions for your applications will literally define the rest of your project!
Cheers!

This strikes me as like asking "What is the difference between apples and oranges? I've heard they're both types of fruit."
AWS has an overview of the analytics products they offer, which at the time of writing lists 21 different services. They also have a list of database products which includes Redshift and 10 others. There's no particularly obvious reason why these two should be compared, and the others on both pages ignored.
There is inevitably a lot of overlap between the capabilities of these tools, so there is no way to write an exhaustive list of use cases for each. Their strengths and weaknesses, and the other tools they integrate easily with, will change over time, and some differences are a matter of "taste" or "style".
Regarding the two picked out in the question:
Elasticsearch is a product built by elastic.co, which AWS can manage the installation and configuration for. As its name suggests, its core functionality is based around search - it can be used to build a flexible but fast product search for an e-commerce site, for instance. It's also commonly used along with other tools to search and aggregate logs and monitoring data.
Redshift is a database system built by AWS, based on PostgreSQL but optimised for extremely large data sets. It is designed for "data warehouse" applications, where you want to write complex logical queries against the data, like "how many people in each city bought both a toothbrush and toothpaste, this year compared to last year".
Rather than trying to make an abstract comparison of all the different services available, it makes more sense to start from the use case which you actually have, and see which tool best fits that need.

Redshift as a Web App Backend?

I am building an application (using Django's ORM) that will ingest a lot of events, let's say 50/s (1-2k per msg). Initially some "real time" processing and monitoring of the events is in scope so I'll be using redis to keep some of that data to make decisions, expunging them when it makes sense. I was going to persist all of the entities, including events in Postgres for "at rest" storage for now.
In the future I will need "analytical" capability for dashboards and other features. I want to use Amazon Redshift for this. I considered just going straight for Redshift and skipping Postgres. But I also see folks say that it should play more of a passive role. Maybe I could keep a window of data in the SQL backend and archive to Redshift regularly.
My question is:
Is it even normal to use something like Redshift as a backend for web applications or does it typically play more of a passive role? If not is it realistic to think I can scale the Postgres enough for the event data to start with only that? Also if not, does the "window of data and archival" method make sense?
EDIT Here are some things I've seen before writing the post:
Some say "yes go for it" regarding the should I use Redshift for this question.
Some say "eh not performant enough for most web apps" and support the front it with a postgres database camp.

Redshift (ParAccel) is an OLAP-optimised DB, based on a fork of a very old version of PostgreSQL.
It's good at parallelised read-mostly queries across lots of data. It's bad at many small transactions, especially many small write transactions as seen in typical OLTP workloads.
You're partway in between. If you don't mind a data loss window, then you could reasonably accumulate data points and have a writer thread or two write batches of them to Redshift in decent sized transactions.
If you can't afford any data loss window and expect to be processing 50+ TPS, then don't consider using Redshift directly. The round-trip costs alone would be horrifying. Use a local database - or even a file based append-only journal that you periodically rotate. Then periodically upload new data to Redshift for analysis.
A few other good reasons you probably shouldn't use Redshift directly:
OLAP DBs with column store designs often work best with star schemas or similar structures. Such schemas are slow and inefficient for OLTP workloads as inserts and updates touch many tables, but they make querying the data along various axes for analysis much more efficient.
Using an ORM to talk to an OLAP DB is asking for trouble. ORMs are quite bad enough on OLTP-optimised DBs, with their unfortunate tendency toward n+1 SELECTs and/or wasteful chained left joins, tendency to do many small inserts instead of a few big ones, etc. This will be even worse on most OLAP-optimised DBs.
Redshift is based on a painfully old PostgreSQL with a bunch of limitations and incompatibilities. Code written for normal PostgreSQL may not work with it.
Personally I'd avoid an ORM entirely for this - I'd just accumulate data locally in an SQLite or a local PostgreSQL or something, sending multi-valued INSERTs or using PostgreSQL's COPY to load chunks of data as I received it from an in-memory buffer. Then I'd use appropriate ETL tools to periodically transform the data from the local DB and merge it with what was already on the analytics server.
Now forget everything I just said and go do some benchmarks with a simulation of your app's workload. That's the only really useful way to tell.

In addition to Redshift's slow transaction processing (by modern DB standards) there's another big challenge:
Redshift only supports serializable transaction isolation, most likely as a compromise to support ACID transactions while also optimizing for parallel OLAP mostly-read workload.
That can result in all kinds of concurrency-related failures that would not have been failures on typical DB that support read-committed isolation by default.

What type of database is best for high number of users and high concurrency?

We are building a web-based application that needs to support large number of users in a very high concurrency environment. Users will be attempting to change the same record at the same time. In terms of data volume in the database, we expect it to be very low (we're not trying to build the next Facebook), instead we need to provide each user very quick turnaround time for each request, so from the database perspective we need a solution that scales very easily as we add more users and records.
We are currently looking at relational and object-based databases, and also distributed database systems such as Cassandra and Hypertable. We prefer the open source solutions over commercial.
We're just looking for some direction, we don't need details on how to build the solution. Any suggestions would be greatly appreciated.

Amazon's SimpleDB supports conditional puts and consistent reads, but at that point, you're defeating the purpose and might as well just use mysql/percona and scale out vertically.
do you really need ACID? something's gotta give. and eventual consistency isn't all that bad, right? :)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js