This question is more on best practices when developing a web service. This may be a bit vague
Lets say my service uses Spring container which creates a standard controller object for all requests. Now, in my controller I inject an instance of Dynamo Db mapper created once in spring container.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBMapper.OptionalConfig.html
Question:-
Shouldn't we create a pool of DynamoDb client objects and mappers so that the parallel requests to the service are supplied from the pool? Or should we should inject same/new instance of DynamoDb mapper object for all requests? Why don't we use something like C3PO for Dynamo Db connections ?
There is a pretty significant difference between how a relational database works and how DynamoDB works.
With a typical relational database engine such as MySQL, PostgreSQL or MSSQL, each client application instance is expected to establish a small number of connections to the engine and keep the connections open while the application is in use. Then, when parts of the application need to interact with the database, then they borrow a connection from the pool, use it to make a query, and release the connection back to the pool. This makes efficient use of the connections and removes the overhead of setting up and tearing down connections as well as reduces the thrashing that results from creating and releasing the connection object resources.
Now, switching over to DynamoDB: things look a bit different. You no longer have persistent connections from the client to a database server. When you execute a Dynamo operation (query, scan, etc.) it's an HTTP request/response which means the connection is established ad-hoc and lasts only the duration of the requests. DynamoDB is a web service and it takes care of load balancing and routing to get you consistent performance regardless of scale. In this case it is generally better for applications to use a single DynamoDB client object per instance and let the client and the associated service-side infrastructure take care of the load balancing and routing.
Now, the DynamoDB client for your stack (ie. Java client, .NET client, JavaScript/NodeJS client etc.) will typically make use of an underlying HTTP client that is pooled, mostly to minimize the costs associated with creating and tearing down these objects. And you can tweak some of those settings, and in some cases provide your own HTTP client pool implementation, but usually that is not needed.
Related
The Postgres query builder my Lambda functions use, Knex, uses prepared statements so I'm unable to fully take advantage of RDS Proxy since the sessions are pinned. I try to ensure that the lambdas run for as little time as possible so that the pinned session completes as quickly as possible and its connection is returned to the pool.
I was wondering how I might be able to make the sessions shorter and more granular and thinking about creating and closing a connection to AWS RDS Proxy with each query.
What performance considerations should I be considering to determine the viability of this approach?
Things I'm thinking of:
RDS Proxy connection overhead (latency and memory)
The time that RDS Proxy takes to return a closed connection back to the pool and make it reusable by others (haven't been able to find documentation on this)
Overhead of Knex's local connection pool
Using RDS proxy when building applications with Lambda functions is a recommended infrastructure pattern by AWS. Relational Databases are not built to handle tons of connections, while Lambdas can scale to thousands of instances.
RDS Proxy connection overhead (latency and memory)
This would definitely increase your latency, but you will see a great improvement in the CPU and memory usage of your database, which would ultimately prevent unnecessary failures. It's a good trade-off when you can do a lot of other optimizations on the lambda side.
The time that RDS Proxy takes to return a closed connection back to the pool and make it reusable by others (haven't been able to find documentation on this)
While working with Lambdas, you should drop the connection to your RDS proxy, as soon as you finish processing your logic without worrying about the time the RDS proxy would take to return the closed connection back. Once the connection is dropped, the RDS proxy keeps it warm in the pool of connections it maintains for a certain duration of time. If another lambda tries to make a connection meanwhile, it can share the same connection which is still warm in the pool. Dropping the database connection at the right time from your lambda would save you lambda processing time -> money.
Overhead of Knex's local connection pool
I would suggest not using Knex local connection pool with lambda as it won't do any good (Keep the pool max to 1). Every lambda execution is independent of another, the pool will never be shared and the connection doesn't persist after the execution completes unless you plan to use it with a serverless-offline kind of local framework for development purposes.
Read More about AWS Lambda + RDS Proxy usage: https://aws.amazon.com/blogs/compute/using-amazon-rds-proxy-with-aws-lambda/
AWS Documentation on where to use RDS Proxy: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/rds-proxy-planning.html
AWS SQS provides long polling and short polling, but why it doesn't provide push mechanism, like RabbitMQ?
The application could establish a long-lived connection, and consume the messages, which is pushed from the SQS queue.
This is a design choice and makes sense when considering use cases for which SQS is designed.
Serverless Computing: SQS is a core service when it comes to designing serverless architectures. In such systems there is no concept of "persistent servers" and hence no need for long lived connections. This is also why the pricing model of SQS is primarily on the API calls.
REST API Access vs Connections: To me, this speaks everything. When in an serverless environment having REST APIs to the microservices is needed. This is because I cannot program around when the compute node would be provisioned and deprovisioned, eg in lambda there are no hooks for these actions. This means I will either have to introduce a new layer - Connection Pools or live with having dangling connections. If not I will end up opening and closing connections for every single operation (or lambda invocation) which will not give me any benefits of the "connection" in the first place. Here having REST API makes sense.
This is also why DynamoDB (database) is made accessible via a REST API and Aurora now has a serverless alternative, which as you guessed, has REST API.
Overhead of long-lived connections: The overhead of long lived connections, on either end is expensive enough and would require a completely different architecture. This again tied down to the above point of not having servers to keep the connections open in the first place.
Disclaimer: This answer comes from my experience of building architecture on AWS.
I am accessing auroradb service from my java lambda code. Here I set my lambda concurrency as 1.
Since creating/closing database connection is an expensive process, I have created the mysql connection and made it static. So it will reuse the same connection every time. I haven't added the code to close the connection.
Will it cause any problems?
Will it automatically close after some days?
Most certainly yes! When your lambda "cools" down, your connection to the database will be broken. The next time you invoke your lambda, it goes through a cold start, and your lambda code should initialize the connection again. This is a standard issue with working with persisted connections from serverless infrastructure.
What you need to use is something like a REST API for your data apis, and that's something Aurora Serverless supports as beta.
https://aws.amazon.com/about-aws/whats-new/2018/11/aurora-serverless-data-api-beta/
Each request is a independent HTTP request, and you don't end up managing you persisted connections.
I'm trying to find a good architecture for connecting to the database. It is required that the connection to the database is not repeated in each lambda function. In addition, this way will create many connections for individual lambdas instead of one common. Can I implement the structure as in the figure below, so that one lambda connects to the database, and everyone else uses its connection in its code
Your proposed architecture will not work because unless your Innovation of DB Lambda is too frequent to always keep it warm and that you are storing your connection in /tmp for reusing on subsequent innovations your DB Lambda will create new connections for each invocation. Moreover if your invocations of DB Lambda create multiple containers to serve simultaneous requests then you will anyways have those many connections instead of just one
Ideal solution will be to replace the DB Lambda with a tiny EC2 instance
The DB connection can be cached in your "DB Lambda" when the Lambda stays warm. If the Lambda does not stay warm, then invoking that lambda will suffer the price of a cold lambda, which could be dominated by having to recreate a DB connection, or could be dominated by what other work you do in "DB Lambda".
How frequently you expect your Lambdas to go cold is something to take into consideration. It depends on the statistics of your incoming traffic. Are you willing to suffer the latency of recreating a DB connection once in a while or not, another consideration?
Managing a tiny EC2 instance like someone else said, could be a lot of extra work depending on whether your cloud service is a complex set of backend services, whether that service shuts down during periods of inactivity. Managing EC2 instances is more work that Lambdas.
I do see one potential problem with your architecture. If for whatever reason your "DB Lambda" fails, the calling Lambda won't know. That could be a problem if you need to handle that situation and do cleanup.
I have a system that consists of one central server, many mobile clients and many worker server. Each worker server has its own database and may be on the customer infrastructure (when he purchases the on premise installation).
In my current design, mobile clients send updates to the central server, which updates its database. The worker servers periodically pull the central to get updated information. This "pull model" creates a lot of requests and is still not suficient, because workers often use outdated information.
I want a "push model", where the central server can "post" updates to "somewhere", which persist the last version of the data. Then workers can "subscribe" to this "somewhere" and be always up-to-date.
The main problems are:
A worker server may be offline when an update happen. When it come back online, it should receive the updates it lost.
A new worker server may be created and need to get the updated data, even the data that was posted before it exists.
A bonus point:
Not needing to manage this "somewhere" myself. My application is deployed at AWS, so if there's any combination of services I can use to achieve that would be great. Everything I found has limited time data retention.
The problems with a push model are:
If clients are offline, the central system would need a retry method, which would generate many more requests than a push model
The clients might be behind firewalls, so cannot receive the message
It is not scalable
A pull model is much more efficient:
Clients should retrieve the latest data when the start, and also at regular intervals
New clients simply connect to the central server -- no need to update the central server with a list of clients (depending upon your security needs)
It is much more scalable
There are several options for serving traffic to pull requests:
Via an API call, powered by AWS API Gateway. You would then need an AWS Lambda function or a web server to handle the request.
Directly from DynamoDB (but the clients would require access credentials)
From an Amazon S3 bucket
Using an S3 bucket has many advantages: Highly scalable, a good range of security options (public; via credentials; via pre-signed URLs), no servers required.
Simply put the data in an S3 bucket and have the clients "pull" the data. You could have one set of files for "every" client, and a specific file for each individual client, thereby enabling individual configuration. Just think of S3 as a very large key-value datastore.