I have a system that consists of one central server, many mobile clients and many worker server. Each worker server has its own database and may be on the customer infrastructure (when he purchases the on premise installation).
In my current design, mobile clients send updates to the central server, which updates its database. The worker servers periodically pull the central to get updated information. This "pull model" creates a lot of requests and is still not suficient, because workers often use outdated information.
I want a "push model", where the central server can "post" updates to "somewhere", which persist the last version of the data. Then workers can "subscribe" to this "somewhere" and be always up-to-date.
The main problems are:
A worker server may be offline when an update happen. When it come back online, it should receive the updates it lost.
A new worker server may be created and need to get the updated data, even the data that was posted before it exists.
A bonus point:
Not needing to manage this "somewhere" myself. My application is deployed at AWS, so if there's any combination of services I can use to achieve that would be great. Everything I found has limited time data retention.
The problems with a push model are:
If clients are offline, the central system would need a retry method, which would generate many more requests than a push model
The clients might be behind firewalls, so cannot receive the message
It is not scalable
A pull model is much more efficient:
Clients should retrieve the latest data when the start, and also at regular intervals
New clients simply connect to the central server -- no need to update the central server with a list of clients (depending upon your security needs)
It is much more scalable
There are several options for serving traffic to pull requests:
Via an API call, powered by AWS API Gateway. You would then need an AWS Lambda function or a web server to handle the request.
Directly from DynamoDB (but the clients would require access credentials)
From an Amazon S3 bucket
Using an S3 bucket has many advantages: Highly scalable, a good range of security options (public; via credentials; via pre-signed URLs), no servers required.
Simply put the data in an S3 bucket and have the clients "pull" the data. You could have one set of files for "every" client, and a specific file for each individual client, thereby enabling individual configuration. Just think of S3 as a very large key-value datastore.
Related
I need to notify all machines behind a load balancer when something happens.
For example, I have machines behind a load balancer which cache data, and if the data changes I want to notify the machines so they can dump their caches.
I feel as if I'm missing something as it seems I might be overcomplicating how I talk to all the machines behind my load balancer.
--
Options I've considered
SNS
The problem with this is such that each individual machine would need to be publicly accessible over HTTPS.
SNS Straight to Machines
Machines would subscribe themselves with their EC2 URL with SNS on startup. To achieve this I'd need to either
open those machines up to http from anywhere (not just the load balancer)
create a security group which lets SNS IP ranges into the machines over HTTPS.
This security group could be static (IPs don't appear to have changed since ~2014 from what i can gather)
I could create a scheduled lambda which updates this security group from the json file provided by AWS if I wanted to ensure this list was always up to date.
SNS via LB with fanout
The load balancer URL would be subscribed to SNS. When a notification is received one of the machines would receive it.
The machine would use the AWS API to look at the autoscaling group it belongs to to find other machines attached to the same load balancer and then send the other machines the same message using its internal URL.
SQS with fanout
Each machine would be a queue worker, one would receive the message and forward on to the other machines in the same way as the SNS fanout described above.
Redis PubSub
I could set up a Redis cluster which each node subscribes to and receives the updates. This seems a costly option given the task at hand (especially given I'm operating in many regions and AZs).
Websocket MQTT Topics
Each node would subscribe to an MQTT topic and received the update this way. Not every region I use supports IOT Core yet so I'd need to either host my own broker in each region or have every region connect to their nearest supported (or even a single) region. Not sure about the stability of this but seems like it might be a good option perhaps.
I suppose a 3rd party websocket service like Pusher or something could be used for this purpose.
Polling for updates
Each node contains x cached items, I would have to poll for each item individually or build some means by which to determine which items have changed into a bulk request.
This seems excessive though - hypothetically 50 items, at polling intervals of 10 seconds
6 requests per item per minute
6 * 50 * 60 * 24 = 432000 requests per day to some web service/lambda etc. Just seems a bad option for this use case when most of those requests will say nothing has changed. A push/subscription model seems better than a pull/get model.
I could also use long polling perhaps?
Dynamodb streams
The change which would cause a cache clear is made in a global DynamoDB table (not owned by or known by this service) so I could perhaps allow access to read the stream from that table in every region and listen for changes via that route. That couples the two services pretty tightly though which I'm not keen on.
My job is to move our existing java calculation (servlet as WAR file) from our own server to AWS. This is a calculation without user interface or database. Other companies should be able to call the calculation in their programs. The servlet takes a post request with Json payload and the response sends Json payload back to client after the calculation is performed. The calculation is relatively heavy and therefore time-consuming (1-2 sec.).
I have decided to use AWS Elastic Beanstalk for the cloud computing but I'm in doubt as to what EB Environment to use - Server or Worker environment? and if I should use AWS API Gateway in front of EB?
Hopefully somebody can clarify this for me.
Worker environment produces an SQS queue where you submit your jobs into. To enable access to it from outside of AWS you would have to front it with API Gateway (preferred way).
However, the worker environment works in asynchronous way. It does not return job results to the caller. You would need to have some other mechanism for your clients to get the results back, e.g. though different API call.
An alternative is web environment where the clients get the response back directly from your json processing application. 1-2 seconds is not that long of a wait for an HTTP request.
For more complex solution based on EB, one could look at Creating links between Elastic Beanstalk environments. You would have a front-end environment for your clients linked with worker environment that does the json job processing.
The other way would be to rewrite the app into lambda, if possible of course. Lambda seems as a good fit for 1-2 seconds processing tasks.
My first question is, do I need SQS queues to receive my remote data, or can it go directly into an Amazon cloud storage solution like S3 or EC2?
Currently, my company uses a third-party vendor to gather and report on our remote data. By remote data, I mean data coming from our machines out in the wilderness. These data are uploaded a few times each day to Amazon Web Services SQS queues (setup by the third party vendor), and then the third-party vendor polls the data from the queues, removing it and saving it in their own on-premises databases for one year only. This company only provides reporting services to us, so they don't need to store the data long-term.
Going forward, we want to own the data and store it permanently in Amazon Web Services (AWS). Then we want to use machine learning to monitor the data and report any potential problems with the machines.
To repeat my first question, do we need SQS queues to receive this data, or can it go directly into an Amazon cloud storage solution like S3 or EC2?
My second question is, can an SQS queue send data to two different places? That is, can the queue send the data to the third party vendor, and also to an Amazon Web Services database?
I am an analyst/data scientist, so I know how to use the data once it's in a database. I just don't know the best way of getting it into a database.
You don't really need to have a queue. Whenever you push an item in Queue a function gets triggered and you can perform your custom logic in that. whether you want to store the information to S3/EC2 or sending it to anyother Http service.
Your Lambda(function) can send the data to anyother 3rd party service easily.
This question is more on best practices when developing a web service. This may be a bit vague
Lets say my service uses Spring container which creates a standard controller object for all requests. Now, in my controller I inject an instance of Dynamo Db mapper created once in spring container.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBMapper.OptionalConfig.html
Question:-
Shouldn't we create a pool of DynamoDb client objects and mappers so that the parallel requests to the service are supplied from the pool? Or should we should inject same/new instance of DynamoDb mapper object for all requests? Why don't we use something like C3PO for Dynamo Db connections ?
There is a pretty significant difference between how a relational database works and how DynamoDB works.
With a typical relational database engine such as MySQL, PostgreSQL or MSSQL, each client application instance is expected to establish a small number of connections to the engine and keep the connections open while the application is in use. Then, when parts of the application need to interact with the database, then they borrow a connection from the pool, use it to make a query, and release the connection back to the pool. This makes efficient use of the connections and removes the overhead of setting up and tearing down connections as well as reduces the thrashing that results from creating and releasing the connection object resources.
Now, switching over to DynamoDB: things look a bit different. You no longer have persistent connections from the client to a database server. When you execute a Dynamo operation (query, scan, etc.) it's an HTTP request/response which means the connection is established ad-hoc and lasts only the duration of the requests. DynamoDB is a web service and it takes care of load balancing and routing to get you consistent performance regardless of scale. In this case it is generally better for applications to use a single DynamoDB client object per instance and let the client and the associated service-side infrastructure take care of the load balancing and routing.
Now, the DynamoDB client for your stack (ie. Java client, .NET client, JavaScript/NodeJS client etc.) will typically make use of an underlying HTTP client that is pooled, mostly to minimize the costs associated with creating and tearing down these objects. And you can tweak some of those settings, and in some cases provide your own HTTP client pool implementation, but usually that is not needed.
We need to sync data between different web servers. The idea is very basic: when one entity is created on one server, it should be sent to all the other servers. What's the right way to do it? We are currently evaluating 2 approaches: amazon's sqs and sns services and custom implementation with some key-value database (like memcached and memqueue). What are the common pitfalls of custom implementations? Any feedback will be highly appreciated.
SQS would work OK if you create a new queue for each server and write the data to each queue. The biggest downside is that you will need each server to poll for new messages.
SNS would work more efficiently because it allows you to broadcast a message to multiple locations. However, it's a one-shot try; if a machine can't receive its notification when SNS sends it SNS will not try again.
You don't specify how many messages you are sending or what your performance requirements are, but any SQS/SNS system will likely be much, much slower (mostly due to latencies between sending the message and the servers receiving it) then a local memcache/key-value server solution.
A mixed solution would be to use a persistant store (like SimpleDB) and use SNS to alert the servers that new data is available.