I have an interesting problem where I have a job processing architecture that has a limit on how many jobs can be processed at once. When another job is about to start, it needs to check how many jobs are being processed, and if it is at the threshold, add the job to a queue.
What has stumped me is the best way to implement a "counter" that tracks the number of jobs running at once? This counter needs to be accessed and incremented up and down from different lambda functions.
My first thought was a CloudWatch custom high latency metric, but 1 second is not quick enough, as the system breaks if too many jobs are submitted. Additionally, I'm not sure if the metric can be incremented up or down only through code. The only thing I can think of now is an entire separate DB or EC2 instance, but that seems like complete overkill for just ONE number. We are not using a DB for data storage, it is in another cloud platform, only S3.
Any suggestions on what to do next? Thank you so much :)
You could use a DynamoDB table to hold your counter as a document. However, keep in mind that some operations in DynamoDB could lead to race conditions, so you might want to “lock” your table.
Depending on your load, this could potentially be free, given the Free Tier.
Related
A service runs on ECS and writes the requested URL to a DynamoDB. Dynamic scaling was activated to keep the costs for DynamoDB from becoming too high. DynamoDB scales slower than requests are coming in at any given time, so some calls are not logged. My question now is whether writing to an SQS would be the better way here, because the documentation says:
Standard queues support a nearly unlimited number of API calls per second, per API action (SendMessage, ReceiveMessage, or DeleteMessage).
Of course, the messages would then have to be written back to DynamoDB, but another service can then do that.
Is the throughput of messages per second to SQS really unlimited, so it's definitely cheaper to send messages to SQS instead of increasing DynamoDB's writes per second?
I don't know if this qualifies for a good answer. But remembering a discussion with my architect at the time, we concluded that to have a queue for precisely this problem seems good practice, regardless of load. It keeps requests even if services go down, so there is an added benefit.
SQS and Dynamo fit two very different use cases. Its not so much which is better, its which is right for what you need.
Dynamodb is a NoSQL Document based Database. This is best for when you have known access patterns to data that needs to persist over time, that you need to access quickly, but probably are not making many changes too (or at least the changes do not have to be absolutely immediately, sub 5 ms accessible). Each document in a dynamodb is similar (but also very different) to a row in a standard SQL table, in that it will have attributes (columns) keys (Partition and Sort Key) and be retrievable through a query (though dynamic on the fly queries are NOT good for Dynamo)
SQS is a Queue system. It has no persistence. Payloads of JSON objects are dropped into the Queue and then processed by some end point - either a Lambda, or put into a dynamo, or something else entirely depending on your products use case. It is perfect for when you often receive bursts of data but your system needs some time to handle each individual payload - such as it is waiting on other systems to finish before it can handle the next one - so instead of scaling horizontally (by just handling all the payloads in parallel) you have to scale vertically (be able to handle more payloads at once through a single or only a few threads). You cannot access the data coming in while it is waiting in the queue, no queries on said data, only wait until that data pops/pushes off the queue and into processing by whatever system you have set up to receive it.
The answer to your question is entirely dependent on your use case and your system - something we here at SO will never really understand or know simply because we will always be hearing about it through you and never really experiencing it. As such, to answer it, you need to understand the capabilities of both Dynamo and SQS, the pros and cons for each, and then determine which is best for your product.
I have thousands of training jobs that I want to run on sagemaker. Basically I have a list of hyperparameters and I want to train the model for all of those hyperparmeters in parallel (not a standard hyperparameter tuning where we just want to optimize the hyperparameter, here we want to train for all of the hyperparameters). I have searched the docs quite extensively but it surprises me that I couldn't find any info about this, even though it seems like a pretty basic functionality.
For example, let's say I have 10,000 training jobs, and my quota is 20 instances, what is the best way to run these jobs utilizing all my available instances? In particular,
Is there a "queue manager" functionality that takes the list of hyperparameters and runs the training jobs in batches of 20 until they are all done (even better if it could keep track of failed/completed jobs).
Is it best practice to run a single training job per instance? If that's the case do I need to ask for a much higher quota on the number of instance?
If this functionality does not exist in sagemaker, is it worth using EC2 instead since it's a bit cheaper?
Your question is very broad and the best way forward would depend on other details of your use-case, so we will have to make some assumptions.
[Queue manager]
SageMaker does not have a queue manager. If at the end you decide you need a queue manager, I would suggest looking towards AWS Batch.
[Single vs multiple training jobs]
Since you need to run 10s of thousands job I assume you are training fairly lightweight models, so to save on time, you would be better off reusing instances for multiple training jobs. (Otherwise, with 20 instances limit, you need 500 rounds of training, with a 3 min start time - depending on instance type - you need 25 hours just for the wait time. Depending on the complexity of each individual model, this 25hours might be significant or totally acceptable).
[Instance limit increase]
You can always ask for a limit increase, but going from a limit of 20 to 10k at once is likely that will not be accepted by the AWS support team, unless you are part of an organisation with a track record of usage on AWS, in which case this might be fine.
[One possible option] (Assuming multiple lightweight models)
You could create a single training job, with instance count, the number of instances available to you.
Inside the training job, your code can run a for loop and perform all the individual training jobs you need.
In this case, you will need to know which which instance is which so you can make the split of the HPOs. SageMaker writes this information on the file: /opt/ml/input/config/resourceconfig.json so using that you can easily have each instance run a subset of the trainings required.
Another thing to think of, is if you need to save the generated models (which you probably need). You can either save everything in the output model directory - standard SM approach- but this would zip all models in a model.tar.gz file.
If you don't want this, and prefer to have each model individually saved, I'd suggest using the checkpoints directory that will sync anything written there to your s3 location.
I am looking for some best practice advice on AWS, and hoping this question won't immediately be closed as too open to opinion.
I am working on a conversion of a windows server application to AWS lambda.
The server runs every 5 minutes and grabs all the files that have been uploaded to various FTP locations.
These files must be processed in a specific order, which might not be the order they arrive in, so it then sorts them and processes accordingly.
It interacts with a database to validate the files against information from previous files.
It then sends the relevant information on, and records new information in the database.
Errors are flagged, and logged in the database, to be dealt with manually.
Note that currently there is no parallel processing going on. This would be difficult because of the need to sort the files and process them in the correct order.
I have therefore been assuming the lambda will have to run as a single invocation on a schedule.
However, I have realised that the files can be partitioned according to where they come from, and those locations can be processed independantly.
So I could have a certain amount of parallelism.
My question is what is the correct way to manage that limited parallelism in AWS?
A clunky way of doing it would be through the database, something like this:
A lambda spins up and reads a particular table in the database
This table has a list of independant processing areas, and the columns: "Status", "StartTime".
The lambda finds the oldest one not currently
being processed, registers it as "processing" and updates the
"StartTime".
After processing the status is set to "done" or some such.
I think this would work, but it doesn't feel quite right to be managing such things through the database.
Can someone suggest a pattern that my problem fits into, and the correct AWS way of doing this?
if you really want to do this with parallel lambda invocations, then yes, you should absolutely use a database to coordinate their work.
The protocol you're thinking about seems reasonable. You need to use the transactional capabilities of the database to ensure that the parallel invocations don't interfere with each other, and you need to make sure that the system is resilient to lambda invocations that don't happen.
When your lambda is invoked to handle the event, it should decide how many additional parallel invocations are required, and then make asynchronous lambda calls to run those additional instances. Those instances should recognize that they were invoked directly and skip that part.
After that, all of the parallel lambda invocations should do exactly the same thing. Make sure that none of them are special in any way, so you don't need to rely on any particular one completing without error. They should each pull work from a work queue in the DB until all the work is done.
BUT NOTE: Usually the kind of tasks you're talking about are not CPU-bound. If that is the case then running multiple parallel tasks inside the same lambda invocation will make better use of your resources. You can do both, of course.
At a high / theoretical level I know exactly the type of architecture I want to build and how it would work, but I'm attempting to construct this as cheaply as possible using AWS services and my lack of familiarity with the offerings of AWS has me running in circles.
The Data
We run a video streaming platform. On busy nights we have about 100 simultaneous live streams going with upwards of 30,000 viewers. We expect this number to rise to 100,000 in the next few years. A live stream lasts, on average, 2 hours.
We send a heartbeat from our player every 10 seconds with information about the viewer -- how much data they've viewed, how much data they've buffered, what quality they're streaming, etc.
These heartbeats are sent directly to an AWS Kinesis endpoint.
Finally, we want to retain all past messages for at least 5 years (hopefully longer) so that we can look at historic analytics.
Some back of the envelope calculations suggest we will have 0.1 * 60 * 60 * 2 * 100000 * 365 * 5 = 131 billion heartbeat messages five years from now.
Our Old Pipeline
Our old system had a single Kinesis consumer. Aggregate data was stored in DynamoDB. Whenever a message arrived we would read the record from DynamoDB, update the record, then write the new record back. This read-update-write loop limited the speed at which we could process messages and made it so that each message coming in was dependent on the messages before it, so they could not be processed in parallel.
Part of the reason for this setup is that our message schema was not well designed from the outset. We send the timestamp at which the message was sent, but we do not send "amount of video watched since last heartbeat". As a result in order to compute the total viewer time we need to look up the last heartbeat message sent by this player, subtract the timestamps, and add that value. Similar issues exist with many other metrics.
Our New Pipeline
We've begun to run into scaling issues. During our peak hours analytics can be delayed by as much as four hours while waiting for a backlog of messages to be processed. If this backlog reaches 24 hours Kinesis will start deleting data. So we need to fix our pipeline to remove this dependency on past messages so we can process them in parallel.
The first part of this was updating the messages sent by our players. Our new specification includes only metrics that can be trivially sum'd with no subtraction. So we can just keep adding to the "time viewed" metric, for instance, without any regard to past messages.
The second part of this was ensuring that Kinesis never backs up. We dump the raw messages to S3 as quickly as they arrive with no processing (Kinesis Data Fire Hose) so that we can crunch analytics on them at our leisure.
Finally, we now want to actually extract information from these analytics as quickly as possible. This is where I've hit a snag.
The Questions We Want to Answer
As this is an analytics pipeline, our questions mostly revolve around filtering these messages and then aggregating fields for the remaining messages (possibly, in fact likely, with grouping). For instance:
How many Android users watched last night's stream in HD? (FILTER by stream and OS)
What's the average bandwidth usage among all users? (SUM and COUNT, with later division of the final aggregates which could be done on the dashboard side)
What percent of users last year were on any Apple device (iOS, tvOS, etc)? (COUNT, grouped by OS)
What's the average time spent buffering among Android users for streams in the past year? (a mix of all of the above)
Options
AWS Athena would allow us to query the data in S3 directly as if it were an ANSI SQL table. However reading up on Athena, unless the data is properly formatted it can be incredibly slow. Some benchmarks I've seen show that processing 1.1 billion rows of CSV data can take up to 2 minutes. I'm looking at processing 100x that much data
AWS EMR and AWS Redshift sound like they are built for this purpose, but are complicated to set up and have a high base cost to run (requiring an EC2 cluster to remain active at all times). AWS Redshift also requires data be loaded into it, which sounds like it might be a very slow process, delaying our access to analytics
AWS Glue sounds like it may be able to take the raw messages as they arrive in S3 and convert them to Parquet files for more rapid querying via Athena
We could run a job to regularly batch messages to reduce the total number that must be processed. While a stream is live we'll receive one message every 10 seconds, but we really only care about the totals for a given viewer. This means that when a 2-hour stream concludes we can combine the 720 messages we've received from that player into a single "summary" message about the viewer's experience during the whole stream. This would massively reduce the amount of data we need to process, but exactly how and when to trigger this process isn't clear to me
The Ideal Architecture
This is a Big Data problem. The generic solution to Big Data problems is "don't take your data to your query, take your query to your data". If these messages were spread across 100 small storage nodes then each node could filter, sum, and count the subset of data they hold and pass these aggregates back to a central node which sums the sums and sums the counts. If each node is only operating on 1/100th of the data set then this kind of processing could theoretically be incredibly fast.
My Confusion
While I have a theoretical understanding of the "ideal" architecture, it's not clear to me if AWS works this way or how to construct a system that will function well like this.
S3 is a black box. It's not clear if Athena queries are run on individual nodes and aggregates are further reduced elsewhere, or if there's a system reading all of the data and aggregating it in a central location
Redshift requires the data by copied into a Redshift database. This doesn't sound fast, nor distributed
It's unclear to me how EMR works or if it will suit my purpose. Still researching
AWS Glue seems like it may need to be triggered by some event?
Parquet files seems to be like CSVs, where multiple records reside in a single file. Meanwhile I'm dumping one record per file. But perhaps there's a way to fix that? e.g. batching files every minute or every 5 minutes?
RDS or a similar service might be really good for this (indexing and whatnot) but would require a guaranteed schema (or necessitate migrating if our message schema changed) which is a concern. Migrating terabytes of data if we change our message schema sounds out of the question
Finally, along with wanting to get analytics results in as "real time" as possible (ideally we want to know within 1 minute when someone joins or leaves a stream), we want the dashboards to load quickly. Waiting 30 seconds to see the count of live viewers is horrendous. Dashboards should load in 2 seconds or less (ideally)
The plan is to use QuickSight to create dashboards (our old system had a hack-y Django app that read from our DynamoDB aggregates table, but I'd like to avoid creating more code for people to maintain)
I expect you are going to get a lot of different answers and opinions from the broad set of experts you have pinged with this. There is likely no single best answer to this as there are a lot of variables. Let me give you my best advice based on my experience in the field.
Kinesis to S3 is a good start and not moving data more than needed is the right philosophy.
You didn't mention Kinesis Data Analytics and this could be a solution for SOME of your needs. It is best for questions about what is happening in the data feed right now. The longer timeframe questions are better suited for the tools you mention. If you aren't too interested in what is happening in the past 10 minutes (or so) it could be good to omit.
S3 organization will be key to performing any analytic directly on the data there. You mention parquet formatting which is good but partitioning is far more powerful. Organizing the S3 data into "days" or "hours" of data and setting up the partitioning based on this can greatly speed up any query that is limited in the amount of time that is needed (don't read what you don't need).
Important safety note on S3 - S3 is an object store and as such there is overhead for each object you reference. Having many small objects (10,000+) treated as a single set of data is going to be slow no matter what solution you go with. You need to fix this before you go forward with any solution. You see it takes upwards of .5 sec to look up an object in S3 but if the file is small the transfer time is next to nothing. Now multiply .5 sec times all the objects you have and see how long it will take to read them. This is not a function of the downstream tool you choose but of the S3 organization you have. S3 objects as part of a Big Data solution should be at least 100M in size to not suffer greatly from the object lookup time. The choice of parquet or CSV files is mute without addressing object size and partitioning first.
Athena is good for occasional queries especially if the date ranges are limited. Is this the query pattern you expect? As you say "move the compute to the data" but if you use Athena to do large cross-sectional analytics where a large percentage of the data needs to be used, you are just moving the data to Athena every time you execute this query. Don't stop thinking about data movement at the point it is stored - think about the data movements to do the analytics also.
So a big question is how much data is needed and how often to support your analytics workloads and BI functions? This is the end result you are looking for. If a high percentage of the data is needed frequently then a warehouse solution like Redshift with the data loaded to disk is the right answer. The data load time to Redshift is quite fast as it parallel loads the data from S3 (you see S3 is a cluster and Redshift is a cluster and parallel loads can be done). If loading all your data into Redshift is what you need then the load time is not your main concern - the cost is. Big powerful tool with a price tag to match. The new RA3 instance type bends this curve down significantly for large data size clusters so could be a possibility.
Another tool you haven't mentioned is Redshift Spectrum. This brings several powerful technologies together that could be important to you. First is the power of Redshift with the ability to choose smaller cluster sizes that normally would be used for your data size. S3 filtering and aggregation technology allows Spectrum to perform actions on the data in S3 (yes initial compute actions of the query are performed inside of S3 potentially greatly reducing the data moved to Redshift). If your query patterns support this data reduction in S3 then the data movement will be small and the Redshift cluster can be small (cheap) too. This can be a powerful compromise point for IoT solutions like yours since complex data models and joining are not needed.
You bring up Glue and conversion to parquet. These can be good to do but as I mentioned before partitioning of the data in S3 is usually far more powerful. The value of parquet will increase as the width of your data increases. Parquet is a columnar format so it is advantaged if only a subset of "columns" are needed. The downside is the conversion time/cost and the loss of easy human readability (which can be huge during debug).
EMR is another choice you mention but I generally advise clients against going with EMR unless they need the flexibility it brings to the analytics and they have the skills to use it well. Without these EMR tends to be an unneeded costs sink.
If this is really going to be a Big Data solution then RDS (and Aurora) not good choices. They are designed for transactional workloads, not analytics. The data size and analytics will not fit well or be cost effective.
Another tool in the space is S3 Select. Not likely what you are looking for but something to remember exists and can be a tool in the toolbox.
Hybrid solutions are common in this space if there are variable needs based on some factor. A common one "is time of day" - no one is running extensive reports at 3am so the needed performance is much less. Another is user group - some groups need simple analytics while others need much more power. Another factor is timeliness of data - does everyone need "up to the second" information or is daily information sufficient? Trying to have one tool that does everything for everybody, all the time is often a path to an expensive, oversized solution.
Since Redshift Spectrum and Athena can point at the same S3 data (well organized since both will benefit) both tools can coexist on the same data. Also, Redshift is ideal for sifting through huge mounds of data, it is ideal for producing summary tables and then writing them (in partitioned parquet) to S3 for tools like Athena to use. All these cloud services can be run on schedules and this includes Redshift and EMR (Athena is query on demand) so they don't need to run all the time. Redshift with Spectrum can run a few hours a day to perform deep analytics and summarize data for writing to S3. Your data scientist can also use Redshift for their hardcore work while Athena supports dashboards using the daily summary data and Kinesis Data Analytics as source.
Lastly you bring up a 2 sec requirement for dashboards. This is definitely possible with Quicksight backed up by Redshift or Athena but won't be met for arbitrarily complex / data intensive queries. To meet this you will need the engine to have enough horsepower to produce the data in question. Redshift with local data storage is likely the fastest (Redshift Spectrum with some data pruning done in S3 wins in some cases) and Athena is the weakest / slowest. But the power doesn't matter if the work is small - see your query workload will be a huge deciding factor. The fastest will be to load the needed data into Quicksight storage (SPICE) but this is another localized / summarized version of the data so timeliness is again a factor (how often is this updated).
Based on designing similar systems and a bunch of guesses as to what you need I'd recommend that you:
Fix your object size (Kineses can be configured to do this)
Partition your data by day
Set up a small Redshift cluster (4 X dc2.large) and use Spectrum source address the data
Connect Quicksight to Redshift
Measure the performance (and cost) and compare to requirements (there will likely be gaps)
Adjust to solution (summary tables to S3, Athena, SPICE etc.) to meet goals
The alternative is to hire someone who has set up such systems before and have them review the requirements in detail and make a less "guess-based" recommendation.
I would look into Druid. Not an AWS offering, but easily runs on AWS, with good integration with S3 and Kinesis.
Capable of reading from Kinesis, at high speeds, and make the data available for querying right away. Can also flatten and transform the data as it reads it.
Capable of doing rollups/aggregation/compaction during ingestion (and further reduce data in an async manner). From what you wrote, it seems to me that it could easily reduce the number of rows in the DB by a very large factor.
Capable of fast queries, using standard SQL.
Smart partitioning of the data to scan only the relevant dates.
The down-side is that you will need to keep a cluster up and running for ingestion and for querying. It is pretty scalable, so you can start small.
On the up-side - you're not using 10 different technologies (Athena/Glue/EMR/etc.)
You might want to consider contacting Imply, which can ease the deployment.
A usual approach a lot of companies take is they do heavy weight lifting in athena or bigquery (or some other distributed sql environment) -> aggregate intermediate results into multiple indexed+partitioned postgres/mysql/redshift/clickhouse tables and then connect their APIs to read on those tables. Of course, this works fine except the fact that with an increased amount of intermediate-aggregated data, table indices grow and problems like cumulative sum or sorting become less and less efficient.
With your problem in hand, I think you can get a lot of help with AWS Lambda. AWS Lambda provides a very feasible serverless approach towards solving large granular data problems (if used correctly). For instance, assume that your pipelines partitions incoming stream by YYYYMMMDDHHMM and stores it into some S3 path which has a Lambda listening to it (as a trigger function) then your data ingest + aggregation becomes pretty much simultaneous processes. As soon as a minute is up, a new instance of the same Lambda function will be taking care of data landing into partition YYYYMMMDDHHMM+1. So, this way, you can run thousands of simultaneous processes with a good bunch of Lambda functions doing the same thing in parallel. Of course, this is a rough picture, but I think it can greatly help.
Most of the nosql solution only use eventually consistency, and given that DynamoDB replicate the data into three datacenter, how does read after write consistency is being maintained?
What would be generic approach to this kind of problem? I think it is interesting since even in MySQL replication data is replicated asynchronously.
I'll use MySQL to illustrate the answer, since you mentioned it, though, obviously, neither of us is implying that DynamoDB runs on MySQL.
In a single network with one MySQL master and any number of slaves, the answer seems extremely straightforward -- for eventual consistency, fetch the answer from a randomly-selected slave; for read-after-write consistency, always fetch the answer from the master.
even in MySQL replication data is replicated asynchronously
There's an important exception to that statement, and I suspect there's a good chance that it's closer to the reality of DynamoDB than any other alternative here: In a MySQL-compatible Galera cluster, replication among the masters is synchronous, because the masters collaborate on each transaction at commit-time and a transaction that can't be committed to all of the masters will also throw an error on the master where it originated. A cluster like this technically can operate with only 2 nodes, but should not have less than three, because when there is a split in the cluster, any node that finds itself alone or in a group smaller than half of the original cluster size will roll itself up into a harmless little ball and refuse to service queries, because it knows it's in an isolated minority and its data can no longer be trusted. So three is something of a magic number in a distributed environment like this, to avoid a catastrophic split-brain condition.
If we assume the "three geographically-distributed replicas" in DynamoDB are all "master" copies, they might operate with logic along same lines of synchronous masters like you'd find with Galera, so the solution would be essentially the same since that setup also allows any or all of the masters to still have conventional subtended asynchronous slaves using MySQL native replication. The difference there is that you could fetch from any of the masters that is currently connected to the cluster if you wanted read-after-write consistency, since all of them are in sync; otherwise fetch from a slave.
The third scenario I can think of would be analogous to three geographically-dispersed MySQL masters in a circular replication configuration, which, again, supports subtended slaves off of each master, but has the additional problems that the masters are not synchronous and there is no conflict resolution capability -- not at all viable for this application, but for purposes of discussion, the objective could still be achieved if each "object" had some kind of highly-precise timestamp. When read-after-write consistency is needed, the solution here might be for the system serving the response to poll all of the masters to find the newest version, not returning an answer until all masters had been polled, or to read from a slave for eventual consistency.
Essentially, if there's more than one "write master" then it would seem like the masters have no choice but to either collaborate at commit-time, or collaborate at consistent-read-time.
Interestingly, I think, in spite of some whining you can find in online opinion pieces about the disparity in pricing among the two read-consistency levels in DynamoDB, this analysis -- even as divorced from the reality of DynamoDB's internals as it is -- does seem to justify that discrepancy.
Eventually-consistent read replicas are essentially infinitely scalable (even with MySQL, where a master can easily serve several slaves, each of which can also easily serve several slaves of its own, each of which can serve several... ad infinitum) but read-after-write is not infinitely scalable, since by definition it would seem to require the involvement of a "more-authoritative" server, whatever that specifically means, thus justifying a higher price for reads where that level of consistency is required.
I'll tell you exactly how DynamoDB does this. No guessing.
In order for a write request to be acknowledged to the client, the write must be durable on two of the three storage nodes for that partition. One of the two storage nodes MUST be the leader node for that partition. The third storage node is probably updated as well, but on the off chance something happened, it may not be. DynamoDB will get that one updated as soon as it can.
When you request a strongly consistent read, that read comes from the leader storage node for the partition the item(s) are stored in.
I know I'm answering this question long after it has been asked, but I thought a could contribute some helpful information...
In a distributed database the concept of a "master" is not particularly relevant anymore (at least for reads/writes). Each node should be able to perform reads and writes, so that read/write performance increases as the # of machines increases. If you want reads to be correct immediately after a write, the number of machines you write to and then read from must be greater than the total number of machines in the system.
Example: if you only write to 1 machine, then you must read from all of them to ensure that your data is not stale. Or if you write to 2 machines (in this case, quorum) you can perform reads at quorum and guarantee that your data is recent.
NOTE: these assumptions change when a subset of nodes in the system crash.