We have a DynamoDB Database that is storing machine sensor information in the "structure" of :
HashKey: MachineNumber (Number)
SortKey: EntryDate (String)
Columns: SensorType (String), SensorValue (Number)
The sensors generate information almost every 3 seconds and we're looking to measure a (near) real-time KPI to count how many machines in a region were down in the past hour for more than 10 minutes. A region can have close to 10000 machines so iterating through DynamoDB is taking almost 10+ minutes for a response. What is the best way to do this?
Describing the answer as discussed in comments on the question.
Performing a table scan on a very large table is expensive and should be avoided. DynamoDB Streams provides the ability to process records using your own custom code after they are inserted. This allows for aggregations or other computations to be performed asynchronously in near real time. The result can then be written or updated in a separate DynamoDB table.
You can run the code that processes the DynamoDB Stream messages on your own server (example: EC2), but it is likely easier to just utilize Lambda. Lambda lets you write Java or NodeJS code that will be run on AWS infrastructure that is fully managed so all you need to worry about is the code.
Related
I store sensor data (power and voltage measurements) coming from different devices in DynamoDB (partition key: deviceid and sort key: timestamp).
Every hour, I would like to access the data that has been stored in that time frame, do some calculations with it and store the results elsewhere.
My initial idea was to run a Lambda that would be triggered by a CloudWatch Rule, do the calculations and store the results in another DynamoDB table.
I saw a similar question and the answer suggested DynamoDB Streams instead. But aren´t Streams supposed to be triggered every time an item is updated/deleted/inserted?. I understand there are conditions to invoke the Lambda with the Streams service that could allow me to do so every hour, but I don´t think this is the most efficient way to get around the problem.
So my question is if there is an efficient way/service to accomplish this?
DynamoDB Streams can efficiently process this data using batch windows and tumbling windows. These are built into DynamoDB stream. Batch windows can ensure the lambda function is only invoked once every 5 minutes or after 6MB of data, and tumbling windows allow you to perform running calculations for up to 15 minutes.
We are building an web application to allow customers insight into their activity based on events currently streaming into ElasticSearch. A customer is an organisation sending messages to people.
A concern has been raised that a requirement to host this data for three years infers a very large amount of storage and high cost of implementation given Elasticsearch.
An alternative is to process each day's data into a report CSV stored in S3 and use something like Amazon Athena to perform the queries. Is Athena something that our application can send ad-hoc queries to in response to a web browser request? It is unlikely to generate a large volume of requests all the time, but I'm uncertain what the latency could be like.
Yes, Athena would be a possible solution to this use case – and done right it could also be fairly cheap.
Athena is not a low latency query engine, but for reporting purposes it's usually good enough. There's no way to say for sure without knowing more, but done right we're talking low single digit seconds.
You can approach this in different ways, either you do as you say and generate a CSV every day, store these for as long as you need, and run queries against them as needed. From your description it sounds like these CSVs would already be aggregates, and I assume they would be significantly less than a megabyte per customer per day. If you partition by customer and month you should be able to run queries for arbitrary time periods in seconds.
Another approach would be to store all your data on S3 and run queries on the full data set. As you stream data into ElasticSearch, stream it to S3 too. Depending on how you do that you probably need some ETL in the form of Lambda functions that partitions the data per customer and time (day or month depending on the volume). You can then run Athena queries on the full historical data set. The downside would be slower queries (double digit seconds for most queries, but I don't know your data volumes), but the upside would be full flexibility on what you can query.
With more details about the particulars of the use case I could help you with the details.
Athena is serverless. You can quickly query your data without having to set up and manage any servers or data warehouses. Just point to your data in Amazon S3, define the schema, and start querying using the built-in query editor.
Amazon Athena automatically executes queries in parallel, so most results come back within seconds/mins.
This is my use case:
I have a JSON Api with 200k objects. The dataset looks a little something like this: date, bike model, production time in min. I use Lambda to read from a JSON Api and write in DynamoDB via http request. The Lambda function runs everyday and updates DynamoDB with the most recent data.
I then retrieve the data by date since I want to calculate the average production time for each day and put it in a second table. An Alexa skill is connected to the second table and reads out the average value for each day.
First question: Since the same bike model is produced multiple times per day, using a composite primary key with date and bike model won't give me a unique key. Shall I create a UUID for the entries instead? Or is there a better solution?
Second question: For the calculation I would need to do a full table scan each time, which is very costly and advised against by many. How can I solve this problem without doing a full table scan?
Third question: Is it better to avoid DynamoDB altogether for my use case? Which AWS database is more suitable for my use case then?
Yes, uuid or any other unique identifier (ex: date+bike model+created time) as pk is fine.
It seems your daily job for average value is some sort of data analytics job not really a transaction job. I would suggest to go with a service support data analytics such as Amazon Redshift. You should be able to add data to such database service using Dynamodb streams. Alternatively, you can stream data into s3 and use a service like Athena to get the daily average.
There is a simple database model that you could use for this task:
PartitionKey: a UUID or use any combination of fields that provide uniqueness.
SortKey: Production date, as a string, i.e. 2020-07-28
If you then create a secondary index which uses as PK the Production date and includes the production time, you can then query (not scan) the secondary index for a specific date and perform any calculations you need on production time. You can then provision the required read/write capacity on the secondary index and the table independently.
Regarding your third question, I don't see any real benefit of using DynamoDB for this task. Any RDS (i.e. MySQL), Redshift or even S3+Athena can easily handle such use case. If you require real time analytics, you could even consider AWS Kinesis.
I have data coming from multiple machines, I would like to aggregate it by user. I'm thinking of producing batches of 1000 "rows", or 10 seconds of data (whichever comes first), by user.
I do have some experience with AWS kinesis and lambdas, but in my experience we don't have so much control on how the aggregation is done. All machines would send the data by kinesis, with the user id as the partition key. Then AWS will call our lambda with small batches. This has been great for some other use cases but here if I receive 100 records I don't know what to do (I would like to "wait" to receive more or wait that 10 seconds elapse since the date of the first record).
Also I'm not sure how the aggregation "by user id" would work. So far, on a lambda, I would have split the records in the batch by user id, but then if I get called with a batch of 100 records, even though there is a partition key on the user id, there is no guarantee that those 100 records would be for 1 user. Maybe I will get 100 records from 100 different users, and there is no "aggregation" help at all.
Any idea if kinesis + lambda is suited for this? I did look at the documentation of AWS but I don't see my scenario. It looks like they also have a tool "Data Streams" but it's hard for me to understand if this would work for my case.
Thanks!
Your understanding is correct. AWS Lambda + Kinesis alone will not be sufficient alone for aggregation. AWS Lambda programming model is stateless, so you can only aggregate based on the batch of records received in that particular invocatio(GetRecords API) call. Furthermore, the batch size provided in the function does not gurantee that you will get that number of records. This is merely the maximum number of records which you can get(MaxRecords) per invocation.
What you need is some kind of windowing mechanism, either row-based or time-based. Kinesis Analytics would be the easiest and fastest to get on-boarded with this. You can either use SQL or Flink with Kinesis analytics. You can even have your output to AWS Lambda for post processing.
Other ways would be use a Spark streaming job (you can use AWS EMR) and use windowing in your application.
I have around 300 GBs of data on S3. Lets say the data look like:
## S3://Bucket/Country/Month/Day/1.csv
S3://Countries/Germany/06/01/1.csv
S3://Countries/Germany/06/01/2.csv
S3://Countries/Germany/06/01/3.csv
S3://Countries/Germany/06/02/1.csv
S3://Countries/Germany/06/02/2.csv
We are doing some complex aggregation on the data, and because some countries data is big and some countries data is small, the AWS EMR doesn't makes sense to use, as once the small countries are finished, the resources are being wasted, and the big countries keep running for long time. Therefore, we decided to use AWS Batch (Docker container) with Athena. One job works on one day of data per country.
Now there are roughly 1000 jobs which starts together and when they query Athena to read the data, containers failed because they reached Athena query limits.
Therefore, I would like to know what are the other possible ways to tackle this problem? Should I use Redshift cluster, load all the data there and all the containers query to Redshift cluster as they don't have query limitations. But it is expensive, and takes a lot of time to wramp up.
The other option would be to read data on EMR and use Hive or Presto on top of it to query the data, but again it will reach the query limitation.
It would be great if someone can give better options to tackle this problem.
As I understand, you simply send query to AWS Athena service and after all aggregation steps finish you simply retrieve resulting csv file from S3 bucket where Athena saves results, so you end up with 1000 files (one for each job). But the problem is number of concurrent Athena queries and not the total execution time.
Have you considered using Apache Airflow for orchestrating and scheduling your queries. I see airflow as an alternative to a combination of Lambda and Step Functions, but it is totally free. It is easy to setup on both local and remote machines, has reach CLI and GUI for task monitoring, abstracts away all scheduling and retrying logic. Airflow even has hooks to interact with AWS services. Hell, it even has a dedicated operator for sending queries to Athena, so sending queries is as easy as:
from airflow.models import DAG
from airflow.contrib.operators.aws_athena_operator import AWSAthenaOperator
from datetime import datetime
with DAG(dag_id='simple_athena_query',
schedule_interval=None,
start_date=datetime(2019, 5, 21)) as dag:
run_query = AWSAthenaOperator(
task_id='run_query',
query='SELECT * FROM UNNEST(SEQUENCE(0, 100))',
output_location='s3://my-bucket/my-path/',
database='my_database'
)
I use it for similar type of daily/weekly tasks (processing data with CTAS statements) which exceed limitation on a number of concurrent queries.
There are plenty blog posts and documentation that can help you get started. For example:
Medium post: Automate executing AWS Athena queries and moving the results around S3 with Airflow.
Complete guide to installation of Airflow, link 1 and link 2
You can even setup integration with Slack for sending notification when you queries terminate either in success or fail state.
However, the main drawback I am facing is that only 4-5 queries are getting actually executed at the same time, whereas all others just idling.
One solution would be to not launch all jobs at the same time, but pace them to stay within the concurrency limits. I don't know if this is easy or hard with the tools you're using, but it's never going to work out well if you throw all the queries at Athena at the same time. Edit: it looks like you should be able to throttle jobs in Batch, see AWS batch - how to limit number of concurrent jobs (by default Athena allows 25 concurrent queries, so try 20 concurrent jobs to have a safety margin – but also add retry logic to the code that launches the job).
Another option would be to not do it as separate queries, but try to bake everything together into fewer, or even a single query – either by grouping on country and date, or by generating all queries and gluing them together with UNION ALL. If this is possible or not is hard to say without knowing more about the data and the query, though. You'll likely have to post-process the result anyway, and if you just sort by something meaningful it wouldn't be very hard to split the result into the necessary pieces after the query has run.
Using Redshift is probably not the solution, since it sounds like you're doing this only once per day, and you wouldn't use the cluster very much. It would Athena is a much better choice, you just have to handle the limits better.
With my limited understanding of your use case I think using Lambda and Step Functions would be a better way to go than Batch. With Step Functions you'd have one function that starts N number of queries (where N is equal to your concurrency limit, 25 if you haven't asked for it to be raised), and then a poll loop (check the examples for how to do this) that checks queries that have completed, and starts new queries to keep the number of running queries at the max. When all queries are run a final function can trigger whatever workflow you need to run after everything is done (or you can run that after each query).
The benefit of Lambda and Step Functions is that you don't pay for idle resources. With Batch, you will pay for resources that do nothing but wait for Athena to complete. Since Athena, in contrast to Redshift for example, has an asynchronous API you can run a Lambda function for 100ms to start queries, then 100ms every few seconds (or minutes) to check if any have completed, and then another 100ms or so to finish up. It's almost guaranteed to be less than the Lambda free tier.
As I know Redshift Spectrum and Athena cost same. You should not compare Redshift to Athena, they have different purpose. But first of all I would think about addressing you data skew issue. Since you mentioned AWS EMR I assume you use Spark. To deal with large and small partitions you need to repartition you dataset by months, or some other equally distributed value.Or you can use month and country for grouping. You got the idea.
You can use redshift spectrum for this purpose. Yes, it is a bit costly but it is scalable and very good for performing complex aggregations.