I have a scenario whereby if part of a query matches an event, I want to fetch some other events from a datastore to test against the rest of the query
eg. "If JANE DOE buys from my store did she buy anything else over last 3 years" sort of thing.
Does Flink, Storm or WSO2 provide support for such complex event processing?
Flink can do this, but it would require that you process all events starting from the earliest that you care about (e.g. 3 years ago), so that you can construct the state for each customer. Flink then lets you manage this state (typically with RocksDB) so that you wouldn't have to replay all the events in the face of system failures.
If you can't replay all of the history, then typically you'd put this into some other store (Cassandra/HBase, Elasticsearch, etc) with the scalability and performance characteristics you need, and then use Flink's async function support to query it when you receive a new event.
WSO2 Stream processor let’s you implement such functionality with it’s time incremental analytics feature. To implement the scenario you’ve mentioned you can feed the events that are triggered when a customer arrives to an construct called ‘aggregate’. When you keep feeding events to an aggregate it will summarize data over time and will saved in a configured persistence store such as DB.
You can query this aggregate to get the state for a given period of time. For an example following query will fetch the name, total items bought and avg transaction value with the year 2014-2015
from CustomerSummaryRetrievalStream as b join CustoemrAggregation as a
on a.name == b.name
within "2014-01-01 00:00:00 +05:30", "2015-01-01 00:00:00 +05:30"
per “years”
select a.name, a.total, a.avgTxValue
insert into CustomerSummaryStream;
Related
I'm trying to establish email alerts at a project level to send email alerts for when a certain number of query/job concurrency is reached e.g. 5 concurrent queries. We have a flat-rate pricing model.
I want to do a similar email notification when total slot Usage exceeds a certain threshold as well e.g. slot usage reaching 1000 slots
As a next step I would like to throttle new incoming queries based on the above mentioned thresholds. Meaning if there are already for example 5 queries actively running the 6th one will be put on hold until one of the 5 running earlier have completed.
You may create an Alert Policy in which you can set your desired metric type (eg. slots) and then configure your desired threshold similar to below.
In creating an Alert Policy you may also configure the notification channel to email notification which is also included on the same documentation.
For the available metric types for SLOTS in BigQuery, you may refer to this Google Cloud Metrics for BigQuery documentation.
For your next step, you may code (python, node.js, etc) using BigQuery API to count queries actively running (through JOB ID) and when the count hits 5, you may print "query queue is full" and then wait for the total JOBS to hit below 5 before running the next query. You may refer to this BigQuery Managing Jobs API Documentation.
I have a single-table design for my app. However, I have certain rows in the table that have important information that I plan to use to query different kinds of data. Let me explain. My app handles alarms triggered by users. When an alarm gets triggered I record a lot of info about that alert. My goals is to create GSIs so I can retrieve and sort all the info about that alarm that was triggered. Let me give you an example of a row in my table.
PK
SK
GSI1PK
GSI1SK
GSI2PK
GSI2SK
GSI3PK
GSI3SK
GSI4PK
GSI4SK
GSI5PK
GSI5SK
OtherProperties
ShipmentReceived
AL#TR#2020-08-19T23:37:41.513Z
AL#TR
2020-08-19T23:37:41.513Z
AL#TR#LO
Building1#WingA#Floor1#OfficeB#2020-08-19T23:37:41.513Z
user#example.com
2020-08-19T23:37:41.513Z
1234567
2020-08-19T23:37:41.513Z
AL#TR#HOW
PC#KS
Other values go in other columns
NOTE: AL#TR means: "Alarm Triggered" and AL#TR#LO means "Alarm triggered from location". AL#TR#HOW indicates how the alarm was triggered. 1234567 is a "device ID" used to trigger the alarm.
This kind of structure allows me to query for all sorts of interesting data. for example:
All of the ShipmentReceived alarms sorted by date
GSI1: I can get all of the alarms triggered at the company and sort them by date (That includes ShipmentReceived, PackageSent, etc)
GSI2: I can get all of the alarms triggered at a certain location and I can sort them by date.
GSI3: I can get all of the alarms triggered by a specific user and I can sort by date.
GSI4: I can get all of the alarms triggered by a specific device and I can sort them by date.
GSI5: Allows me to sort the alarms by method used to trigger them.
I am reading the DynamoDB documentation and I see that it says that it is not recommended to use indexes on items that are not queried often. A lot of these GSIs will not be queried often at all. Just very sporadically.
My question is, am I doing this wrong by creating 5 different GSIs? in this case? Is there a better way to model this data? I thought about this, maybe I can insert multiple rows with related information instead of having everything in one row, but I do not know if that is a better approach. Any other ideas?
I'm on the DynamoDB team in Seattle, and this response is from one of my colleagues:
"Anytime you need to group or sort the same entities differently, you need to make a new GSI for that access pattern. When you have multiple entity types stored in the same table you can reuse the GSI (aka GSI overloading) for those access patterns on different entities. But in your case, all of the access patterns are about grouping and sorting alarm entities so each would need a different GSI.
"However, GSIs exist to speed up or make cheaper read requests with the trade-off being a higher write expense (to keep the GSIs updated). This makes sense in access patterns that have a high read:write ratio and where the response must come back quickly. But for read access patterns that are done infrequently and for which there isn't a low-latency requirement, it might be cheaper to simply do a Scan operation compared to the cost of having a GSI. For example, for a batch job that runs once a day or once a week it might be cheaper to scan the table once a day or once a week."
This is my use case:
I have a JSON Api with 200k objects. The dataset looks a little something like this: date, bike model, production time in min. I use Lambda to read from a JSON Api and write in DynamoDB via http request. The Lambda function runs everyday and updates DynamoDB with the most recent data.
I then retrieve the data by date since I want to calculate the average production time for each day and put it in a second table. An Alexa skill is connected to the second table and reads out the average value for each day.
First question: Since the same bike model is produced multiple times per day, using a composite primary key with date and bike model won't give me a unique key. Shall I create a UUID for the entries instead? Or is there a better solution?
Second question: For the calculation I would need to do a full table scan each time, which is very costly and advised against by many. How can I solve this problem without doing a full table scan?
Third question: Is it better to avoid DynamoDB altogether for my use case? Which AWS database is more suitable for my use case then?
Yes, uuid or any other unique identifier (ex: date+bike model+created time) as pk is fine.
It seems your daily job for average value is some sort of data analytics job not really a transaction job. I would suggest to go with a service support data analytics such as Amazon Redshift. You should be able to add data to such database service using Dynamodb streams. Alternatively, you can stream data into s3 and use a service like Athena to get the daily average.
There is a simple database model that you could use for this task:
PartitionKey: a UUID or use any combination of fields that provide uniqueness.
SortKey: Production date, as a string, i.e. 2020-07-28
If you then create a secondary index which uses as PK the Production date and includes the production time, you can then query (not scan) the secondary index for a specific date and perform any calculations you need on production time. You can then provision the required read/write capacity on the secondary index and the table independently.
Regarding your third question, I don't see any real benefit of using DynamoDB for this task. Any RDS (i.e. MySQL), Redshift or even S3+Athena can easily handle such use case. If you require real time analytics, you could even consider AWS Kinesis.
Basically, I have a web service that receives a small json payload (an event) a few times per minute, say 60. This event must be sent to an SQS queue only after 1 year has elapsed (it's ok to have it happen a few hours sooner or later, but the day of month should be exactly the same).
This means I'll have to store more than 31 million events somewhere before the first one should be sent to the SQS queue.
I thought about using SQS message timers, but they have a limit of only 15 minutes, and as pointed out by #Charlie Fish, it's weird to have an element lurking around on a queue for such a long time.
A better possibility could be to schedule a lambda function using a Cron expression for each event (I could end up with millions or billions of scheduled lambda functions in a year, if I don't hit an AWS limit well before that).
Or I could store these events on DynamoDB or RDS.
What would be the recommended / most cost-effective way to handle this using AWS services? Scheduled lambda functions? DynamoDB? PostgreSQL on RDS? Or something entirely different?
And what if I have 31 billion events per year instead of 31 million?
I cannot afford to loose ANY of those events.
DynamoDB is a reasonable option, as is RDS - SQS for long term storage is not a good choice. However - if you want to keep your costs down, I may suggest another: accumulate the events for a single 24 hour period (or a smaller interval if that is desirable), and write that set of data out as an S3 object instead of keeping it in DynamoDB. You could employ dynamodb or rds (or just about anything else) as a place to accumulate events for the day (or hour) before it then writes out that data to S3 as a single set of data for the interval.
Each S3 object could be named appropriately, either indicating the date/time it was created, or the data/time it needs to be used, i.e. 20190317-1400 to indicate that on March 17th, 2019 at 2PM this file needs to be used.
I would imagine a lambda function, called by a cloudwatch event that is triggered every 60 minutes, scans your s3 bucket looking for files that are due to be used, and it then reads in the json data and puts them into an SQS queue for further processing and moves the processed s3 object to another 'already processed' bucket
Your storage costs would be minimal (especially if you batch them up by day or hour), S3 has 11 9's of durability, and you can archive older events off to Glacier if you want to keep them around even after the are processed.
DynamoDB is a great product, it provides redundant storage, and super high performance - but I see nothing in your requirements to that would warrant incurring that cost or requiring the performance of DynamoDB; and why keep millions of records of data in a 'always on' database when you know in advance that you don't need to use or see the records until a year from now.
I mean you could store some form of data in DynamoDB, and run some daily Lambda task to query for all the items that are greater than a year old, remove those from DynamoDB and import it into SQS.
As you mentioned SQS doesn't have this functionality built in. So you need to store the data using some other technology. DynamoDB seems like a responsible choice based on what you have mentioned above.
Of course you also have to think about if doing a cron task once per day is sufficient for your task. Do you need it to be exactly after 1 year? Is it acceptable to have it be one year and a few days? Or one year and a few weeks? What is the window that is acceptable for importing into SQS?
Finally, the other question you have to think about is if SQS is even reasonable for your application. Having a queue that has a 1 year delay seems kinda strange. I could be wrong, but you might want to consider something besides SQS because SQS is meant for much more instantaneous tasks. See the examples on this page (Decouple live user requests from intensive background work: let users upload media while resizing or encoding it, Allocate tasks to multiple worker nodes: process a high number of credit card validation requests, etc.). None of those examples are really meant for a year of wait time before executing. At the end of the day it depends on your use case, but off the top of my head I can't think of a situation that makes sense for delaying entry into an SQS queue for a year. There seem to be much better ways to handle this, but again I don't know your specific use case.
EDIT another question is if your data is consistent? Is the amount of data you need to store consistent? How about the format? What about the number of events per second? You mention that you don’t want to lose any data. For sure build in error handling and backup systems. But for DynamoDB it doesn’t scale the best if one moment you store 5 items then the next moment you want to store 5 million items. If you set your capacity to account for 5 million then it is fine. But the question is will the amount of data and frequency be consistent or not?
We have a DynamoDB Database that is storing machine sensor information in the "structure" of :
HashKey: MachineNumber (Number)
SortKey: EntryDate (String)
Columns: SensorType (String), SensorValue (Number)
The sensors generate information almost every 3 seconds and we're looking to measure a (near) real-time KPI to count how many machines in a region were down in the past hour for more than 10 minutes. A region can have close to 10000 machines so iterating through DynamoDB is taking almost 10+ minutes for a response. What is the best way to do this?
Describing the answer as discussed in comments on the question.
Performing a table scan on a very large table is expensive and should be avoided. DynamoDB Streams provides the ability to process records using your own custom code after they are inserted. This allows for aggregations or other computations to be performed asynchronously in near real time. The result can then be written or updated in a separate DynamoDB table.
You can run the code that processes the DynamoDB Stream messages on your own server (example: EC2), but it is likely easier to just utilize Lambda. Lambda lets you write Java or NodeJS code that will be run on AWS infrastructure that is fully managed so all you need to worry about is the code.