What is the correct way to organize multiple recordings from a single device: AWS Kinesis video streams

What is the correct way to organize multiple recordings from a single device: AWS Kinesis video streams - amazon-kinesis-video-streams

What is the correct way to create a searchable archive of videos from a single Raspberry PI type device?
Should I create a single stream per device, then whenever that device begins a broadcast it adds to that stream?
I would then create a client that lists timestamps of those separate recordings on the stream? I have been trying to do this, but I have only gotten as far are ListFragments and GetClip. Neither of which seem to do the job. What is the use case for working with fragments? I'd like to get portions of the stream separated by distinct timestamps. As in, if I have a recording from 2pm to 2:10pm, that would be a separate list item from a recording taken between 3pm and 3:10pm.
Or should I do a single stream per broadcast?
I would create a client to list the streams then allow users to select between streams to view each video. This seems like an inefficient use of the platform, where if I have 5 10 second recordings made by the same device over a few days, it creates 5 separate archived streams.
I realize there are implications related to data retention in here, but am also not sure how that would act if part of a stream expires, but another part does not.
I've been digging through the documentation to try to infer what best practices are related to this but haven't found anything directly answering it.
Thanks!

Hard to tell what your scenario really is. Some applications use sparsely populated streams per device and use ListFragments API and other means to understand the sessions within the stream.
This doesn't work well if you have very sparse streams and large number of devices. In this case, some customers implement "stream leasing" mechanism by which their backend service or some centralized entity keeps the track of pool of streams and leases those to the requestor, potentially adding new streams to the pool. The streams leased times are then stored in a database somewhere for the consumer side application to be able to do its business logic. The producer application can also "embed" certain information within the stream using FragmentMetadata concept which really evaluates into outputting MKV tags into the stream.
If you have any further scoped down questions regarding the implementations, etc, don't hesitate to cut GitHub issues against particular KVS assets in question which would be the fastest way to get answers.

Related

How to efficiently aggregate data in billions of individual records in AWS?

At a high / theoretical level I know exactly the type of architecture I want to build and how it would work, but I'm attempting to construct this as cheaply as possible using AWS services and my lack of familiarity with the offerings of AWS has me running in circles.
The Data
We run a video streaming platform. On busy nights we have about 100 simultaneous live streams going with upwards of 30,000 viewers. We expect this number to rise to 100,000 in the next few years. A live stream lasts, on average, 2 hours.
We send a heartbeat from our player every 10 seconds with information about the viewer -- how much data they've viewed, how much data they've buffered, what quality they're streaming, etc.
These heartbeats are sent directly to an AWS Kinesis endpoint.
Finally, we want to retain all past messages for at least 5 years (hopefully longer) so that we can look at historic analytics.
Some back of the envelope calculations suggest we will have 0.1 * 60 * 60 * 2 * 100000 * 365 * 5 = 131 billion heartbeat messages five years from now.
Our Old Pipeline
Our old system had a single Kinesis consumer. Aggregate data was stored in DynamoDB. Whenever a message arrived we would read the record from DynamoDB, update the record, then write the new record back. This read-update-write loop limited the speed at which we could process messages and made it so that each message coming in was dependent on the messages before it, so they could not be processed in parallel.
Part of the reason for this setup is that our message schema was not well designed from the outset. We send the timestamp at which the message was sent, but we do not send "amount of video watched since last heartbeat". As a result in order to compute the total viewer time we need to look up the last heartbeat message sent by this player, subtract the timestamps, and add that value. Similar issues exist with many other metrics.
Our New Pipeline
We've begun to run into scaling issues. During our peak hours analytics can be delayed by as much as four hours while waiting for a backlog of messages to be processed. If this backlog reaches 24 hours Kinesis will start deleting data. So we need to fix our pipeline to remove this dependency on past messages so we can process them in parallel.
The first part of this was updating the messages sent by our players. Our new specification includes only metrics that can be trivially sum'd with no subtraction. So we can just keep adding to the "time viewed" metric, for instance, without any regard to past messages.
The second part of this was ensuring that Kinesis never backs up. We dump the raw messages to S3 as quickly as they arrive with no processing (Kinesis Data Fire Hose) so that we can crunch analytics on them at our leisure.
Finally, we now want to actually extract information from these analytics as quickly as possible. This is where I've hit a snag.
The Questions We Want to Answer
As this is an analytics pipeline, our questions mostly revolve around filtering these messages and then aggregating fields for the remaining messages (possibly, in fact likely, with grouping). For instance:
How many Android users watched last night's stream in HD? (FILTER by stream and OS)
What's the average bandwidth usage among all users? (SUM and COUNT, with later division of the final aggregates which could be done on the dashboard side)
What percent of users last year were on any Apple device (iOS, tvOS, etc)? (COUNT, grouped by OS)
What's the average time spent buffering among Android users for streams in the past year? (a mix of all of the above)
Options
AWS Athena would allow us to query the data in S3 directly as if it were an ANSI SQL table. However reading up on Athena, unless the data is properly formatted it can be incredibly slow. Some benchmarks I've seen show that processing 1.1 billion rows of CSV data can take up to 2 minutes. I'm looking at processing 100x that much data
AWS EMR and AWS Redshift sound like they are built for this purpose, but are complicated to set up and have a high base cost to run (requiring an EC2 cluster to remain active at all times). AWS Redshift also requires data be loaded into it, which sounds like it might be a very slow process, delaying our access to analytics
AWS Glue sounds like it may be able to take the raw messages as they arrive in S3 and convert them to Parquet files for more rapid querying via Athena
We could run a job to regularly batch messages to reduce the total number that must be processed. While a stream is live we'll receive one message every 10 seconds, but we really only care about the totals for a given viewer. This means that when a 2-hour stream concludes we can combine the 720 messages we've received from that player into a single "summary" message about the viewer's experience during the whole stream. This would massively reduce the amount of data we need to process, but exactly how and when to trigger this process isn't clear to me
The Ideal Architecture
This is a Big Data problem. The generic solution to Big Data problems is "don't take your data to your query, take your query to your data". If these messages were spread across 100 small storage nodes then each node could filter, sum, and count the subset of data they hold and pass these aggregates back to a central node which sums the sums and sums the counts. If each node is only operating on 1/100th of the data set then this kind of processing could theoretically be incredibly fast.
My Confusion
While I have a theoretical understanding of the "ideal" architecture, it's not clear to me if AWS works this way or how to construct a system that will function well like this.
S3 is a black box. It's not clear if Athena queries are run on individual nodes and aggregates are further reduced elsewhere, or if there's a system reading all of the data and aggregating it in a central location
Redshift requires the data by copied into a Redshift database. This doesn't sound fast, nor distributed
It's unclear to me how EMR works or if it will suit my purpose. Still researching
AWS Glue seems like it may need to be triggered by some event?
Parquet files seems to be like CSVs, where multiple records reside in a single file. Meanwhile I'm dumping one record per file. But perhaps there's a way to fix that? e.g. batching files every minute or every 5 minutes?
RDS or a similar service might be really good for this (indexing and whatnot) but would require a guaranteed schema (or necessitate migrating if our message schema changed) which is a concern. Migrating terabytes of data if we change our message schema sounds out of the question
Finally, along with wanting to get analytics results in as "real time" as possible (ideally we want to know within 1 minute when someone joins or leaves a stream), we want the dashboards to load quickly. Waiting 30 seconds to see the count of live viewers is horrendous. Dashboards should load in 2 seconds or less (ideally)
The plan is to use QuickSight to create dashboards (our old system had a hack-y Django app that read from our DynamoDB aggregates table, but I'd like to avoid creating more code for people to maintain)

I expect you are going to get a lot of different answers and opinions from the broad set of experts you have pinged with this. There is likely no single best answer to this as there are a lot of variables. Let me give you my best advice based on my experience in the field.
Kinesis to S3 is a good start and not moving data more than needed is the right philosophy.
You didn't mention Kinesis Data Analytics and this could be a solution for SOME of your needs. It is best for questions about what is happening in the data feed right now. The longer timeframe questions are better suited for the tools you mention. If you aren't too interested in what is happening in the past 10 minutes (or so) it could be good to omit.
S3 organization will be key to performing any analytic directly on the data there. You mention parquet formatting which is good but partitioning is far more powerful. Organizing the S3 data into "days" or "hours" of data and setting up the partitioning based on this can greatly speed up any query that is limited in the amount of time that is needed (don't read what you don't need).
Important safety note on S3 - S3 is an object store and as such there is overhead for each object you reference. Having many small objects (10,000+) treated as a single set of data is going to be slow no matter what solution you go with. You need to fix this before you go forward with any solution. You see it takes upwards of .5 sec to look up an object in S3 but if the file is small the transfer time is next to nothing. Now multiply .5 sec times all the objects you have and see how long it will take to read them. This is not a function of the downstream tool you choose but of the S3 organization you have. S3 objects as part of a Big Data solution should be at least 100M in size to not suffer greatly from the object lookup time. The choice of parquet or CSV files is mute without addressing object size and partitioning first.
Athena is good for occasional queries especially if the date ranges are limited. Is this the query pattern you expect? As you say "move the compute to the data" but if you use Athena to do large cross-sectional analytics where a large percentage of the data needs to be used, you are just moving the data to Athena every time you execute this query. Don't stop thinking about data movement at the point it is stored - think about the data movements to do the analytics also.
So a big question is how much data is needed and how often to support your analytics workloads and BI functions? This is the end result you are looking for. If a high percentage of the data is needed frequently then a warehouse solution like Redshift with the data loaded to disk is the right answer. The data load time to Redshift is quite fast as it parallel loads the data from S3 (you see S3 is a cluster and Redshift is a cluster and parallel loads can be done). If loading all your data into Redshift is what you need then the load time is not your main concern - the cost is. Big powerful tool with a price tag to match. The new RA3 instance type bends this curve down significantly for large data size clusters so could be a possibility.
Another tool you haven't mentioned is Redshift Spectrum. This brings several powerful technologies together that could be important to you. First is the power of Redshift with the ability to choose smaller cluster sizes that normally would be used for your data size. S3 filtering and aggregation technology allows Spectrum to perform actions on the data in S3 (yes initial compute actions of the query are performed inside of S3 potentially greatly reducing the data moved to Redshift). If your query patterns support this data reduction in S3 then the data movement will be small and the Redshift cluster can be small (cheap) too. This can be a powerful compromise point for IoT solutions like yours since complex data models and joining are not needed.
You bring up Glue and conversion to parquet. These can be good to do but as I mentioned before partitioning of the data in S3 is usually far more powerful. The value of parquet will increase as the width of your data increases. Parquet is a columnar format so it is advantaged if only a subset of "columns" are needed. The downside is the conversion time/cost and the loss of easy human readability (which can be huge during debug).
EMR is another choice you mention but I generally advise clients against going with EMR unless they need the flexibility it brings to the analytics and they have the skills to use it well. Without these EMR tends to be an unneeded costs sink.
If this is really going to be a Big Data solution then RDS (and Aurora) not good choices. They are designed for transactional workloads, not analytics. The data size and analytics will not fit well or be cost effective.
Another tool in the space is S3 Select. Not likely what you are looking for but something to remember exists and can be a tool in the toolbox.
Hybrid solutions are common in this space if there are variable needs based on some factor. A common one "is time of day" - no one is running extensive reports at 3am so the needed performance is much less. Another is user group - some groups need simple analytics while others need much more power. Another factor is timeliness of data - does everyone need "up to the second" information or is daily information sufficient? Trying to have one tool that does everything for everybody, all the time is often a path to an expensive, oversized solution.
Since Redshift Spectrum and Athena can point at the same S3 data (well organized since both will benefit) both tools can coexist on the same data. Also, Redshift is ideal for sifting through huge mounds of data, it is ideal for producing summary tables and then writing them (in partitioned parquet) to S3 for tools like Athena to use. All these cloud services can be run on schedules and this includes Redshift and EMR (Athena is query on demand) so they don't need to run all the time. Redshift with Spectrum can run a few hours a day to perform deep analytics and summarize data for writing to S3. Your data scientist can also use Redshift for their hardcore work while Athena supports dashboards using the daily summary data and Kinesis Data Analytics as source.
Lastly you bring up a 2 sec requirement for dashboards. This is definitely possible with Quicksight backed up by Redshift or Athena but won't be met for arbitrarily complex / data intensive queries. To meet this you will need the engine to have enough horsepower to produce the data in question. Redshift with local data storage is likely the fastest (Redshift Spectrum with some data pruning done in S3 wins in some cases) and Athena is the weakest / slowest. But the power doesn't matter if the work is small - see your query workload will be a huge deciding factor. The fastest will be to load the needed data into Quicksight storage (SPICE) but this is another localized / summarized version of the data so timeliness is again a factor (how often is this updated).
Based on designing similar systems and a bunch of guesses as to what you need I'd recommend that you:
Fix your object size (Kineses can be configured to do this)
Partition your data by day
Set up a small Redshift cluster (4 X dc2.large) and use Spectrum source address the data
Connect Quicksight to Redshift
Measure the performance (and cost) and compare to requirements (there will likely be gaps)
Adjust to solution (summary tables to S3, Athena, SPICE etc.) to meet goals
The alternative is to hire someone who has set up such systems before and have them review the requirements in detail and make a less "guess-based" recommendation.

I would look into Druid. Not an AWS offering, but easily runs on AWS, with good integration with S3 and Kinesis.
Capable of reading from Kinesis, at high speeds, and make the data available for querying right away. Can also flatten and transform the data as it reads it.
Capable of doing rollups/aggregation/compaction during ingestion (and further reduce data in an async manner). From what you wrote, it seems to me that it could easily reduce the number of rows in the DB by a very large factor.
Capable of fast queries, using standard SQL.
Smart partitioning of the data to scan only the relevant dates.
The down-side is that you will need to keep a cluster up and running for ingestion and for querying. It is pretty scalable, so you can start small.
On the up-side - you're not using 10 different technologies (Athena/Glue/EMR/etc.)
You might want to consider contacting Imply, which can ease the deployment.

A usual approach a lot of companies take is they do heavy weight lifting in athena or bigquery (or some other distributed sql environment) -> aggregate intermediate results into multiple indexed+partitioned postgres/mysql/redshift/clickhouse tables and then connect their APIs to read on those tables. Of course, this works fine except the fact that with an increased amount of intermediate-aggregated data, table indices grow and problems like cumulative sum or sorting become less and less efficient.
With your problem in hand, I think you can get a lot of help with AWS Lambda. AWS Lambda provides a very feasible serverless approach towards solving large granular data problems (if used correctly). For instance, assume that your pipelines partitions incoming stream by YYYYMMMDDHHMM and stores it into some S3 path which has a Lambda listening to it (as a trigger function) then your data ingest + aggregation becomes pretty much simultaneous processes. As soon as a minute is up, a new instance of the same Lambda function will be taking care of data landing into partition YYYYMMMDDHHMM+1. So, this way, you can run thousands of simultaneous processes with a good bunch of Lambda functions doing the same thing in parallel. Of course, this is a rough picture, but I think it can greatly help.

should I do many smaller requests, or fewer but larger requests using s3 to pass data

I'm working on a project that requires data entries to be inserted into an RDS instance. We're using a serverless stack (cognito, api gateway, lambda, rds) to accomplish this. Our application requires a large amount of data to be read off of an embedded device, prior to insertion. That data must then be inserted immediately.
Based on our current setup, a single batch of data could be in excess of 60KB, but that's a worst case scenario.
Is there an accepted best practice or ideal way of sending/accessing this data this large in my lambda function? As of right now, I'm planning on shipping it off with my API request. I've seen s3 mentioned as an intermediary for large quantities of data, but I'm not sure if it's really necessary for something like this.

In my experience it depends on a number of factors. What communication are you using? What is the drop rate? do you experience corrupt packages? What is your embedded device?
If you can send the data in one time with a 97% success rate then I don't see a reason to split the data. If packets take a long time and connections can drop then its good to send multiple packets and resend the failed ones.
For the network 60KB is a small amount of data. If you have a slow 2G embedded device then that's your bottleneck and you need to experience what the most efficient way is to get the data out of it. A single stream of data would probably be the most efficient.

Microservice for notifications

Most of us are familiar with the notifications that pop-up on top right on FB page. Then we ack them, we can still see them with latest first ordering etc.
I have a question on designing such a thing(trying to do some reading also on the side on the arch for this). So at a high level let us say there are microservices running into the system
S1
S2
S3
So let us say I have services like that. I am thinking of creating a new Service say S4 that can receive messages from these services. I am less worried about how these services will talk to S4. There are ways like SQS, kafka, etc.
My main Q is how can S4 do the following
Maintain these notifications in like what? PSQL?
Have a column with timestamp?
Do I need a timeseries DB for it?
How can I store them based on severity? Fatal, Info, critical?
How can someone ack a notification?

Below are some thoughts, but I don't think there is a one size fit all approach. It really depends on the rest of your architecture, domain requirements, scalability requirements, etc.
Maintain these notifications in like what? PSQL?
Do you really need to store the notifications additionally to your event queues? It may depend on if your queue design does allow for a just-in-time access based on topics and message content. Many queue systems would allow you to store the message indefinitely/until processed (check out Apache Pulsar). In this case there is a backlog of unacknowledged messages that can be accessed and processed whenever ready (e.g. client confirms message was read). After acknowledgement you can archive or delete the events.
If you decide that doesn't work and you need an additional database, the usual suspects come into play: Key-Value-Stores, such as MongoDB, or Relational DB, such as CockroachDB.
Have a column with timestamp?
Most queue systems would have a time stamp recorded by default. Sometimes the queue order could be enough. Depending on your query requirements a schema-less format (e.g. json or blob) for the message content and some explicit identifiers for finding the messages (by user etc) would be required as a minimum in most cases.
Do I need a timeseries DB for it?
Possibly useful if you need to track a large number of values changing over time, e.g. IOT related sensor data such as a temperature measurement. If you just have a handful of facebook-style notifications per user that seems not like a good fit.
How can I store them based on severity? Fatal, Info, critical?
Here you could utilize separation by topics in event queues or when using a separate DB it depends again if you use a schema or schema-less DB.
How can someone ack a notification?
As previously mentioned, modern event queue systems have that built in, or if using a separate DB you'll have to add that to the schema (or use a schema-less DB). In either case you need to define or implement the behavior for message retention after acknowledgment, e.g. whether to delete or archive.

AWS Event-Sourcing implementation

I'm quite a newbe in microservices and Event-Sourcing and I was trying to figure out a way to deploy a whole system on AWS.
As far as I know there are two ways to implement an Event-Driven architecture:
Using AWS Kinesis Data Stream
Using AWS SNS + SQS
So my base strategy is that every command is converted to an event which is stored in DynamoDB and exploit DynamoDB Streams to notify other microservices about a new event. But how? Which of the previous two solutions should I use?
The first one has the advanteges of:
Message ordering
At least one delivery
But the disadvantages are quite problematic:
No built-in autoscaling (you can achieve it using triggers)
No message visibility functionality (apparently, asking to confirm that)
No topic subscription
Very strict read transactions: you can improve it using multiple shards from what I read here you must have a not well defined number of lamdas with different invocation priorities and a not well defined strategy to avoid duplicate processing across multiple instances of the same microservice.
The second one has the advanteges of:
Is completely managed
Very high TPS
Topic subscriptions
Message visibility functionality
Drawbacks:
SQS messages are best-effort ordering, still no idea of what they means.
It says "A standard queue makes a best effort to preserve the order of messages, but more than one copy of a message might be delivered out of order".
Does it means that giving n copies of a message the first copy is delivered in order while the others are delivered unordered compared to the other messages' copies? Or "more that one" could be "all"?
A very big thanks for every kind of advice!

I'm quite a newbe in microservices and Event-Sourcing
Review Greg Young's talk Polygot Data for more insight into what follows.
Sharing events across service boundaries has two basic approaches - a push model and a pull model. For subscribers that care about the ordering of events, a pull model is "simpler" to maintain.
The basic idea being that each subscriber tracks its own high water mark for how many events in a stream it has processed, and queries an ordered representation of the event list to get updates.
In AWS, you would normally get this representation by querying the authoritative service for the updated event list (the implementation of which could include paging). The service might provide the list of events by querying dynamodb directly, or by getting the most recent key from DynamoDB, and then looking up cached representations of the events in S3.
In this approach, the "events" that are being pushed out of the system are really just notifications, allowing the subscribers to reduce the latency between the write into Dynamo and their own read.
I would normally reach for SNS (fan-out) for broadcasting notifications. Consumers that need bookkeeping support for which notifications they have handled would use SQS. But the primary channel for communicating the ordered events is pull.
I myself haven't looked hard at Kinesis - there's some general discussion in earlier questions -- but I think Kevin Sookocheff is onto something when he writes
...if you dig a little deeper you will find that Kinesis is well suited for a very particular use case, and if your application doesn’t fit this use case, Kinesis may be a lot more trouble than it’s worth.
Kinesis’ primary use case is collecting, storing and processing real-time continuous data streams. Data streams are data that are generated continuously by thousands of data sources, which typically send in the data records simultaneously, and in small sizes (order of Kilobytes).
Another thing: the fact that I'm accessing data from another
microservice stream is an anti-pattern, isn't it?
Well, part of the point of dividing a system into microservices is to reduce the coupling between the capabilities of the system. Accessing data across the microservice boundaries increases the coupling. So there's some tension there.
But basically if I'm using a pull model I need to read
data from other microservices' stream. Is it avoidable?
If you query the service you need for the information, rather than digging it out of the stream yourself, you reduce the coupling -- much like asking a service for data rather than reaching into an RDBMS and querying the tables yourself.
If you can avoid sharing the information between services at all, then you get even less coupling.
(Naive example: order fulfillment needs to know when an order has been paid for; so it needs a correlation id when the payment is made, but it doesn't need any of the other billing details.)

Kafka Storm HDFS/S3 data flow

It is unclear if you can do a fan-out (duplication) in Kafka like you can in Flume.
I'd like to have Kafka save data to HDFS or S3 and send a duplicate of that data to Storm for real time processing. The output of Storm aggregations/analysis will be stored in Cassandra. I see some implementations flowing all data from Kafka into Storm and then two outputs from Storm. However, I'd like to eliminate the dependency of Storm for the raw data storage.
Is this possible? Are you aware of any documentation/examples/implementations like this?
Also, does Kafka have good support for S3 storage?
I saw Camus for storing to HDFS -- do you just run this job via cron to continually load data from Kafka to HDFS? What happens if a second instance of the job starts before the previous has finished? Finally, would Camus work with S3?
Thanks -- I appreciate it!

Regarding Camus,
Yeah, a scheduler that launches the job should work.
What they use at LinkedIn is Azkaban, you can look at that too.
If one launches before the other finishes, some amount of data will be read twice. Since the second job will start reading from the same offsets used by the first one.
Regarding Camus with S3, currently I dont think that is in place.

Regarding Kafka support for S3 storage, there are several Kafka S3 consumers you can easily plugin to get your data saved to S3. kafka-s3-storage is one of them.

There are many possible ways to feed storm with translated data. The main question that is not clear to me is what the dependency you wish to eliminate and what tasks you wish to keep storm from doing.
If it is considered ok that storm would receive an xml or json, you could easily read from the original queue using two consumers. As each consumer controls the messages it reads, both could read the same messages. One consumer could insert the data to your storage and the other will translate the information and send it to storm. There is no real complexity with the feasibiliy of this, but, I believe this is not the ideal solution due to the following reasons:
Maintainability - a consumer needs supervision. You would therefor need to supervise your running consumers. Depending on your deployment and the way you handle data types, this might be a non-trivial effort. Especially, when you already have storm installed and therefore supervised.
Storm connectiviy - you still need to figure out how to connect this data to storm. Srorm has a kafka spout, that i have used, and works very well. But, using the suggested architecture , this means an additional kafka topic to place the translated messages on. This is not very efficient as the spout could also read information directly from the original topic and translate it using a simple bolt.
Suggested way to handle this would be to form a topology, using kafka spout to read raw data and one bolt to send the raw data to storage and another one to translate it. But, this solution depends on the reasons you wish to keep storm out of the raw data business.

Kafka actually retains events for a configurable period of time -- events are not purged immediately upon consumption like other message or queue systems. This allows you to have multiple consumers that can read from Kafka either at the beginning (per the configurable retention time) or from an offset.
For the use case described, you would use Camus to batch load events to hadoop, and Storm to read events off the same Kafka output. Just ensure both processes read new events before the configurable retention time expires.
Regarding Camus, ggupta1612 answered this aspect best
A scheduler that launches the job should work. What they use at LinkedIn is Azkaban, you can look at that too.
If one launches before the other finishes, some amount of data will be read twice. Since the second job will start reading from the same offsets used by the first one.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js