Dynamo streams on small tables consumed by multiple instances

Dynamo streams on small tables consumed by multiple instances - amazon-web-services

I am using dynamodb to store configuration for an application, this configuration is likely to be changed a few times a day, and will be in the order of tens of rows. My application will be deployed to a number of EC2 instances. I will eventually write another application to allow management of the configuration, in the meantime configuration is managed by making changes to the table directly in the AWS console.
I am trying to use dynamo streams to watch for changes to the configuration, and when the application receives records to process, it simply rereads the entire dynamo table.
This works locally and when deployed to one instance, but when I deploy it to three instances it never initializes the IRecordProcessor, and doesn't pick up any changes to the table.
I suspect this is because the table has only one shard, and the number of instances should not exceed the number of shards (at least for kinesis streams, I understand that kinesis and dynamo streams are actually different though).
I known how to split shards in kinesis streams, but cannot seem to find a way to do this for dynamo streams. I read that, in fact, the number of shards in a dynamo stream is equal to the number of partitions in the dynamo table, and you can increase the number of partitions by increasing read/write capacity. I don't want to increase the throughput as this would be costly.
Does the condition that the number of shards should be more than the number of instances also apply to dyanmo streams? If so, is there another way to increase the number of shards, and if not, is there a known reason that dynamo streams on small tables fail when read by multiple instances?
Is there a better way to store and watch such configuration (ideally using AWS infrastructure)? I am going to investigate triggers.

I eventually solved this by adding the instance ID (EC2MetadataUtils.getInstanceId) to the stream name when setting up the KinesisClientLibConfiguration, so a new stream is set up for each instance. This does result in a separate dynamo table being set up for each instance, and I now need to delete old tables when I restart the app on new instances.
I also contacted AWS support, and received this response.

Related

Recommendation for near real time data sync between DynamoDb and S3/Redshift

I have a bunch of tables in DynamoDB (80 right now but can grow in future) and looking to sync data from these tables to either Redshift or S3 (using Glue on top of it to query using Athena) for running analytics query.
There are also very frequent updates (to existing entries) and deletes happen in the DynamoDB tables which I want to sync along with addition of newer entries.
Checked the Write capacity units (WCU) consumed for all tables and the rate is coming out to be around 35-40 WCU per second at peak times.
Solutions I considered:
Use Kinesis Firehose along with lambda (which reads updates from DDB streams) to push data to Redshift in small batches (Issue: it cannot support updates and deletes and is only good for adding new entries because it uses Redshift COPY command under the hood to upload data to Redshift)
Use Lambda (which reads updates from DDB streams) to copy data to S3 directly as json with each entry being a separate file. This can support updates and deletes if we have S3 filepath same as primary key of DynamoDB tables (Issue: It will result in tons of small files in S3 which might not scale for querying using AWS Glue)
Use Lambda (which reads updates from DDB streams) to update data to Redshift directly as soon as a new update happens. (Issue: Too many small writes to Redshift can cause scaling issues for Redshift as it is more suited for batch writes/updates)

Lambda architecture on AWS: choose database for batch layer

We're building Lambda architecture on AWS stack. A lack of devops knowledge forces us to prefer AWS managed solution over custom deployments.
Our workflow:
[Batch layer]
Kinesys Firehouse -> S3 -Glue-> EMR (Spark) -Glue-> S3 views -----+
|===> Serving layer (ECS) => Users
Kinesys -> EMR (Spark Streaming) -> DynamoDB/ElasticCache views --+
[Speed layer]
We have already using 3 datastores: ElasticCache, DynamoDB and S3 (queried with Athena). Bach layer produce from 500,000 up to 6,000,000 row each hour. Only last hour results should be queried by serving layer with low latency random reads.
Neither of our databases fits batch-insert & random-read requirements. DynamoDB not fit batch-insert - it's too expensive because of throughput required for batch inserts. Athena is MPP and moreover has limitation of 20 concurrent queries. ElasticCache is used by streaming layer, not sure if it's good idea to perform batch inserts there.
Should we introduce the fourth storage solution or stay with existing?
Considered options:
Persist batch output to DynamoDB and ElasticCache (part of data that is updated rarely and can be compressed/aggregated goes to DynamoDB; frequently updated data ~8GB/day goes to elasticCache).
Introduce another database (HBase on EMR over S3/ Amazon redshift?) as a solution
Use S3 Select over parquet to overcome Athena concurrent query limits. That will also reduce query latency. But have S3 Select any concurrent query limits? I can't find any related info.
The first option is bad because of batch insert to ElasticCache used by streaming. Also does it follow Lambda architecture - keeping batch and speed layer views in the same data stores?
The second solution is bad because of the fourth database storage, isn't it?

In this case you might want to use something like HBase or Druid; not only can they handle batch inserts and very low latency random reads, they could even replace the DynamoDB/ElastiCache component from your solution, since you can write directly to them from the incoming stream (to a different table).
Druid is probably superior for this, but as per your requirements, you'll want HBase, as it is available on EMR with the Amazon Hadoop distribution, whereas Druid doesn't come in a managed offering.

Kinesis Stream and Kinesis Firehose Updating Elasticsearch Indexes

We want to use kinesis stream and firehose to update an aws managed elasticsearch cluster. We have hundreds of different indexes (corresponding to our DB shards) that need to be updated. When creating the firehose it requires that I specify the specific index name I want updated. Does that mean I need to create a separate firehose for each index in my cluster? Or is there a way to configure the firehose so it knows what index to used based on the content of the data.
Also, we would have 20 or so separate producers that would send data to a kinesis stream (each one of these producers would generate data for 10 different indexes). Would I also need a separate kinesis stream for each producer.
Summary:
20 producers (EC2 instances) -> Each producer sends data for 20 different indexes to a kinesis stream -> The kinesis stream then uses a firehose to update a single cluster which has 200 indexes in it.
Note: all of the indexes have the same mapping and name temple i.e. index_1, index_2...index_200
Edit: As we reindex the data we create new indexes along the lines of index_1-v2. Obviously we won't want to create a new firehose for each index version as they're being created. The new index name can be included in the JSON that's sent to the kinesis stream.

As you guessed, Firehose is the wrong solution for this problem, at least as stated. It is designed for situations where there's a 1:1 correspondence between stream (not producer!) and index. Things like clickstream data or log aggregation.
For any solution, you'll need to provide a mechanism to identify which index a record belongs to. You could do this by creating a separate Kinesis stream per message type (in which case you could use Firehose), but this would mean that your producers have to decide which stream to write each message to. That may cause unwanted complexity in your producers, and may also increase your costs unacceptably.
So, assuming that you want a single stream for all messages, you need a consumer application and some way to group those messages. You could include a message type (/ index name) in the record itself, or use the partition key for that purpose. The partition key makes for a somewhat easier implementation, as it guarantees that records for the same index will be stored on the same shard, but it means that your producers may be throttled.
For the consumer, you could use an always-on application that runs on EC2, or have the stream invoke a Lambda function.
Using Lambda is nice if you're using partition key to identify the message type, because each invocation only looks at a single shard (you may still have multiple partition keys in the invocation). On the downside, Lambda will poll the stream once per second, which may result in throttling if you have multiple stream consumers (with a stand-alone app you can control how often it polls the stream).

Partition Kinesis firehose S3 records by event time

Firehose->S3 uses the current date as a prefix for creating keys in S3. So this partitions the data by the time the record is written. My firehose stream contains events which have a specific event time.
Is there a way to create S3 keys containing this event time instead? Processing tools downstream depend on each event being in an "hour-folder" related to when it actually happened. Or would that have to be an additional processing step after Firehose is done?
The event time could be in the partition key or I could use a Lambda function to parse it from the record.

Kinesis Firehose doesn't (yet) allow clients to control how the date suffix of the final S3 objects is generated.
The only option with you is to add a post-processing layer after Kinesis Firehose. For e.g., you could schedule an hourly EMR job, using Data Pipeline, that reads all files written in last hour and publishes them to correct S3 destinations.

It's not an answer for the question, however I would like to explain a little bit the idea behind storing records in accordance with event arrival time.
First a few words about streams. Kinesis is just a stream of data. And it has a concept of consuming. One can reliable consume a stream only by reading it sequentially. And there is also an idea of checkpoints as a mechanism for pausing and resuming the consuming process. A checkpoint is just a sequence number which identifies a position in the stream. Via specifying this number, one can start reading the stream from the certain event.
And now go back to default s3 firehose setup... Since the capacity of kinesis stream is quite limited, most probably one needs to store somewhere the data from kinesis to analyze it later. And the firehose to s3 setup does this right out of the box. It just stores raw data from the stream to s3 buckets. But logically this data is the still the same stream of records. And to be able to reliable consume (read) this stream one needs these sequential numbers for checkpoints. And these numbers are records arrival times.
What if I want to read records by creation time? Looks like the proper way to accomplish this task is to read the s3 stream sequentially, dump it to some [time series] database or data warehouse and do creation-time-based readings against this storage. Otherwise there will be always a non-zero chance to miss some bunches of events while reading the s3 (stream). So I would not suggest the reordering of s3 buckets at all.

You'll need to do some post-processing or write a custom streaming consumer (such as Lambda) to do this.
We dealt with a huge event volume at my company, so writing a Lambda function didn't seem like a good use of money. Instead, we found batch-processing with Athena to be a really simple solution.
First, you stream into an Athena table, events, which can optionally be partitioned by an arrival-time.
Then, you define another Athena table, say, events_by_event_time which is partitioned by the event_time attribute on your event, or however it's been defined in the schema.
Finally, you schedule a process to run an Athena INSERT INTO query that takes events from events and automatically repartitions them to events_by_event_time and now your events are partitioned by event_time without requiring EMR, data pipelines, or any other infrastructure.
You can do this with any attribute on your events. It's also worth noting you can create a view that does a UNION of the two tables to query real-time and historic events.
I actually wrote more about this in a blog post here.

For future readers - Firehose supports Custom Prefixes for Amazon S3 Objects
https://docs.aws.amazon.com/firehose/latest/dev/s3-prefixes.html

AWS started offering "Dynamic Partitioning" in Aug 2021:
Dynamic partitioning enables you to continuously partition streaming data in Kinesis Data Firehose by using keys within data (for example, customer_id or transaction_id) and then deliver the data grouped by these keys into corresponding Amazon Simple Storage Service (Amazon S3) prefixes.
https://docs.aws.amazon.com/firehose/latest/dev/dynamic-partitioning.html

Look at https://docs.aws.amazon.com/firehose/latest/dev/dynamic-partitioning.html. You can implement a lambda function which takes your records, processes them, changes the partition key and then sends them back to firehose to be added. You would also have the change the firehose to enable this partitioning and also define your custom partition key/prefix/suffix.

How does AWS Lambda parallel execution works with DynamoDB?

I went through this article which says that the data records are organized into groups called Shards, and these shards can be consumed and processed in parallel by Lambda function.
I also found these slides from AWS webindar where on slide 22 you can also see that Lambda functions consume different shards in parallel.
However I could not achieve parallel execution of a single function. I created a simple lambda function that runs for a minute. Then I started to create tons of items in DynamoDB expecting to get a lot of stream records. In spite of this, my functions was started one after another.
What i'm doing wrong?

Pre-Context:
How DaynamoDB stores data?
DynamoDB uses partition to store the table records. These partitions are abstracted from users and managed by DynamoDB team. As data grows in the table, these partitions are further divided internally.
What these dynamo streams all about?
DynamoDB as a data-base provides a way for user to retrieve the ordered changed logs (think of it as transnational replay logs of traditional database). These are vended as Dynamo table streams.
How data is published in streams?
Stream has a concept of shards (which is somewhat similar to partition). Shards by definition contains ordered events. With dynamo terminology, a stream shard will contains the data from a certain partition.
Cool!.. So what will happen if data grows in table or frequent writes occurs?
Dynamo will keep persisting the records based on HashKey/SortKey in its associated partition, until a threshold is breached (like table size and/or RCU/WCU counts). The exact value of these thresholds are not shared to us by dynamoDB, Though we have some document around rough estimation.
As this threshold is breached, dynamo splits the partition and do the re-hashing to distribute the data (somewhat) evenly across the partition.
Since new partitions have arrived, these data will be published to its own shards (mapped to its partition)
Great, so what about Lambda? How the parallel processing works then.
One lambda function process records from one and only one shard. Thus the number of shards present in the dynamo stream will decide the number of parallel running lambda function.
Vaguely you can think of, # of partitions = # shards = # of parallel lambda running.

From the first article it is said:
Because shards have a lineage (parent and children), applications must always process a parent shard before it processes a child shard. This will ensure that the stream records are also processed in the correct order.
Yet, when working with Kinesis streams for example, you can achieve parallelism by having multiple shards as the order in which records are processed is guaranteed only within a shard.
Side note, it makes some sense to trigger lambda with Dynamodb events in order.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js