I am trying to implement an event-driven architecture using Amazon Kinesis as the central event log of the platform. The idea is pretty much the same to the one presented by Nordstrom's with the Hello-Retail project.
I have done similar things with Apache Kafka before, but Kinesis seems to be a cost-effective alternative to Kafka and I decided to give it a shot. I am, however, facing some challenges related to event persistence and replaying. I have two questions:
Are you guys using Kinesis for such use-case OR do you recommend using it?
Since Kinesis is not able to retain the events forever (like Kafka does), how to handle replays from consumers?
I'm currently using a lambda function (Firehose is also an option) to persist all events to Amazon S3. Then, one could read past events from the storage and then start listening to new events coming from the stream. But I'm not happy with this solution. Consumers are not be able to use Kinesis' checkpoints (Kafka's consumer offsets). Plus, Java's KCL does not support the AFTER_SEQUENCE_NUMBER yet, which would be useful in such implementation.
First question. Yes I am using Kinesis Streams when I need to process the received log / event data before storing in S3. When I don't I use Kinesis Firehose.
Second question. Kinesis Streams can store data up to seven days. This is not forever, but should be enough time to process your events. Depending on the value of the events being processed ....
If I do not need to process the event stream before storing in S3, then I use Kinesis Firehose writing to S3. Now I do not have to worry about event failures, persistence, etc. I then process the data stored in S3 with the best tool. I use Amazon Athena often and Amazon Redshift too.
You don't mention how much data you are processing or how it is being processed. If it is large, multiple MB / sec or higher, then I would definitely use Kinesis Firehose. You have to manage performance with Kinesis Streams.
One issue that I have with Kinesis Streams is that I don't like the client libraries, so I prefer to write everything myself. Kinesis Firehose reduces coding for custom applications as you just store the data in S3 and then process afterwards.
I like to think of S3 as my big data lake. I prefer to throw everything into S3 without preprocessing and then use various tools to pull out the data that I need. By doing this I remove lots of points of failure that need to be managed.
Related
I am working on an application that will read and analyze the logs of payment transactions. I know I will use Kinesis Analytics as per my requirements, which takes the input from the Data Streams and Firehose. But I am having trouble deciding which input method should I use for my system. My requirements are:
It can tolerate latency, but Data shouldn't lose data.
Must record all the errors in DynamoDB or S3 buckets.
Which input stream is suitable for my use case?
Data Streams vs Firehose
Streams:
Kinesis data streams is highly customizable and best suited for developers building custom applications or streaming data for specialized needs.
Going to write custom code
Real time (200ms latency for classic, 70ms latency for enhanced fan-out)
You must manage scaling (shard splitting/merging)
Data storage for 1 to 7 days, replay capability, multi consumers
Use with Lambda to insert data in real-time to ElasticSearch
Firehose:
Firehose handles loading data streams directly into AWS products for processing.
Fully managed, send to S3, Splunk, Redshift, ElasticSearch
Serverless data transformations with Lambda
Near real time (lowest buffer time is 1 minute)
Automated Scaling
No data storage
Kinesis Data Streams allows consumers to READ streaming data. And it gives you a plenty of options to do so. It is best suitable for use cases that require custom processing, choice of stream processing frameworks, and sub-second processing latency.
Data is reliably stored in streams up to 7 days and distributed across 3 Availability Zones.
Kinesis Firehose is used to LOAD streaming data to a target destination (S3, Elasticsearch, Splunk, etc). You can also transform streaming data (by using Lambda) before loading it to destination.
Data from failed attempts will be saved to S3.
So, if your goal is to only load data to Kinesis Data Analytics service with minimal or no pre-processing then try Kinesis Firehose first.
Please note, that you also would need to consider such aspects as cost, development efforts, scaling options, volume of the data when choosing a proper service.
Please take a look at the following AWS Solutions Implementation for reference:
https://aws.amazon.com/solutions/implementations/real-time-web-analytics-with-kinesis/
https://aws.amazon.com/solutions/implementations/real-time-iot-device-monitoring-with-kinesis/
There are some key differences between Kinesis Stream (KS) and Firehose (FH):
KS is real time, while FH is near-real time.
KS requires manual scaling and setup of its provisioning (shards) , while FH is basically serverless.
KS records are immutable (they persist in stream for its retention period - default 24h), while records in FH are gone from FH the moment they are delivered to destination.
From what you wrote, I think FH should be considered first, as you are not concerned about non-real-time nature of FH, it is much easier to manage and setup, and you can specify S3 as a backup for failed or all messages:
Kinesis Data Firehose uses Amazon S3 to backup all or failed only data that it attempts to deliver to your chosen destination.
The S3 backup ensures you are not loosing records, if delivery or lambda processing fail. Subsequently, in my view, Firehose addresses your two points well.
You can use firehose to feed into analytics, but question is how firehose gets data? You can write your own code to feed data or use kinesis data steams. Firehose mainly is delivery system for stream data that can be written in to various destinations such as S3, Redshift or others with optional capability to perform data transformation.
Check this link https://www.slideshare.net/AmazonWebServices/abd217from-batch-to-streaming?from_action=save and see how your use case can benefit from the information.
More info: https://docs.aws.amazon.com/kinesisanalytics/latest/dev/how-it-works.html
https://docs.aws.amazon.com/firehose/latest/dev/what-is-this-service.html
If you are creating s3 files from the kinesis stream but you dont require cleaning of those s3 files then go with the firehose option. Also if you dont have any partitioning key requirement that makes many small s3 files then firehose is a good solution. If you are doing more cleaning up the FH files than you would have created those s3 files yourself then FH isnt a good option.
Also depends on what do you with those s3 files. You need to find out if you are saving any work/money because of using Firehose against the manual creation of S3 files. Remember you cant reorder the content of the s3 files.
I am new to Data Engeenering, and am picking up a project for personal growth.
I understand Kafka is used for data ingestion, but does it make sense to use it to ingest data from an API to AWS S3?
Is there another way to do the same, or what situation can be brought to use Kafka in such a situation?
The best way to ingest streaming data to S3 is Kinesis Data Firehose. Kinesis is a technology similar to Kafka, and Data Firehose is its specialised version for delivering data to S3 (or other destinations). It's fairly cheap and configurable, and much, much less hassle than Kafka if all you want to do is get data onto S3. Kafka is great, but it's not going to deliver data to S3 out of the box.
My use-case is as follows:
I have JSON data coming in which needs to be stored in S3 in parquet format. So far so good, I can create a schema in Glue and attach a "DataFormatConversionConfiguration" to my firehose stream. BUT the data is coming from different "topics". Each topic has a particular "schema". From my understanding I will have to create multiple firehose streams, as one stream can only have one schema. But I have thousands of such topics with very high volume high throughput data incoming. It does not look feasible to create so many firehose resources (https://docs.aws.amazon.com/firehose/latest/dev/limits.html)
How should I go about building my pipeline.
IMO you can:
ask for upgrade of your Firehose limit and do everything with 1 Firehose/stream + add Lambda transformation to convert the data into a common schema - IMO not cost-efficient but you should see with your load.
create a Lambda for each Kinesis data stream, convert each event to the schema managed by a single Firehose and at the end can send the events directly to your Firehose stream with Firehose API https://docs.aws.amazon.com/firehose/latest/APIReference/API_PutRecord.html (see "Q: How do I add data to my Amazon Kinesis Data Firehose delivery stream?" here https://aws.amazon.com/kinesis/data-firehose/faqs/) - but also, check the costs before because even though your Lambdas are invoked "on demand", you may have a lot of them invoked during long period of time.
use one of data processing frameworks (Apache Spark, Apache Flink, ...) and read your data from Kinesis in batches of 1 hour, starting every time when you terminated last time --> use available sinks to convert the data and write it in Parquet format. The frameworks use the notion of checkpointing and store the last processed offset in an external storage. Now, if you restart them every hour, they will start to read the data directly from the last seen entry. - It may be cost-efficient, especially if you consider using spot instances. On the other side, it requires more coding that 2 previous solutions and obviously may have higher latency.
Hope that helps. You could please give a feedback about chosen solution ?
I have a server which can only process 20 request at a time. When lots of request coming, I want to store the request data, in some queues. and read a set of request (i.e 20) and process them by batch. What would be the ideal way to that ? Using SQS, or kinesis. I'm totally confused.
SQS = Simple Queue Service is for queuing messages in a 1:1 (once the message is consumed, it is removed from the queue)
Kinesis = low latency, high volumetry data streaming ... typically for 1:N (many consumers of messages)
As Kinesis is also storing the data for a period of time, both are often confused, but their architectural patterns are totally different.
Queue => SQS.
Data Streams => Kinesis.
Taken from https://aws.amazon.com/kinesis/data-streams/faqs/ :
Q: How does Amazon Kinesis Data Streams differ from Amazon SQS?
Amazon Kinesis Data Streams enables real-time processing of streaming
big data. It provides ordering of records, as well as the ability to
read and/or replay records in the same order to multiple Amazon
Kinesis Applications. The Amazon Kinesis Client Library (KCL) delivers
all records for a given partition key to the same record processor,
making it easier to build multiple applications reading from the same
Amazon Kinesis data stream (for example, to perform counting,
aggregation, and filtering).
Amazon Simple Queue Service (Amazon SQS) offers a reliable, highly
scalable hosted queue for storing messages as they travel between
computers. Amazon SQS lets you easily move data between distributed
application components and helps you build applications in which
messages are processed independently (with message-level ack/fail
semantics), such as automated workflows.
Q: When should I use Amazon Kinesis Data Streams, and when should I
use Amazon SQS?
We recommend Amazon Kinesis Data Streams for use cases with
requirements that are similar to the following:
Routing related records to the same record processor (as in streaming MapReduce). For example, counting and aggregation are
simpler when all records for a given key are routed to the same record
processor.
Ordering of records. For example, you want to transfer log data from the application host to the processing/archival host while maintaining
the order of log statements.
Ability for multiple applications to consume the same stream concurrently. For example, you have one application that updates a
real-time dashboard and another that archives data to Amazon Redshift.
You want both applications to consume data from the same stream
concurrently and independently.
Ability to consume records in the same order a few hours later. For example, you have a billing application and an audit application that
runs a few hours behind the billing application. Because Amazon
Kinesis Data Streams stores data for up to 7 days, you can run the
audit application up to 7 days behind the billing application.
We recommend Amazon SQS for use cases with requirements that are
similar to the following:
Messaging semantics (such as message-level ack/fail) and visibility timeout. For example, you have a queue of work items and want to track
the successful completion of each item independently. Amazon SQS
tracks the ack/fail, so the application does not have to maintain a
persistent checkpoint/cursor. Amazon SQS will delete acked messages
and redeliver failed messages after a configured visibility timeout.
Individual message delay. For example, you have a job queue and need to schedule individual jobs with a delay. With Amazon SQS, you can
configure individual messages to have a delay of up to 15 minutes.
Dynamically increasing concurrency/throughput at read time. For example, you have a work queue and want to add more readers until the
backlog is cleared. With Amazon Kinesis Data Streams, you can scale up
to a sufficient number of shards (note, however, that you'll need to
provision enough shards ahead of time).
Leveraging Amazon SQS’s ability to scale transparently. For example, you buffer requests and the load changes as a result of occasional
load spikes or the natural growth of your business. Because each
buffered request can be processed independently, Amazon SQS can scale
transparently to handle the load without any provisioning instructions
from you.
Question
I've read this and this and this articles. But they provide contradictory answers to the question: how to customize partitioning on ingesting data to S3 from Kinesis Stream?
More details
Currently, I'm using Firehose to deliver data from Kinesis Streams to Athena. Afterward, data will be processed with EMR Spark.
From time to time I have to handle historical bulk ingest into Kinesis Streams. The issue is that my Spark logic hardly depends on data partitioning and order of event handling. But Firehouse supports partitioning only by ingestion_time (into Kinesis Stream), not by any other custom field (I need by event_time).
For example, under Firehouse's partition 2018/12/05/12/some-file.gz I can get data for the last few years.
Workarounds
Could you please help me to choose between the following options?
Copy/partition data from Kinesis Steam with help of custom lambda. But this looks more complex and error-prone for me. Maybe because I'm not very familiar with AWS lambdas. Moreover, I'm not sure how well it will perform on bulk load. At this article it was said that Lambda option is much more expensive than Firehouse delivery.
Load data with Firehouse, then launch Spark EMR job to copy the data to another bucket with right partitioning. At least it sounds simpler for me (biased, I just starting with AWS Lambas). But it has the drawback of double-copy and additional spark Job.
At one hour I could have up to 1M rows that take up to 40 MB of memory (at compressed state). From Using AWS Lambda with Amazon Kinesis I know that Kinesis to Lambda event sourcing has a limitation of 10,000 records per batch. Would it be effective to process such volume of data with Lambda?
While Kinesis does not allow you to define custom partitions, Athena does!
The Kinesis stream will stream into a table, say data_by_ingestion_time, and you can define another table data_by_event_time that has the same schema, but is partitioned by event_time.
Now, you can make use of Athena's INSERT INTO capabilities to let you repartition data without needing to write Hadoop or a Spark job and you get the serverless scale-up of Athena for your data volume. You can use SNS, cron, or a workflow engine like Airflow to run this at whatever interval you need.
We dealt with this at my company and go in-to more depth details of the trade-offs of using EMR or a streaming solution, but now you don't need to introduce anymore systems like Lambda or EMR.
https://radar.io/blog/custom-partitions-with-kinesis-and-athena
you may use the kinesis stream, and create the partitions like you wants.
you create a producer, and in your consumer, create the partitions.
https://aws.amazon.com/pt/kinesis/data-streams/getting-started/