Kinesis Data Firehose and Glue Race conditions

Kinesis Data Firehose and Glue Race conditions - amazon-web-services

We are using Kinesis Data Firehose to write RDS CDC data to S3 buckets as raw json files. Our Kinesis Firehose configuration is 128 MB and 60 seconds to create the S3 files. We have a glue job that monitors the s3 buckets and picks up these json file. We have a question on whether we will run into a race condition between a json file that is being currently written by Kinesis Firehose and Glue. I looked at the FAQ, but I could not get any pointers. Please let me know if the race condition is possible and any strategies that can mitigate this condition
https://aws.amazon.com/kinesis/data-firehose/faqs/

If you worry that your glue job will start working with a partially written file by Kinesis, then you should know that S3 operations are atomic as exampled in Amazon S3 data consistency model:
Updates to a single key are atomic. For example, if you PUT to an existing key, a subsequent read might return the old data or the updated data, but it never returns corrupted or partial data.

Related

How to replay in a stream data pushed to S3 from AWS Firehose?

There is a plenty of examples how data is stores by AWS Firehose to S3 bucket and parallelly passed to some processing app (like on the picture above).
But I can't find anything about good practice of replaying this data from s3 bucket in case if processing app was crushed. And we need to supply it with historical data, which we have in s3, but which is already not in the Firehose.
I can think of replaying it with Firehose or Lambda, but:
Kinesis Firehose could not consume from bucket
Lambda will need to deserialize .parquet file to send it to Firehose or Kinesis Data Stream. And I'm confused with this implicit deserializing, because Firehose was serializing it explicitly.
Or maybe there is some other way to put data back from s3 to stream which I completely miss?
EDIT: More over if we will run lambda for pushing records to stream probably it will have to rum more that 15 min. So another option is to run a script doing it which runs on separate EC2 instance. But this methods of extracting data from s3 looks so much more complicated than storing it there with Firehose, that is makes me think there should be some easier approach

The problem which stuck me was actually that I expect some more advanced serialization than just converting to JSON (as Kafka support AVRO for example).
Regarding replaying records from s3 bucket: this part of solution seems to be really significantly more complicated, than the one needed for archiving records. So if we can archive stream with out of the box functions of Firehose, for replaying it we will need two lambda functions and two streams.
Lambda 1 (pushes filenames to stream)
Lambda 2 (activated for every filename in the first stream, pushes records from files to second stream)
First lambda is triggered manually, scans through all s3 bucket files and write their names to first stream. Second lambda function is triggered by every event is stream with file names, reads all the records in the file and sends them to final stream. From which there could be consumed but Kinesis Data Analytics or another Lambda.
This solution expects that there are multiple files generated per day, and there are multiple records in every file.
Similar to this solution, but destination is Kinesis in my case instead of Dynamo in the article.

Kinesis to S3 custom partitioning

Question
I've read this and this and this articles. But they provide contradictory answers to the question: how to customize partitioning on ingesting data to S3 from Kinesis Stream?
More details
Currently, I'm using Firehose to deliver data from Kinesis Streams to Athena. Afterward, data will be processed with EMR Spark.
From time to time I have to handle historical bulk ingest into Kinesis Streams. The issue is that my Spark logic hardly depends on data partitioning and order of event handling. But Firehouse supports partitioning only by ingestion_time (into Kinesis Stream), not by any other custom field (I need by event_time).
For example, under Firehouse's partition 2018/12/05/12/some-file.gz I can get data for the last few years.
Workarounds
Could you please help me to choose between the following options?
Copy/partition data from Kinesis Steam with help of custom lambda. But this looks more complex and error-prone for me. Maybe because I'm not very familiar with AWS lambdas. Moreover, I'm not sure how well it will perform on bulk load. At this article it was said that Lambda option is much more expensive than Firehouse delivery.
Load data with Firehouse, then launch Spark EMR job to copy the data to another bucket with right partitioning. At least it sounds simpler for me (biased, I just starting with AWS Lambas). But it has the drawback of double-copy and additional spark Job.
At one hour I could have up to 1M rows that take up to 40 MB of memory (at compressed state). From Using AWS Lambda with Amazon Kinesis I know that Kinesis to Lambda event sourcing has a limitation of 10,000 records per batch. Would it be effective to process such volume of data with Lambda?

While Kinesis does not allow you to define custom partitions, Athena does!
The Kinesis stream will stream into a table, say data_by_ingestion_time, and you can define another table data_by_event_time that has the same schema, but is partitioned by event_time.
Now, you can make use of Athena's INSERT INTO capabilities to let you repartition data without needing to write Hadoop or a Spark job and you get the serverless scale-up of Athena for your data volume. You can use SNS, cron, or a workflow engine like Airflow to run this at whatever interval you need.
We dealt with this at my company and go in-to more depth details of the trade-offs of using EMR or a streaming solution, but now you don't need to introduce anymore systems like Lambda or EMR.
https://radar.io/blog/custom-partitions-with-kinesis-and-athena

you may use the kinesis stream, and create the partitions like you wants.
you create a producer, and in your consumer, create the partitions.
https://aws.amazon.com/pt/kinesis/data-streams/getting-started/

Cannot Archive Data from AWS Kinesis to Glacier

I am working on a Data processing application hosted as a web service on an EC2, each second a small data file (less than 10KB) in .csv format is generated.
Problem Statement:
Archive all the data files generated to Amazon Glacier.
My Approach :
As data files are very small. I store the files in AWS Kinesis and after few hours i flush data to S3 (because i cannot find a direct way to put data from Kinesis to Glacier) and using S3 lifecycle management at the end of the day i archive all the objects to Glacier.
My Questions :
Is there a way to transfer data to Glacier directly from Kinesis ?
Is it possible to configure Kinesis to flush data to S3/Glacier at the end of the day ? Is there any time or memory limitation upto which Kinesis can hold data ?
If Kinesis cannot transfer data to Glacier directly. Is there a workaround for this like - can i write a lambda function which can fetch data from Kinesis and archive it to Glacier ?
Is it possible to merge all the .csv file at Kinesis or S3 or Glacier level ?
Is Kinesis suitable for my usecase ? Is there anything else i can use ?
I would be grateful if someone can take time and answer my questions and point me to some references. Please let me know if there is a flaw in my approach or if there is a better way to do so.
Thanks.

You can't directly put data from Kinesis into Glacier (unless you want to put the 10kb filea directly into Glacier)
You could consider Kinesis Data Firehose as a way of flushing 15min. Increments of data to S3
You can definitely do that. Glacier allows direct uploads so there's no need to upload to S3 first
You could use Firehose to flush to S3 then transform and aggregate using Athena then transition that file to Glacier. Or you use Lambda directly and upload straight to Glacier.
Perhaps streaming data into Firehose would make more sense. Depending on your exact needs IoT Analytics might also be interesting.
Reading your question again, seeing you use csv files, I would highly recommend using the Kinesis > S3 > Athena > Transition to glacier approach

Is there a way to put data into Kinesis Firehose from S3 bucket?

I am want to write streaming data from S3 bucket into Redshift through Firehose as the data is streaming in real time (600 files every minute) and I dont want any form of data loss.
How to put data from S3 into Kinesis Firehose?

It appears that your situation is:
Files randomly appear in S3 from an SFTP server
You would like to load the data into Redshift
There's two basic ways you could do this:
Load the data directly from Amazon S3 into Amazon Redshift, or
Send the data through Amazon Kinesis Firehose
Frankly, there's little benefit in sending it via Kinesis Firehose because Kinesis will simply batch it up, store it into temporary S3 files and then load it into Redshift. Therefore, this would not be a beneficial approach.
Instead, I would recommend:
Configure an event on the Amazon S3 bucket to send a message to an Amazon SQS queue whenever a file is created
Configure Amazon CloudWatch Events to trigger an AWS Lambda function periodically (eg every hour, or 15 minutes, or whatever meets your business need)
The AWS Lambda function reads the messages from SQS and constructs a manifest file, then triggers Redshift to import the files listed in the manifest file
This is a simple, loosely-coupled solution that will be much simpler than the Firehose approach (which would require somehow reading each file and sending the contents to Firehose).

Its actually designed to do the opposite, Firehose sends incoming streaming data to Amazon S3 not from Amazon S3, and other than S3 it can send data to other services like Redshift and Elasticsearch Service.
I don't know whether this will solve your problem but you can use COPY from S3 to redshift.
Hope it will help!

DynamoDB Streams to S3

I am using Data Pipeline (DP) for daily backups of DynamoDB, however, I would like to do incremental backups of the data that is missed by DP runs (updates between DP runs). To accomplish that, I would like to use DynamoDB Streams + Lambda + S3 to bring real-time DynamoDB updates to S3. I understand how DynamoDB streams work, however, I am struggling with creating a Lambda function that writes to S3 and say rolls a file every hour.
Has anyone tried it?

Its an hour job dude,What you need to do is
Enable Dynamo DB update Stream and attach aws provided lambda function
https://github.com/awslabs/lambda-streams-to-firehose
Enable Firehose stream and use above function to stream outs records in firehose.
Configure Firehose to dump the records to S3.
done.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js