Scrapy: write output into Amazon Kinesis Data Firehose - amazon-web-services

Instead of exporting my output (which is a .json file) into an s3 bucket I would like to export it into Amazon Kinesis Data Firehouse.
Is it possible to do this?
Where should I write the functions to handle this? I'm planning to use boto3

Related

Writing to S3 via Kinesis Stream or Firehose

I have events that keep coming which I need to put to S3. I am trying to evaluate if I muse use Kinesis Stream or Firehose. I also want to wait for few minutes before writing to S3 so that the object is fairly full.
Based on my reading of Kinesis Data stream, I have to create an analytics app which will then be used to invoke a lambda. I will then have to use the lambda to write to S3. Or Kinesis Data Streams can directly write to lambda somehow? I could not find anything indicating the same.
Firehose is not charged by hour(while stream is). So is firehose a better option for me?
Or Kinesis Data Streams can directly write to lambda somehow?
Data Streams can't write directly to S3. Instead Firehose can do this:
delivering real-time streaming data to destinations such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon Elasticsearch Service (Amazon ES), Splunk, and any custom HTTP endpoint or HTTP endpoints owned by supported third-party service providers, including Datadog, MongoDB, and New Relic.
What's more Firehose allows you to buffer the records before writing them to S3. The writing can happen based on buffer size or time. In addition to that you can process the records using lambda function before writing to S3.
Thus, colectively it seems that Firehose is more suited to your use-case then Data Streams.

Data Pipeline (DynamoDB to S3) - How to format S3 file?

I have a Data Pipeline that exports my DynamoDB table to an S3 bucket so I can use the S3 file for services like QuickSight, Athena and Forecast.
However, for my S3 file to work with these services, I need the file to be formatted in a csv like so:
date, journal, id
1589529457410, PLoS Genetics, 10.1371/journal.pgen.0030110
1589529457410, PLoS Genetics, 10.1371/journal.pgen.1000047
But instead, my exported file looks like this:
{"date":{"s":"1589529457410"},"journal":{"s":"PLoS Genetics"},"id":{"s":"10.1371/journal.pgen.0030110"}}
{"date":{"s":"1589833552714"},"journal":{"s":"PLoS Genetics"},"id":{"s":"10.1371/journal.pgen.1000047"}}
How can I specify the format for my exported file in S3 so I can operate with services like QuickSight, Athena and Forecast? I'd preferably do the data transformation using Data Pipeline as well.
Athena can read JSON data.
You can also use DynamoDB streams to stream the data to S3. Here is a link to a blog post with best practice and design patterns for streaming data from DynamoDB to S3 to be used with Athena.
You can use DynamoDB streams to trigger an AWS Lambda function, which can transform the data and store it in Amazon S3, Amazon Redshift etc. With AWS Lambda you could also trigger Amazon Forecast to retrain, or pass the data to Amazon Forecast for a prediction.
Alternatively you could use Amazon Data Pipeline to write the data to an S3 bucket as you currently have it. Then use a cloud watch event scheduled to run a lambda function, or an S3 event notification to run a lambda function. The lambda function can transform the file and store it in another S3 bucket for further processing.

Writing S3 bucket csv to redshift using kinesis

I have 3 types of csv in my s3 bucket and want to flow them into respective redshift tables based on csv prefix . I am thinking to use Kinesis to stream data to redshift as file in s3 will be dropped every 5 min. I am all new to aws and not sure how to achieve this.
I have gone through aws documentation but not sure how to achieve this

Configure Firehose so it writes only one record per S3 object?

I'm using a Firehose delivery stream to write JSONs to S3. These JSONs represent calls. The stream will often receive a new version of a JSON, that bring new info about the represented call.
I would want my Firehose to write each JSON record to a separate S3 object, so not grouping them together as it seems to do by default. Each JSON would be written at an S3 key that identifies the call, so that when a new version of a JSON shows up, Firehose replaces its previous version in S3. Is this possible?
I see that I can set up the buffer size that triggers writing to S3, but can I explicitly configure my Firehose stream so it writes exactly one S3 object per record?
There's no Redshift involved.
This is not possible with Amazon Kinesis Data Firehose. It is a simplified service that only has a few configuration options.
Instead, you could use Amazon Kinesis Data Streams:
Send data to the stream
Create an AWS Lambda function that will be triggered whenever data is received by the stream
Code the Lambda function to write the data to the appropriate Amazon S3 object
See: Using AWS Lambda with Amazon Kinesis - AWS Lambda

Is there a way to put data into Kinesis Firehose from S3 bucket?

I am want to write streaming data from S3 bucket into Redshift through Firehose as the data is streaming in real time (600 files every minute) and I dont want any form of data loss.
How to put data from S3 into Kinesis Firehose?
It appears that your situation is:
Files randomly appear in S3 from an SFTP server
You would like to load the data into Redshift
There's two basic ways you could do this:
Load the data directly from Amazon S3 into Amazon Redshift, or
Send the data through Amazon Kinesis Firehose
Frankly, there's little benefit in sending it via Kinesis Firehose because Kinesis will simply batch it up, store it into temporary S3 files and then load it into Redshift. Therefore, this would not be a beneficial approach.
Instead, I would recommend:
Configure an event on the Amazon S3 bucket to send a message to an Amazon SQS queue whenever a file is created
Configure Amazon CloudWatch Events to trigger an AWS Lambda function periodically (eg every hour, or 15 minutes, or whatever meets your business need)
The AWS Lambda function reads the messages from SQS and constructs a manifest file, then triggers Redshift to import the files listed in the manifest file
This is a simple, loosely-coupled solution that will be much simpler than the Firehose approach (which would require somehow reading each file and sending the contents to Firehose).
Its actually designed to do the opposite, Firehose sends incoming streaming data to Amazon S3 not from Amazon S3, and other than S3 it can send data to other services like Redshift and Elasticsearch Service.
I don't know whether this will solve your problem but you can use COPY from S3 to redshift.
Hope it will help!