Data Pipeline (DynamoDB to S3) - How to format S3 file? - amazon-web-services

I have a Data Pipeline that exports my DynamoDB table to an S3 bucket so I can use the S3 file for services like QuickSight, Athena and Forecast.
However, for my S3 file to work with these services, I need the file to be formatted in a csv like so:
date, journal, id
1589529457410, PLoS Genetics, 10.1371/journal.pgen.0030110
1589529457410, PLoS Genetics, 10.1371/journal.pgen.1000047
But instead, my exported file looks like this:
{"date":{"s":"1589529457410"},"journal":{"s":"PLoS Genetics"},"id":{"s":"10.1371/journal.pgen.0030110"}}
{"date":{"s":"1589833552714"},"journal":{"s":"PLoS Genetics"},"id":{"s":"10.1371/journal.pgen.1000047"}}
How can I specify the format for my exported file in S3 so I can operate with services like QuickSight, Athena and Forecast? I'd preferably do the data transformation using Data Pipeline as well.

Athena can read JSON data.
You can also use DynamoDB streams to stream the data to S3. Here is a link to a blog post with best practice and design patterns for streaming data from DynamoDB to S3 to be used with Athena.
You can use DynamoDB streams to trigger an AWS Lambda function, which can transform the data and store it in Amazon S3, Amazon Redshift etc. With AWS Lambda you could also trigger Amazon Forecast to retrain, or pass the data to Amazon Forecast for a prediction.
Alternatively you could use Amazon Data Pipeline to write the data to an S3 bucket as you currently have it. Then use a cloud watch event scheduled to run a lambda function, or an S3 event notification to run a lambda function. The lambda function can transform the file and store it in another S3 bucket for further processing.

Related

Is it possible to use bigquery in aws lambda?

I want to write a python function that send data to Bigquery every time a put event occurs in my S3 bucket but I'm new in AWS is it possible to integrate bigquery with a lambda function? or can someone give me another way to stream my dynamodb data to bigquery? Thank you my language is python
N.B: I used dynamostream firehose to send my data in S3 now I want to retrieve my data from s3 every time a put event occur to send it into bigquery.
There are already plenty of resources online about "how to trigger a lambda after a put object on a S3".
But here are a few links to get you set up:
You will need to set up an EventBridge (CloudWatch Events are legacy) to trigger your Lambda when some action happens on your S3 bucket:
https://aws.amazon.com/fr/blogs/compute/using-dynamic-amazon-s3-event-handling-with-amazon-eventbridge/
You can use the boto3 Python framework to write AWS Lambdas:
https://boto3.amazonaws.com/v1/documentation/api/latest/index.html
You can check the BigQuery Python SDK by GCP to communicate with your BQ database: https://googleapis.dev/python/bigquery/latest/index.html

Incremental exports from Amazon DynamoDB to Amazon S3

We need to run an analysis of the data in Amazon DynamoDB. Since doing it in the DDB isn't an option due to DDB's limitations with analysis, based on the recommendations I am leaning towards DDB -?> S3 -> Athena.
It is a data-heavy application with data streaming from AWS IoT devices and is also a multi-tenant application. Now, to sync data from DDB to Amazon S3, it will be probably a couple of times a day. How do we set up incremental exports for this purpose?
There is an Athena connector to be able to query your data in DynamoDB table directly using SQL query.
https://docs.aws.amazon.com/athena/latest/ug/athena-prebuilt-data-connectors-dynamodb.html
https://dev.to/jdonboch/finally-dynamodb-support-in-aws-quicksight-sort-of-2lbl
Another solution for this use case is you can write an AWS Step Functions workflow that when invoked, can read data from an Amazon DynamoDB table and then format the data to the way you want it and place the data into an Amazon S3 bucket (an example that shows a similar use case will be available soon):
This is the reverse (here the source is an Amazon S3 bucket and the target is an Amazon DynamoDB table) but you can build the Workflow so the target is an Amazon S3 bucket. Because it's a workflow, you can use a Lambda function that is scheduled to fire a few times a day based on a CRON expression. The job of this Lambda function is to invoke the workflow using the Step Functions API.

Writing S3 bucket csv to redshift using kinesis

I have 3 types of csv in my s3 bucket and want to flow them into respective redshift tables based on csv prefix . I am thinking to use Kinesis to stream data to redshift as file in s3 will be dropped every 5 min. I am all new to aws and not sure how to achieve this.
I have gone through aws documentation but not sure how to achieve this

What is the most optimal way to automate data (csv file) transfer from s3 to Redshift without AWS Pipeline?

I am trying to take sql data stored in a csv file in an s3 bucket and transfer the data to AWS Redshift and automate that process. Would writing etl scripts with lambda/glue be the best way to approach this problem, and if so, how do I get the script/transfer to run periodically? If not, what would be the most optimal way to pipeline data from s3 to Redshift.
Tried using AWS Pipeline but that is not available in my region. I also tried to use the AWS documentation for Lambda and Glue but don't know where to find the exact solution to the problem
All systems (including AWS Data Pipeline) use the Amazon Redshift COPY command to load data from Amazon S3.
Therefore, you could write an AWS Lambda function that connects to Redshift and issues the COPY command. You'll need to include a compatible library (eg psycopg2) to be able to call Redshift.
You can use Amazon CloudWatch Events to call the Lambda function on a regular schedule. Or, you could get fancy and configure Amazon S3 Events so that, when a file is dropped in an S3 bucket, it automatically triggers the Lambda function.
If you don't want to write it yourself, you could search for existing code on the web, including:
The very simply Python-based christianhxc/aws-lambda-redshift-copy: AWS Lambda function that runs the copy command into Redshift
A more fully-featured node-based A Zero-Administration Amazon Redshift Database Loader | AWS Big Data Blog

Is there a way to put data into Kinesis Firehose from S3 bucket?

I am want to write streaming data from S3 bucket into Redshift through Firehose as the data is streaming in real time (600 files every minute) and I dont want any form of data loss.
How to put data from S3 into Kinesis Firehose?
It appears that your situation is:
Files randomly appear in S3 from an SFTP server
You would like to load the data into Redshift
There's two basic ways you could do this:
Load the data directly from Amazon S3 into Amazon Redshift, or
Send the data through Amazon Kinesis Firehose
Frankly, there's little benefit in sending it via Kinesis Firehose because Kinesis will simply batch it up, store it into temporary S3 files and then load it into Redshift. Therefore, this would not be a beneficial approach.
Instead, I would recommend:
Configure an event on the Amazon S3 bucket to send a message to an Amazon SQS queue whenever a file is created
Configure Amazon CloudWatch Events to trigger an AWS Lambda function periodically (eg every hour, or 15 minutes, or whatever meets your business need)
The AWS Lambda function reads the messages from SQS and constructs a manifest file, then triggers Redshift to import the files listed in the manifest file
This is a simple, loosely-coupled solution that will be much simpler than the Firehose approach (which would require somehow reading each file and sending the contents to Firehose).
Its actually designed to do the opposite, Firehose sends incoming streaming data to Amazon S3 not from Amazon S3, and other than S3 it can send data to other services like Redshift and Elasticsearch Service.
I don't know whether this will solve your problem but you can use COPY from S3 to redshift.
Hope it will help!