I need to upgrade my application in order to bear streaming data. My application has a different kind of data that is stored in different MySQL tables.
So, I want to create an AWS Kinesis Firehose and AWS Lambda function to receive, transform and load my data to S3 in CSV file.
All the information I have found googling explains very well how to implement this but only storing the data in one unique CSV. I assume that only having one unique CSV, it will be interpreted by Athena as one table.
I have not found any information to create and store multiple CSV files using Kinesis Firehose and AWS Lambda function (which will represent tables in Athena).
Should I create a new Kinesis Firehose instance for each table I have in my MySQL database? or exists some way to store these data in different CSV files?
Related
I've found a tutorial for how to get item level changes into s3 from dynambo via kinsis firehose here
but how do I get these into a redshift table? If an item is updated, it will create a new record for it and post to s3, so is there a tutorial or guidance on how to take these item level changes and read them into a table?
Kinesis Firehose has multiple destinations that you can choose from. S3 is only one of them, and Redshift is another.
You can use the following configuration to set up Redshift as the destination.
I have a requirement of reading a csv batch file that was uploaded to s3 bucket, encrypt data in some columns and persist this data in a Dynamo DB table. While persisting each row in the DynamoDB table, depending on the data in each row, I need to generate an ID and store that in the DynamoDB table too. It seems AWS Data pipeline allows to create a job to import S3 bucket files into DynanoDB, but I can't find a way to add a custom logic there to encrypt some of the column values in the file and add custom logic to generate the id mentioned above.
Is there any way that I can achieve this requirement using AWS Data Pipeline? If not what would the best approach that I can follow using AWS services?
We also have a situation where we need fetch data from S3 and populate it to DynamoDb after performing some transformations (business logic).
We also use AWS DataPipeline for this process.
We first trigger a EMR cluster from Data Pipeline where we fetch the data from S3 and then transform it and populate the DynamoDB(DDB). You can include all the logic you require in the EMR cluster.
We have a timer set in the pipeline which triggers the EMR cluster every day once to perform the task.
This can be having additional costs too.
In this article - https://aws.amazon.com/blogs/database/how-to-perform-advanced-analytics-and-build-visualizations-of-your-amazon-dynamodb-data-by-using-amazon-athena/:
Similarly this article - https://aws.amazon.com/blogs/database/simplify-amazon-dynamodb-data-extraction-and-analysis-by-using-aws-glue-and-amazon-athena/:
Why not use Athena to directly query into the DynamoDb?
First of all, Athena cannot query directly to DynamoDB.
In order to do so, you need to make data available in another location that can be identified as a valid data source by AWS Glue;
The most common is actually S3 and Kinesis (due to performance and cost reasons), but there are other options as:
JDBC
Amazon RDS
MongoDB
Amazon DocumentDB
Kafka
(others options will be displayed according to the method you choose to map data)
For DynamoDb you must extract data from the desired table before it can be used. Or, as in the first example, use real-time streams.
Explaining each scenario.
First Scenario: Uses DynamoDb Streams directly connected to kinesis Firehouse which makes the data emitted by real-time DynamoDb streams available in S3. This way Athena could use S3 as a source for the data.
Second Scenario: Uses glue crawler to map data schema from DynamoDb and create a table in your Data Catalog containing the schema map of the object properties. And to extract data itself uses a glue job that points out to properties map table and extracts the data to S3, creating another table in your Data Catalog but this time pointing to S3, making it available for Athena to perform queries.
The DynamoDB data structure and storage are not optimized to perform relational queries as Athena expects, you could read more about it on DynamoDB docs.
In AWS Glue jobs, in order to retrieve data from DB or S3, we can get using 2 approaches. 1) Using Crawler 2) Using direct connection to DB or S3.
So my question is: How does crawler much better than direct connecting to a database and retrieve data?
AWS Glue Crawlers will not retrieve the actual data. Crawlers accesses your data stores and progresses through a prioritized list of classifiers to extract the schema of your data and other statistics, and then populates the Glue Data Catalog with this metadata. Crawlers can be scheduled to run periodically that will detect the availability of the new data along with the change to the existing data, including the table definition changes made by the data crawler. Crawlers automatically adds new table, new partitions to the existing table and the new versions of table definitions.
AWS Glue Data Catalog becomes a common metadata repository between
Amazon Athena, Amazon Redshift Spectrum, Amazon S3. AWS Glue Crawlers
helps in building this metadata repository.
I have a Data Pipeline that exports my DynamoDB table to an S3 bucket so I can use the S3 file for services like QuickSight, Athena and Forecast.
However, for my S3 file to work with these services, I need the file to be formatted in a csv like so:
date, journal, id
1589529457410, PLoS Genetics, 10.1371/journal.pgen.0030110
1589529457410, PLoS Genetics, 10.1371/journal.pgen.1000047
But instead, my exported file looks like this:
{"date":{"s":"1589529457410"},"journal":{"s":"PLoS Genetics"},"id":{"s":"10.1371/journal.pgen.0030110"}}
{"date":{"s":"1589833552714"},"journal":{"s":"PLoS Genetics"},"id":{"s":"10.1371/journal.pgen.1000047"}}
How can I specify the format for my exported file in S3 so I can operate with services like QuickSight, Athena and Forecast? I'd preferably do the data transformation using Data Pipeline as well.
Athena can read JSON data.
You can also use DynamoDB streams to stream the data to S3. Here is a link to a blog post with best practice and design patterns for streaming data from DynamoDB to S3 to be used with Athena.
You can use DynamoDB streams to trigger an AWS Lambda function, which can transform the data and store it in Amazon S3, Amazon Redshift etc. With AWS Lambda you could also trigger Amazon Forecast to retrain, or pass the data to Amazon Forecast for a prediction.
Alternatively you could use Amazon Data Pipeline to write the data to an S3 bucket as you currently have it. Then use a cloud watch event scheduled to run a lambda function, or an S3 event notification to run a lambda function. The lambda function can transform the file and store it in another S3 bucket for further processing.