Build s3 Datalake Using Dynamo DB data source

Build s3 Datalake Using Dynamo DB data source - amazon-web-services

i'am a data engineer using AWS, we want to build a data pipeline in order to visualise our Dynmaodb data on QuickSigth, as u know, it's not possible de connect directly dynamo to Quick...u have to pass by S3.
S3 Will be our datalake, the issue is that the date updates frequently (for exemple column named can change / costumer status can evolve..)
So i'am looking for a batch solution in order to always get the lastest data from dynamo on my s3 datalake and visualise it in quicksigth.
Thank u

You can access your tables at DynamoDB, in the console, and export data to S3 under the Streams and Exports tab. This blog post from AWS explains just what you need.
You could also try this approach with Athena instead of S3.

Related

How do I import a DynamoDB table exported to S3 as a new table?

I don't want to use data pipeline because it is too cumbersome. I also have a relatively small table so it would be heavy handed to use data pipeline for it- I could run a script locally to do the import because it's so small.
I used the fully managed Export to S3 feature to export a table to a bucket (in a different account): https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DataExport.html
What are my options now for importing that to a new table in the other account?
If there isn't a managed feature for this, does AWS provide a canned script I can point at an S3 folder and give the name of the new table I want to create from it?

Amazon DyanamoDB now supports importing data from S3 buckets to new DynamoDB tables from this blog post.
The steps for importing data from S3 buckets can be found in their developer guide.

As of 18 August 2022, this feature is now built into DynamoDB and you need no other services or code.

Another AWS-blessed option is a cross-account DynamoDB table replication that uses Glue in the target account to import the S3 extract and Dynamo Streams for ongoing replication.

You may want to create a AWS Data Pipeline which already has recommended template for importing DynamoDB data from S3:
This is the closest you can get to a "managed feature" where you select the S3 prefix and the DynamoDB table.

Visualize DynamoDB data in AWS Quicksight

I am looking for an AWS-centric solution (avoiding 3rd party stuff if possible) for visualizing data that is in a very simple DynamoDB table.
We use AWS Quicksight for many other reports and dashboards for our clients so that is goal to have visualizations made available there.
I was very surprised to see that DynamoDB was not a supported source for Quicksight although many other things are like S3, Athena, Redshift, RDS, etc.
Does anyone have any experience for creating a solution for this?
I am thinking that I will just create a job that will dump the DynamoDB table to S3 every so often and then use the S3 or Athena integrations with Quicksight to read/display it. It would be nice to have a simple solution for more live data.

!!UPDATE!!
As of 2021, we can finally get Athena Data connectors to expose DynamoDB data in Quicksight without any custom scripts or duplicate data.
That being said, I would like the caveat this by saying just because it can be done, you may need to ask yourself if this is really a good solution for your workload. DynamoDB isn't the best for data warehousing use cases and performing large scans on tables can end up being slow/costly. If your dataset is very large and this is a real production use case, it would probably be best to still go with an ETL workflow and move the DynamoDB data to a more appropriate data store.
But.. if you are still interested in seeing DynamoDB data live QuickSight without any additional ETL processes to move/transform the data: I wrote a detailed blog post with step by step instructions but in general, here is the process:
Ensure you have an Athena Workgroup that uses the new Athena Engine version 2 and if not, create one
In Athena under data sources, create a new data source and select "Query a data source" and then "Amazon DynamoDB"
On the next part of the wizard, click the "Configure new AWS Lambda function" to deploy the prebuilt AthenaDynamoDBConnector.
Once the AthenaDynamoDBConnector is deployed, select the name of the function you deployed in the Data Source creation wizard in Athena, give your DynamoDB data a catalog name like "dynamodb" and click "Connect"
You now should be able to query DynamoDB data in Athena but there are a few more steps to get things working in QuickSight.
Go to the IAM console and find the QuickSight service role (i.e. aws-quicksight-service-role-v0).
Attach the AWS Managed "AWSLambdaRole" policy to the QuickSight role since QuickSight now needs the permissions to invoke your data connector.
Go to the QuickSight console and add a new Athena data source that uses the version 2 engine that you created in Step 1
You should now be able to create a data set with that Athena Engine version 2 workgroup data source and choose the Athena catalog name you gave the DynamoDB connector in Step 4.
Bingo bango, you should now be able to directly query or cache DynamoDB data in Quicksight without needing to create custom code or jobs that duplicate your data to another data source.
As of March 2020, Amazon is making available a beta feature called Athena DynamoDB Connector.
Unfortunately, it's only beta/preview and you can get it setup in Athena but I don't see a way to use these new Athena catalogs in Quicksight.
Hopefully once this feature is GA, it can be easily imported into Quicksight and I can update the answer with the good news.
Instructions on getting up a DynamoDB connector
There are many new data sources that AWS is making available in beta for autmoting the connections to Athena.
You can set these up via the console by:
Navigate to the "Data Sources" menu in the AWS Athena console.
Click the "Configure Data Source" button
Choose "Query a data source" radio button
Select "Amazon DynamoDB" option that appears
Click the "Configure new function" option
You'll need to specify a bucket to help put "spilled" data into and provide a name for the new DyanmoDB catalog.
Once the app is deployed from Step 5, select the Lambda name (the name of the catalog you entered in Step 5) in the Athena data source form from Step 4 and also provide that same catalog name.
Create the data connector
Now you can go to the Athena query editor, select the catalog you just created and see a list of all DyanmoDB tables for your region, under the default Athena database in the new catalog, that you can now query as part of Athena.

We want DynamoDB support in Quicksight!
The simplest way I could find is below:
1 - Create a Glue Crawler which takes DynamoDB table as a Data Source and writes documents to a Glue Table. (Let's say Table X)
2 - Create a Glue Job which takes 'Table X' as a data source and writes them into a S3 Bucket in parquet format. (Let's say s3://table-x-parquets)
3 - Create a Glue Crawler which takes 's3://table-x-parquets' as data source and creates a new Glue Table from it. (Let's say Table Y)
Now you can execute Athena queries in Table Y and also you can use it as Data Set in Quicksight.

I'd also like to see a native integration between DynamoDB and QuickSight, so I will be watching this thread as well.
But there is at least 1 option that's closer to what you want. You could enable Streams on your DynamoDB table and then set up a trigger to trigger a Lambda function when changes are made to DynamoDB.
Then you could only take action on specific DynamoDB events if you like ('Modify', 'Insert', 'Delete') and then dump the new/modified record to S3. That would be pretty close to real-time data, as it would trigger immediately upon update.
I did something similar in the past but instead of dumping data to S3 I was updating another DynamoDB table. It would be pretty simple to switch the example to S3 instead. See below.
const AWS = require('aws-sdk');
exports.handler = async (event, context, callback) => {
console.log("Event:", event);
const dynamo = new AWS.DynamoDB();
const customerResponse = await dynamo.scan({
TableName: 'Customers',
ProjectionExpression: 'CustomerId'
}).promise().catch(err => console.log(err));
console.log(customerResponse);
let customers = customerResponse.Items.map(item => item.CustomerId.S);
console.log(customers);
for(let i = 0; i < event.Records.length; i++)
{
if(event.Records[i].eventName === 'INSERT')
{
if(event.Records[i].dynamodb.NewImage)
{
console.log(event.Records[i].dynamodb.NewImage);
for(let j = 0; j < customers.length; j++)
{
await dynamo.putItem({
Item: {
...event.Records[i].dynamodb.NewImage,
CustomerId: { S: customers[j] }
},
TableName: 'Rules'
}).promise().catch(err => console.log(err));
}
}
}
}
}

Possible solutions are explained in other answers. Just wanted to discuss another point.
BI tools such as QuickSight are designed to be usually used on top of analytical data stores such as Redshift, S3 etc. DynamoDB is not a very suitable data storage for analytics purposes. Row by row operations such as "put" or "get" are very efficient. But bulk operations such as "scan" are expensive. If you are constantly doing scans during the day, your DynamoDB costs might grow fast.
A possible way is to cache the data in SPICE (QuickSight's in memory cache). But a better way is to unload the data into a better suited storage such as S3 or RedShift. Couple of solutions are given on other answers.

Would love to see DynamoDB integration with Quicksight. Using DynamoDB streams to dump to S3 doesn't work because DynamoDB streams send out events instead of updating records. Hence if you read from this S3 bucket you'll have two instances of the same item: one before update and one after update.
One solution that I see now is to dump data from DynamoDB to a S3 bucket periodically using data pipeline and use Athena and Quicksight on this s3 bucket.
Second solution is to use dynamo db stream to send data to elastic search using lambda function. Elastic search has a plug in called Kibana which has pretty cool visualizations. Obviously this is going to increase your cost because now you are storing your data in two places.
Also make sure that you transform your data such that each Elastic Search document has the most granular data according to your needs. As kibana visualizations will aggregate everything in one document.

AWS Glue ETL : transfer data to S3 Bucket

I wish to transfer data in a database like MySQL[RDS] to S3 using AWS Glue ETL.
I am having difficulty trying to do this the documentation is really not good.
I found this link here on stackoverflow:
Could we use AWS Glue just copy a file from one S3 folder to another S3 folder?
SO based on this link, it seems that Glue does not have an S3 bucket as a data Destination, it may have it as a data Source.
SO, i hope i am wrong on this.
BUT if one makes an ETL tool, one of the first basics on AWS is for it to tranfer data to and from an S3 bucket, the major form of storage on AWS.
So hope someone can help on this.

You can add a Glue connection to your RDS instance and then use the Spark ETL script to write the data to S3.
You'll have to first crawl the database table using Glue Crawler. This will create a table in the Data Catalog which can be used in the job to transfer the data to S3. If you do not wish to perform any transformation, you may directly use the UI steps for autogenerated ETL scripts.
I have also written a blog on how to Migrate Relational Databases to Amazon S3 using AWS Glue. Let me know if it addresses your query.
https://ujjwalbhardwaj.me/post/migrate-relational-databases-to-amazon-s3-using-aws-glue

Have you tried https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-template-copyrdstos3.html?
You can use AWS Data Pipeline - it has standard templates for full as well incrementation copy to s3 from RDS.

Amazon S3 to Amazon Athena to Tableau

I am working on a project to get data from an Amazon S3 bucket into Tableau.
The data needs to reorganised and combined from multiple .CSV files. Is Amazon Athena capable of connecting from the S3 to Tableau directly and is it relatively easy/cheap? Or should I instead look at another software package to achieve this?
I am looking to visualise the data and provide a forecast based on observed trend (may need to incorporate functions to generate data to fit linear regression).

It appears that Tableau can query data from Amazon Athena.
See: Connect to your S3 data with the Amazon Athena connector in Tableau 10.3 | Tableau Software
Amazon Athena can query multiple CSV files in a given path (directory) and run SQL against the data. So, it sounds like this is a feasible solution for you.

Yes, you can integrate Athena with Tableau to query your data in S3. There are plenty resource online that describe how to do that, e.g. link 1, link 2, link 3. But obviously, tables that define meta information of your data have to be defined before hand.
Amazon Athena pricing is based on on the amount of data scanned by each query, i.e. 5$ per 1TB of data scanned. So it all comes down how much data you have and how it is structured, i.e. partitioning, bucketing file format etc. Here is a nice blog post that covers these aspects.
While you prototype a dashboard there is one thing to keep in mind. By deafult, each time you would change list of parameters, filters etc, Tableau would automatically send a request to AWS Athena to execute your query. Luckily, you can disable auto querying of the data source and do it manually.

Athena can't resolve CSV files from AWS DMS

I've DMS configured to continuously replicate data from MySQL RDS to S3. This creates two type of CSV files: a full load and change data capture (CDC). According to my tests, I have the following files:
testdb/addresses/LOAD001.csv.gz
testdb/addresses/20180405_205807186_csv.gz
After DMS is running properly, I trigger a AWS Glue Crawler to build the Data Catalog for the S3 Bucket that contains the MySQL Replication files, so the Athena users will be able to build queries in our S3 based Data Lake.
Unfortunately the crawlers are not building the correct table schema for the tables stored in S3.
For the example above It creates two tables for Athena:
addresses
20180405_205807186_csv_gz
The file 20180405_205807186_csv.gz contains a one line update, but the crawler is not capable of merging the two informations (taking the first load from LOAD001.csv.gz and making the updpate described in 20180405_205807186_csv.gz).
I also tried to create the table in the Athena console, as described in this blog post:https://aws.amazon.com/pt/blogs/database/using-aws-database-migration-service-and-amazon-athena-to-replicate-and-run-ad-hoc-queries-on-a-sql-server-database/.
But it does not yield the desired output.
From the blog post:
When you query data using Amazon Athena (later in this post), you
simply point the folder location to Athena, and the query results
include existing and new data inserts by combining data from both
files.
Am I missing something?

The AWS Glue crawler is not able to reconcile the different schemas in the initial LOAD csvs and incremental CDC csvs for each table. This blog post from AWS and its associated cloudformation templates demonstrate how to use AWS Glue jobs to process and combine these two type of DMS target outputs.

Athena will combine the files in am S3 if they are the same structure. The blog speaks to only inserts of new data in the cdc files. You'll have to build a process to merge the CDC files. Not what you wanted to hear, I'm sure.
From the blog post:
"When you query data using Amazon Athena (later in this post), due to the way AWS DMS adds a column indicating inserts, deletes and updates to the new file created as part of CDC replication, we will not be able to run the Athena query by combining data from both files (initial load and CDC files)."

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js