I am looking for an AWS-centric solution (avoiding 3rd party stuff if possible) for visualizing data that is in a very simple DynamoDB table.
We use AWS Quicksight for many other reports and dashboards for our clients so that is goal to have visualizations made available there.
I was very surprised to see that DynamoDB was not a supported source for Quicksight although many other things are like S3, Athena, Redshift, RDS, etc.
Does anyone have any experience for creating a solution for this?
I am thinking that I will just create a job that will dump the DynamoDB table to S3 every so often and then use the S3 or Athena integrations with Quicksight to read/display it. It would be nice to have a simple solution for more live data.
!!UPDATE!!
As of 2021, we can finally get Athena Data connectors to expose DynamoDB data in Quicksight without any custom scripts or duplicate data.
That being said, I would like the caveat this by saying just because it can be done, you may need to ask yourself if this is really a good solution for your workload. DynamoDB isn't the best for data warehousing use cases and performing large scans on tables can end up being slow/costly. If your dataset is very large and this is a real production use case, it would probably be best to still go with an ETL workflow and move the DynamoDB data to a more appropriate data store.
But.. if you are still interested in seeing DynamoDB data live QuickSight without any additional ETL processes to move/transform the data: I wrote a detailed blog post with step by step instructions but in general, here is the process:
Ensure you have an Athena Workgroup that uses the new Athena Engine version 2 and if not, create one
In Athena under data sources, create a new data source and select "Query a data source" and then "Amazon DynamoDB"
On the next part of the wizard, click the "Configure new AWS Lambda function" to deploy the prebuilt AthenaDynamoDBConnector.
Once the AthenaDynamoDBConnector is deployed, select the name of the function you deployed in the Data Source creation wizard in Athena, give your DynamoDB data a catalog name like "dynamodb" and click "Connect"
You now should be able to query DynamoDB data in Athena but there are a few more steps to get things working in QuickSight.
Go to the IAM console and find the QuickSight service role (i.e. aws-quicksight-service-role-v0).
Attach the AWS Managed "AWSLambdaRole" policy to the QuickSight role since QuickSight now needs the permissions to invoke your data connector.
Go to the QuickSight console and add a new Athena data source that uses the version 2 engine that you created in Step 1
You should now be able to create a data set with that Athena Engine version 2 workgroup data source and choose the Athena catalog name you gave the DynamoDB connector in Step 4.
Bingo bango, you should now be able to directly query or cache DynamoDB data in Quicksight without needing to create custom code or jobs that duplicate your data to another data source.
As of March 2020, Amazon is making available a beta feature called Athena DynamoDB Connector.
Unfortunately, it's only beta/preview and you can get it setup in Athena but I don't see a way to use these new Athena catalogs in Quicksight.
Hopefully once this feature is GA, it can be easily imported into Quicksight and I can update the answer with the good news.
Instructions on getting up a DynamoDB connector
There are many new data sources that AWS is making available in beta for autmoting the connections to Athena.
You can set these up via the console by:
Navigate to the "Data Sources" menu in the AWS Athena console.
Click the "Configure Data Source" button
Choose "Query a data source" radio button
Select "Amazon DynamoDB" option that appears
Click the "Configure new function" option
You'll need to specify a bucket to help put "spilled" data into and provide a name for the new DyanmoDB catalog.
Once the app is deployed from Step 5, select the Lambda name (the name of the catalog you entered in Step 5) in the Athena data source form from Step 4 and also provide that same catalog name.
Create the data connector
Now you can go to the Athena query editor, select the catalog you just created and see a list of all DyanmoDB tables for your region, under the default Athena database in the new catalog, that you can now query as part of Athena.
We want DynamoDB support in Quicksight!
The simplest way I could find is below:
1 - Create a Glue Crawler which takes DynamoDB table as a Data Source and writes documents to a Glue Table. (Let's say Table X)
2 - Create a Glue Job which takes 'Table X' as a data source and writes them into a S3 Bucket in parquet format. (Let's say s3://table-x-parquets)
3 - Create a Glue Crawler which takes 's3://table-x-parquets' as data source and creates a new Glue Table from it. (Let's say Table Y)
Now you can execute Athena queries in Table Y and also you can use it as Data Set in Quicksight.
I'd also like to see a native integration between DynamoDB and QuickSight, so I will be watching this thread as well.
But there is at least 1 option that's closer to what you want. You could enable Streams on your DynamoDB table and then set up a trigger to trigger a Lambda function when changes are made to DynamoDB.
Then you could only take action on specific DynamoDB events if you like ('Modify', 'Insert', 'Delete') and then dump the new/modified record to S3. That would be pretty close to real-time data, as it would trigger immediately upon update.
I did something similar in the past but instead of dumping data to S3 I was updating another DynamoDB table. It would be pretty simple to switch the example to S3 instead. See below.
const AWS = require('aws-sdk');
exports.handler = async (event, context, callback) => {
console.log("Event:", event);
const dynamo = new AWS.DynamoDB();
const customerResponse = await dynamo.scan({
TableName: 'Customers',
ProjectionExpression: 'CustomerId'
}).promise().catch(err => console.log(err));
console.log(customerResponse);
let customers = customerResponse.Items.map(item => item.CustomerId.S);
console.log(customers);
for(let i = 0; i < event.Records.length; i++)
{
if(event.Records[i].eventName === 'INSERT')
{
if(event.Records[i].dynamodb.NewImage)
{
console.log(event.Records[i].dynamodb.NewImage);
for(let j = 0; j < customers.length; j++)
{
await dynamo.putItem({
Item: {
...event.Records[i].dynamodb.NewImage,
CustomerId: { S: customers[j] }
},
TableName: 'Rules'
}).promise().catch(err => console.log(err));
}
}
}
}
}
Possible solutions are explained in other answers. Just wanted to discuss another point.
BI tools such as QuickSight are designed to be usually used on top of analytical data stores such as Redshift, S3 etc. DynamoDB is not a very suitable data storage for analytics purposes. Row by row operations such as "put" or "get" are very efficient. But bulk operations such as "scan" are expensive. If you are constantly doing scans during the day, your DynamoDB costs might grow fast.
A possible way is to cache the data in SPICE (QuickSight's in memory cache). But a better way is to unload the data into a better suited storage such as S3 or RedShift. Couple of solutions are given on other answers.
Would love to see DynamoDB integration with Quicksight. Using DynamoDB streams to dump to S3 doesn't work because DynamoDB streams send out events instead of updating records. Hence if you read from this S3 bucket you'll have two instances of the same item: one before update and one after update.
One solution that I see now is to dump data from DynamoDB to a S3 bucket periodically using data pipeline and use Athena and Quicksight on this s3 bucket.
Second solution is to use dynamo db stream to send data to elastic search using lambda function. Elastic search has a plug in called Kibana which has pretty cool visualizations. Obviously this is going to increase your cost because now you are storing your data in two places.
Also make sure that you transform your data such that each Elastic Search document has the most granular data according to your needs. As kibana visualizations will aggregate everything in one document.
Related
i'am a data engineer using AWS, we want to build a data pipeline in order to visualise our Dynmaodb data on QuickSigth, as u know, it's not possible de connect directly dynamo to Quick...u have to pass by S3.
S3 Will be our datalake, the issue is that the date updates frequently (for exemple column named can change / costumer status can evolve..)
So i'am looking for a batch solution in order to always get the lastest data from dynamo on my s3 datalake and visualise it in quicksigth.
Thank u
You can access your tables at DynamoDB, in the console, and export data to S3 under the Streams and Exports tab. This blog post from AWS explains just what you need.
You could also try this approach with Athena instead of S3.
We need to run an analysis of the data in Amazon DynamoDB. Since doing it in the DDB isn't an option due to DDB's limitations with analysis, based on the recommendations I am leaning towards DDB -?> S3 -> Athena.
It is a data-heavy application with data streaming from AWS IoT devices and is also a multi-tenant application. Now, to sync data from DDB to Amazon S3, it will be probably a couple of times a day. How do we set up incremental exports for this purpose?
There is an Athena connector to be able to query your data in DynamoDB table directly using SQL query.
https://docs.aws.amazon.com/athena/latest/ug/athena-prebuilt-data-connectors-dynamodb.html
https://dev.to/jdonboch/finally-dynamodb-support-in-aws-quicksight-sort-of-2lbl
Another solution for this use case is you can write an AWS Step Functions workflow that when invoked, can read data from an Amazon DynamoDB table and then format the data to the way you want it and place the data into an Amazon S3 bucket (an example that shows a similar use case will be available soon):
This is the reverse (here the source is an Amazon S3 bucket and the target is an Amazon DynamoDB table) but you can build the Workflow so the target is an Amazon S3 bucket. Because it's a workflow, you can use a Lambda function that is scheduled to fire a few times a day based on a CRON expression. The job of this Lambda function is to invoke the workflow using the Step Functions API.
I have a requirement of reading a csv batch file that was uploaded to s3 bucket, encrypt data in some columns and persist this data in a Dynamo DB table. While persisting each row in the DynamoDB table, depending on the data in each row, I need to generate an ID and store that in the DynamoDB table too. It seems AWS Data pipeline allows to create a job to import S3 bucket files into DynanoDB, but I can't find a way to add a custom logic there to encrypt some of the column values in the file and add custom logic to generate the id mentioned above.
Is there any way that I can achieve this requirement using AWS Data Pipeline? If not what would the best approach that I can follow using AWS services?
We also have a situation where we need fetch data from S3 and populate it to DynamoDb after performing some transformations (business logic).
We also use AWS DataPipeline for this process.
We first trigger a EMR cluster from Data Pipeline where we fetch the data from S3 and then transform it and populate the DynamoDB(DDB). You can include all the logic you require in the EMR cluster.
We have a timer set in the pipeline which triggers the EMR cluster every day once to perform the task.
This can be having additional costs too.
I am working on a project to get data from an Amazon S3 bucket into Tableau.
The data needs to reorganised and combined from multiple .CSV files. Is Amazon Athena capable of connecting from the S3 to Tableau directly and is it relatively easy/cheap? Or should I instead look at another software package to achieve this?
I am looking to visualise the data and provide a forecast based on observed trend (may need to incorporate functions to generate data to fit linear regression).
It appears that Tableau can query data from Amazon Athena.
See: Connect to your S3 data with the Amazon Athena connector in Tableau 10.3 | Tableau Software
Amazon Athena can query multiple CSV files in a given path (directory) and run SQL against the data. So, it sounds like this is a feasible solution for you.
Yes, you can integrate Athena with Tableau to query your data in S3. There are plenty resource online that describe how to do that, e.g. link 1, link 2, link 3. But obviously, tables that define meta information of your data have to be defined before hand.
Amazon Athena pricing is based on on the amount of data scanned by each query, i.e. 5$ per 1TB of data scanned. So it all comes down how much data you have and how it is structured, i.e. partitioning, bucketing file format etc. Here is a nice blog post that covers these aspects.
While you prototype a dashboard there is one thing to keep in mind. By deafult, each time you would change list of parameters, filters etc, Tableau would automatically send a request to AWS Athena to execute your query. Luckily, you can disable auto querying of the data source and do it manually.
I see there is tons of examples and documentation to copy data from DynamoDB to Redshift, but we are looking at an incremental copy process where only the new rows are copied from DynamoDB to Redshift. We will run this copy process everyday, so there is no need to kill the entire redshift table each day. Does anybody have any experience or thoughts on this topic?
Dynamo DB has a feature (currently in preview) called Streams:
Amazon DynamoDB Streams maintains a time ordered sequence of item
level changes in any DynamoDB table in a log for a duration of 24
hours. Using the Streams APIs, developers can query the updates,
receive the item level data before and after the changes, and use it
to build creative extensions to their applications built on top of
DynamoDB.
This feature will allow you to process new updates as they come in and do what you want with them, rather than design an exporting system on top of DynamoDB.
You can see more information about how the processing works in the Reading and Processing DynamoDB Streams documentation.
The copy from redshift can only copy the entire table. There are several ways to achieve this
Using an AWS EMR cluster and Hive - If you set up an EMR cluster then you can use Hive tables to execute queries on the dynamodb data and move to S3. Then that data can be easily moved to redshift.
You can store your dynamodb data based on access patterns (see http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.TimeSeriesDataAccessPatterns). If we store the data this way, then the dynamodb tables can be dropped after they are copied to redshift
This can be solved with a secondary DynamoDB table that tracks only the keys that were changed since the last backup. This table has to be updated wherever initial DynamoDB table is updated (add, update, delete). At the end of a backup process you will delete them or after you backup a row (one by one).
If your DynamoDB table can have
Timestamps as an attribute or
A binary flag which conveys data freshness as attribute
then you can write a hive query to export only current day's data or fresh data to s3 and then 'KEEP_EXISTING' copy this incremental s3 data to Redshift.