How QuickSight SPICE refresh the data - amazon-web-services

I have a Quick Sight dashboard pointed to Athena table. Now I want to schedule to refresh SPICE every hour. As per documentation, Refreshing imports the data into SPICE again, so the data includes any changes since the last import.
If I have a 2TB dataset in Athena and every hour new data added in Athena. So QuickSight will load 2TB every hour to find the delta? if yes, it will increase the Athena cost. Does QuickSight query on Athena to fetch data?

As of the date of answering (11/11/2019) SPICE does in fact perform a full data set reload (i.e. no delta calculation or incremental refresh). I was able to verify this by using a MySQL data set and watching the query log while the refresh was occurring.
The implication for your question is that you would be charged every hour for Athena to query the 2TB data set.
If you do not need the robust querying that Athena provides, I would recommend pointing QuickSight to the S3 data directly.

My data is in parquet format. I guess Quicksight does not support a direct query on s3 parquet data.
Yes, we need to use Athena to read the parquet.
When you say point QuickSight to S3 directly, do you mean without SPICE?
Don't do it, it will increase the Athena and S3 costs significantly.
Sollution:
Collect the delta from your source.
Push it into S3 (Unprocessed data)
Create a lambda function to pre-process the data (if needed)
Set up a trigger for lambda.
Process the data in lambda, and convert the data to parquet format with gzip compression.
Push the data into S3 (Processed data)
Remove the unprocessed data from S3 or set up an S3 lifecycle to maintain it.
Also create a metadata table with primary_key and required fields.
S3 & Athena do not support update records, so each time you push the data it will be appended to the old data, and the entire data will be scanned.
Both S3 and Athena follow the scan-first approach, so even though you are applying a filter it will scan the entire data before it applies the filter.
Use the metadata table to remove the old entry and insert the new entry.
Use partitions wherever possible to avoid scanning the entire data.
Once the data is available, configure Quicksight data refresh to pull the data into SPICE.
Best practice:
Always go with SPICE (Direct queries are expensive and have high latency)
Use the incremental refresh wherever possible.
Always use static data, do not process the data for each dashboard visit/refresh.
Increase your Quicksight SPICE data refresh frequency

Related

AWS Athena tables for BI tools

I'm did ETL for our data and did simple aggregations on it in Athena. Our plan is to use our BI tool to access those tables from Athena. It works for now, but I'm worried that these tables are static i.e. they only reflect the data since I last created the Athena table. When called, are Athena tables automatically ran again? If not, how do I make them be automatically updated when called by our BI tool?
My only solution thus far to overwrite the tables we have is by running two different queries: one query to drop the table, and another to re-create the table. Since it's two different queries, I'm not sure if you can run it all at the same time (at least in Athena, you can't run them all in one go).
Amazon Athena is a query engine, not a database.
When a query is sent to Amazon Athena, it looks at the location stored in the table's DDL. Athena then goes to the Amazon S3 location specified and scans the files for the requested data.
Therefore, every Athena query always reflects the data shown in the underlying Amazon S3 objects:
Want to add data to a table? Then store an additional object in that location.
Want to delete data from a table? Then delete the underlying object that contains that data.
There is no need to "drop a table, then re-create the table". The table will always reflect the current data stored in Amazon S3. In fact, the table doesn't actually exist -- rather, it is simply a definition of what the table should contain and where to find the data in S3.
The best use-case for Athena is querying large quantities of rarely-accessed data stored in Amazon S3. If the data is often accessed and updated, then a traditional database or data warehouse (eg Amazon Redshift) would be more appropriate.
Pointing a Business Intelligence tool to Athena is quite acceptable, but you need to have proper processes in place for updating the underlying data in Amazon S3.
I would also recommend storing the data in Snappy-compressed Parquet files, which will make Athena queries faster and lower cost (because it is charged based upon the amount of data read from disk).

How to perform Backfilling in redshift to bigquery migration?

I am using BigQuery Data Transfer Service to migrate all data from redshift to bigquery.
After that, i want to perform backfilling for specific time, if any data is missing. But i don't see any backfilling option in Transfer job.
How can i achieve that in bigquery?
Reading your question under the light of your comments I would proceed differently from what you describe. You reach the same goal however :) .
Using your ETL pipeline, the first step would be to accumulate raw data in a datalake.
Let's take a storage service like S3 to do so. For this ETL pipeline, S3 is your datasink.
Note that your pipeline does nothing more than taking raw data from A to put it into S3. Also, the location in S3 should be under a timestampted folder on day for instance (e.g: yyyymmdd) so that you can sort and consume your data on time dimension.
Obviously the considered data is ahead in time from the one you already have in Redshift.
Maybe it is also a different structure from the one you already put in redshift due to potential transformation you set in your initial pipeline.
In case you set raw data directly into redshift, then just export the data into the same S3 bucket under the name legacy/*. (In case it is transformed, then you have to put a second S3 datasink in your pipeline with this intermediary transformation an keep the same S3 naming strategy).
Let's take a break to understand what we have. We filled an S3 bucket with raw data that we can now replay at will on a specific day using a cron or an orchestrating tool such as Apache Airflow. Moreover you can freely modified the content of each timestamped folder in case you missed data to replay the following pipelines => the backfill you want.
Speaking of which, S3 would act as a data source for these following pipelines that would set wanted transformations on the raw data from S3 and choose BigQuery and potentially Redshift as Datasink. Now please take in consideration the price of these operations. Streaming API in BQ is expensive. As high of 0.50$ per Gb. Do that only if you need real time result. If you can afford latency of more than 5 minutes a better strategy would to set GCS as the datasink of your ETL and transfer the data from there into BQ (note to put the data in the same file naming pattern yyyymmdd to enable potential backfill). This transfer is free if GCS bucket and BQ dataset are in the same region. You would trigger the transfer with GCS events for instance (trigering a cloud function on blob creation that put the data into BQ).
Last but not least, backfilling should be done wisely especially in BQ where update or creation at row level is not peformant and is an open door for duplication. What you should consider is BigQuery partition that you can set on a column that contains a timestamp or an hidden one if your data contains none. Which timestamp? Well the one set in GCS folder name!
Once again you can modify data in your GCS bucket per day and replay the transfer into BQ.
But each transfer from a given day must overwrtite the partition the considered data belongs to. (e.g: the data under 20200914 would overwrite the associated partition in BQ. We abide by the concept of pure task doing so which a guarantee for idempotency and non duplication).
Please read this article to have more insights.
Note: If you intend to get rid off Redshit, you can choose to do it directly and forget about S3 as a datasink of your first ETL. Choose directly GCS (ingress is free) and migrate your already present Redshift data into GCS using S3 as an intermediary service and the Google transfer service from S3 to GCS.

Best way to archive dynamodb records?

I have a table with about 6 million records and want to start archiving records, I have thought of creating a backup version of the same table and moving the records across once they meet the criteria for being archived. However, I have been told that it is also possible to use Hive to copy this data to an S3.
Could someone please explain why I would opt to copy the data in to an S3 bucket rather than store it in another dynamodb table.
DynamomDB has a time-to-live mechanism and you can set a stream of records deletions which will call an AWS Lambda and put the data to S3. Check this detailed guide on how to set it up. Also, you can try out AWS Data Pipeline with an EMR cluster which is a common way to set one-time or periodical migrations.
If you actively use full scan operations over your DynamoDB then it's better to archive and remove the records you don't use. If you query the records only by the primary key, then most probably archiving doesn't worth the effort. You can verify the bill, but storing the first 25 GB in DynamoDB are free.

Are there any benefits to storing data in DynamoDB vs S3 for use with Redshift?

My particular scenario: Expecting to amass TBs or even PBs of JSON data entries which track price history for many items. New data will be written to the data store hundreds or even thousands of times per a day. This data will be analyzed by Redshift and possibly AWS ML. I don't expect to query outside of Redshift or ML.
Question: How do I decide if I should store my data in S3 or DynamoDB? I am having trouble deciding because I know that both stores are supported with redshift, but I did notice Redshift Spectrum exists specifically for S3 data.
Firstly DynamoDB is far more expensive than S3. S3 is only a storage solution; while DynamoDB is a full-fledge NoSQL database.
If you want to query using Redshift; you have to load data into Redshift. Redshift is again an independent full-fledge database ( warehousing solution ).
You can use Athena to query data directly from S3.

Copying only new records from AWS DynamoDB to AWS Redshift

I see there is tons of examples and documentation to copy data from DynamoDB to Redshift, but we are looking at an incremental copy process where only the new rows are copied from DynamoDB to Redshift. We will run this copy process everyday, so there is no need to kill the entire redshift table each day. Does anybody have any experience or thoughts on this topic?
Dynamo DB has a feature (currently in preview) called Streams:
Amazon DynamoDB Streams maintains a time ordered sequence of item
level changes in any DynamoDB table in a log for a duration of 24
hours. Using the Streams APIs, developers can query the updates,
receive the item level data before and after the changes, and use it
to build creative extensions to their applications built on top of
DynamoDB.
This feature will allow you to process new updates as they come in and do what you want with them, rather than design an exporting system on top of DynamoDB.
You can see more information about how the processing works in the Reading and Processing DynamoDB Streams documentation.
The copy from redshift can only copy the entire table. There are several ways to achieve this
Using an AWS EMR cluster and Hive - If you set up an EMR cluster then you can use Hive tables to execute queries on the dynamodb data and move to S3. Then that data can be easily moved to redshift.
You can store your dynamodb data based on access patterns (see http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.TimeSeriesDataAccessPatterns). If we store the data this way, then the dynamodb tables can be dropped after they are copied to redshift
This can be solved with a secondary DynamoDB table that tracks only the keys that were changed since the last backup. This table has to be updated wherever initial DynamoDB table is updated (add, update, delete). At the end of a backup process you will delete them or after you backup a row (one by one).
If your DynamoDB table can have
Timestamps as an attribute or
A binary flag which conveys data freshness as attribute
then you can write a hive query to export only current day's data or fresh data to s3 and then 'KEEP_EXISTING' copy this incremental s3 data to Redshift.