AWS Athena tables for BI tools - amazon-web-services

I'm did ETL for our data and did simple aggregations on it in Athena. Our plan is to use our BI tool to access those tables from Athena. It works for now, but I'm worried that these tables are static i.e. they only reflect the data since I last created the Athena table. When called, are Athena tables automatically ran again? If not, how do I make them be automatically updated when called by our BI tool?
My only solution thus far to overwrite the tables we have is by running two different queries: one query to drop the table, and another to re-create the table. Since it's two different queries, I'm not sure if you can run it all at the same time (at least in Athena, you can't run them all in one go).

Amazon Athena is a query engine, not a database.
When a query is sent to Amazon Athena, it looks at the location stored in the table's DDL. Athena then goes to the Amazon S3 location specified and scans the files for the requested data.
Therefore, every Athena query always reflects the data shown in the underlying Amazon S3 objects:
Want to add data to a table? Then store an additional object in that location.
Want to delete data from a table? Then delete the underlying object that contains that data.
There is no need to "drop a table, then re-create the table". The table will always reflect the current data stored in Amazon S3. In fact, the table doesn't actually exist -- rather, it is simply a definition of what the table should contain and where to find the data in S3.
The best use-case for Athena is querying large quantities of rarely-accessed data stored in Amazon S3. If the data is often accessed and updated, then a traditional database or data warehouse (eg Amazon Redshift) would be more appropriate.
Pointing a Business Intelligence tool to Athena is quite acceptable, but you need to have proper processes in place for updating the underlying data in Amazon S3.
I would also recommend storing the data in Snappy-compressed Parquet files, which will make Athena queries faster and lower cost (because it is charged based upon the amount of data read from disk).

Related

How QuickSight SPICE refresh the data

I have a Quick Sight dashboard pointed to Athena table. Now I want to schedule to refresh SPICE every hour. As per documentation, Refreshing imports the data into SPICE again, so the data includes any changes since the last import.
If I have a 2TB dataset in Athena and every hour new data added in Athena. So QuickSight will load 2TB every hour to find the delta? if yes, it will increase the Athena cost. Does QuickSight query on Athena to fetch data?
As of the date of answering (11/11/2019) SPICE does in fact perform a full data set reload (i.e. no delta calculation or incremental refresh). I was able to verify this by using a MySQL data set and watching the query log while the refresh was occurring.
The implication for your question is that you would be charged every hour for Athena to query the 2TB data set.
If you do not need the robust querying that Athena provides, I would recommend pointing QuickSight to the S3 data directly.
My data is in parquet format. I guess Quicksight does not support a direct query on s3 parquet data.
Yes, we need to use Athena to read the parquet.
When you say point QuickSight to S3 directly, do you mean without SPICE?
Don't do it, it will increase the Athena and S3 costs significantly.
Sollution:
Collect the delta from your source.
Push it into S3 (Unprocessed data)
Create a lambda function to pre-process the data (if needed)
Set up a trigger for lambda.
Process the data in lambda, and convert the data to parquet format with gzip compression.
Push the data into S3 (Processed data)
Remove the unprocessed data from S3 or set up an S3 lifecycle to maintain it.
Also create a metadata table with primary_key and required fields.
S3 & Athena do not support update records, so each time you push the data it will be appended to the old data, and the entire data will be scanned.
Both S3 and Athena follow the scan-first approach, so even though you are applying a filter it will scan the entire data before it applies the filter.
Use the metadata table to remove the old entry and insert the new entry.
Use partitions wherever possible to avoid scanning the entire data.
Once the data is available, configure Quicksight data refresh to pull the data into SPICE.
Best practice:
Always go with SPICE (Direct queries are expensive and have high latency)
Use the incremental refresh wherever possible.
Always use static data, do not process the data for each dashboard visit/refresh.
Increase your Quicksight SPICE data refresh frequency

What does data category contain in AWS glue?

I am working on crawling data to data catalog via aws glue. But I am a bit confused about the database definition. From what I can find in aws doc, A database in the AWS Glue Data Catalog is a container that holds tables. You use databases to organize your tables into separate categories.. I wonder what exactly a database contains. Does it load all the data from other data sources and create a catalog on them? Or does it only contain catalog? How do I know the size of tables in glue database? And what type of database it uses, like nosql, rds?
For example, I create a crawler to load data from s3 and create a catalog table in glue. Does the glue table includes all the data from s3 bucket? If I delete s3 bucket, will it have impact on other jobs in glue which runs against the catalog table created by the crawler?
If the catalog table only includes data schema, how can I keep it update to data if my data source is modified?
The Catalog is just a metadata store. Its mission is to document the data that lives elsewhere, and to export that to other tools, like Athena or EMR, so they can discover the data.
Data is not replicated into the catalog, but remains in the origin. If you remove the table from the catalog, the data in origin remains intact.
If you delete the origin data (as you described in your question), the other services will not have access to the data anymore, as it is deleted. If you run the crawler again it should detect it is not there.
If you want to keep the crawler schema up to date, you can either schedule automatic runs of the crawler, or execute on demand whenever your data changes. When the crawler is run again it will update accordingly things like the number of records, partitions, or even changes in the schema. Please refer to the documentation to see the effect changes in the schema can have on your catalog.

Are there any benefits to storing data in DynamoDB vs S3 for use with Redshift?

My particular scenario: Expecting to amass TBs or even PBs of JSON data entries which track price history for many items. New data will be written to the data store hundreds or even thousands of times per a day. This data will be analyzed by Redshift and possibly AWS ML. I don't expect to query outside of Redshift or ML.
Question: How do I decide if I should store my data in S3 or DynamoDB? I am having trouble deciding because I know that both stores are supported with redshift, but I did notice Redshift Spectrum exists specifically for S3 data.
Firstly DynamoDB is far more expensive than S3. S3 is only a storage solution; while DynamoDB is a full-fledge NoSQL database.
If you want to query using Redshift; you have to load data into Redshift. Redshift is again an independent full-fledge database ( warehousing solution ).
You can use Athena to query data directly from S3.

Creating Table As substitution

I am currently working with AWS-Athena and it does not support CREATE TABLE AS which is fine so I thought I would approach it by doing INSERT OVERWRITE DIRECTORY S3://PATH and then loading from S3 but apparently that doesn't seem to work either. How would I create a table from a query if both of these options are out the window?
Amazon Athena is read-only. It cannot be used to create tables in Amazon S3.
However, the output of an Amazon Athena query is stored in Amazon S3 and could be used as input for another query. However, you'd have to know the path of the output.
Amazon Athena is ideal for individual queries against data stored in Amazon S3, but is not the best tool for ETL actions, which typically involve transforming data, storing it and then sequentially processing it again.
You don't have to use INSERT, just create an external table over the location of the previous query results
https://aws.amazon.com/premiumsupport/knowledge-center/athena-query-results/

Copying only new records from AWS DynamoDB to AWS Redshift

I see there is tons of examples and documentation to copy data from DynamoDB to Redshift, but we are looking at an incremental copy process where only the new rows are copied from DynamoDB to Redshift. We will run this copy process everyday, so there is no need to kill the entire redshift table each day. Does anybody have any experience or thoughts on this topic?
Dynamo DB has a feature (currently in preview) called Streams:
Amazon DynamoDB Streams maintains a time ordered sequence of item
level changes in any DynamoDB table in a log for a duration of 24
hours. Using the Streams APIs, developers can query the updates,
receive the item level data before and after the changes, and use it
to build creative extensions to their applications built on top of
DynamoDB.
This feature will allow you to process new updates as they come in and do what you want with them, rather than design an exporting system on top of DynamoDB.
You can see more information about how the processing works in the Reading and Processing DynamoDB Streams documentation.
The copy from redshift can only copy the entire table. There are several ways to achieve this
Using an AWS EMR cluster and Hive - If you set up an EMR cluster then you can use Hive tables to execute queries on the dynamodb data and move to S3. Then that data can be easily moved to redshift.
You can store your dynamodb data based on access patterns (see http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.TimeSeriesDataAccessPatterns). If we store the data this way, then the dynamodb tables can be dropped after they are copied to redshift
This can be solved with a secondary DynamoDB table that tracks only the keys that were changed since the last backup. This table has to be updated wherever initial DynamoDB table is updated (add, update, delete). At the end of a backup process you will delete them or after you backup a row (one by one).
If your DynamoDB table can have
Timestamps as an attribute or
A binary flag which conveys data freshness as attribute
then you can write a hive query to export only current day's data or fresh data to s3 and then 'KEEP_EXISTING' copy this incremental s3 data to Redshift.