I have a table with about 6 million records and want to start archiving records, I have thought of creating a backup version of the same table and moving the records across once they meet the criteria for being archived. However, I have been told that it is also possible to use Hive to copy this data to an S3.
Could someone please explain why I would opt to copy the data in to an S3 bucket rather than store it in another dynamodb table.
DynamomDB has a time-to-live mechanism and you can set a stream of records deletions which will call an AWS Lambda and put the data to S3. Check this detailed guide on how to set it up. Also, you can try out AWS Data Pipeline with an EMR cluster which is a common way to set one-time or periodical migrations.
If you actively use full scan operations over your DynamoDB then it's better to archive and remove the records you don't use. If you query the records only by the primary key, then most probably archiving doesn't worth the effort. You can verify the bill, but storing the first 25 GB in DynamoDB are free.
Related
I am using BigQuery Data Transfer Service to migrate all data from redshift to bigquery.
After that, i want to perform backfilling for specific time, if any data is missing. But i don't see any backfilling option in Transfer job.
How can i achieve that in bigquery?
Reading your question under the light of your comments I would proceed differently from what you describe. You reach the same goal however :) .
Using your ETL pipeline, the first step would be to accumulate raw data in a datalake.
Let's take a storage service like S3 to do so. For this ETL pipeline, S3 is your datasink.
Note that your pipeline does nothing more than taking raw data from A to put it into S3. Also, the location in S3 should be under a timestampted folder on day for instance (e.g: yyyymmdd) so that you can sort and consume your data on time dimension.
Obviously the considered data is ahead in time from the one you already have in Redshift.
Maybe it is also a different structure from the one you already put in redshift due to potential transformation you set in your initial pipeline.
In case you set raw data directly into redshift, then just export the data into the same S3 bucket under the name legacy/*. (In case it is transformed, then you have to put a second S3 datasink in your pipeline with this intermediary transformation an keep the same S3 naming strategy).
Let's take a break to understand what we have. We filled an S3 bucket with raw data that we can now replay at will on a specific day using a cron or an orchestrating tool such as Apache Airflow. Moreover you can freely modified the content of each timestamped folder in case you missed data to replay the following pipelines => the backfill you want.
Speaking of which, S3 would act as a data source for these following pipelines that would set wanted transformations on the raw data from S3 and choose BigQuery and potentially Redshift as Datasink. Now please take in consideration the price of these operations. Streaming API in BQ is expensive. As high of 0.50$ per Gb. Do that only if you need real time result. If you can afford latency of more than 5 minutes a better strategy would to set GCS as the datasink of your ETL and transfer the data from there into BQ (note to put the data in the same file naming pattern yyyymmdd to enable potential backfill). This transfer is free if GCS bucket and BQ dataset are in the same region. You would trigger the transfer with GCS events for instance (trigering a cloud function on blob creation that put the data into BQ).
Last but not least, backfilling should be done wisely especially in BQ where update or creation at row level is not peformant and is an open door for duplication. What you should consider is BigQuery partition that you can set on a column that contains a timestamp or an hidden one if your data contains none. Which timestamp? Well the one set in GCS folder name!
Once again you can modify data in your GCS bucket per day and replay the transfer into BQ.
But each transfer from a given day must overwrtite the partition the considered data belongs to. (e.g: the data under 20200914 would overwrite the associated partition in BQ. We abide by the concept of pure task doing so which a guarantee for idempotency and non duplication).
Please read this article to have more insights.
Note: If you intend to get rid off Redshit, you can choose to do it directly and forget about S3 as a datasink of your first ETL. Choose directly GCS (ingress is free) and migrate your already present Redshift data into GCS using S3 as an intermediary service and the Google transfer service from S3 to GCS.
I have a Quick Sight dashboard pointed to Athena table. Now I want to schedule to refresh SPICE every hour. As per documentation, Refreshing imports the data into SPICE again, so the data includes any changes since the last import.
If I have a 2TB dataset in Athena and every hour new data added in Athena. So QuickSight will load 2TB every hour to find the delta? if yes, it will increase the Athena cost. Does QuickSight query on Athena to fetch data?
As of the date of answering (11/11/2019) SPICE does in fact perform a full data set reload (i.e. no delta calculation or incremental refresh). I was able to verify this by using a MySQL data set and watching the query log while the refresh was occurring.
The implication for your question is that you would be charged every hour for Athena to query the 2TB data set.
If you do not need the robust querying that Athena provides, I would recommend pointing QuickSight to the S3 data directly.
My data is in parquet format. I guess Quicksight does not support a direct query on s3 parquet data.
Yes, we need to use Athena to read the parquet.
When you say point QuickSight to S3 directly, do you mean without SPICE?
Don't do it, it will increase the Athena and S3 costs significantly.
Sollution:
Collect the delta from your source.
Push it into S3 (Unprocessed data)
Create a lambda function to pre-process the data (if needed)
Set up a trigger for lambda.
Process the data in lambda, and convert the data to parquet format with gzip compression.
Push the data into S3 (Processed data)
Remove the unprocessed data from S3 or set up an S3 lifecycle to maintain it.
Also create a metadata table with primary_key and required fields.
S3 & Athena do not support update records, so each time you push the data it will be appended to the old data, and the entire data will be scanned.
Both S3 and Athena follow the scan-first approach, so even though you are applying a filter it will scan the entire data before it applies the filter.
Use the metadata table to remove the old entry and insert the new entry.
Use partitions wherever possible to avoid scanning the entire data.
Once the data is available, configure Quicksight data refresh to pull the data into SPICE.
Best practice:
Always go with SPICE (Direct queries are expensive and have high latency)
Use the incremental refresh wherever possible.
Always use static data, do not process the data for each dashboard visit/refresh.
Increase your Quicksight SPICE data refresh frequency
I am planning to store data into S3 on top of which SQL queries would later be executed. The S3 file would basically contain json records. I would be getting these records through DynamoDB streams triggering AWS Lambda execution so its difficult to handle duplication at that layer as AWS Lambda guarantees atleast once delivery.
To avoid handling duplicate records in queries, I would like ensure that records being inserted at unique.
As far as I know, the only way to do achieve uniqueness is to have a unique S3 key. If I were to opt for this approach, I would end creating couple of million S3 files per day. Each file consists of single json record.
Would creating so many files be an concern when executing Athena queries?
Any alternatives approaches?
I think you would be better off handling the deduplication in Athena itself. For Athena, weeding out a few duplicates will be an easy job. Set up a view that groups by the unique property and uses ARBITRARY or MAX_BY (if you have something to order by to pick the latest) for the non-unique properties, and run your queries against this view to not have to worry about deduplication in each individual query.
You could also run a daily or weekly deduplication job using CTAS, depending on how fresh the data has to be (you can also do complex hybrids with pre-deduplicated historical data union'ed with on-the-fly-deduplicated data).
When running a query Athena lists the objects on S3, and this is not a parallelizable operation (except for partitioned tables where it's parallelizable to the grain of the partitioning), and S3's listings are limited to a page size of 1000. You really don't want to have Athena queries against tables (or partitions) with more than 1000 files.
Write to S3 via a Kinesis Firehose and then query that via Athena. The Firehose will group your records into a relatively small number of files, such that it will then be efficient to query them via Athena. Indeed, it will even organize them into a folder structure that is nicely partitioned by write timestamp.
I am considering using AWS DynamoDB for an application we are building. I understand that setting a backup job that exports data from DynamoDB to S3 involves a data pipeline with EMR. But my question is do I need to worry about having a backup job set up on day 1? What are the chances that a data loss would happen?
There are multiple use-cases for DynamoDB table data copy elsewhere:
(1) Create a backup in S3 on a daily basis, in order to restore in case of accidental deletion of data or worse yet drop table (code bugs?)
(2) Create a backup in S3 to become the starting point of your analytics workflows. Once this data is backed up in S3, you can combine it with, say, your RDBMS system (RDS or on-premise) or other S3 data from log files. Data Integration workflows could involve EMR jobs to be ultimately loaded into Redshift (ETL) for BI queries. Or directly load these into Redshift to do more ELT style - so transforms happen within Redshift
(3) Copy (the whole set or a subset of) data from one table to another (either within the same region or another region) - so the old table can be garbage collected for controlled growth and cost containment. This table-to-table copy could also be used as a readily consumable backup table in case of, say region-specific availability issues. Or, use this mechanism to copy data from one region to another to serve it from an endpoint closer to the DynamoDB client application that is using it.
(4) Periodic restore of data from S3. Possibly as a way to load back post-analytics data back into DynamoDB for serving it in online applications with high-concurrency, low-latency requirements.
AWS Data Pipeline helps schedule all these scenarios with flexible data transfer solutions (using EMR underneath).
One caveat when using these solutions is to note that this is not a point-in-time backup: so any changes to the underlying table happening during the backup might be inconsistent.
This is really subjective. IMO you shouldn't worry about them 'now'.
You can also use simpler solutions other than pipleline. Perhaps that will be a good place to start.
After running DynamoDB as our main production database for more than a year I can say it is a great experience. No data loss and no downtime. The only thing that we care about is sometimes SDK misbehaves and tweaking provisioned throughput.
data pipeline has limit regions.
https://docs.aws.amazon.com/general/latest/gr/rande.html#datapipeline_region
I would recommend setting up a Data pipeline to backup on daily basis to an S3 bucket - If you want to be really safe.
Dynamo DB itself might be very reliable, but nobody can protect you from your own accidental deletions (what if by mistake you or your colleague ended up deleting a table from the console). So I would suggest setup a backup on daily basis - It doesn't any case cost so much.
You can tell the Pipeline to only consume say may 25% of the capacity while backup is going on so that your real users don't see any delay. Every backup is "full" (not incremental), so in some periodic interval, you can delete some old backups if you are concerned about storage.
I see there is tons of examples and documentation to copy data from DynamoDB to Redshift, but we are looking at an incremental copy process where only the new rows are copied from DynamoDB to Redshift. We will run this copy process everyday, so there is no need to kill the entire redshift table each day. Does anybody have any experience or thoughts on this topic?
Dynamo DB has a feature (currently in preview) called Streams:
Amazon DynamoDB Streams maintains a time ordered sequence of item
level changes in any DynamoDB table in a log for a duration of 24
hours. Using the Streams APIs, developers can query the updates,
receive the item level data before and after the changes, and use it
to build creative extensions to their applications built on top of
DynamoDB.
This feature will allow you to process new updates as they come in and do what you want with them, rather than design an exporting system on top of DynamoDB.
You can see more information about how the processing works in the Reading and Processing DynamoDB Streams documentation.
The copy from redshift can only copy the entire table. There are several ways to achieve this
Using an AWS EMR cluster and Hive - If you set up an EMR cluster then you can use Hive tables to execute queries on the dynamodb data and move to S3. Then that data can be easily moved to redshift.
You can store your dynamodb data based on access patterns (see http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.TimeSeriesDataAccessPatterns). If we store the data this way, then the dynamodb tables can be dropped after they are copied to redshift
This can be solved with a secondary DynamoDB table that tracks only the keys that were changed since the last backup. This table has to be updated wherever initial DynamoDB table is updated (add, update, delete). At the end of a backup process you will delete them or after you backup a row (one by one).
If your DynamoDB table can have
Timestamps as an attribute or
A binary flag which conveys data freshness as attribute
then you can write a hive query to export only current day's data or fresh data to s3 and then 'KEEP_EXISTING' copy this incremental s3 data to Redshift.