Unloading & reloading data between S3 and Redshift with schema changes - amazon-web-services

I'm interested in setting up some automated jobs that will periodically export data from our Redshift instance and store it on S3, where ideally it will then be bubbled back up into Redshift via an external table running in Redshift Spectrum. One thing I'm not sure of how to best deal with is the case of certain tables I'm working with changing in schema over time.
I'm able to both UNLOAD data from Redshift to S3 without a problem, and I'm also able to set up an external table within Redshift and have that S3 data available for querying. However, I'm not sure how to best deal with cases where our tables will change columns over time. For example, in the case of certain event data we capture through Segment, traits that get added will result in a new column on the Redshift table that won't have existed in previous UNLOADs. In Redshift, the column value for data that came in before the column existed will just result in NULL values.
What are best way to deal deal with this gradual change in data structure over time? If I just update the new fields in our external table will Redshift be able to deal with the fact that these fields don't necessarily exist on the older UNLOADs, or do I need to go some other route?

Related

Options to export selective data from one dynamodb table to another table in same region

Need to move data from one dynamodb table to another table after doing a transformation
What is the best approach to do that
Do I need to write a script to read selective data from one table and put in another table
or Do I need to follow CSV export
You need to write a script to do so. However, you may wish to first export the data to S3 using DynamoDB's native function as it does not impact capacity on the table, ensuring you do not impact production traffic for example.
If your table is not serving production traffic or the size of the table is not too large then you can simply use Lambda functions to read your items, transform and then write to the new table.
If your table is large, you can use AWS Glue to achieve the same result in a distributed fashion.
Is this a live table that is used on prod?
If it is what I usually do is.
Enable Dynamo streams (if not already enabled)
Create a lambda function that has access to both tables
Place transformation logic in the lambda
Subscribe the lambda to the dynamo stream
Update all fields on the original table (like update a new field called 'migrate')
Now all elements will flow through the lambda and it can store them with transformation on the new table
You can now switch to the new table
Check if everything still works
Delete lambda, old table, and disable dynamo streams (if needed)
This approach is the only one I found that can guarantee 100% uptime during the migration.
If the table is not live then you can just export it to S3 and then import it into the new table
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBPipeline.html

AWS ETL Design Help: How Can I Maintain a new S3 Bucket with just the most recent data updates?

I am fairly new to AWS and am a bit overwhelmed with the options for a task I have.
Have: I have a dimensional model in an S3 bucket (read only access), that has a folder structure and contains partitioned parquet files. This bucket will be updated daily (+40GB a day), with changes to both dim and fact tables. I need to get this data out of S3, but it's extremely inefficient to set up a boto3 connection and repeatedly pull the entire raw data and continuously check if the data has even been updated.
What I was thinking for a solution: To maintain updated tables in another S3 bucket that I create (likely Athena query outputs), where I can just pull in the updated changes, so that boto can just check if there is data in the new bucket and pull, reducing load.
Considerations:
I need some kind of event notification that triggers the Athena query. I was looking into Lambda or Cloudwatch, but unsure which is better or restraints.
For the fact tables, I need an Athena query that gets the most recent "Last Updated" timestamp from the updated data. And then updates the updated bucket tables to include all the raw data that is greater than the found timestamp.
FYI: I am working with partitioned data, and I am not sure if I can just work with the tables as partitions (part-0000dim-table-3.parquet) or if additional steps are required to work with partitions.
For the dim tables, I need to somehow scan the entire table for changes (dim tables are a combination of SCD 0,1,2)... unsure how best to do this. In the worst case, I could just point the boto3 connection to the raw dim tables whenever the fact tables update.
What AWS APIs, workflows, should I think about using?
I am unclear on the constraints that I could run into with either Lambda, Cloudwatch, Athena step functions, etc. and trying to learn as I go. I am also struggling on how to compare Athena query results across the two buckets.
Thank you very much & if there is any more information that would help, just let me know!!

Retention and archival policy on Hive data

We have an AWS EMR which includes a Hive backed by aurora metadata and data stored in s3. There are programs that create the database(s) and tables inside in Hive and populate data.
After a while, these databases are no longer needed (say after 1 year). We want to delete those hive databases automatically after a set period. The usual way is to set a cron job that runs every month or so, to find the databases from an internal metadata table that are older than 1 year, and programmatically fire the queries in Hive which deletes it. But this has some drawbacks like Manually created tables are not being covered.
Is there any hive built-in feature that does the above?
Hive is actually just a metadata store that defines how data should be interpreted. It does not manage any of the underlying data. (This is a major difference between hive and a conventional database. And why hive can use multiple file backends(hdfs&S3) in the same hive instance.)
I'm going to guess you are using an s3 bucket for you data so you likely want to look into expiring objects. This will do exactly what you want. Delete data after a period of time. This will not disrupt hive.
If you are using partitions you may wish to do some additional cleanup.
MSCK REPAIR TABLE will help maintain the partitions in hive but is really slow in S3 and periodically can timeout. YMMV.
It's better to drop partitions:
ALTER TABLE bills DROP IF EXISTS PARTITION (mydate='2022-02') PURGE;
In Hive you can implement partitions retention (since Hive 3.1.0)
For example to drop partitions and their data after 7 days:
ALTER TABLE employees SET TBLPROPERTIES ('partition.retention.period'='7d');
There is not a hive internal tool that removes 'databases' according to a "retention period" in hive.
You have been doing this for a while so you are likely well aware of the risks of deleting metadata older than a year.
There are several ways to define retention on data, but none that I'm aware to remove metadata.
Things you could look at:
You could add a trigger to Aurora to delete tables directly from the hive metadata. (Hive tables have values for create time and they're last access time) you could create some logic to work at that level.

Optimal Big Data solution for aggregating time-series data and storing results to DynamoDB

I am looking into different Big Data solutions and have not been able to find a clear answer or documentation on what might be the best approach and frameworks/services to use to address my Big Data use-case.
My Use-case:
I have a data producer that will be sending ~1-2 billion events to a
Kinesis Data Firehose delivery stream daily.
This data needs to be stored in some data lake / data warehouse, aggregated, and then
loaded into DynamoDB for our service to consume the aggregated data
in its business logic.
The DynamoDB table needs to be updated hourly. (hourly is not a hard requirement but we would like DynamoDB to be updated as soon as possible, at the longest intervals of daily updates if required)
The event schema is similar to: customerId, deviceId, countryCode, timestamp
The aggregated schema is similar to: customerId, deviceId, countryCode (the aggregation is on the customerId's/deviceId's MAX(countryCode) for each day over the last 29 days, and then the MAX(countryCode) overall over the last 29 days.
Only the CustomerIds/deviceIds that had their countryCode change from the last aggregation (from an hour ago) should be written to DynamoDB to keep required write capacity units low.
The raw data stored in the data lake / data warehouse needs to be deleted after 30 days.
My proposed solution:
Kinesis Data Firehose delivers the data to a Redshift staging table (by default using S3 as intermediate storage and then using the COPY command to load to Redshift)
An hourly Glue job that:
Drops the 30 day old time-series table and creates a new time-series table for today in Redshift if this is the first job run of a new day
Loads data from staging table to the appropriate time-series table
Creates a view on top of the last 29 days of time-series tables
Aggregates by customerId, deviceId, date, and MAX(CountryCode)
Then aggregates by customerId, deviceId, MAX(countryCode)
Writes the aggregated results to an S3 bucket
Checks the previous hourly Glue job's run aggregated results vs. the current runs aggregated results to find the customerIds/deviceIds that had their countryCode change
Writes the customerIds/deviceIds rows that had their countryCode change to DynamoDB
My questions:
Is Redshift the best storage choice here? I was also considering using S3 as storage and directly querying data from S3 using a Glue job, though I like the idea of a fully-managed data warehouse.
Since our data has a fixed retention period of 30 days, AWS documentation: https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-time-series-tables.html suggests to use time-series tables and running DROP TABLE on older data that needs to be deleted. Are there other approaches (outside of Redshift) that would make the data lifecycle management easier? Having the staging table, creating and loading into new time-series tables, dropping older time-series tables, updating the view to include the new time-series table and not the one that was dropped could be error prone.
What would be an optimal way to find the the rows (customerId/deviceId combinations) that had their countryCode change since the last aggregation? I was thinking the Glue job could create a table from the previous runs aggregated results S3 file and another table from the current runs aggregated results S3 file, run some variation of a FULL OUTER JOIN to find the rows that have different countryCodes. Is there a better approach here that I'm not aware of?
I am a newbie when it comes to Big Data and Big Data solutions so any and all input is appreciated!
tldr: Use step functions, not Glue. Use Redshift Spectrum with data in S3. Otherwise you overall structure looks on track.
You are on the right track IMHO but there are a few things that could be better. Redshift is great for sifting through tons of data and performing analytics on it. However I'm not sure you want to COPY the data into Redshift if all you are doing is building aggregates to be loaded into DDB. Do you have other analytic workloads being done that will justify storing the data in Redshift? Are there heavy transforms being done between the staging table and the time series event tables? If not you may want to make the time series tables external - read directly from S3 using Redshift Spectrum. This could be a big win as the initial data grouping and aggregating is done in the Spectrum layer in S3. This way the raw data doesn't have to be moved.
Next I would advise not using Glue unless you have a need (transform) that cannot easily be done elsewhere. I find Glue to require some expertise to get to do what you want and it sounds like you would just be using it for a data movement orchestrator. If this impression is correct you will be better off with a step function or even a data pipeline. (I've wasted way too much time trying to get Glue to do simple things. It's a powerful tool but make sure you'll get value from the time you will spend on it.)
If you are only using Redshift to do these aggregations and you go the Spectrum route above you will want to get as small a cluster as you can get away with. Redshift can be pricy and if you don't use its power, not cost effective. In this case you can run the cluster only as needed but Redshift boot up times are not fast and the smallest clusters are not expensive. So this is a possibility but only in the right circumstances. Depending on how difficult the aggregation is that you are doing you might want to look at Athena. If you are just running a few aggregating queries per hour then this could be the most cost effective approach.
Checking against the last hour's aggregations is just a matter of comparing the new aggregates against the old which are in S3. This is easily done with Redshift Spectrum or Athena as they can makes files (or sets of files) the source for a table. Then it is just running the queries.
In my opinion Glue is an ETL tool that can do high power transforms. It can do a lot of things but is not my first (or second) choice. It is touchy, requires a lot of configuration to do more than the basics, and requires expertise that many data groups don't have. If you are a Glue expert, knock you self out; If not, I would avoid.
As for data management, yes you don't want to be deleting tons of rows from the beginning of tables in Redshift. It creates a lot of data reorganization work. So storing your data in "month" tables and using a view is the right way to go in Redshift. Dropping tables doesn't create this housekeeping. That said if you organize you data in S3 in "month" folders then unneeded removing months of data can just be deleting these folders.
As for finding changing country codes this should be easy to do in SQL. Since you are comparing aggregate data to aggregate data this shouldn't be expensive either. Again Redshift Spectrum or Athena are tools that allow you to do this on S3 data.
As for being a big data newbie, not a worry, we all started there. The biggest difference from other areas is how important it is to move the data the fewest number of times. It sounds like you understand this when you say "Is Redshift the best storage choice here?". You seem to be recognizing the importance of where the data resides wrt the compute elements which is on target. If you need the horsepower of Redshift and will be accessing the data over and over again then the Redshift is the best option - The data is moved once to a place where the analytics need to run. However, Redshift is an expensive storage solution - it's not what it is meant to do. Redshift Spectrum is very interesting in that the initial aggregations of data is done in S3 and much reduced partial results are sent to Redshift for completion. S3 is a much cheaper storage solution and if your workload can be pattern-matched to Spectrum's capabilities this can be a clear winner.
I want to be clear that you have only described on area where you need a solution and I'm assuming that you don't have other needs for a Redshift cluster operating on the same data. This would change the optimization point.

Clear All Existing Entries In DynamoDB Table In AWS Data Pipeline

My goal is to take daily snapshots of an RDS table and put it in a DynamoDB table. The table should only contain data from a single day.
For this have a Data Pipeline set up to query a RDS table and publish the results into S3 in CSV format.
Then a HiveActivity imports this CSV into a DynamoDB table by creating external tables for the file and an existing DynamoDB table.
This works great, but older entries from the previous day still exist in the DynamoDB table. I want to do this within Data Pipeline if at all possible. I need to:
1) Find a way to clear the DynamoDB table, or at least drop/recreate it, or
2) Include an extra column of the snapshot date and find a way to clear out all older entries.
Any ideas on how I can do this?
You can use DynamoDb Time to Live(TTL) which allows you to set an expiration time after which items are auto deleted from the DynamoDb table. TTL is very useful for cases where data loses it's relevance after a specific time period and in your case it can be start time of next day.