AWS Redshift purge policy automation - amazon-web-services

AWS Redshift team recommend using TRUNCATE in order to clean up a large table.
I have a continuous EC2 service that keeps adding rows to a table. I would like to apply some purging mechanism, so that when the cluster is near full it will auto delete old rows (say using the index column).
Is there some best practice for doing that?
Do I need to write my own code to handle that? (if so is there already a Python script for that that I can use e.g. in a Lambda function?)

A common practice when dealing with continuous data is to create a separate table for each month, eg Sales-2018-01, Sales-2018-02.
Then create a VIEW that combines the tables:
CREATE VIEW sales AS
SELECT * FROM Sales-2018-01
UNION
SELECT * FROM Sales-2018-02
Then, create a new table each month and remove the oldest month from the View. This effectively gives a 12-month rolling view of the data.
The benefit is that data does not have to be deleted from tables (which would then require a VACUUM). Instead, the old table can simply be dropped, or kept around for historical reporting with a different View.
See: Using Time Series Tables - Amazon Redshift

Related

Getting incremental data from Amazon Aurora to Redshift via DMS using CDC

my company wants to build a Data warehouse in Redshift. We have an OLTP database running in Amazon Aurora and we are thinking of using the DMS (data migration service). I am trying to get my head around the capabilities of CDC (change data capture). The thing is that CDC (over DMS) replicates and stores changes (in our case in Redshift) and I was wondering if it is possible to select specific columns which I want to store (this should be possible to do with table mapping - include) and based on which I want to store? As far as I understand it, if any columns of a row are updated, then the replication is triggered, which could mean a replication that is useless (e.g. if somebody updates a column that I do not want to follow)
E.g. I have a table with leads, which has some 30 columns. Now I am interested in the DW purposes only in 5 columns and I want to get a new line to the redshift table only if any of those 5 columns changes (is updated)... like if the stage of lead is changed, I will get a new line. On the other hand, I am not interested in the column 'Salesmans_comment' so if the salesman updates a comment, I do not want to have a new line, because I am not interested in it...Cheers!
I have run through most of the available yt tutorials and read through the documentation, but I haven't found a clear answer...
Thanks

how to get AWS quicksight to show the old and new value of a particular column of a table (for comparison purposes)?

what I have seen so far is that the aws glue crawler creates the table based on the latest changes in the s3 files.
let's say crawler creates a table and then I upload a CSV with updated values in one column. the crawler is run again and it updates the table's column with the updated values. I want to be able to show a comparison of the old and new data in quick sight eventually, is this scenario possible?
for example,
right now my csv file is set up as details of one aws service, like RDS is the csv file name and the columns are account id, account name, what region is it in, etc etc
there was one column of percentage with a value 50%, it gets updated with 70%. would I be able to somehow get the old value as well to show in quicksight, to say like previously it was 50% and now its 70%
Maybe this scenerio is not even valid? because I want to be able to show like what account has what cost in xyz month and show how the cost is different in other months. If I make separate tables on each update of csv then there would be 1000+ tables at one point.
If I have understood your question correctly, you are aiming to track data over time. Above you suggest creating a table for each time series, why not instead maintain a record in a table for each time series, you can then create various Analysis over the data, comparing specific months or tracking month-by-month values.

Optimal Big Data solution for aggregating time-series data and storing results to DynamoDB

I am looking into different Big Data solutions and have not been able to find a clear answer or documentation on what might be the best approach and frameworks/services to use to address my Big Data use-case.
My Use-case:
I have a data producer that will be sending ~1-2 billion events to a
Kinesis Data Firehose delivery stream daily.
This data needs to be stored in some data lake / data warehouse, aggregated, and then
loaded into DynamoDB for our service to consume the aggregated data
in its business logic.
The DynamoDB table needs to be updated hourly. (hourly is not a hard requirement but we would like DynamoDB to be updated as soon as possible, at the longest intervals of daily updates if required)
The event schema is similar to: customerId, deviceId, countryCode, timestamp
The aggregated schema is similar to: customerId, deviceId, countryCode (the aggregation is on the customerId's/deviceId's MAX(countryCode) for each day over the last 29 days, and then the MAX(countryCode) overall over the last 29 days.
Only the CustomerIds/deviceIds that had their countryCode change from the last aggregation (from an hour ago) should be written to DynamoDB to keep required write capacity units low.
The raw data stored in the data lake / data warehouse needs to be deleted after 30 days.
My proposed solution:
Kinesis Data Firehose delivers the data to a Redshift staging table (by default using S3 as intermediate storage and then using the COPY command to load to Redshift)
An hourly Glue job that:
Drops the 30 day old time-series table and creates a new time-series table for today in Redshift if this is the first job run of a new day
Loads data from staging table to the appropriate time-series table
Creates a view on top of the last 29 days of time-series tables
Aggregates by customerId, deviceId, date, and MAX(CountryCode)
Then aggregates by customerId, deviceId, MAX(countryCode)
Writes the aggregated results to an S3 bucket
Checks the previous hourly Glue job's run aggregated results vs. the current runs aggregated results to find the customerIds/deviceIds that had their countryCode change
Writes the customerIds/deviceIds rows that had their countryCode change to DynamoDB
My questions:
Is Redshift the best storage choice here? I was also considering using S3 as storage and directly querying data from S3 using a Glue job, though I like the idea of a fully-managed data warehouse.
Since our data has a fixed retention period of 30 days, AWS documentation: https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-time-series-tables.html suggests to use time-series tables and running DROP TABLE on older data that needs to be deleted. Are there other approaches (outside of Redshift) that would make the data lifecycle management easier? Having the staging table, creating and loading into new time-series tables, dropping older time-series tables, updating the view to include the new time-series table and not the one that was dropped could be error prone.
What would be an optimal way to find the the rows (customerId/deviceId combinations) that had their countryCode change since the last aggregation? I was thinking the Glue job could create a table from the previous runs aggregated results S3 file and another table from the current runs aggregated results S3 file, run some variation of a FULL OUTER JOIN to find the rows that have different countryCodes. Is there a better approach here that I'm not aware of?
I am a newbie when it comes to Big Data and Big Data solutions so any and all input is appreciated!
tldr: Use step functions, not Glue. Use Redshift Spectrum with data in S3. Otherwise you overall structure looks on track.
You are on the right track IMHO but there are a few things that could be better. Redshift is great for sifting through tons of data and performing analytics on it. However I'm not sure you want to COPY the data into Redshift if all you are doing is building aggregates to be loaded into DDB. Do you have other analytic workloads being done that will justify storing the data in Redshift? Are there heavy transforms being done between the staging table and the time series event tables? If not you may want to make the time series tables external - read directly from S3 using Redshift Spectrum. This could be a big win as the initial data grouping and aggregating is done in the Spectrum layer in S3. This way the raw data doesn't have to be moved.
Next I would advise not using Glue unless you have a need (transform) that cannot easily be done elsewhere. I find Glue to require some expertise to get to do what you want and it sounds like you would just be using it for a data movement orchestrator. If this impression is correct you will be better off with a step function or even a data pipeline. (I've wasted way too much time trying to get Glue to do simple things. It's a powerful tool but make sure you'll get value from the time you will spend on it.)
If you are only using Redshift to do these aggregations and you go the Spectrum route above you will want to get as small a cluster as you can get away with. Redshift can be pricy and if you don't use its power, not cost effective. In this case you can run the cluster only as needed but Redshift boot up times are not fast and the smallest clusters are not expensive. So this is a possibility but only in the right circumstances. Depending on how difficult the aggregation is that you are doing you might want to look at Athena. If you are just running a few aggregating queries per hour then this could be the most cost effective approach.
Checking against the last hour's aggregations is just a matter of comparing the new aggregates against the old which are in S3. This is easily done with Redshift Spectrum or Athena as they can makes files (or sets of files) the source for a table. Then it is just running the queries.
In my opinion Glue is an ETL tool that can do high power transforms. It can do a lot of things but is not my first (or second) choice. It is touchy, requires a lot of configuration to do more than the basics, and requires expertise that many data groups don't have. If you are a Glue expert, knock you self out; If not, I would avoid.
As for data management, yes you don't want to be deleting tons of rows from the beginning of tables in Redshift. It creates a lot of data reorganization work. So storing your data in "month" tables and using a view is the right way to go in Redshift. Dropping tables doesn't create this housekeeping. That said if you organize you data in S3 in "month" folders then unneeded removing months of data can just be deleting these folders.
As for finding changing country codes this should be easy to do in SQL. Since you are comparing aggregate data to aggregate data this shouldn't be expensive either. Again Redshift Spectrum or Athena are tools that allow you to do this on S3 data.
As for being a big data newbie, not a worry, we all started there. The biggest difference from other areas is how important it is to move the data the fewest number of times. It sounds like you understand this when you say "Is Redshift the best storage choice here?". You seem to be recognizing the importance of where the data resides wrt the compute elements which is on target. If you need the horsepower of Redshift and will be accessing the data over and over again then the Redshift is the best option - The data is moved once to a place where the analytics need to run. However, Redshift is an expensive storage solution - it's not what it is meant to do. Redshift Spectrum is very interesting in that the initial aggregations of data is done in S3 and much reduced partial results are sent to Redshift for completion. S3 is a much cheaper storage solution and if your workload can be pattern-matched to Spectrum's capabilities this can be a clear winner.
I want to be clear that you have only described on area where you need a solution and I'm assuming that you don't have other needs for a Redshift cluster operating on the same data. This would change the optimization point.

Update AWS Athena data & table to rename columns

Today, I saw myself with a simple problem, renaming column of an Athena glue table from old to new name.
First thing, I search here and tried some solutions like this, this, and many others. Unfortunately, none works, so I decided to use my knowledge and imagination.
I'm posting this question with the intention of share, but also, with the intention to get how others did and maybe find out I reinvented the wheel. So please also share your way if you know how to do it.
My setup is, a Athena JSON table partitioned by day with valuable and enormous amount of data, the infrastructure is defined and updated through Cloudformation.
How to rename an Athena column and still keep the data?
Explaining without all the cloudformation infrastructure.
Imagine a table containing:
userId
score
otherColumns
eventDateUtc
dt_utc
Partitioned by dt_utc and stored using JSON format. Wee need to change the column score to deltaScore.
Keep in mind, although I haven't tested with others format/configurations, this should apply to any configuration supported by athena as we are going to use athena algorithm to do the job for us.
How to do
if you run the cloudformation migration first, you gonna "lose" access to the dropped column.
but you can simply rename the column back and the data appears.
Those are the steps required for rename a AWS Athena table:
Create a temporary table mapping the old column name to the new one:
This can be done by use of CREATE TABLE AS, read more in the aws docs
With this command, we use Athena engine to apply the transformation on the files of the original table for us and save at s3://bucket_name/A_folder/temp_table_rename/.
CREATE TABLE "temp_table_rename"
WITH(
format = 'JSON',
external_location = 's3://bucket_name/A_folder/temp_table_rename/',
partitioned_by = ARRAY['dt_utc']
)
AS
SELECT DISTINCT
userid,
score as deltascore,
otherColumns,
eventDateUtc,
"dt_utc"
FROM "my_database"."original_table"
Apply the database rename by running the cloudformation with the changes or on the way you have.
At this point, you can even drop the original_table, and create again using the right column name.
After rename, you will notice that the renamed column have no data.
Remove the data of the original table by deleting it's s3 source.
Copy the data from the temp table source to the original table source
I prefer to use a aws command as, there can be thousands of files to copy
aws s3 cp s3://bucket_name/A_folder/temp_table_rename/ s3://bucket_name/A_folder/original_table/ --recursive
Restore the index of the original table
MSCK REPAIR TABLE "my_database"."original_table"
done.
Final notes:
Using CREATE TABLE AS to do the transformation job, allow you to do much more than only renaming the column, for example split the data of a column into 2 new columns, or merge it to a single one.

How to update Athena tables in a zero-down-time manner?

I am deploying Athena external tables, and want to update their definition without downtime, is there a way?
The ways I thought about are:
Create a new table and rename the old and then rename the new to the old name, first, it involves a very small downtime, and renaming tables doesn't seem to be supported (neither altering the definition).
The other way is to drop the table and recreate it, which obviously involves downtime.
If you use the Glue UpdateTable API call you can change a table definition without first dropping the table. Athena uses the Glue Catalog APIs behind the scenes when you do things like CREATE TABLE … and DROP TABLE ….
Please note that if you make changes to partitioned tables you also need to update all partitions to match.
Another way that does not involve using Glue directly would be to create a view for each table and only use these views in queries. When you need to replace a table you create a new table with the new schema, then recreate the view (with CREATE OR REPLACE) to use the new table, then drop the old table. I haven't checked, but it would make sense if replacing a view used the UpdateTable API behind the scenes.