Keep existing files on loading redshift table with aws pipeline - amazon-web-services

Im am configuring AWS pipline to load a redshift table with data from JSON S3 file.
Im using RedshiftActivity and everything was good until i try to configure KEEP_EXISTING load method. I really do not want to truncate my table with each load but keep existing information and ADD new records.
Redshift activity seems to require PRIMARY KEY defined in the table in order towork (OK) ... now it's also requresting me to configure DISTRIBUTION KEY, but i am interested in EVEN distribution and it seems that DISTRIBUTION KEY cannot work aside with EVEN distribution style.
Can i simulate EVEN distribution using a distribution key?
Thanks.

I don't bother with primary key when creating tables in Redshift. For distkey, you want to pick a field whose values are randomly distributed, ideally.
In your case of incremental insertion, what I normally do is just use SQLActivity to copy the data from s3 to a staging table in Redshift. Then I perform the update/insert/dedup and whatever steps, depending on business logic. Finally I drop the staging table. Done.

Related

Options to export selective data from one dynamodb table to another table in same region

Need to move data from one dynamodb table to another table after doing a transformation
What is the best approach to do that
Do I need to write a script to read selective data from one table and put in another table
or Do I need to follow CSV export
You need to write a script to do so. However, you may wish to first export the data to S3 using DynamoDB's native function as it does not impact capacity on the table, ensuring you do not impact production traffic for example.
If your table is not serving production traffic or the size of the table is not too large then you can simply use Lambda functions to read your items, transform and then write to the new table.
If your table is large, you can use AWS Glue to achieve the same result in a distributed fashion.
Is this a live table that is used on prod?
If it is what I usually do is.
Enable Dynamo streams (if not already enabled)
Create a lambda function that has access to both tables
Place transformation logic in the lambda
Subscribe the lambda to the dynamo stream
Update all fields on the original table (like update a new field called 'migrate')
Now all elements will flow through the lambda and it can store them with transformation on the new table
You can now switch to the new table
Check if everything still works
Delete lambda, old table, and disable dynamo streams (if needed)
This approach is the only one I found that can guarantee 100% uptime during the migration.
If the table is not live then you can just export it to S3 and then import it into the new table
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBPipeline.html

AWS ETL Design Help: How Can I Maintain a new S3 Bucket with just the most recent data updates?

I am fairly new to AWS and am a bit overwhelmed with the options for a task I have.
Have: I have a dimensional model in an S3 bucket (read only access), that has a folder structure and contains partitioned parquet files. This bucket will be updated daily (+40GB a day), with changes to both dim and fact tables. I need to get this data out of S3, but it's extremely inefficient to set up a boto3 connection and repeatedly pull the entire raw data and continuously check if the data has even been updated.
What I was thinking for a solution: To maintain updated tables in another S3 bucket that I create (likely Athena query outputs), where I can just pull in the updated changes, so that boto can just check if there is data in the new bucket and pull, reducing load.
Considerations:
I need some kind of event notification that triggers the Athena query. I was looking into Lambda or Cloudwatch, but unsure which is better or restraints.
For the fact tables, I need an Athena query that gets the most recent "Last Updated" timestamp from the updated data. And then updates the updated bucket tables to include all the raw data that is greater than the found timestamp.
FYI: I am working with partitioned data, and I am not sure if I can just work with the tables as partitions (part-0000dim-table-3.parquet) or if additional steps are required to work with partitions.
For the dim tables, I need to somehow scan the entire table for changes (dim tables are a combination of SCD 0,1,2)... unsure how best to do this. In the worst case, I could just point the boto3 connection to the raw dim tables whenever the fact tables update.
What AWS APIs, workflows, should I think about using?
I am unclear on the constraints that I could run into with either Lambda, Cloudwatch, Athena step functions, etc. and trying to learn as I go. I am also struggling on how to compare Athena query results across the two buckets.
Thank you very much & if there is any more information that would help, just let me know!!

Row level changes captured via AWS DMS

I am trying to migrate the database using AWS DMS. Source is Azure SQL server and destination is Redshift. Is there any way to know the rows updated or inserted? We dont have any audit columns in source database.
Redshift doesn’t track changes and you would need to have audit columns to do this at the user level. You may be able to deduce this from Redshift query history and save data input files but this will be solution dependent. Query history can be achieved in a couple of ways but both require some action. The first is to review the query logs but these are only saved for a few days. If you need to look back further than this you need a process to save these tables so the information isn’t lost. The other is to turn on Redshift logging to S3 but this would need to be turned on before you run queries on Redshift. There may be some logging from DMS that could be helpful but I think the bottom line answer is that row level change tracking is not something that is on in Redshift by default.

Strategy for Updating Schema/Data of Data Stored in AWS S3

At my organization, we are using a stack of AWS S3, AWS Glue, and Athena to drive some reporting of internal metrics. In general, this stack is great for quick set up for reporting off of raw data (stored in S3). The problem we've come against is what to do if we notice we need to somehow update the data that's already stored in S3. For example, we want to update values in a column that have a certain string to update that value.
Unlike a database, we can't just run a query to update all the existing data. I've tried to see if we can utilize Glue Jobs to accomplish this, but from my limited understanding, it doesn't seem like it's meant to do ETL from a bucket back to the same bucket.
The only thing I can think is to write a custom tool that iterates through an S3 bucket, loads a file, provides the transformation, and puts it back, overwriting the original. It seems there has to be a better way though.
Updates are not handled in a native way in a traditional hive-like warehousing solution, which I deem Athena to be. A common solution is a kind of engineering workaround where you do "insert overwrite" a partition (borrowing Hive syntax, possible in Presto and hopefully also possible in Athena, which is based on Presto).
Other solutions include creating new tables and atomically replacing a view, which users are supposed to query, instead of querying the underlying table(s) directly.
As this is a common problem, there are also some ready to use solutions to it, but I do not know whether which/whether they are possible with Athena. They are certainly possible with Presto (Presto SQL):
Hive ACID transactional tables (updates currently required Hive runtime)
Data Lake (open sourced by Databricks; updates currently require Spark runtime)
Hudi (I know little about this one)

Unloading & reloading data between S3 and Redshift with schema changes

I'm interested in setting up some automated jobs that will periodically export data from our Redshift instance and store it on S3, where ideally it will then be bubbled back up into Redshift via an external table running in Redshift Spectrum. One thing I'm not sure of how to best deal with is the case of certain tables I'm working with changing in schema over time.
I'm able to both UNLOAD data from Redshift to S3 without a problem, and I'm also able to set up an external table within Redshift and have that S3 data available for querying. However, I'm not sure how to best deal with cases where our tables will change columns over time. For example, in the case of certain event data we capture through Segment, traits that get added will result in a new column on the Redshift table that won't have existed in previous UNLOADs. In Redshift, the column value for data that came in before the column existed will just result in NULL values.
What are best way to deal deal with this gradual change in data structure over time? If I just update the new fields in our external table will Redshift be able to deal with the fact that these fields don't necessarily exist on the older UNLOADs, or do I need to go some other route?