I was asked for a solution to migrate data from Redshift to an RDS instance.
The migration will consist in only few tables and i'm trying to migrate them to a postgresql instance.
Any clue where could I begin?
You have multiple options to do it. Which one suits best to you depends on many questions answer like-- size of the data, one time or regular process and so on and so forth...
Typical process to do it.
Select and Export the data
Temporary hold and transfer data from import will be triggered to EC2/Any Machine
Load the data
Here goes options-
For step 1, you have multiple options but best option could be use Copy command and export 3 using a copy command.
For step 2, Multiple options like store on local/ec2/S3/... I think best option could be s3 and its done in step-1 I suggested.
For step 3, depends on RDS type but most of database supports way to import data from CSV, so use that.
If its Redshift to mysql. You may would like to refer steps, I have suggested on one of answer earlier.
I hope it will help. Feel free if you have specific additional questions.
Related
I need to pull two companies' data from their respective AWS S3 buckets, map their columns in Glue, and export them to a specific schema in a Microsoft SQL database. The schema is to have one table, with the companies' data being distinguished with attributes for each of their sites (each company has multiple sites).
I am completely new to AWS and SQL, would someone mind explaining to me how to add an attribute to the data, or point me to some good literature on this? I feel like manipulating the .csv in the Python script I'm already running to automatically download the data from another site then upload it to S3 could be an option (deleting NaN columns and adding a column for site name), but I'm not entirely sure.
I apologize if this has already been answered elsewhere. Thanks!
I find this website to generally be pretty helpful with figuring out SQL stuff. I've linked to the ALTER TABLE commands that would allow you to do this through SQL.
If you are running a python script to edit the .csv to start, then I would edit the data there, personally. Depending on the size of the data sets, you can run your script as a Lambda or Batch job to grab, edit, and then upload to s3. Then you can run your Glue crawler or whatever process you're using to map the columns.
I am trying to migrate the database using AWS DMS. Source is Azure SQL server and destination is Redshift. Is there any way to know the rows updated or inserted? We dont have any audit columns in source database.
Redshift doesn’t track changes and you would need to have audit columns to do this at the user level. You may be able to deduce this from Redshift query history and save data input files but this will be solution dependent. Query history can be achieved in a couple of ways but both require some action. The first is to review the query logs but these are only saved for a few days. If you need to look back further than this you need a process to save these tables so the information isn’t lost. The other is to turn on Redshift logging to S3 but this would need to be turned on before you run queries on Redshift. There may be some logging from DMS that could be helpful but I think the bottom line answer is that row level change tracking is not something that is on in Redshift by default.
I need to migrate all records 3 billions from one MySQL Aurora table to 5 different tables in same cluster .
There are transformation of 2 columns is also has to happen .
So when we migrate we need to convert xml to json and then json will be stored in one of the destination table .
We are looking for best way to migrate this data from one MySQL table to another and we are on AWS so we have flexibility to use any services which can help us achieve this .
So far this is what we have planned
MySQL TABLE ----DMS------>S3 ------LAMBDA to convert XML to JSON and create 5 types of files ---->Lambda on file create and Load data local to 5 Different MySQL table .
But one thing we would like to know how can we handle if Load data local fails in between ?So Lambda will submit the query for load data local from s3 to MySQL but how can we track in Lambda that Load data local success or failure ?
We can not use any direct way because we need to transform data in between .
Is there any better way we can use here ?
Can we use Data pipeline in place of Lambda function for load data local?
Or can we use DMS which will upload file from S3 to MySQL ?
Please suggest what can be the best way which will capability to handle failure scenario
What you are basically doing is an ETL process. I would advise you to look into either AWS EMR or AWS Glue. Since you don't seem to have that much experience, I would use Glue.
With Glue you could basically read from MySQL, do the transformation and write back directly to MySQL. Also, since Glue is running Spark in the background, you can leverage it's distributed computing, which will speed up your process instead of using a single thread lambda function.
At my organization, we are using a stack of AWS S3, AWS Glue, and Athena to drive some reporting of internal metrics. In general, this stack is great for quick set up for reporting off of raw data (stored in S3). The problem we've come against is what to do if we notice we need to somehow update the data that's already stored in S3. For example, we want to update values in a column that have a certain string to update that value.
Unlike a database, we can't just run a query to update all the existing data. I've tried to see if we can utilize Glue Jobs to accomplish this, but from my limited understanding, it doesn't seem like it's meant to do ETL from a bucket back to the same bucket.
The only thing I can think is to write a custom tool that iterates through an S3 bucket, loads a file, provides the transformation, and puts it back, overwriting the original. It seems there has to be a better way though.
Updates are not handled in a native way in a traditional hive-like warehousing solution, which I deem Athena to be. A common solution is a kind of engineering workaround where you do "insert overwrite" a partition (borrowing Hive syntax, possible in Presto and hopefully also possible in Athena, which is based on Presto).
Other solutions include creating new tables and atomically replacing a view, which users are supposed to query, instead of querying the underlying table(s) directly.
As this is a common problem, there are also some ready to use solutions to it, but I do not know whether which/whether they are possible with Athena. They are certainly possible with Presto (Presto SQL):
Hive ACID transactional tables (updates currently required Hive runtime)
Data Lake (open sourced by Databricks; updates currently require Spark runtime)
Hudi (I know little about this one)
In Redshift, there's an STL_QUERY table that stores queries that were run over the last 5 days. I'm trying to find a way to keep more than 5 days worth of records. Here are some things that I've considered:
Is there a Redshift setting for this? It would appear not.
Could I use a trigger? Triggers are not available in Redshift, so this is a no-go.
Could I create an Amazon Data Pipeline job to periodically "scrape" the STL_QUERY table? I could, so this is an option. Unfortunately, I would have to give the pipeline some EC2 instance to use to run this work. It seems like a waste to have an instance sitting around to scrape this table once a day.
Could I use an Amazon Simple Work Flow job to scrape the table? I could, but it suffers from the same issues as 3.
Are there any other options/ideas that I'm missing? I would prefer some other option that does not involve me dedicating an EC2 instance, even if it means paying for an additional service (provided that it's cheaper than the EC2 instance I would have used in it's stead).
Keep it simple, do it all in Redshift.
First, use "CREATE TABLE … AS" to save all current history into a permanent table.
CREATE TABLE admin.query_history AS SELECT * FROM stl_query;
Second, using psql to run it, schedule a job on a machine you control to run this every day.
INSERT INTO admin.query_history SELECT * FROM stl_query WHERE query > (SELECT MAX(query) FROM admin.query_history);
Done. :)
Notes:
You need an 8.x version of psql if you haven't set this up yet.
Even if your job doesn't run for a few days stl_query keeps enough history that you'll be covered.
As per your comment, it might be safer to use starttime instead of query as the criteria.