We have a couple of mySql tables in RDS that are huge (over 700 GB), that we'd like to migrate to a DynamoDB table. Can you suggest a strategy, or a direction to do this in a clean, parallelized way? Perhaps using EMR or the AWS Data Pipeline.
You can use AWS Pipeline. There are two basic templates, one for moving RDS tables to S3 and the second for importing data from S3 to DynamoDB. You can create your own pipeline using both templates.
Regards
one thing to consider with such large data is whether Dynamo is the best option.
If this is statistical data or otherwise "big data", check out AWS RedShift which might be better suited for your situation.
We have done a similar work and there is probably a better strategy to do this. Using AWS DMS and some prep tables within your source instance.
It involved two steps:
You create new tables within your source instance which match exactly with the dynamodb schema. Like merging multiple tables to one etc.
Set up DMS task with the prep tables as source and DynamoDB as the target. Since the prep tables and the target schema matches now, it should be pretty straightforward from this point.
Related
Right now we have an ETL that extracts info from an API, transforms, and Store in one big table in our OLTP database we want to migrate this table to some OLAP solution. This table is only read to do some calculations that we store on our OLTP database.
Which service fits the most here?
We are currently evaluating Redshift but never used the service before. Also, we thought of some snowflake schema(some kind of fact table with dimensions) in an RDS because is intended to store 10GB to 100GB but don't know how much this approach can scale.
Which service fits the most here?
imho you could do a PoC to see which service is more feasible for you. It really depends on how much data you have, what queries and what load you plan to execute.
AWS Redshift is intended for OLAP on top of peta- or exa-bytes scale handling heavy parallel workload. RS can as well aggregate data from other data sources (jdbc, s3,..). However RS is not OLTP, it requires more static server overhead and extra skills for managing the deployment.
So without more numbers and use cases one cannot advice anything. Cloud is great that you can try and see what fits you.
AWS Redshift is really great when you only want to read the data from the database. Basically, Redshift in the backend is a column-oriented database that is more suitable for analytics. You can transfer all your existing data to redshift using the AWS DMS. AWS DMS is a service that basically needs your bin logs of the existing database and it will automatically transfer your data we don't have to do anything. From my Personal experience Redshift is really great.
I am trying to copy all the tables from a schema (PostgreSQL, 50+ tables) to Amazon S3.
What is the best way to do this? I am able to create 50 different copy activities, but is there a simple way to copy all tables in a schema or write one pipeline and loop?
I think the old method is :
1. Unload your data from PostgreSQL to a CSV file first using something like psql
2. Then just copy the csv to S3
But, AWS gives u a script to do so , RDSToS3CopyActivity See this link from AWS
Since you have a large number of tables. I would recommend using AWS Glue as compared to AWS Data Pipeline. Glue is easily configurable having crawlers etc that allows you the flexibility to choose columns, define etc. Moreover,he underlying jobs in AWS Glue are pyspark jobs that scale really well giving you really good performance.
I have a use case wherein I want to take a data from DynamoDB and do some transformation on the data. After this I want to create 3 csv files (there will be 3 transformations on the same data) and dump them to 3 different s3 locations.
My architecture would be sort of following:
Is it possible to do so? I can't seem to find any documentation regarding it. If it's not possible using pipeline, are there any other services which could help me with my use case?
These dumps will be scheduled daily. My other consideration was using aws lamda. But according to my understanding, it's event based triggered rather time based scheduling, is that correct?
Yes it is possible but not using HiveActivity instead EMRActivity. If you look into Data pipeline documentation for HiveActivity, it clearly states its purpose and not suits your use case:
Runs a Hive query on an EMR cluster. HiveActivity makes it easier to set up an Amazon EMR activity and automatically creates Hive tables based on input data coming in from either Amazon S3 or Amazon RDS. All you need to specify is the HiveQL to run on the source data. AWS Data Pipeline automatically creates Hive tables with ${input1}, ${input2}, and so on, based on the input fields in the HiveActivity object.
Below is how your data pipeline should look like. There is also a inbuilt template Export DynamoDB table to S3 in UI for AWS Data Pipeline which creates the basic structure for you, and then you can extend/customize to suit your requirements.
To your next question using Lambda, Of course lambda can be configured to have event based triggering or schedule based triggering, but I wouldn't recommend using AWS Lambda for any ETL operations as they are time bound & usual ETLs are longer than lambda time limits.
AWS has specific optimized feature offerings for ETLs, AWS Data Pipeline & AWS Glue, I would always recommend to choose between one of two. In case your ETL involves data sources not managed within AWS compute and storage services OR any speciality use case which can't be sufficed by above two options, then AWS Batch will be my next consideration.
Thanks amith for your answer. I have been busy for quite some time now. I did some digging after you posted your answer. Turns out we can dump the data to different s3 locations using Hive activity as well.
This is how the data pipeline would like in that case.
But I believe writing multiple hive activities, when your input source is DynamoDB table, is not a good idea since hive doesn't load any data in memory. It does all the computations on the actual table which could deteriorate the performance of the table. Even documentation suggests to export the data incase you need to make multiple queries to same data. Reference
Enter a Hive command that maps a table in the Hive application to the data in DynamoDB. This table acts as a reference to the data stored in Amazon DynamoDB; the data is not stored locally in Hive and any queries using this table run against the live data in DynamoDB, consuming the table’s read or write capacity every time a command is run. If you expect to run multiple Hive commands against the same dataset, consider exporting it first.
In my case I needed to perform different type of aggregations on the same data once a day. Since dynamoDB doesn't support aggregations, I turned to Data pipeline using Hive. In the end we ended up using AWS Aurora which is My-SQL based.
I am kind of evaluating Athena & Redshift Spectrum. Both serve the same purpose, Spectrum needs a Redshift cluster in place whereas Athena is pure serverless. Athena uses Presto and Spectrum uses its Redshift's engine
Are there any specific disadvantages for Athena or Redshift spectrum?
Any limitations on using Athena or Spectrum ?
I have used both across a few different use cases and conclude:
Advantages of Redshift Spectrum:
Allows creation of Redshift tables
Able to join Redshift tables with Redshift spectrum tables
efficiently
If you do not need those things then you should consider Athena as well
Athena differences from Redshift spectrum:
Billing. This is the major difference and depending on your use case
you may find one much cheaper than the other
Performance. I found Athena slightly faster.
SQL syntax and features. Athena is derived from presto and is a bit
different to Redshift which has its roots in postgres.
Connectivity. Its easy enough to connect to Athena using API,JDBC or
ODBC but many more products offer "standard out of the box"
connection to Redshift
Also, for either solution, make sure you use the AWS Glue metadata, rather than Athena as there are fewer limitations.
This question has been up for quite a time, but still, I think I can contribute something to the discussion.
What is Athena?
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. (From the Doc)
Pretty straight forward, right?
Then comes the question of what is Redshift Spectrum and why Amazon folks made it when Athena was pretty much a solution for external table queries?
So, AWS folks wanted to create an extension to Redshift (which is pretty popular as a managed columnar datastore at this time) and give it the capability to talk to external tables(typically S3). But they wanted to make life easier for Redshift users, mostly analytics people. Many analytics tools don't support Athena but support Redshift at this time. But creating your Reshift cluster and storing data was a bottleneck. Again Redshift isn't that horizontally scalable and it takes some downtime in case of adding new machines. If you are a Redshift user, making your storage cheaper makes your life so much easier basically.
I suggest you use Redshift spectrum in the following cases:
You are an existing Redshift user and you want to store more data in Redshift.
You want to move colder data to an external table but still, want to join with Redshift tables in some cases.
Spark unloading of your data and if you just want to import data to Pandas or any other tools for analyzing.
And Athena can be useful when:
You are a new user and don't have Redshift cluster. Access to Spectrum requires an active, running Redshift instance. So Redshift Spectrum is not an option without Redshift.
As Spectrum is still a developing tool and they are kind of adding some features like transactions to make it more efficient.
BTW Athena comes with a nice REST API , so go for it you want that.
All to say Redshift + Redshift Spectrum is indeed powerful with lots of promises. But it has still a long way to go to be mature.
If you are using Redshift database then it will be wise to use Spectrum along with redshift to get the required performance.
However, if you are beginning to explore options then we can consider Athena as a tool to go ahead.
I had learned (from Adrian Cantril's/LA's 2019 SA Pro course) that Redshift Spectrum would use one's own Redshift cluster to provide more consistent performance than is available by leveraging the shared capacity which AWS makes available to Athena queries. I appreciate this information might only be useful for the exam, I didn't find his argument convincing.
I wrote this answer because I wasn't satisfied with the leading answer's treatment of Athena outperforming Redshift Spectrum. The rest of that answer is good and I do not mean to directly copy any of that here (without references it hadn't registered with me when I wrote this).
I (again, based solely on my hands-off research) would choose Spectrum when the majority of my data is in S3, which would typically be for the larger data sets. The recent RA3 instances seem to overlap this niche though. So I say Spectrum is most suited to where we have long term Redshift clusters that, being OLAP nodes, have spare capacity to query S3.
Why would you use your own estate to perform the queries that Athena would do without such an investment from you? Caching, where it fits. And consistent performance, if I am to believe Adrian Cantrill more than Jon Scott. This made me suspect RA3 might be edging Spectrum out; that and the lack of decent literature on Spectrum. Why would Amazon offer a serverless product in Athena that outperforms Redshift Spectrum which is more expensive? This is how they are choosing to deprecate RRS. I can't believe Spectrum is deprecated so must offer this answer to contest this. Just look at https://aws.amazon.com/redshift/whats-new/.
I think the picture below (from https://d1.awsstatic.com/events/Summits/AMER2020/May13SummitOnline/Modernize_your_data_warehouse.pdf) is fairly clear that compute nodes are influential here, and perhaps contrary to #JonScott's valuable insights above.
One final big difference is Athena is limited to IAM for authentication, as depicted in this reinvent 2018 (ANT201-R1) slide:
One big limitation and differing factor is the ability to use structured data. Athena supports it for both JSON and Parquet file formats while Redshift Spectrum only accepts flat data.
Another is the availability of GIS functions that Athena has and also lambdas, which do come in handy sometimes.
Now if you ran a standalone new Postgres then that does everything and more, but as far as comparison between Redshift (and Spectrum) goes - it's a tool that has stopped evolving.
I'm trying to implement, I think, a very simple process, but I don't really know what's the best approach.
I want to read a big csv (around 30gb) file from S3, make some transformation and load it into RDS MySQL and I want this process to be replicable.
I tought that the best approach was Aws data pipeline, but I've found that this service is more designed to load data from different sources to redshift after several transformtions.
I've also seen that the process of creating a pipeline is slow and a little bit messy.
Then I've found the dataduct wrapper of Coursera, but after some research, it seems that this project has been abandoned (the last commit was one year ago).
So I don't know if I should continue trying with aws data pipeline or take another approach.
I've also read about AWS Simple Workflow and Step Functions, but I don't know if it's simpler.
Then I've seen a video of AWS glue and it looks nice, but unfortunatelly it's not yet available and I don't know when Amazon will launch it.
As you see, I'm a little bit confuse, can anyone enlight me?
Thanks in advance
If you are trying to get them into RDS so you can query them, there are other options that do not require the data to be moved from S3 to RDS to do SQL like queries.
You can use Redshift spectrum to read and query information from S3 now.
Using Amazon Redshift Spectrum, you can efficiently query and retrieve structured and semistructured data from files in Amazon S3 without having to load the data into Amazon Redshift tables
Step 1. Create an IAM Role for Amazon Redshift
Step 2: Associate the IAM Role with Your Cluster
Step 3: Create an External Schema and an External Table
Step 4: Query Your Data in Amazon S3
Or you can use Athena to query the data in S3 as well if Redshift is too much horsepower for the need job.
Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL.
You could use an ETL tool to do the transformations on your csv data and then load it into your RDS database. There are a number of open source tools that do not require large licensing costs. That way you can pull the data into the tool, do your transformations and then the tool will load the data into your MySQL database. For example there is Talend, Apache Kafka, and Scriptella. Here's some information on them for comparison.
I think Scriptella would be an option for this situation. It can use SQL scripts (or other scripting languages), and has JDBC/ODBC compliant drivers. With this you could create a script that would perform your transformations and then load the data into your MySQL database. And you would be using familiar SQL (I'm assuming you already can create SQL scripts) so there isn't a big learning curve.