I'm a beginner in glue technology, I'm trying to develop jobs without using codes, currently I'm trying to update rows in an S3 file and in a table in sql server, I have the impression that is not possible, could you please advise me on the best way to do it?
I use as target data catalog (S3 or RDS)
Related
I am trying to automate an ETL pipeline that outputs data from AWS RDS MYSQL to AWS S3. I am currently using AWS Glue to do the job. When I do an initial load from RDS to S3. It captures all the data in the file which is exactly what I want. However, when I add new data to the MYSQL database and run the Glue job again. I get an empty file instead of the added rows. Any help would be MUCH appreciated.
Bookmarking rules for JDBC Sources are here. Important point to remember for JDBC sources is that values have to be increasing or decreasing order and Glue only processes new data from last checkpoint.
Typically, either an autogenerated sequence number or a datatime used as key for bookmarking
For anybody who is still struggling with this (it drove me mad, because i thought my spark code was wrong), disable bookmarking in job details.
Why AWS claims Glue as a ETL tool? We need to code everything to pull data, no inbuilt functionality provided by Glue. Any benefits of using Glue instead of Nifi or some other ingestion tools?
Glue is a good ETL tool within AWS. Especially for big data work loads. After all it is running on spark.
Glue does have the ability to produce some basic automated transformation code -> Move data from A to B and remap column names etc.
However, it's the flexibility to write custom code that really sets it apart. Using the Glue code editor, or the Pycharm IDE, you can script any transformations you need using pyspark and/or scala.
The benefits of Glue are really gained when it is used in conjunction with other AWS services. The Glue Data Catalog is shared with Athena and even AWS EMR, so you end up with a central point for your big data ecosystem.
One limitation of Glue I have found is writing large datasets to MS SQL Server (10 million rows+). Glue uses JDBC drivers, and as of 2020, there is yet to be a Microsoft JDBC connection that avails of bulk copy. So, effectively you are writing an insert statement for each row. Therefore, performance can suffer once you get into the 10s of millions of rows currently.
I wish to transfer data in a database like MySQL[RDS] to S3 using AWS Glue ETL.
I am having difficulty trying to do this the documentation is really not good.
I found this link here on stackoverflow:
Could we use AWS Glue just copy a file from one S3 folder to another S3 folder?
SO based on this link, it seems that Glue does not have an S3 bucket as a data Destination, it may have it as a data Source.
SO, i hope i am wrong on this.
BUT if one makes an ETL tool, one of the first basics on AWS is for it to tranfer data to and from an S3 bucket, the major form of storage on AWS.
So hope someone can help on this.
You can add a Glue connection to your RDS instance and then use the Spark ETL script to write the data to S3.
You'll have to first crawl the database table using Glue Crawler. This will create a table in the Data Catalog which can be used in the job to transfer the data to S3. If you do not wish to perform any transformation, you may directly use the UI steps for autogenerated ETL scripts.
I have also written a blog on how to Migrate Relational Databases to Amazon S3 using AWS Glue. Let me know if it addresses your query.
https://ujjwalbhardwaj.me/post/migrate-relational-databases-to-amazon-s3-using-aws-glue
Have you tried https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-template-copyrdstos3.html?
You can use AWS Data Pipeline - it has standard templates for full as well incrementation copy to s3 from RDS.
I am trying to copy all the tables from a schema (PostgreSQL, 50+ tables) to Amazon S3.
What is the best way to do this? I am able to create 50 different copy activities, but is there a simple way to copy all tables in a schema or write one pipeline and loop?
I think the old method is :
1. Unload your data from PostgreSQL to a CSV file first using something like psql
2. Then just copy the csv to S3
But, AWS gives u a script to do so , RDSToS3CopyActivity See this link from AWS
Since you have a large number of tables. I would recommend using AWS Glue as compared to AWS Data Pipeline. Glue is easily configurable having crawlers etc that allows you the flexibility to choose columns, define etc. Moreover,he underlying jobs in AWS Glue are pyspark jobs that scale really well giving you really good performance.
I'm trying to implement, I think, a very simple process, but I don't really know what's the best approach.
I want to read a big csv (around 30gb) file from S3, make some transformation and load it into RDS MySQL and I want this process to be replicable.
I tought that the best approach was Aws data pipeline, but I've found that this service is more designed to load data from different sources to redshift after several transformtions.
I've also seen that the process of creating a pipeline is slow and a little bit messy.
Then I've found the dataduct wrapper of Coursera, but after some research, it seems that this project has been abandoned (the last commit was one year ago).
So I don't know if I should continue trying with aws data pipeline or take another approach.
I've also read about AWS Simple Workflow and Step Functions, but I don't know if it's simpler.
Then I've seen a video of AWS glue and it looks nice, but unfortunatelly it's not yet available and I don't know when Amazon will launch it.
As you see, I'm a little bit confuse, can anyone enlight me?
Thanks in advance
If you are trying to get them into RDS so you can query them, there are other options that do not require the data to be moved from S3 to RDS to do SQL like queries.
You can use Redshift spectrum to read and query information from S3 now.
Using Amazon Redshift Spectrum, you can efficiently query and retrieve structured and semistructured data from files in Amazon S3 without having to load the data into Amazon Redshift tables
Step 1. Create an IAM Role for Amazon Redshift
Step 2: Associate the IAM Role with Your Cluster
Step 3: Create an External Schema and an External Table
Step 4: Query Your Data in Amazon S3
Or you can use Athena to query the data in S3 as well if Redshift is too much horsepower for the need job.
Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL.
You could use an ETL tool to do the transformations on your csv data and then load it into your RDS database. There are a number of open source tools that do not require large licensing costs. That way you can pull the data into the tool, do your transformations and then the tool will load the data into your MySQL database. For example there is Talend, Apache Kafka, and Scriptella. Here's some information on them for comparison.
I think Scriptella would be an option for this situation. It can use SQL scripts (or other scripting languages), and has JDBC/ODBC compliant drivers. With this you could create a script that would perform your transformations and then load the data into your MySQL database. And you would be using familiar SQL (I'm assuming you already can create SQL scripts) so there isn't a big learning curve.