What is major drawback of AWS Glue Data Catalog? Been asked in one of interview.
That could be answered in a number of ways depending on the wider context. For example:
It's an AWS managed service, so using it locks you into the AWS
ecosystem (instead of using a standalone Hive metastore for example)
It's limited to the data sources supported by the Glue Data Catalog
It doesn't integrate with third-party authentication and
authorisation tools like Apache Ranger (as far as I am aware)
Related
I've been wondering if it is possible to have versions of schema in GCP Data Catalog Service? Or maybe advice on how you deal with Data Catalog entries when schema is changed (e.g. in CloudSQL, GCS fileset, BigQuery) and how history could be handled if it is not supported by Google?
Tried to investigate Data Catalog API calls and Logging after entry is updated, however, there were no changes, no history.
I've found that functionality in AWS (https://docs.aws.amazon.com/glue/latest/dg/schema-registry.html).
There is also question in GCP Community that is unanswered: https://www.googlecloudcommunity.com/gc/Data-Analytics/How-can-I-see-entry-history/m-p/425135#M338
Custom tools, such as Liquibase (https://medium.com/google-cloud/version-control-of-bigquery-schema-changes-with-liquibase-ddc7092d6d1d), are not suitable in this case, as they are limited for BigQuery (not all GCP services).
I expect ANY versioning of Data Catalog Entries (schemas in particular), history in logs or such.
Unfortunately currently there is no versioning in Data Catalog. If entry schema in ingested systems is changed the tag attached to removed column is lost. For simple versioning use cases you may consider using terraform with storing configuration in Cloud Source Repositories.
Why AWS claims Glue as a ETL tool? We need to code everything to pull data, no inbuilt functionality provided by Glue. Any benefits of using Glue instead of Nifi or some other ingestion tools?
Glue is a good ETL tool within AWS. Especially for big data work loads. After all it is running on spark.
Glue does have the ability to produce some basic automated transformation code -> Move data from A to B and remap column names etc.
However, it's the flexibility to write custom code that really sets it apart. Using the Glue code editor, or the Pycharm IDE, you can script any transformations you need using pyspark and/or scala.
The benefits of Glue are really gained when it is used in conjunction with other AWS services. The Glue Data Catalog is shared with Athena and even AWS EMR, so you end up with a central point for your big data ecosystem.
One limitation of Glue I have found is writing large datasets to MS SQL Server (10 million rows+). Glue uses JDBC drivers, and as of 2020, there is yet to be a Microsoft JDBC connection that avails of bulk copy. So, effectively you are writing an insert statement for each row. Therefore, performance can suffer once you get into the 10s of millions of rows currently.
Any suggested architecture ?
For the first full load, using Kinesis, how do I automate it so that it creates different streams for different tables. (Is this the way to do it?)
Incase if there is a new additional table, how do I create a new stream automatically.
3.How do I load to Kinesis incrementally (whenever the data is populated )
Any resources/ architectures will be definitely helpful. Using Kinesis because multiple other down stream consumers might access this data in future.
Recommend looking into AWS Schema Conversion Tool (AWS SCT) and AWS Database Migration Service (AWS DMS). DMS does not necessarily use Kinesis but it is specifically design for this use case.
Start with the walk through in this blog post: "How to Migrate Your Oracle Data Warehouse to Amazon Redshift Using AWS SCT and AWS DMS"
I am confused about these two services. It looks that they are offering the same service. Probably the only difference is that the Glue catalog can contain a wider range of data sources. Does it mean that AWS Glue can replace Redshift?
The Comment is right , These two services are not same AWS Glue is ETL Service while AWS Redshift is Data Warehousing service.
According to AWS Documentation :
Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools. It allows you to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution.
According to AWS Documentation :
AWS Glue is a fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores
You can Refer the Documentation Provided by AWS for Details but essentially these are totally different services.
I'm trying to implement, I think, a very simple process, but I don't really know what's the best approach.
I want to read a big csv (around 30gb) file from S3, make some transformation and load it into RDS MySQL and I want this process to be replicable.
I tought that the best approach was Aws data pipeline, but I've found that this service is more designed to load data from different sources to redshift after several transformtions.
I've also seen that the process of creating a pipeline is slow and a little bit messy.
Then I've found the dataduct wrapper of Coursera, but after some research, it seems that this project has been abandoned (the last commit was one year ago).
So I don't know if I should continue trying with aws data pipeline or take another approach.
I've also read about AWS Simple Workflow and Step Functions, but I don't know if it's simpler.
Then I've seen a video of AWS glue and it looks nice, but unfortunatelly it's not yet available and I don't know when Amazon will launch it.
As you see, I'm a little bit confuse, can anyone enlight me?
Thanks in advance
If you are trying to get them into RDS so you can query them, there are other options that do not require the data to be moved from S3 to RDS to do SQL like queries.
You can use Redshift spectrum to read and query information from S3 now.
Using Amazon Redshift Spectrum, you can efficiently query and retrieve structured and semistructured data from files in Amazon S3 without having to load the data into Amazon Redshift tables
Step 1. Create an IAM Role for Amazon Redshift
Step 2: Associate the IAM Role with Your Cluster
Step 3: Create an External Schema and an External Table
Step 4: Query Your Data in Amazon S3
Or you can use Athena to query the data in S3 as well if Redshift is too much horsepower for the need job.
Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL.
You could use an ETL tool to do the transformations on your csv data and then load it into your RDS database. There are a number of open source tools that do not require large licensing costs. That way you can pull the data into the tool, do your transformations and then the tool will load the data into your MySQL database. For example there is Talend, Apache Kafka, and Scriptella. Here's some information on them for comparison.
I think Scriptella would be an option for this situation. It can use SQL scripts (or other scripting languages), and has JDBC/ODBC compliant drivers. With this you could create a script that would perform your transformations and then load the data into your MySQL database. And you would be using familiar SQL (I'm assuming you already can create SQL scripts) so there isn't a big learning curve.