Can I use Informatica EDC instead of Glue catalog in AWS.
does AWS Athena tightly coupled with Glue catalog?
Did you check here: [https://docs.aws.amazon.com/athena/latest/ug/glue-upgrade.html?
Looks like you need to perform some AWS Glue upgrade, and also add policies so that Athena can pull catalog information. Also, FAQ is available here https://docs.aws.amazon.com/athena/latest/ug/glue-faq.html. I have not worked on this scenario yet, but working on Glue - Redshift.
In the FAQ, its mentioned as follows:
Why do I need to add AWS Glue policies to Athena users?
Before you upgrade, Athena manages the data catalog, so Athena actions must be allowed for your users to perform queries. After you upgrade to the AWS Glue Data Catalog, Athena actions no longer apply to accessing the AWS Glue Data Catalog, so AWS Glue actions must be allowed for your users. Remember, the managed policy for Athena has already been updated to allow the required AWS Glue actions, so no action is required if you use the managed policy.
What happens if I don’t allow AWS Glue policies for Athena users?
If you upgrade to the AWS Glue Data Catalog and don't update a user's customer-managed or inline IAM policies, Athena queries fail because the user won't be allowed to perform actions in AWS Glue. For the specific actions to allow, see Step 2 - Update Customer-Managed/Inline Policies Associated with Athena Users.
Related
I've just started to use AWS, I have a question.
I'm trying to send data from DynamoDB (Account A) to Athena (Account B).
What I would like to do is,
Transfer and update everyday the tables which are stored in Account A's DynamoDB TO ANOTHER ACCOUNT which is Account B.
Account B is going to execute queries in Athena with the tables received from Account A.
Do you have any solution ?
Thanks a lot !
Note that we can not transfer data to Athena, which does not hold data itself.
If the query data source is in the same account, Athena Federated Query would be useful, but it does not seem to be supported cross-account yet.
So for your case, export DynamoDB table to S3 like this aws blog.
Then query S3 data using Athena.
Amazon EventBridge can be used for everyday scheduled actions.
I'm using AWSlake formation te manage the permissions needed to use Athena.
For one of the users i revoked all his permissions, so now he can't see the databases, tables in athena Catalog, but when he runs any request directly from the editor, it still work.
He's not a ldata ake formation administrator, and he has full access on athena.
I think it's because the Athena service has permissions via a service-linked role (created by Lake Formation): https://docs.aws.amazon.com/lake-formation/latest/dg/service-linked-roles.html
Since the user has access to Athena, his requests are being executed by the Athena service (which still has access).
A long running glue jobs for exporting data from Redshift to s3 are failed du to S3ServiceException:The provided token has expired. Amazon describes using a custom role as workaround (here). But they do not provide any example. Could somebody provide a cloudformation snipped? What a role should looks like? If I uses glue job should I add action for dynamodb or EMR cluster into Role policy?
I have been setting up data lakes for clients wherein we load the data from onprem or any other sources, into the S3 (a data lake). We will create an AWS Glue catalog on these raw data to create schemas.
The next step would be to either use an EMR or AWS Glue for some data cleansing, load the transformed data into RDS / REDSHIFT / S3 as final target.
The jobs can be scheduled using Data pipeline, Glue Jobs, or AWS Lambda event trigger depending on the use case / service used.
The analysts, other users would be provided required data / S3 bucket access using IAM service for Quicksight visualizations or data querying using Athena, Drill, etc. or use the data for ML applications in Sagemaker.
My question is how is AWS Lake Formation different from above traditional Data Lakes?
I can define that AWS Lake Formation provides all the above services such as S3, Glue Catalog, ETL code generator in Glue, Job scheduler, etc. are available in a single window? With some more advanced security for users / data (record / column level) that can be configured from within the Lake Formation console.
Is there anything else that makes Lake formation stand out from the traditional cloud based Data Lake?
Thanks
AWS Lake Formation is primarily a Permission control layer which is coupled with AWS Glue to basically provide catalog coupled with permissions control. Lake Formation provides reprieve from managing IAM Permissions and instead provides its own Grant based fine grain permission control using simple DB like grants.
Lake Formation still has some challenges with regards to integration with some data services like EMR.(It requires additional IAM policies)
But overall using Lake Formation with S3, Glue ETL provides everything needed to build a data lake.
Lake Formation can still benefit from a improved UI and Data Discovery.
You can use Lake Formation to implement traditional styled Data Lake or make them more modular and provide support across multiple AWS accounts.
Your understanding is correct, Lake Formation is essentially just a permissions model over the Glue Catalog that allows close integration with the other AWS data lake tools: Athena, S3, Glue, EMR, etc. As well as some additional features like Blueprints (for syncing data from RDBMS to S3), Jobs (for ETL), and Crawlers (for data discovery).
Lake Formation allows easier permission management for "user" IAM roles in your environment by allowing them to be centrally managed through the Lake Formation UI and API. Instead of having to update individual IAM/bucket policies each time a role needs a new access, Lake Formation allows you to onboard a single "service" IAM role to have bucket access and then grant Database/Table/Column level access to the user IAM roles that need it.
The user roles essentially assume the service role to perform their operations (Might not be assume exactly as this is an AWS black-box). So Lake Formation saves you from the hassle of having to manage permissions for all user IAM roles via a mess of IAM/bucket policies.
It also offers some ease of integration with sharing data to cross account resources if your setup requires it.
Is it possible to directly access AWS Glue Data Catalog of Account B via the Athena interface of Account A?
I was just trying to resolve this same issue in my own setup, but then stumbled across this bummer (the last bullet under Cross-Account Access Limitations on this page):
Cross-account access to the Data Catalog is not supported when using an AWS Glue crawler, Amazon Athena, or Amazon Redshift.
So it sounds like even with the cross-account access that is possible today, they won't naturally replicate through those services (including the asked about Athena).
That said, I was able to set up cross-account access to the AWS Glue Data Catalog in a way that allowed me to use Account A to pull all relevant info about Data Catalog objects from Account B. I can update my answer to incorporate how far I got, if you want, but a hacky method that might solve this question would be to set up the cross-account access that is possible today then run a recurring Lambda function that replicates over all the relevant metadata in the Data Catalog from Account B to Account A so users in Account A can view that within Account A's AWS Glue Data Catalog. I'm not sure whether Athena specifically would work in that setup, as I know it requires PutObject access when it queries data in S3 (which could be solved via the appropriate S3 bucket policies, but that'd be another cross-account permissions thing to manage).
Let me know whether you'd like to see those details on what cross-account stuff I was able to get working.
AWS has started supporting this using Lambda, please follow below link
https://aws.amazon.com/blogs/big-data/cross-account-aws-glue-data-catalog-access-with-amazon-athena/
Since May 2021 it is now possible to register a data catalog from a different account in Amazon Athena, see the User Guide.
Athena Query Engine v2 is required though and there are some other limitations.