Terraform AWS Athena to use Glue catalog as db - amazon-web-services

I'm confused as to how I should use terraform to connect Athena to my Glue Catalog database.
I use
resource "aws_glue_catalog_database" "catalog_database" {
name = "${var.glue_db_name}"
}
resource "aws_glue_crawler" "datalake_crawler" {
database_name = "${var.glue_db_name}"
name = "${var.crawler_name}"
role = "${aws_iam_role.crawler_iam_role.name}"
description = "${var.crawler_description}"
table_prefix = "${var.table_prefix}"
schedule = "${var.schedule}"
s3_target {
path = "s3://${var.data_bucket_name[0]}"
}
s3_target {
path = "s3://${var.data_bucket_name[1]}"
}
}
to create a Glue DB and the crawler to crawl an s3 bucket (here only two), but I don't know how I link the Athena query service to the Glue DB. In the terraform documentation for Athena, there doesn't appear to be a way to connect Athena to a Glue catalog but only to an S3 Bucket. Clearly, however, Athena can be integrated with Glue.
How can I terraform an Athena database to use my Glue catalog as its data source rather than an S3 bucket?

Our current basic setup for having Glue crawl one S3 bucket and create/update a table in a Glue DB, which can then be queried in Athena, looks like this:
Crawler role and role policy:
The assume_role_policy of the IAM role needs only Glue as principal
The IAM role policy allows actions for Glue, S3, and logs
The Glue actions and resources can probably be narrowed down to the ones really needed
The S3 actions are limited to those needed by the crawler
resource "aws_iam_role" "glue_crawler_role" {
name = "analytics_glue_crawler_role"
assume_role_policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Action": "sts:AssumeRole",
"Principal": {
"Service": "glue.amazonaws.com"
},
"Effect": "Allow",
"Sid": ""
}
]
}
EOF
}
resource "aws_iam_role_policy" "glue_crawler_role_policy" {
name = "analytics_glue_crawler_role_policy"
role = "${aws_iam_role.glue_crawler_role.id}"
policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"glue:*",
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:ListBucket",
"s3:GetBucketAcl",
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::analytics-product-data",
"arn:aws:s3:::analytics-product-data/*",
]
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": [
"arn:aws:logs:*:*:/aws-glue/*"
]
}
]
}
EOF
}
S3 Bucket, Glue Database and Crawler:
resource "aws_s3_bucket" "product_bucket" {
bucket = "analytics-product-data"
acl = "private"
}
resource "aws_glue_catalog_database" "analytics_db" {
name = "inventory-analytics-db"
}
resource "aws_glue_crawler" "product_crawler" {
database_name = "${aws_glue_catalog_database.analytics_db.name}"
name = "analytics-product-crawler"
role = "${aws_iam_role.glue_crawler_role.arn}"
schedule = "cron(0 0 * * ? *)"
configuration = "{\"Version\": 1.0, \"CrawlerOutput\": { \"Partitions\": { \"AddOrUpdateBehavior\": \"InheritFromTable\" }, \"Tables\": {\"AddOrUpdateBehavior\": \"MergeNewColumns\" } } }"
schema_change_policy {
delete_behavior = "DELETE_FROM_DATABASE"
}
s3_target {
path = "s3://${aws_s3_bucket.product_bucket.bucket}/products"
}
}

I had many things wrong in my Terraform code. To start with:
The S3 bucket argument in the aws_athena_database code refers to the bucket for query output not the data the table should be built from.
I had set up my aws_glue_crawler to write to a Glue database rather than an Athena db. Indeed, as Martin suggested above, once correctly set up, Athena was able to see the tables in the Glue db.
I did not have the correct policies attached to my crawler. Initially, the only policy attached to the crawler role was
resource "aws_iam_role_policy_attachment" "crawler_attach" {
policy_arn = "arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"
role = "${aws_iam_role.crawler_iam_role.name}"
}
after setting a second policy that explicitly allowed all S3 access to all of the buckets I wanted to crawl and attaching that policy to the same crawler role, the crawler ran and updated tables successfully.
The second policy:
resource "aws_iam_policy" "crawler_bucket_policy" {
name = "crawler_bucket_policy"
path = "/"
description = "Gives crawler access to buckets"
policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Stmt1553807998309",
"Action": "*",
"Effect": "Allow",
"Resource": "*"
},
{
"Sid": "Stmt1553808056033",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::bucket0"
},
{
"Sid": "Stmt1553808078743",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::bucket1"
},
{
"Sid": "Stmt1553808099644",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::bucket2"
},
{
"Sid": "Stmt1553808114975",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::bucket3"
},
{
"Sid": "Stmt1553808128211",
"Action": "s3:*",
"Effect": "Allow",
"Resource": "arn:aws:s3:::bucket4"
}
]
}
EOF
}
I'm confident that I can get away from hardcoding the bucket names in this policy but I don't yet know how to do that.

Related

InsufficientS3BucketPolicyFault when enabling AWS Redshift audit logging through Terraform

Problem
I'm trying to enable audit logging on an AWS redshift cluster. I've been following the instructions provided by AWS here: https://docs.aws.amazon.com/redshift/latest/mgmt/db-auditing.html#db-auditing-enable-logging
Current Configuration
I've defined the relevant IAM role as follows
resource "aws_iam_role" "example-role" {
name = "example-role"
assume_role_policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Action": "sts:AssumeRole",
"Principal": {
"Service": "redshift.amazonaws.com"
},
"Effect": "Allow",
"Sid": ""
}
]
}
EOF
}
And have granted the following IAM permissions to the example-role role:
{
"Sid": "AllowAccessForAuditLogging",
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetBucketAcl"
],
"Resource": [
"arn:aws:s3:::example-bucket",
"arn:aws:s3:::example-bucket/*"
]
},
The relevant portion of the redshift cluster configuration is as follows:
resource "aws_redshift_cluster" "example-cluster-name" {
cluster_identifier = "example-cluster-name"
...
# redshift audit logging to S3
logging {
enable = true
bucket_name = "example-bucket-name"
}
master_username = var.master_username
iam_roles = [aws_iam_role.example-role.arn]
...
Error
terraform plan runs correctly, and produces the expected plan based on the above configuration. However, when running terraform apply the following error occurs:
Error: error enabling Redshift Cluster (example-cluster-name) logging: InsufficientS3BucketPolicyFault: Cannot read ACLs of bucket example-bucket-name. Please ensure that your IAM permissions are set up correctly.
note: i've replaced all resource identifiers with example-* resource names and identifiers.
#shimo's answer is correct. I just detail for someone like me
Redshift has full access to S3. But you need add bucket policy too. ( S3's permission)
{
"Sid": "Statement1",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::361669875840:user/logs"
},
"Action": [
"s3:GetBucketAcl",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::<your-bucket>",
"arn:aws:s3:::<your-bucket>/*"
]
}
- `361669875840` is match with your region check [here][1]
[1]: https://github.com/finos/compliant-financial-infrastructure/blob/main/aws/redshift/redshift_template_public.yml

Incorrect S3 bucket policy is detected for bucket in Boto3

I have been working on setting up CloudTrail for an IAM user using Boto but I have run into an error:
An error occurred (InsufficientS3BucketPolicyException) when calling the CreateTrail operation: Incorrect S3 bucket policy is detected for bucket: goodbucket
I am not sure what's wrong here. Saving the CloudTrail log is not a priority but I will need ResourceID, to delete resource later on using Lambda functions.
import boto3
import sys
import json
import time
iam = boto3.client('iam')
sts = boto3.client('sts')
ec2 = boto3.resource('ec2')
cloudtrail = boto3.client('cloudtrail')
response = iam.create_user(
UserName='GoodUser'
)
IDK = sts.get_caller_identity()
print(IDK['UserId'])
response = iam.create_group(
GroupName='GoodGroup'
)
response = iam.add_user_to_group(
GroupName='GoodGroup',
UserName='GoodUser'
)
some_policy = {
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": "ec2:RunInstances",
"Resource": [
f"arn:aws:ec2:us-east-2:{IDK['Account']}:instance/*",
f"arn:aws:ec2:us-east-2:{IDK['Account']}:network-interface/*",
f"arn:aws:ec2:us-east-2:{IDK['Account']}:key-pair/*",
f"arn:aws:ec2:us-east-2:{IDK['Account']}:security-group/*",
f"arn:aws:ec2:us-east-2:{IDK['Account']}:subnet/*",
f"arn:aws:ec2:us-east-2:{IDK['Account']}:volume/*",
f"arn:aws:ec2:us-east-2:{IDK['Account']}:image/ami-0a91cd140a1fc148a"
],
"Condition": {
"ForAllValues:NumericLessThanEquals": {
"ec2:VolumeSize": "10"
},
"ForAllValues:StringEquals": {
"ec2:InstanceType": "t2.micro"
}
}
},
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": [
"ec2:TerminateInstances",
"ec2:StartInstances",
"ec2:StopInstances"
],
"Resource": f"arn:aws:ec2:us-east-2:{IDK['Account']}:instance/*",
"Condition": {
"ForAllValues:StringEquals": {
"ec2:InstanceType": "t2.micro"
}
}
},
{
"Sid": "VisualEditor2",
"Effect": "Allow",
"Action": [
"ec2:Describe*",
"ec2:GetConsole*",
"cloudwatch:DescribeAlarms",
"iam:ListInstanceProfiles",
"cloudwatch:GetMetricStatistics",
"ec2:DescribeKeyPairs",
"ec2:CreateKeyPair"
],
"Resource": "*",
"Condition": {
"DateGreaterThan": {
"aws:CurrentTime": "2020-12-10T05:00:00Z"
},
"DateLessThanEquals": {
"aws:CurrentTime": "2020-12-10T05:35:00Z"
}
}
}
]
}
response = iam.create_policy(
PolicyName='GoodPolicy',
PolicyDocument=json.dumps(some_policy)
)
print(response)
IDK1 = iam.attach_group_policy(
GroupName='GoodGroup',
PolicyArn= f"arn:aws:iam::{IDK['Account']}:policy/GoodPolicy"
)
logs = cloudtrail.create_trail(
Name='GoodTrail',
S3BucketName='goodbucket',
)
print (logs)
You are configuring AWS CloudTrail to write log files to an Amazon S3 bucket. To do this, the S3 bucket requires a Bucket Policy that grants permission to the CloudTrail service to write to the bucket.
From Amazon S3 Bucket Policy for CloudTrail - AWS CloudTrail:
If you want to create or modify an Amazon S3 bucket to receive the log files for an organization trail, you must further modify the bucket policy.
To deliver log files to an S3 bucket, CloudTrail must have the required permissions, and it cannot be configured as a Requester Pays bucket. CloudTrail automatically attaches the required permissions to a bucket when you create an Amazon S3 bucket as part of creating or updating a trail in the CloudTrail console.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {"Service": "cloudtrail.amazonaws.com"},
"Action": "s3:GetBucketAcl",
"Resource": "arn:aws:s3:::myBucketName"
},
{
"Effect": "Allow",
"Principal": {"Service": "cloudtrail.amazonaws.com"},
"Action": "s3:PutObject",
"Resource": "arn:aws:s3:::myBucketName/[optional prefix]/AWSLogs/myAccountID/*",
"Condition": {"StringEquals": {"s3:x-amz-acl": "bucket-owner-full-control"}}
}
]
}

how to get sid in AWS S3 bucket policy in terraform

i want attaches a policy to an S3 bucket resource.
my terraform infra,
resource "aws_s3_bucket" "storage" {
bucket = "${var.service}-${local.stage}-storage"
acl = "public-read"
tags = {
Service = var.service
Stage = local.stage
}
cors_rule {
allowed_headers = [
"*"
]
allowed_methods = [
"GET",
"HEAD"
]
allowed_origins = [
"*"
]
max_age_seconds = 3000
}
}
this bucket for web static file hosting.i need bucket policy to public.
my policy in terraform,
resource "aws_s3_bucket_policy" "storage-policy" {
bucket = aws_s3_bucket.storage.id
policy = <<POLICY
{
"Version": "2012-10-17",
"Id": "????????",
"Statement": [
{
"Sid": "????????",
"Effect": "Allow",
"Principal": "*",
"Action": "*",
"Resource": "arn:aws:s3:::BUCKET-NAME/*"
}
]
}
POLICY
}
in this code, i need get Id, Sid field value.
how can i get this?
thanks.
The Id and Sid are whatever you want. For example:
resource "aws_s3_bucket_policy" "storage-policy" {
bucket = aws_s3_bucket.storage.id
policy = <<POLICY
{
"Version": "2012-10-17",
"Id": "my-bucket-polict",
"Statement": [
{
"Sid": "allow-all-access",
"Effect": "Allow",
"Principal": "*",
"Action": "*",
"Resource": "arn:aws:s3:::BUCKET-NAME/*"
}
]
}
POLICY
}

Access Denied when querying in Athena for data in S3 bucket in another AWS account

I want to use Glue Crawler to crawl data from an S3 bucket. This S3 bucket is in another AWS account. Let's call is Account A. My Glue Crawler is in Account B.
I have created a Role in Account B and called it AWSGlueServiceRole-Reporting
I have attached the following policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "BucketAccess",
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::AccountAbucketname"
]
},
{
"Sid": "ObjectAccess",
"Effect": "Allow",
"Action": "s3:GetObject",
"Resource": [
"arn:aws:s3:::AccountABucketName/Foldername/*"
]
}
]
}
And also AWSGlueServiceRole policy.
In Account A that has the S3 bucket, I've attached the following bucket policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::AccountB:role/AWSGlueServiceRoleReporting”
},
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": "arn:aws:s3:::AccountABucketName"
},
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::AccountB:role/AWSGlueServiceRoleReporting”
},
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::AccountABucketName/FolderName/*"
}
]
}
I'm able to run a Glue Crawler in Account B on this S3 bucket and it created Glue Tables. But when I try to query them in Athena, I get Access Denied.
Can anybody help me how to query it in Athena??
When Amazon Athena queries run, they use the permissions of the user that is running the query.
Therefore, you will need to modify the Bucket Policy on the bucket in Account A to permit access by whoever is running the query in Amazon Athena:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::AccountB:role/AWSGlueServiceRoleReporting",
"arn:aws:iam::AccountB:user/username"
]
},
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": "arn:aws:s3:::AccountABucketName"
},
{
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::AccountB:role/AWSGlueServiceRoleReporting",
"arn:aws:iam::AccountB:user/username"
]
},
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::AccountABucketName/FolderName/*"
}
]
}
The user will also need sufficient S3 permissions (on their IAM User) to access that S3 bucket. (For example, having s3:ListBucket and s3:GetObject on S3 buckets. They likely already have this, but it is worth mentioning.)
This is different to AWS Glue, which uses an IAM Role. Athena does not accept an IAM Role for running queries.

AWS Firehose cross region/account policy

I am trying to create Firehose streams that can receive data from different regions in Account A, through AWS Lambda, and output into a redshift table in Account B. To do this I created an IAM role on Account A:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "",
"Effect": "Allow",
"Principal": {
"Service": "firehose.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
I gave it the following permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "",
"Effect": "Allow",
"Action": [
"s3:AbortMultipartUpload",
"s3:GetBucketLocation",
"s3:GetObject",
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::b-bucket/*",
"arn:aws:s3:::b-bucket"
]
},
{
"Effect": "Allow",
"Action": [
"firehose:*"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"redshift:*"
],
"Resource": "*"
}
]
}
On Account B I created a role with this trust policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "",
"Effect": "Allow",
"Principal": {
"Service": "firehose.amazonaws.com"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "11111111111"
}
}
}
]
}
I gave that role the following access:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:*"
],
"Resource": [
"arn:aws:s3:::b-bucket",
"arn:aws:s3:::b-bucket/*",
"arn:aws:s3:::b-account-logs",
"arn:aws:s3:::b-account-logs/*"
]
},
{
"Effect": "Allow",
"Action": [
"firehose:*"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": "redshift:*",
"Resource": "arn:aws:redshift:us-east-1:cluster:account-b-cluster*"
}
]
}
I also edited the access policy on the S3 buckets to give access to my Account A role:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::11111111111:role/AccountAXAccountBPolicy"
},
"Action": "s3:*",
"Resource": ["arn:aws:s3:::b-bucket","arn:aws:s3:::b-bucket/*"]
}
]
}
However, none of this works. When I try to create the the stream in Account A it does not list the buckets in Account B nor the redshift cluster. Is there any way to make this work?
John's answer is semi correct. I would recommend that the account owner of the Redshift Cluster creates the FireHose Stream. Creating through CLI requires you to supply the user name and password. Having the cluster owner create the stream and sharing IAM Role permissions on the stream is safer for security and in case of credential change. Additionally, you cannot create a stream that accesses a database outside of the region, so have the delivery application access the correct stream and region.
Read on to below to see how to create the cross account stream.
In my case both accounts are accessible to me and to lower the amount of changes and ease of monitoring I created the stream on Account A side.
The above permissions are right however, you cannot create a Firehose Stream from Account A to Account B through AWS Console. You need to do it through AWS Cli:
aws firehose create-delivery-stream --delivery-stream-name testFirehoseStreamToRedshift
--redshift-destination-configuration 'RoleARN="arn:aws:iam::11111111111:role/AccountAXAccountBRole", ClusterJDBCURL="jdbc:redshift://<cluster-url>:<cluster-port>/<>",
CopyCommand={DataTableName="<schema_name>.x_test",DataTableColumns="ID1,STRING_DATA1",CopyOptions="csv"},Username="<Cluster_User_name>",Password="<Cluster_Password>",S3Configuration={RoleARN="arn:aws:iam::11111111111:role/AccountAXAccountBRole",
BucketARN="arn:aws:s3:::b-bucket",Prefix="test/",CompressionFormat="UNCOMPRESSED"}'
You can test this by creating a test table on the other AWS Account:
create table test_schema.x_test
(
ID1 INT8 NOT NULL,
STRING_DATA1 VARCHAR(10) NOT NULL
)
distkey(ID1)
sortkey(ID1,STRING_DATA1);
You can send test data like this:
aws firehose put-record --delivery-stream-name testFirehoseStreamToRedshift --record '{"DATA":"1,\"ABCDEFGHIJ\""}'
This with the permissions configuration above should create the cross account access for you.
Documentation:
Create Stream - http://docs.aws.amazon.com/cli/latest/reference/firehose/create-delivery-stream.html
Put Record - http://docs.aws.amazon.com/cli/latest/reference/firehose/put-record.html
No.
Amazon Kinesis Firehose will only output to Amazon S3 buckets and Amazon Redshift clusters in the same region.
However, anything can send information to Kinesis Firehose by simply calling the appropriate endpoint. So, you could have applications in any AWS Account and in any Region (or anywhere on the Internet) send data to the Firehose and then have it stored in a bucket or cluster in the same region as the Firehose.