BQ - No permission to INFORMATION_SCHEMA.JOBS - google-cloud-platform

Issue description
Trying to insert DML statistics from Bigquery system tables to BQ native tables for monitoring purpose using Airflow task.
For this need, I am using below query:
INSERT INTO
`my-project-id.my_dataset.my_table_metrics`
(table_name,
row_count,
inserted_row_count,
updated_row_count,
creation_time)
SELECT
b.table_name,
a.row_count AS row_count,
b.inserted_row_count,
b.updated_row_count,
b.creation_time
FROM
`my-project-id.my_dataset`.__TABLES__ a
JOIN (
SELECT
tables.table_id AS table_name,
dml_statistics.inserted_row_count AS inserted_row_count,
dml_statistics.updated_row_count AS updated_row_count,
creation_time AS creation_time
FROM
`my-project-id`.`region-europe-west3`.INFORMATION_SCHEMA.JOBS,
UNNEST(referenced_tables) AS tables
WHERE
DATE(creation_time) = current_date ) b
ON
a.table_id = b.table_name
WHERE
a.table_id = 'my_bq_table'
The query is working in Bigquery console but not working via airflow.
Error as per Airflow
python_http_client.exceptions.UnauthorizedError: HTTP Error 401:
Unauthorized
[2022-10-21, 12:18:30 UTC] {standard_task_runner.py:93}
ERROR - Failed to execute job 313922 for task load_metrics (403 Access
Denied: Table
my-project-id:region-europe-west3.INFORMATION_SCHEMA.JOBS: User does
not have permission to query table
my-project-id:region-europe-west3.INFORMATION_SCHEMA.JOBS, or perhaps
it does not exist in location europe-west3.
How to fix this access issue?
Appricate your help.

Related

BigQuery Table Exports

I am looking for the best pattern to be able to execute and export a BigQuery query result to a cloud storage bucket. I would like this to be executed when the BigQuery table is written to or modified.
I think I would traditionally setup a pubsub topic that would be written to when the table is modified, which would trigger a GCP function that is responsible for executing the query and writing the result to a GCP bucket. I just am not too confident that there isn't a better approach (more straight forward) to do this in GCP.
Thanks in advance.
I propose you an approach based on Eventarc.
The goal is to launch a Cloud Function or Cloud Run action when the data is inserted or updated in a BigQuery table, example with Cloud Run :
SERVICE=bq-cloud-run
PROJECT=$(gcloud config get-value project)
CONTAINER="gcr.io/${PROJECT}/${SERVICE}"
gcloud builds submit --tag ${CONTAINER}
gcloud run deploy ${SERVICE} --image $CONTAINER --platform managed
gcloud eventarc triggers create ${SERVICE}-trigger \
--location ${REGION} --service-account ${SVC_ACCOUNT} \
--destination-run-service ${SERVICE} \
--event-filters type=google.cloud.audit.log.v1.written \
--event-filters methodName=google.cloud.bigquery.v2.JobService.InsertJob \
--event-filters serviceName=bigquery.googleapis.com
When a BigQuery job was executed, the Cloud Run action will be triggered.
Example of Cloud Run action :
#app.route('/', methods=['POST'])
def index():
# Gets the Payload data from the Audit Log
content = request.json
try:
ds = content['resource']['labels']['dataset_id']
proj = content['resource']['labels']['project_id']
tbl = content['protoPayload']['resourceName']
rows = int(content['protoPayload']['metadata']
['tableDataChange']['insertedRowsCount'])
if ds == 'cloud_run_tmp' and \
tbl.endswith('tables/cloud_run_trigger') and rows > 0:
query = create_agg()
return "table created", 200
except:
# if these fields are not in the JSON, ignore
pass
return "ok", 200
You can apply logic based on the current dataset, table or other elements existing in the current payload.

Creating Connection for RedshiftDataOperator

So i when to the airflow documentation for aws redshift there is 2 operator that can execute the sql query they are RedshiftSQLOperator and RedshiftDataOperator. I already implemented my job using RedshiftSQLOperator but i want to do it using RedshiftDataOperator instead, because i dont want to using postgres connection in RedshiftSQLOperator but AWS API.
RedshiftDataOperator Documentation
I had read this documentation there is aws_conn_id in the parameter. But when im trying to use the same connection id there is error.
[2023-01-11, 04:55:56 UTC] {base.py:68} INFO - Using connection ID 'redshift_default' for task execution.
[2023-01-11, 04:55:56 UTC] {base_aws.py:206} INFO - Credentials retrieved from login
[2023-01-11, 04:55:56 UTC] {taskinstance.py:1889} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/operators/redshift_data.py", line 146, in execute
self.statement_id = self.execute_query()
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/operators/redshift_data.py", line 124, in execute_query
resp = self.hook.conn.execute_statement(**filter_values)
File "/home/airflow/.local/lib/python3.7/site-packages/botocore/client.py", line 415, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/home/airflow/.local/lib/python3.7/site-packages/botocore/client.py", line 745, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (UnrecognizedClientException) when calling the ExecuteStatement operation: The security token included in the request is invalid.
From task id
redshift_data_task = RedshiftDataOperator(
task_id='redshift_data_task',
database='rds',
region='ap-southeast-1',
aws_conn_id='redshift_default',
sql="""
call some_procedure();
"""
)
What should i fill in the airflow connection ? Because in the documentation there is no example of value that i should fill to airflow. Thanks
Airflow RedshiftDataOperator Connection Required Value
Have you tried using the Amazon Redshift connection? There is both an option for authenticating using your Redshift credentials:
Connection ID: redshift_default
Connection Type: Amazon Redshift
Host: <your-redshift-endpoint> (for example, redshift-cluster-1.123456789.us-west-1.redshift.amazonaws.com)
Schema: <your-redshift-database> (for example, dev, test, prod, etc.)
Login: <your-redshift-username> (for example, awsuser)
Password: <your-redshift-password>
Port: <your-redshift-port> (for example, 5439)
(source)
and an option for using an IAM role (there is an example in the first link).
Disclaimer: I work at Astronomer :)
EDIT: Tested the following with Airflow 2.5.0 and Amazon provider 6.2.0:
Added the IP of my Airflow instance to the VPC security group with "All traffic" access.
Airflow Connection with the connection id aws_default, Connection type "Amazon Web Services", extra: { "aws_access_key_id": "<your-access-key-id>", "aws_secret_access_key": "<your-secret-access-key>", "region_name": "<your-region-name>" }. All other fields blank. I used a root key for my toy-aws. If you use other credentials you need to make sure that IAM role has access and the right permissions to the Redshift cluster (there is a list in the link above).
Operator code:
red = RedshiftDataOperator(
task_id="red",
database="dev",
sql="SELECT * FROM dev.public.users LIMIT 5;",
cluster_identifier="redshift-cluster-1",
db_user="awsuser",
aws_conn_id="aws_default"
)

AWS unable to get query result because of ResourceNotFoundException

I'm trying to get cloudwatch query with boto3, but I'm getting ResourceNotFoundException.
import boto3
if __name__ == "__main__":
client = boto3.client('logs')
response = client.start_query(
logGroupName='/aws/lambda/My-Stack-Name-SE349DJ',
startTime=123,
endTime=123,
queryString="fields #message",
limit=1
)
I attempted to the above code. And an error message is as follows.
botocore.errorfactory.ResourceNotFoundException: An error occurred (ResourceNotFoundException) when calling the StartQuery operation: Log group '/aws/lambda/My-Stack-Name-SE349DJ' does not exist for account ID '11111111' (Service: AWSLogs; Status Code: 400; Error Code: ResourceNotFoundException; Request ID: xxxxx-xxxx-xxx; Proxy: null)
What I tested are as below.
The log group exists. I tested it with Logs Insights on the aws console. Also I tested after paste the log group as it is.
I added a backslash to test if '/' is a problem (ex. '/aws/lambda/My-Stack-Name-SE349DJ') and InvalidParameterException appears.
The aws account has administrate access privileges in the log group.
I got the same error message when I tested with aws cli.
An error occurred (ResourceNotFoundException) when calling the StartQuery operation: Log group 'XXXXXXXXXXXXXX' does not exist for account ID '11111111' (Service: AWSLogs; Status Code: 400; Error Code: ResourceNotFoundException; Request ID: xxxxx-xxxx-xxx; Proxy: null)
How can I solve this problem?
Actually the reason why I'm trying this is because I need to get more than 500,000 data from the filtered log group, but 10,000 are the maximum. I think It's better to pull it out by changing the start time and end time.
There is a high possibility that there are too many data in certain time, so I think it would be better to run it with boto3 rather than directly. Is there an easy way to extract more than 500,000 pieces of data from the console or other methods?
As #Marcin commented, It was because of the region configuration.
I added these lines before creating an aws client.
from botocore.config import Config
...
my_config = Config(
region_name = 'us-east-2',
)
...
client = boto3.client('logs', config=my_config)

Change location of the glue table

I've created glue table (external) via terraform where I din't put location of the table.
Location of the table should be updated after app run. And when app runs it receives an exception:
org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Error: ',', ':', or ';' expected at position 291 from 'bigint:bigint:bigint:bigint:bigint:bigint:bigint:bigint:bigint:bigint:bigint:bigint:bigint:bigint:string:string:smallint:smallint:smallint:decimal(12,2):decimal(12,2):decimal(12,2):bigint:string:bigint:string:timestamp:timestamp:bigint:bigint:bigint:bigint:bigint:string:string:decimal(12,2) :bigint:timestamp:string:bigint:decimal(12,2):string:bigint:bigint:timestamp:int' [0:bigint, 6::, 7:bigint, 13::, 14:bigint, 20::, 21:bigint, 27::, 28:bigint, 34::, 35:bigint, 41::, 42:bigint, 48::, 49:bigint, 55::, 56:bigint, 62::, 63:bigint, 69::, 70:bigint, 76::, 77:bigint, 83::, 84:bigint, 90::, 91:bigint, 97::, 98:string, 104::, 105:string, 111::, 112:smallint, 120::, 121:smallint, 129::, 130:smallint, 138::, 139:decimal, 146:(, 147:12, 149:,, 150:2, 151:), 152::, 153:decimal, 160:(, 161:12, 163:,, 164:2, 165:), 166::, 167:decimal, 174:(, 175:12, 177:,, 178:2, 179:), 180::, 181:bigint, 187::, 188:string, 194::, 195:bigint, 201::, 202:string, 208::, 209:timestamp, 218::, 219:timestamp, 228::, 229:bigint, 235::, 236:bigint, 242::, 243:bigint, 249::, 250:bigint, 256::, 257:bigint, 263::, 264:string, 270::, 271:string, 277::, 278:decimal, 285:(, 286:12, 288:,, 289:2, 290:), 291: , 292::, 293:bigint, 299::, 300:timestamp, 309::, 310:string, 316::, 317:bigint, 323::, 324:decimal, 331:(, 332:12, 334:,, 335:2, 336:), 337::, 338:string, 344::, 345:bigint, 351::, 352:bigint, 358::, 359:timestamp, 368::, 369:int]
This exception kind of represents fields which were defined in terraform.
From aws console I couldn't set location after table was created. When I connected to AWS EMR which uses Glue metastore and tried to execute same query I receive same exception.
So I have several questions:
Does anybody know how to alter empty location of the external glue table?
The default location of the table should looks like that hive/warehouse/dbname.db/tablename. So what is the correct path in that case in EMR ?

AWS Batch - Access denied 403

I am using AWS Batch with ECS to perform a job which need to send a request to Athena. I use python boto3 to send the query and the get the request status :
start_query_execution : work fine
get_query_execution : have an error !
When I try to get the query execution I have the following error :
{'QueryExecution': {'QueryExecutionId': 'XXXX', 'Query': "SELECT * FROM my_table LIMIT 10 ", 'StatementType': 'DML', 'ResultConfiguration': {'OutputLocation': 's3://my_bucket_name/athena-results/query_id.csv'}, 'QueryExecutionContext': {'Database': 'my_database'}, 'Status': {'State': 'FAILED', 'StateChangeReason': '**Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: 4.**. ; S3 Extended Request ID: ....=)'
I have the all permissions to the container role (only to test) :
s3:*
athena : *
glue : *
I face this problem only in container in AWS batch : with the same policy and code in a lambda it's working !
Any help will be appreciated.
In Athena Output location what I have been using Athena bucket name not file name.
As result set will be generated which will have its own id
'ResultConfiguration': {'OutputLocation': 's3://my_bucket_name/athena-results/'}
If ypu are not sure of the bucket for query you can check in query console -->settings