Dataflow needs bigquery.datasets.get permission for the underlying table in authorized view - google-cloud-platform

In a dataflow pipeline, I'm reading from a BigQuery Authorized View:
beam.io.Read(beam.io.BigQuerySource(query = "SELECT col1 FROM proj2.dataset2.auth_view1", use_standard_sql=True))
This is the error which I'm getting:
Error:
Message: Access Denied: Dataset proj1:dataset1: The user xxxxxx-compute#developer.gserviceaccount.com does not have bigquery.datasets.get permission for dataset proj1:dataset1.
proj1:dataset1 has the base table for the view auth_view1.
According to this issue in DataflowJavaSDK, dataflow seems to be directly executing some metadata query against the underlying table.
Is there a fix available for this issue in Apache Beam SDK?

Explicitly setting the Query location is also a solution in the Apache Beam Java SDK, using the withQueryLocation option of BigQueryIO.
It looks like setting the query location is not possible in the Python SDK yet.

Related

Schedule query failure in GCP with 'The caller does not have permission' error

So I created a python script similar to [BQ tutorial on SQ][1]. The service account has been set using os.environ. When executing with BigQuery Admin and other similar permissions(Data user, Data transfer agent, Data view etc) the schedule query creation fails with
status = StatusCode.PERMISSION_DENIED
details = "The caller does not have permission"
The least permission level it is accepting is 'Project Owner'. As this is a service account, I was hoping a lower permission level can be applied eg Bigquery Admin, as all I need with the Service account is to remotely create schedule queries. Even the how to guide says it should work. Can anyone provide some input if there is any other combination of permissions which will allow this to work please.
[1]: https://cloud.google.com/bigquery/docs/scheduling-queries#set_up_scheduled_queries

Preview not available. Permission denied error in Dataprep while importing GCS file

I am referring to this article of creating Cloud Dataprep pipeline
When following the step of importing the data while creating flow, I am not able to read the data and its says access denied as per the above screenshot.
Reference Link : https://www.trifacta.com/blog/data-quality-monitoring-for-cloud-dataprep-pipelines/
I tried importing the json file and I am expecting the flow to read the table

Error Bigquery/dataflow "Could not resolve table in Data Catalog"

I'm having troubles with a job I've set up on dataflow.
Here is the context, I created a dataset on bigquery using the following path
bi-training-gcp:sales.sales_data
In the properties I can see that the data location is "US"
Now I want to run a job on dataflow and I enter the following command into the google shell
gcloud dataflow sql query ' SELECT country, DATE_TRUNC(ORDERDATE , MONTH),
sum(sales) FROM bi-training-gcp.sales.sales_data group by 1,2 ' --job-name=dataflow-sql-sales-monthly --region=us-east1 --bigquery-dataset=sales --bigquery-table=monthly_sales
The query is accepted by the console and returns me a sort of acceptation message.
After that I go to the dataflow dashboard. I can see a new job as queued but after 5 minutes or so the job fails and I get the following error messages:
Error
2021-09-29T18:06:00.795ZInvalid/unsupported arguments for SQL job launch: Invalid table specification in Data Catalog: Could not resolve table in Data Catalog: bi-training-gcp.sales.sales_data
Error 2021-09-29T18:10:31.592036462ZError occurred in the launcher
container: Template launch failed. See console logs.
My guess is that it cannot find my table. Maybe because I specified the wrong location/region, since my table is specified to be location in "US" I thought it would be on a US server (which is why I specified us-east1 as a region), but I tried all us regions with no success...
Does anybody know how I can solve this ?
Thank you
This error occurs if the Dataflow service account doesn't have access to the Data Catalog API. To resolve this issue, enable the Data Catalog API in the Google Cloud project that you're using to write and run queries. Alternately, assign the roles/datacatalog.

GCP: Is it possible to have an access to a resource if don't have project access?

It is my first expirience in Google Cloud Platform and I'm confused.
I've got an access to a resource:
xxx#gmail.com has granted you the following roles for resource resource_name(projects/project_name/datasets/ClientsExport/tables/resource_name) BigQuery Data Editor
But if I open BigQuery Data Editor, I don't see project_name and resource_name. Search by resource_name also returns no result.
Is it only access that I have in the project (I didn't get another accesses and mails).
Could you please help me with this? Maybe should I get some additional access to resource_name will be available? If is there another way to find the resource?
Thank you in advance!
In the message you have access to BigQuery data inside a table. You can query them from your project, you are autorised to access them (and to write also, because you are editor).
However, this table isn't in your project, it's in another project that's why you don't see it directly in the BigQuery console. In addition, you haven't the right to read the metadata (roles/bigquery.metadataViewer) on the dataset of the other project. Eventually, you can't also view the table schema in the console, but the bq CLI allow you to view it.
I had some discussions with Google BigQuery team about that (because I got the same issue in my company), and updates should happen by the end of the year (or soon in 2022) to fix this "view" issue in the console.
It looks like you have IAM permission to access a specific resource in BigQuery but cannot access it from the GUI.
Some reasons you may not see access on your GUI:
You have permission to interact with BigQuery but don't have access to any of the data.
You aren't a member of the organization which provided the resources and they have higher level permissions (on the org level) which prevents sharing of resources outside of the org.
Your access is restricted to the command line/app level. (If your account is a service account then this is likely the case.)

Can I use Athena View as a source for a AWS Glue Job?

I'm trying to use an Athena View as a data source to my AWS Glue Job. The error message I'm getting while trying to run the Glue job is about the classification of the view. What can I define it as?
Thank you
Error Message Appearing
You can by using the Athena JDBC driver. This approach circumvents the catalog, as only Athena (and not Glue as of 25-Jan-2019) can directly access views.
Download the driver and store the jar to an S3 bucket.
Specify the S3 path to the driver as a dependent jar in your job definition.
Load the data into a dynamic frame using the code below (using an IAM user
with permission to run Athena queries).
from awsglue.dynamicframe import DynamicFrame
# ...
athena_view_dataframe = (
glueContext.read.format("jdbc")
.option("user", "[IAM user access key]")
.option("password", "[IAM user secret access key]")
.option("driver", "com.simba.athena.jdbc.Driver")
.option("url", "jdbc:awsathena://athena.us-east-1.amazonaws.com:443")
.option("dbtable", "my_database.my_athena_view")
.option("S3OutputLocation","s3://bucket/temp/folder") # CSVs/metadata dumped here on load
.load()
)
athena_view_datasource = DynamicFrame.fromDF(athena_view_dataframe, glueContext, "athena_view_source")
The driver docs (pdf) provide alternatives to IAM user auth (e.g. SAML, custom provider).
The main side effect to this approach is that loading causes the query results to be dumped in CSV format to the bucket specified with the S3OutputLocation key.
I don't believe that you can create a Glue Connection to Athena via JDBC because you can't specify an S3 path to the driver location.
Attribution: AWS support totally helped me get this working.