service account execution batch dataflow job - google-cloud-platform

I need to execute a dataflow job using a service account , I'm following a very simple and basic example wordcount offered within the same platform itself.
Which is weird is the error I'm getting:
According to this, GCP requires the service account having permissions as Dataflow worker in order to execute my job. The weir part comes over when the error kept on showing up even though I have already set the required permissions:
Can someone explain this strange behavior? thanks so much

To run a Dataflow job, a project must enable billing and the following Google Cloud Platform APIs:
Google Cloud Dataflow API
Compute Engine API (Google Compute Engine)
Google Cloud Logging API
Google Cloud Storage
Google Cloud Storage JSON API
BigQuery API
Google Cloud Pub/Sub
Google Cloud Datastore API
You should also have enough quota in the project for any one of the APIs you are using in the Dataflow job.
I would suggest you to create a fresh service account which its name has not been used before and then granting roles/dataflow.worker to this new fresh service account. Remember, that Cloud IAM propagation takes fewer than 60 seconds, up to 7 minutes, so please have a couple of minutes between an IAM change and Dataflow job creation.
Another possible workaround is to delete the Dataflow worker permission and add it again. The permission remains after the removal of the account, pointing to its old ID. This ID must not be refreshed until explicitly deleting the role.
I encourage you to visit Dataflow IAM roles with role descriptions and permissions documentation.

Related

What is the difference between service account and service agent in GCP

Say I have this case where
I have to run some test with dataflow
inside this dataflow job I need to access a gcs bucket and save my output there.
I will need to run the dataflow job with my own SA instead of the default SA.
I created a Google Service Account to run my dataflow job. But after I enabled the dataflow API. I end up having 2 SA in front of me.
the service account agent --> 123456789#dataflow.gserviceaccount.com
the dataflow job runner service account --> dataflow-job-runner#MY-PROJECT-ID.iam.gserviceaccount.com
It got me really confused to see what the official document says
Some Google Cloud services have Google-managed service accounts that allow the services to access your resources. These service accounts are sometimes known as service agents.
If I create a dataflow job to run with the dataflow-job-runner#MY-PROJECT-ID.iam.gserviceaccount.com SA, I suppose I'd need to grant the roles/storage.objectAdmin for it.
The question is
Do I need to grant any permission to the service account agent?
What does the service account agent actually do, what does it has to access any resource?
Several Google Cloud services such as Cloud Dataflow require two sets of permissions.
The program that you write uses a service account. You grant this service account IAM roles to access resources that require authorization that your program requires. For example, reading data from Cloud Storage or issuing queries to BigQuery.
The service agent applies to the service's runtime. For example when you launch a job on Cloud Dataflow, Cloud Dataflow needs to launch VMs to run your program on. Your program is not launching the VMs, the service is. Therefore the service requires its own set of permissions. This is what the service agent is for.
By using two different service accounts, separation of privilege is achieved.

How to programatically add Roles to cloud build service account?

I am trying to use setIAMPolicy for Cloud Build Service account #cloudbuild.gserviceaccount.com. I want to provide AppEngine Admin, Cloud Run Admin permissions to the Cloud Build Service member so that it can do automated releases on AppEngine.
Somehow it throws 404 when I pass resource of Cloud Build Service account while getting IAM Policy. To confirm, I tried GET https://iam.googleapis.com/v1/{name=projects/*}/serviceAccounts in API Explorer and it also does not return the Google Managed Service accounts. It seems it only returns the service accounts which are created and not the Google Managed default accounts.
How can I set IAM Policy to grant these permissions to Cloud Build?
The general idea is to enable these permissions for both App Engine and Cloud Run.
Also, a common problem is not knowing that cron permissions are needed for App Engine and Cloud build. For example, this article mentions "Update cron schedules" as "No" for "App Engine Admin". Whether you need that or not depends on how your builds are done. If you end-up needing that too, use permission "Cloud Scheduler Admin" on your #cloudbuild.gserviceaccount.com. You can apply the same logic to other permissions and that chart might be useful for knowing what is needed depending on your setup.

Cloud Dataflow job reading from one Bigquery project and writing to another BigQuery project

I'm implementing a Cloud Dataflow job on GCP that needs to deal with 2 GCP projects.
Both input and output are Bigquery partitionned tables.
The issue I'm going through now is that I must read data from a project A and write it into a project B.
I havent seen anything related to cross project service accounts and I can't give Dataflow two different credential key either which is a bit annoying ?
I don't know if someone else went through that kind of architecture or how you dealt with it.
I think you can accomplish this with the following steps:
Create a dedicated service account in the project running the Dataflow job.
Grant the service account the Dataflow Worker and BigQuery Job User roles. The service account might need additional roles based on the full resource needs of the Dataflow job.
In Project A, grant the service account the BigQuery Data Viewer role to either the entire project or to specific datasets.
In Project B, grant the service account the BigQuery Data Editor role to either the entire project or to specific datasets.
When you start the Dataflow job, override the service account pipeline option supplying the new service account.
It is very simple. you need to give required permission/access to your service account from both the project.
So you need only service account which has required access/permission in both the project
Hope it helps.

Dataprep doesn't works - Cloud Dataflow Service Agent

I made a mistake deleting an user service-[project number]#dataflow-service-producer-prod.iam.gserviceaccount.com in Service accounts, I should have deleted another user.
After that, the Dataprep stopped running the jobs.
I've checked all guidelines about dataflow and dataprep: if the API is enable (yes, it is). If there is a proper service account (yes). But I don't know what rules to assign to these accounts.
I tried assigning the "Cloud Dataflow Service Agent" role for this account, but it doesn't appear for me >
I tried too assigning another roles, but didn't work.
It all started when I deleted this account erroneously.
Someone knows how solve this?
PS: I'm working progress with my English, sorry for some mistakes.
If you accidentally deleted the Dataflow service account, disable Dataflow API then re-enable it will create the service account again automatically.
Disabling/Enabling the API is not recommended as associated resources will be impacted. You should rather undelete the default service account in the following 30 days. You would need its ACCOUNT_UNIQUE_ID that can be found in the generated logs when it was deleted. Find details here.

Failed job in Cloud Dataflow: enable Dataflow API

I'm currently trying to use Dataflow with Pub/Sub but I'm getting this error:
Workflow failed. Causes: (6e74e8516c0638ca): There was a problem refreshing your credentials. Please check:
1. Dataflow API is enabled for your project.
2. There is a robot service account for your project:
service-[project number]#dataflow-service-producer-prod.iam.gserviceaccount.com should have access to your project. If this account does not appear in the permissions tab for yourproject, contact Dataflow support.
I tried to look in the API manager to enable Dataflow API but I can't find Dataflow at all. I'm also not seeing the robot service account.
You can see whether the API is enabled by searching for dataflow within the API Manager (should enumerate whether its enabled or not):
To find the appropriate robot account, search for dataflow-service-producer-prod.iam.gserviceaccount.com within the IAM page:
Finally, the quick start guide may be of use.
You can enable it from the console or just use the gcloud command.
Enable Dataflow API: gcloud services enable dataflow.googleapis.com
Disable Dataflow API: gcloud services disable dataflow.googleapis.com
Adding the dataflow Worker role to the default project compute service account solved the problem for me