What is the right way to pass credentials to Dataflow jobs?
Some of my Dataflow jobs need credentials to make REST calls and fetch/post processed data.
I am currently using environment variables to pass the credentials to the JVM, read them into a Serializable object and pass them on to the DoFn implementation's constructor. I am not sure this is the right approach as any class which is Serializable should not contain sensitive information.
Another way I thought of is to store the credential in GCS and retrieve them using service account key file, but was wondering why should my job execute this task of reading credentials from GCS.
Google Cloud Dataflow does not have native support for passing or storing secured secrets. However you can use Cloud KMS and/or GCS as you propose to read a secret at runtime using your Dataflow service account credentials.
If you read the credential at runtime from a DoFn, you can use the DoFn.Setup lifecycle API to read the value once and cache it for the lifetime of the DoFn.
You can learn about various options for secret management in Google Cloud here: Secret management with Cloud KMS.
Related
Currently, we use AWS IAM User permanent credentials to transfer customers' data from our company's internal AWS S3 buckets to customers' Google BigQuery tables following BigQuery Data Transfer Service documentation.
Using permanent credentials possesses security risks related to the data stored in AWS S3.
We would like to use AWS IAM Role temporary credentials, which require the support of a session token on the BiqQuery side to get authorized on the AWS side.
Is there a way that the BigQuery Data Transfer Servce can use AWS IAM roles or temporary credentials to authorise against AWS and transfer data?
We considered Omni framework (https://cloud.google.com/bigquery/docs/omni-aws-cross-cloud-transfer) to transfer data from S3 to BQ, however, we faced several concerns/limitations:
Omni framework targets data analysis use-case rather than data transfer from external services. This concerns us that the design of Omni framework may have drawbacks in relation to data transfer at high scale
Omni framework currently supports only AWS-US-EAST-1 region (we require support at least in AWS-US-WEST-2 and AWS-EU-CENTRAL-1 and corresponding Google regions). This is not backward compatible with current customers' setup to transfer data from internal S3 to customers' BQ.
Our current customers will need to signup for Omni service to properly migrate from the current transfer solution we use
We considered a workaround with exporting data from S3 through staging in GCS (i.e. S3 -> GCS -> BQ), but this will also require a lot of effort from both customers and our company's sides to migrate to the new solution.
Is there a way that the BigQuery Data Transfer Servce can use AWS IAM roles or temporary credentials to authorise against AWS and transfer data?
No unfortunately.
The official Google BigQuery Data Transfer Service only mentions AWS access keys all throughout the documentation:
The access key ID and secret access key are used to access the Amazon S3 data on your behalf. As a best practice, create a unique access key ID and secret access key specifically for Amazon S3 transfers to give minimal access to the BigQuery Data Transfer Service. For information on managing your access keys, see the AWS general reference documentation.
The irony of the Google documentation is that while it refers to best practices and links to the official AWS docs, it actually doesn't endorse best practices and ignores what AWS mention:
We recommend that you use temporary access keys over long term access keys, as mentioned in the previous section.
Important
Unless there is no other option, we strongly recommend that you don't create long-term access keys for your (root) user. If a malicious user gains access to your (root) user access keys, they can completely take over your account.
You have a few options:
hook into both sides manually (i.e. link up various SDKs and/or APIs)
find an alternative BigQuery-compatible service, which does as such
accept the risk of long-term access keys.
In conclusion, Google is at fault here of not following security best practices and you - as a consumer - will have to bear the risk.
Currently we put the google cloud credentials in json file and set it via GOOGLE_APPLICATION_CREDENTIALS environment variable.
But we do not want to hardcode the private key in the json file due to security reasons. We want to put the private key in Azure key vault (yes we use azure key Vault), is there a way to provide the credentials programatically to GCP. so that i can read azure key vault and provide the private key via code. I tried to check and use GoogleCredentials and googles DefaultCredentialsProvider etc classes, but i could not find a proper example.
Note: The google credentials type is Service Credential
Any help is much appreciated.
Store the service account JSON base64 encoded in Azure Key Vault as a string.
Read the string back from Key Vault and base64 decode back to a JSON string.
Load the JSON string using one of the SDK member methods with names similar to from_service_account_info() (Python SDK).
John's answer is the correct one if you want to load a secret from Azure vault.
However, if you need a service account credential in Google Cloud environment, you don't need a service account key file, you can use the automatically loaded service account into the metadata server
In almost all Google Cloud service, you can customize the service account to use. In the worse case, you need to use the default service account for the context, and grant it the correct permissions.
Service account key file use isn't a good practice on Google Cloud product. It's a long lived credential, and you have to rotate yourselves, keep it secrets,...
I have a CLI tool that interacts with Google KMS. In order for it to work, I fetch the user credentials as a JSON file which is stored on disk. Now a new requirement came along. I need to make a web app out of this CLI tool. The web app will be protected via Google Cloud IAP. Question is, how do I run the CLI tool on behalf of the authenticated user?
You don't. Better use a service-account and assign the required role. That service account still could have domain-wide delegation of rights (able to impersonate just any user, which is known).
Running CLI tools from a web-application probably also could/should be avoided. Iit might be better to convert his CLI tool into a Cloud Function and then call it via HTTP trigger, from within the web-application (so that access to the service account is limited as far as possible).
This might also be something to reconsider, security-wise:
I fetch the user credentials as a JSON file which is stored on disk.
Even if it might have been required, with a service-account it wouldn't.
Within my Dataflow pipeline, I have a function that creates a Cloud Storage client. Instead of my VMs automatically using the default credentials, I would like to specify a key file.
I believe the way to do that is
client = storage.Client.from_service_account_json([path to local file])
However, I'm not sure where to put my json file so that my pipeline function has access to it. Where should I upload my json file?
Dataflow uses controller service accounts to create and manage resources when executing a pipeline
If you want to create and use resources with fine-grained access and control, you can use a service account from your job's project as the user-managed controller service account.
Use the --serviceAccount option and specify your service account when you run your pipeline job:
--serviceAccount=my-service-account-name#my-project.iam.gserviceaccount.com
I want to deploy a node application on a google cloud compute engine micro instance from a source control repo.
As part of this deployment I want to use KMS to store database credentials rather than having them in my source control. To get the credentials from KMS I need to authenticate on the instance with GCLOUD in the first place.
Is it safe to just install the GCloud CLI as part of a startup script and let the default service account handle the authentication? Then use this to pull in the decrypted details and save them to a file?
The docs walkthrough development examples, but I've not found anything about how this should work in production, especially as I obviously don't want to store the GCloud credentials in source control either.
Yes, this is exactly what we recommend: use the default service account to authenticate to KMS and decrypt a file with the credentials in it. You can store the resulting data in a file, but I usually either pipe it directly to the service that needs it or put it in tmpfs so it's only stored in RAM.
You can check the encrypted credentials file into your source repository, store it in Google Cloud Storage, or elsewhere. (You create the encrypted file by using a different account, such as your personal account or another service account, which has wrap but not unwrap access on the KMS key, to encrypt the credentials file.)
If you use this method, you have a clean line of control:
Your administrative user authentication gates the ability to run code as the trusted service account.
Only that service account can decrypt the credentials.
There is no need to store a secret in cleartext anywhere
Thank you for using Google Cloud KMS!