How to set diskSourceImage in google data flow pipeline - google-cloud-platform

I've been trying to use custom made images to run my google data flow pipeline. Given the information from https://cloud.google.com/compute/docs/reference/latest/images I've tested the following code snippets:
DataflowPipelineOptions options = PipelineOptionsFactory.create().as(DataflowPipelineOptions.class);
...
options.setDiskSourceImage("ubuntu-1504-vivid-v20150911");
options.setDiskSourceImage("projects/ubuntu-os-cloud/global/images/ubuntu-1504-vivid-v20150911");
options.setDiskSourceImage("https://www.googleapis.com/compute/beta/projects/ubuntu-os-cloud/global/images/ubuntu-1504-vivid-v20150911");
all of the above tries led to the following error in my pipeline:
(b9c7b66a676906f4): Unable to create VMs. Causes: (b9c7b66a67690aef): Error: Message: Invalid value for field 'resource.disks[0].initializeParams.sourceImage': '[edited]'. Must be the URL to a Compute resource of the correct type HTTP Code: 400

Using a custom disk image with Dataflow is not a viable option. The flag diskSourceImage is deprecated and will be removed in a future SDK release. The reason it is no longer supported is because the Dataflow service relies on versioned resources in the VM image. So Dataflow needs control of the VM image so that we can upgrade it as necessary. If users supply their own custom images we have no way of keeping them in sync with the requirements of the Dataflow service.
If your custom VM image is based off a Dataflow image then you would be able to execute jobs using that custom image until the next release of a Dataflow VM image. There is no reasonable way in which you would be able to keep your custom images in sync with Dataflow's VM images so that you would be able to keep this working.
If you would like to customize the VM image please let us know why (e.g. send us an email at dataflow-feedback#google.com) so we can either suggest an alternative solution or else consider supporting your use case in the future.

There's a subtle issue with setDiskSourceImage -- it uses 'beta' instead of the current 'v1' version for Compute Engine. If you try the following, it should work:
options.setDiskSourceImage("https://www.googleapis.com/compute/v1/projects/ubuntu-os-cloud/global/images/ubuntu-1504-vivid-v20150911");

Related

Dataproc custom image: Cannot complete creation

For a project, I have to create a Dataproc cluster that has one of the outdated versions (for example, 1.3.94-debian10) that contain the vulnerabilities in Apache Log4j 2 utility. The goal is to get the alert related (DATAPROC_IMAGE_OUTDATED), in order to check how SCC works (it is just for a test environment).
I tried to run the command gcloud dataproc clusters create dataproc-cluster --region=us-east1 --image-version=1.3.94-debian10 but got the following message ERROR: (gcloud.dataproc.clusters.create) INVALID_ARGUMENT: Selected software image version 1.3.94-debian10 is vulnerable to remote code execution due to a log4j vulnerability (CVE-2021-44228) and cannot be used to create new clusters. Please upgrade to image versions >=1.3.95, >=1.4.77, >=1.5.53, or >=2.0.27. For more information, see https://cloud.google.com/dataproc/docs/guides/recreate-cluster, which makes sense, in order to protect the cluster.
I did some research and discovered that I will have to create a custom image with said version and generate the cluster from that. The thing is, I have tried to read the documentation or find some tutorial, but I still can't understand how to start or to run the file generate_custom_image.py, for example, since I am not confortable with cloud shell (I prefer the console).
Can someone help? Thank you

Google Cloud Run service deployment, is it the best direction in my situation?

I have some experience with Google Cloud Functions (CF). I tried to deploy a CF function recently with a Python app, but it uses an NLP model so the 8GB memory limit is exceeded when the model is triggered. The function is triggered when a JSON file is uploaded to a bucket.
So, I plan to try Google Cloud Run but I have no experience with it. Also, I am not completely sure if it is the best course of action.
If it is, what is the best way of implementing provided that the Run service will be triggered by a file uploaded to a bucket? In CF, you can select the triggering event, in Run I didn't see anything like that. I could use some starting points as I couldn't find my case in the GCP documentation.
Any help will be appreciated.
You can use at least these two things:
The legacy one: Create a GCS notification in PubSub. Then create a push subscription and add the Cloud Run URL in the HTTP push destination
A more recent way is to use Eventarc to invoke directly a Cloud Run endpoint from an event (it roughly create the same thing with a PubSub topic and push subscription, but it's fully configured for you)
EDIT 1
When you use Push notification, you will received a standard PubSub message. The format is described in the documentation for the attributes and for the body content; keep in mind that the raw content is base64 encoded and you have to decode it to get the final format
I personally have a Cloud Run service that log the contents of any requests to be able to get in the logs all the data that I need to develop. When I have a new message format, I configure the push to that Cloud Run endpoint and I automatically get the format
For Eventarc, the format will be added to the UI soon (I view that feature in preview, but it's not yet available). The best solution is to log the content to know what you get to know what to do!

Want to Trigger DataFlowpipeline with CloudFunctions

I have a scenario that we want to trigger a data-flow pipeline via cloud function, And in data Flow pipeline we have to transform some data and insert in big query
I had created our custom data_Flowpipeline and transformed the Data inserted in the big query (follow standard way of installing Apache beam and using Deployment command from cloud Shell) Pipeline ran successfully, log is showing in monitoring with DAG.
Now what I want do is to trigger the pipeline with cloud-function and for that I researched that
(i)we can create custom flex template of pipeline
(ii) stage it in google Bucket
(iii)Call it with REST-API from cloud function
is the mentioned step in second step is recommended way of doing it or should I try another approach? I don't get any other way apart from classic templates

Google Built CentOS Image - Anyone have a download for this?

I've looked for this across the web a few times, and I feel like this hasn't been asked exactly, or I may just be getting bogged down with the wrong syntax. Hoping to get an easy answer here (yes, you can't get this, is an acceptable answer).
The variations from the base CentOS image are listed here: Link to GCP
However, they don't actually provide a download for this image. I'm trying to get a local VM running in VMWare with this image.
I feel as though they'd provide this to their clients to make it easier to prepare for use of their product, but I'm not finding it anywhere.
If anyone could toss me a link to a pre-configured CentOS ISO with the minor changes, I'd definitely take that as an alternative. I'm just not confident in my skills with Linux enough to configure the firewall properly :)
GCP doesn't support Google-provied images for exporting. However, they support exporting images for custom images.
I don't have any experience about image exporting, but I think this works.
Create custom images
You can create custom images based on your GCE VM instance.
Go navigation -> Compute engine -> images page.
You can create custom image via disk or snapshot in this page.
Select one and create a custom image.
Export your image
After creating custom image successfully, Go custom image page and click "export" on upper side.
Select export format and GCS destination. then click export.
Now you have an image in the Google Cloud storage.
Download image file and import to your local VM machine.

Copy a GCR image from one project to another

I aim to copy a gcr image from one project to another as soon as the image lands in the container registry of the first project. I am aware of the gcloud container images add-tag command, looking for a more automated option. Also the second project where the image has to be copied is protected by VPC-SC. Any leads will be appreciated...
I understand that you are looking for the best way to mirror the GCR images between two projects. Currently, you can follow the workaround in this document click to copy the container images for your use case. At the moment, the only way to move between two registries is by pulling from one and pushing to another, if you have the right permission. There is currently a tool on github that can automate this for you, gcrane click . However, for mirroring the container images between two projects, a feature request has already been submitted but there is no ETA.
According to the GCP documentation click , If the project is protected by VPC-SC, the container registry does not use googleapis.com domain. To achieve this, container registry need to configured via private DNS or BIND to map to the restricted VIP separately from other APIs.
When a change is made to a container registry that you own, a Pub/Sub message can be published. You can use this Pub/Sub message as a trigger to perform work. My immediate thought would be to create a Cloud Function that is triggered by the arrival of a message which then fires off a Cloud Build recipe. The Cloud Build would perform a docker pull of your original image and then a tag rename and then a docker push. It feels like this would be 100% automated and use components that are designed for CI/CD pipelines.
References:
Configuring Pub/Sub notifications
Cloud Build documentation