Network default is not accessible to Dataflow Service account - google-cloud-platform

Having issues starting a Dataflow job(2018-07-16_04_25_02-6605099454046602382) in a project without a local VPC Network when I get this error
Workflow failed. Causes: Network default is not accessible to Dataflow
Service account
There is a shared VPC connected to the project with a networked called default with a subnet default in us-central1 – however the service account used to run dataflow job don't seam to have access to it. I have given the dataflow-service-producer service account Compute Network User, without any noticeable effect. Any ideas on how I can processed?

The usage of subnetworks in Cloud Dataflow requires to specify the subnetwork parameter when running the pipeline; However, in the case of subnetwork that are located in a Shared VPC network, it is required to use the complete URL based on the following format, as you well mentioned.
https://www.googleapis.com/compute/v1/projects/<HOST_PROJECT>/regions/<REGION>/subnetworks/<SUBNETWORK>
Additionally, in this cases is recommended to verify that you are adding the project's Dataflow service account into the Shared VPC's project IAM table and give it the "Compute Network User" role permission in order to ensure that the service has the required access scope.
Finally, it is seems that the Subnetwork parameter official Google's documentation is alraedy available with detailed information about this matter.

Using the --subnetwork option with the following (undocumented) fully qualified subnetwork format made the Dataflow job run. Where {PROJECT} is the name of the project hosting the shared VPC and {REGION} matches the region you run your dataflow job in.
--subnetwork=https://www.googleapis.com/compute/alpha/projects/{PROJECT}/regions/{REGION}/subnetworks/{SUBNETWORK}

Related

Dataflow job fails without proper error when implemented in VPC

I am trying to run a dataflow job and to do this I am using dataflow template Cloud Spanner to text file on Cloud Storage. My dataflow is on a shared VPC but both &Spanner is not a resource which is on VPC. This job fails but there is no proper error message when it fails. I tried to clone the same job and run this on default VPC then things seems to work and job was successful. Can someone help me understand what is going on and where i should look? Is there an issue for dataflow to communicate with Spanner? If so is there a resource which could help to fix this issue?
Please ensure the following is met -
The Shared VPC network that you select is an auto mode network.
You are a Service Project Admin with project-level permissions to the whole Shared VPC host project. This means that a Shared VPC Admin has granted you the Compute Network User role for the whole host project, so you are able to use all of its networks and subnetworks.
Ref - https://cloud.google.com/dataflow/docs/guides/specifying-networks#network_parameter

Retrieve existing resource data using AWS Cloudformation

I need to retrieve existing data/properties of a given resource by using an AWS Cloudformation template. Is it possible? If it is how can I do it?
Example 1:
Output: Security Group ID which allows traffic on port 22
Example 2:
Output: Instance ID which use default VPC
AWS CloudFormation is used to deploy infrastructure from a template in a repeatable manner. It cannot provide information on any resources created by any methods outside of CloudFormation.
Your requirements seem more relevant to AWS Config:
AWS Config provides a detailed view of the configuration of AWS resources in your AWS account. This includes how the resources are related to one another and how they were configured in the past so that you can see how the configurations and relationships change over time.
An AWS resource is an entity you can work with in AWS, such as an Amazon Elastic Compute Cloud (EC2) instance, an Amazon Elastic Block Store (EBS) volume, a security group, or an Amazon Virtual Private Cloud (VPC).
Using your examples, AWS Config can list EC2 instances and any resources that are connected to the instances, such as Security Groups and VPCs. You can easily click-through these relationship and view the configurations. It is also possible to view how these configurations have changed over time, such as:
When EC2 instance changed state (eg stopped, running)
When rules changed on Security Groups
Alternatively, you can simply make API calls to AWS services to obtain the current configuration of resources, such as calling DescribeInstances to obtain a list of Amazon EC2 instances and their configurations.

How to get AWS ECS Region

I have my microservice running in AWS ECS, and I want to tell which region this service is running in. Do they have a meta data service for me get my microservice region?
There are two ways to do this. The first is to use the Metadata file. This feature is disabled by default so you'll need to turn it on.
Run cat $ECS_CONTAINER_METADATA_FILE on linux after enabling it to see the metadata. The ENV var stores the file location.
The second is to use the HTTP metadata endpoint. There are two potential endpoints here (version 2 and 3) depending on how the instance is launched, so check the docs.
In either case the region is not a specific property of the metadata, but it can be inferred from the ARN.

How does "type: LoadBalancer" creates external Load Balancer in Kubernetes?

How does Kubernetes knows what external cloud provider on it is running?
Is there any specific service running in Master which finds out if the Kubernetes Cluster running in AWS or Google Cloud?
Even if it is able to find out it is AWS or Google, from where does it take the credentials to create the external AWS/Google Load Balancers? Do we have to configure the credentials somewhere so that it picks it from there and creates the external load balancer?
When installing Kubernetes cloud provider flag, you must specify the --cloud-provider=aws flag on a variety of components.
kube-controller-manager - this is the component which interacts with the cloud API when cloud specific requests are made. It runs "loops" which ensure that any cloud provider request is completed. So when you request an Service of Type=LoadBalancer, the controller-manager is the thing that checks and ensures this was provisioned
kube-apiserver - this simply ensure the cloud APIs are exposed, like for persistent volumes
kubelet - ensures thats when workloads are provisioned on nodes. This is especially the case for things like persistent storage EBS volumes.
Do we have to configure the credentials somewhere so that it picks it from there and creates the external load balancer?
All the above components should be able to query the required cloud provider APIs. Generally this is done using IAM roles which ensure the actual node itself has the permissions. If you take a look at the kops documentation, you'll see examples of the IAM roles assigned to masters and workers to give those nodes permissions to query and make API calls.
It should be noted that this model is changing shortly, to move all cloud provider logic into a dedicated cloud-controller-manager which will have to be pre-configured when installing the cluster.

Custom DNS resolver for Google Cloud Dataflow pipeline

I am trying to access Kafka and 3rd-party services (e.g., InfluxDB) running in GKE, from a Dataflow pipeline.
I have a DNS server for service discovery, also running in GKE. I also have a route in my network to access the GKE IP range from Dataflow instances, and this is working fine. I can manually nslookup from the Dataflow instances using my custom server without issues.
However, I cannot find a proper way to set up an additional DNS server when running my Dataflow pipeline. How could I achieve that, so that KafkaIO and similar sources/writers can resolve hostnames against my custom DNS?
sun.net.spi.nameservice.nameservers is tricky to use, because it must be called very early on, before the name service is statically instantiated. I would call java -D, but Dataflow is going to run the code itself directly.
In addition, I would not want to just replace the systems resolvers but merely append a new one to the GCP project-specific resolvers that the instance comes pre-configured with.
Finally, I have not found any way to use a startup script like for a regular GCE instance with the Dataflow instances.
I can't think of a way today of specifying a custom DNS in a VM other than editing /etc/resolv.conf[1] file in the box. I don't know if it is possible to share the default network. If it is machines are available at hostName.c.[PROJECT_ID].internal, which may serve your purpose if hostName is stable [2].
[1] https://cloud.google.com/compute/docs/networking#internal_dns_and_resolvconf [2] https://cloud.google.com/compute/docs/networking