I am creating a connection with a Google Service Account in my Google Cloud Composer that privilegies a DAG for a specific use case with deals with sensitive data, the point is that I want that connection to be exclusive for a certain DAG and no other could see or use it.
Is there a way of doing it?
Currently this is not possible in airflow, and even you cannot implement that using a custom backend secret or another solution, where the connection is not a context variable, and it's accessible from anywhere in airflow not only from a run context.
Infortunately the service account given to Cloud Composer in the creation of cluster, is for all DAGs of this cluster.
It can be too much, but maybe you can create another Cloud Composer cluster 2 (GKE autopilot), with the minimum sizing for machines, containing this DAG that treats sensitive data.
Then you can give a SA with the needed privileges to this cluster.
The disadvantage of this solution is you will have a higher cost, because you have a second cluster. It will increases the cost even if the machine sizes are low.
It is worth noting that Composer 2 with GKE autopilot is cheaper that classical GKE cluster.
Maybe another solution, if the rework is not too important, you can rewrite only your DAG treating sensitive data to Cloud Workflow.
Cloud Workflow is serverless and you can give it a dedicated service account.
Related
Now I am working with our company product developed with spring boot , angular and PostgreSQL technologies where front end angular is communicating with 138 back end ReST API end points. And these 138 end points are from 35 different spring boot project. And all these end points need to separately deploy for 5 different tenant. Actually end point working is same.But databases are different for different tenant. And we decided to go with AWS cloud. And we are looking for cost effective deployment method from AWS.
Our Current Development/Test strategy - Current we are developing application(final stage of development) and testing our application using our On-premise server. Here we are using 5 ubuntu machines. And we created kubernetes cluster with 2 master nodes and 3 worker nodes.And from our SVN repository and Jenkins server we implemented CI/CD pipeline deployment to this 5 machines.
Proposed Cloud Solution - Now we are thinking with to use either EKS deployment method or any of CodeDeploy/CodePipeline method to implement this big project.
So by considering cost and control over infrastructure management which solution is better for my product? Now I am not that much experienced as solution architect and still in cloud learning curve. So can any one suggest/guide me to think properly to achieve my goal please?
Company consideration
Control over infrastructure
Cost effective
Easy management of aws services for multi-tenant deployment
Data security ( Installing database on ec2/ RDS)
Management of load balances
Control over infrastructure
it would be better to manage it on Github, Gitlab, and or AWS code build, or cloud build.
indeed AWS code build, and repo is great tools but again consider the limitation of extra users it allows only 5 users if your team is very big you might have to pay to compare to managing projects at the Github & GitLab level.
Cost effective
EKS would be a good option compared to ECS or others as it has limitations of we can not run the Daemon set or Privilege PODs.
If you are looking for running everything On POD and auto-scalable with little less flexibility and don't want to manage much ECS also a good idea, but again you have to derive the capacity and compare both pricing ECS vs EKS.
Note : EKS will also charge the per hour charges $0.10 for each cluster + worker nodes. it's not just worker nodes like in on-prem we run.
Data security ( Installing database on ec2/ RDS)
RDS would be better as it's managed service compare to managing the EC2 and database performance and encryption etc.
it would be better to use RDS and EKS so the K8s service can connect to RDS easily on a private network.
RDS would be a cost-effective option considering the management of DB over EC2.
Management of load balances
NLB or ALB will take care of that you can use any of them as per the requirement with EKS.
Cloud front could be also a great option with cloud storage to serve static assets, which will reduce calls, improve performance and be cost-effective also.
The cloud workflow doesn't come with a scheduling feature. Apart from that, what are all the differences between these two services in terms of features? In which use case should we prefer the workflow over composer or vice versa?
There are some key differences to consider when choosing between the two solutions :
A Composer instance needs to be in a running state to trigger DAGs and you'll also need to size your Cloud Composer instance based on your usage, You do not need to do this in Cloud Workflows as it is a Serverless service and you pay for anytime a workflow is triggered
Another key difference is that Cloud Composer is really convenient for writing and orchestrating data pipelines because of it's internal scheduler and also because of the provided Operators, You can interact with any Data services inside of GCP.
However, Cloud Workflows interacts with Cloud Functions, wich is a task that Composer cannot do really well.
Both Composer and Workflows support orchestrating multiple services and can handle long running workflows. Despite there being some overlap in the capabilities of these products, each has differentiators that make them well suited to particular use cases.
Composer is most commonly used for orchestrating the transformation of data as part of ELT or data engineering. Workflows, in contrast, is focused on the orchestration of HTTP-based services built with Cloud Functions, Cloud Run, or external APIs.
Composer is designed for orchestrating batch workloads that can handle a delay of a few seconds between task executions. It wouldn’t be suitable if low latency was required in between tasks, whereas Workflows is designed for latency sensitive use cases.
While you don’t have to worry about maintaining Airflow deployments in Composer, you do need to specify how many workers you need for a given Composer environment. Workflows is completely serverless; there is no infrastructure to manage or scale.
For further information refer to this google blog article and this one.
A BigTable table can be backed up through GCP for up to 30 days.
(https://cloud.google.com/bigtable/docs/backups)
Is it possible to have a custom automatic backup policy?
i.e. trigger automatic backups every X days & keep up to 3 copies at a time.
As mentioned in the comment, the link provides a solution which involves the use of the following GCP Products:
Cloud Scheduler: trigger tasks with a cron-based schedule
Cloud Pub/Sub: pass the message request from Cloud Scheduler to Cloud Functions
Cloud Functions: initiate an operation for creating a Cloud Bigtable backup
Cloud Logging and Monitoring (optional).
Full guide can also be seen on GitHub.
This is a good solution since you have a certain requirement that should be done with client libraries, because Big Table doesn't have an API that sets 3 copies at a time.
For normal use cases however, such as triggering automatic backups every X days, there's another solution such as calling the backups.create directly by creating a Cloud Scheduler with HTTP similar to what's done in this answer.
Here is another thought on a solution:
Instead of using three GCP Products, if you are already using k8s or GKE you can replace all this functionality with a k8s CronJob. Put the BigTable API calls in a container and deploy it on a schedule using the CronJob.
In my opinion, it is a simpler solution if you are already using kubernetes.
We developed an application based on Google Cloud Platform, that uses Cloud Dataflow to write data to BigQuery.
I am now trying to setup this application on a new GCP project on another organization.
The problem
I am experiencing this issue:
Workflow failed. Causes: Unable to bring up enough workers: minimum 1, actual 0. Please check your quota and retry later, or please try in a different zone/region.
It happens on two dataflow templates:
1. One takes data from a Pub/Sub topic and writes to a Pub/Sub topic,
2. The other takes data from a Pub/Sub topic and writes to BigQuery.
Jobs are created from the Cloud Dataflow API. The templates are pretty standard, with 3 maximum workers and the THROUGHPUT_BASED autoscaling mode.
As suggested on similar questions, I checked the Compute engine quota, that are far from exceeded. I also changed the region, and the machine type; the problem still happens. Compute Engine and Dataflow APIs are enabled.
The question
As it works on projects on another organization, I believe that it comes from the GCP organization that have specific restrictions. Is it possible?
What other points should I check to make it work?
After multiple tests, we managed to make it work properly.
It was indeed not a problem with regions and machine types, though most of the related Stackoverflow threads suggest that you should start with that.
It was in fact because of a restriction on external IP addresses through a GCP Organization policy. As pointed in this question, standard configuration of Dataflow requires an external IP address.
We are moving from On-Premises environment to google cloud dataproc for spark jobs. I am able to build the cluster though and ssh to master node for job execution. I am not clear how to build the edge node where we can allow users to login and submit job. Is it going to be another gce vm? Any thoughts or best practices?
A new VM instance is a good option to map the EdgeNode role from other architectures:
You can execute your job from the Master node which you can make accessible through SSH.
You will need to find a balance between simplicity (SHH) or security (EdgeNode).
Please note that IAM can help to allow individual users to submit jobs by assigning Dataproc Editor role.
Don't forget the ability that Dataproc offers of creating ephemeral nodes. This means that you create a cluster, execute your job and delete your cluster.
Using ephemeral clusters will avoid unnecessary costs. Even, the script you create for that it can be executed from any machine that has the Google Cloud SDK installed, e.g. OnPrem servers or your PC.