A BigTable table can be backed up through GCP for up to 30 days.
(https://cloud.google.com/bigtable/docs/backups)
Is it possible to have a custom automatic backup policy?
i.e. trigger automatic backups every X days & keep up to 3 copies at a time.
As mentioned in the comment, the link provides a solution which involves the use of the following GCP Products:
Cloud Scheduler: trigger tasks with a cron-based schedule
Cloud Pub/Sub: pass the message request from Cloud Scheduler to Cloud Functions
Cloud Functions: initiate an operation for creating a Cloud Bigtable backup
Cloud Logging and Monitoring (optional).
Full guide can also be seen on GitHub.
This is a good solution since you have a certain requirement that should be done with client libraries, because Big Table doesn't have an API that sets 3 copies at a time.
For normal use cases however, such as triggering automatic backups every X days, there's another solution such as calling the backups.create directly by creating a Cloud Scheduler with HTTP similar to what's done in this answer.
Here is another thought on a solution:
Instead of using three GCP Products, if you are already using k8s or GKE you can replace all this functionality with a k8s CronJob. Put the BigTable API calls in a container and deploy it on a schedule using the CronJob.
In my opinion, it is a simpler solution if you are already using kubernetes.
Related
I am creating a connection with a Google Service Account in my Google Cloud Composer that privilegies a DAG for a specific use case with deals with sensitive data, the point is that I want that connection to be exclusive for a certain DAG and no other could see or use it.
Is there a way of doing it?
Currently this is not possible in airflow, and even you cannot implement that using a custom backend secret or another solution, where the connection is not a context variable, and it's accessible from anywhere in airflow not only from a run context.
Infortunately the service account given to Cloud Composer in the creation of cluster, is for all DAGs of this cluster.
It can be too much, but maybe you can create another Cloud Composer cluster 2 (GKE autopilot), with the minimum sizing for machines, containing this DAG that treats sensitive data.
Then you can give a SA with the needed privileges to this cluster.
The disadvantage of this solution is you will have a higher cost, because you have a second cluster. It will increases the cost even if the machine sizes are low.
It is worth noting that Composer 2 with GKE autopilot is cheaper that classical GKE cluster.
Maybe another solution, if the rework is not too important, you can rewrite only your DAG treating sensitive data to Cloud Workflow.
Cloud Workflow is serverless and you can give it a dedicated service account.
I've created a job in Google Cloud Scheduler to download data from my demo app using HTTP GET it seemed to run successfully. My question is where did it store that data? And how can I store it into Google Cloud Storage? Below is a screenshot of my job:
I'm new to Google Cloud and working with a free trial account. Please advise.
Cloud Scheduler does not process data. The data returned by your example request is discarded.
Write a Cloud Function scheduled by Cloud Scheduler. There are other services such as Firebase and Cloud Run that work well also.
What you're trying to do is that you're trying to create a scheduler job that calls a GET request, and Cloud Scheduler will do exactly that, except the part where data is stored. Since Cloud Scheduler is a managed-cron service, it doesn't matter if the URL returns data. Cloud Scheduler will call the endpoint on a timely manner and that's it. Data returned by the request will be discarded, as mentioned by John Hanley.
What you can do is to integrate scheduling with Cloud Functions, but to be clear, your app needs to do the following first:
Download from external link and save the object to /tmp within the function.
The rest of the file system is read-only, and /tmp is the only writeable part. Any files saved in /tmp are stored within the function's memory. You can clear /tmp for every successful upload to GCS.
Upload the file to Cloud Storage (using Cloud Storage client library).
Now that your app is capable of storing and uploading data, then you can make a decision where to deploy it.
You can deploy your code using Firebase CLI and use scheduled functions. The advantage is that Pub/Sub and Cloud Scheduler is taken care of automatically and configuration is done on your code. Downside is that you are limited to Node Runtime compared to GCP Cloud Functions, where there are many different programming languages available. Learn more about scheduled functions.
Second, you can deploy through gcloud CLI but for this, you need to setup Pub/Sub notifications and Cloud Scheduler. You can check more about this by navigating to this link.
The cloud workflow doesn't come with a scheduling feature. Apart from that, what are all the differences between these two services in terms of features? In which use case should we prefer the workflow over composer or vice versa?
There are some key differences to consider when choosing between the two solutions :
A Composer instance needs to be in a running state to trigger DAGs and you'll also need to size your Cloud Composer instance based on your usage, You do not need to do this in Cloud Workflows as it is a Serverless service and you pay for anytime a workflow is triggered
Another key difference is that Cloud Composer is really convenient for writing and orchestrating data pipelines because of it's internal scheduler and also because of the provided Operators, You can interact with any Data services inside of GCP.
However, Cloud Workflows interacts with Cloud Functions, wich is a task that Composer cannot do really well.
Both Composer and Workflows support orchestrating multiple services and can handle long running workflows. Despite there being some overlap in the capabilities of these products, each has differentiators that make them well suited to particular use cases.
Composer is most commonly used for orchestrating the transformation of data as part of ELT or data engineering. Workflows, in contrast, is focused on the orchestration of HTTP-based services built with Cloud Functions, Cloud Run, or external APIs.
Composer is designed for orchestrating batch workloads that can handle a delay of a few seconds between task executions. It wouldn’t be suitable if low latency was required in between tasks, whereas Workflows is designed for latency sensitive use cases.
While you don’t have to worry about maintaining Airflow deployments in Composer, you do need to specify how many workers you need for a given Composer environment. Workflows is completely serverless; there is no infrastructure to manage or scale.
For further information refer to this google blog article and this one.
We have a Python data pipeline that run in our server. It grab data from various sources, aggregate and write data to sqlite databases. The daily runtime is just 1 hours and network maybe 100mb at most. What are our options to migrate this to Google Cloud? We would like to have more reliable scheduling, cloud database and better data analytics options from the data (powerful dashboard and visualization) and easy development. Should we go with serverless or server? Is the pricing free for such low usage?
for a lift and shift option, you can run your python workload on the Google Compute Engine, which is a virtual machine, but for best use of Google Cloud, i suggest you to:
Spin up a Google Compute Engine
Run your Python Workload
Save your data on Google Big Query
Shutdown your VM
Schedule it using the Cloud Scheduler
Here is a tutorial from Google on how to do it:
https://cloud.google.com/scheduler/docs/start-and-stop-compute-engine-instances-on-a-schedule
GCP on a shoestring budget:
Google Gives you $300 to spend for first 12 months and there are some services which gives you free usage per month: https://cloud.google.com/free/docs/gcp-free-tier
For example:
You can use BigQuery free-of-charge 1 TB of querying per month and 10 GB of storage each month.
Here's an excellent video on making the most of out of GCP Free tiers: https://www.youtube.com/watch?v=N2OG1w6bGFo&t=818s
Approach to migration:
When moving to cloud you typically choose from one of the following approaches:
1) Rehost (lift-and-shift) no modification to code or architecture
2) Replatform - with minor modifications to code
3) Refactor - with modifications to code and architecture
Obviously you'll get the most cloud benefits (i.e. performance and cost efficiency) with option 3 but it will take longer whereas option 1 is quicker with least amount of benefits.
You can use Cloud Composer for scheduling which is effectively managed apache airflow service. It will allow you to manage batch, streaming and schedule tasks.
Visualisation can be done through Google Data Studio, which can use BigQuery as datasource. Data Studio is free but querying on BigQuery will be chargeable.
BigQuery for data-analytics.
For database you can migrate to managed CloudSQL which supports Postgres and MySQL database types.
Typically serverless stuff is likely to be cost effective if you can accommodate it which obviously will fall into option 3) refactor.
There is several requirement to take care before migrating like: Is all your datasources are reachable by a cloud platform?
About the storage and analytics, BigQuery is an amazing product, and work very well with denormalized data. Is your data can be denormalized? Is your job required transactional capabilities?
Is your data need to be requested on website? BigQuery is powerful for analytics but there is about 1s of query warming, not acceptable on website. It's not like CLoud SQL (MySQL or PostgreSQL) response time which is in millis, but limited to some TB (and having good response time with TB in Cloud SQL is a challenge!)
If it's only for dashboarding, you can use Datastudio, it's free and you can cache your BigQuery data with BI-Engine for more responsive dashboards.
If all of this requirements works for you, and if your datasources are publicly accessible on internet (I mean no VPN requirement for accessing them), I can propose you a full serverless solution. This solution is a side use of Google Cloud Service, and that works well!
I wrote an article on similar use and you can take inspiration on it. Cloud Build allows you to run CI pipeline, and you can use Custom Builder: it's a container that you build yourself and that you can run on Cloud Build.
By the way,
Package your current workflow in a container compliant with Cloud Build, and write your Cloud Build jobs (don't forget to set the right timeout value)
Create a Cloud Function or Cloud Run (if you prefer container) that run Cloud Build; with optionally some substitutions variable for customizing your run.
Set up a Cloud Scheduler to trigger every day your Cloud Run or Cloud Function
Out of BigQuery cost, this pattern cost 0! you have 120 free minutes per day (per billing account) with Cloud Build, Cloud Scheduler is free (up to 3 scheduler per billing account) and Cloud Function/Cloud Run have a huge free tier (here only run some milliseconds).
Streaming to BigQuery is not free but affordable. Half of cent for 100Mb!!
Note: Cloud Run will propose, a day, long running jobs. By the way you could reuse your Cloud Builder container into Cloud Run when the feature will be released. Today, I propose a workaround of this
In Google Cloud Platform BigQuery is great serverless choice - you can start small and grow over time.
With partitioning, cluster and other improvements, we've been successfully using it with UI (4-8k queries per day) with most queries completing under second.
You can also get all data seamlessly ingested from the various sources with millions of files per day to one or many tables with BqTail
I currently have a PySpark job that is deployed on a DataProc cluster (1 master & 4 worker nodes with sufficient cores and memory). This job runs on millions of records and performs an expensive computation (Point in Polygon). I am able to successfully run this job by itself. However, I want to schedule the job to be run on the 7th of every month.
What I am looking for is the most efficient way to set up cron jobs on a DataProc Cluster. I tried to read up on Cloud Scheduler, but it doesn't exactly explain how it can be used in conjunction with a DataProc cluster. It would be really helpful to see either an example of cron job on DataProc or some documentation on DataProc exclusively working together with Scheduler.
Thanks in advance!
For scheduled Dataproc interactions (create cluster, submit job, wait for job, delete cluster while also handling errors) Dataproc's Workflow Templates API is a better choice than trying to orchestrate these yourself. A key advantage is Workflows are fire-and-forget and any clusters created will also be deleted on completion.
If your Workflow Template is relatively simple such that it's parameters do not change between invocations a simpler way to schedule would be to use Cloud Scheduler. Cloud Functions are a good choice if you need to run a workflow in response to files in GCS or events in PubSub. Finally, Cloud Composer is great if your workflow parameters are dynamic or there's other GCP products in the mix.
Assuming your use cases is the simple run workflow every so often with the same parameters, I'll demonstrate using Cloud Scheduler:
I created a workflow in my project called terasort-example.
I then created a new Service Account in my project, called workflow-starter#example.iam.gserviceaccount.com and gave it Dataproc Editor role; however something more restricted with just dataproc.workflows.instantiate is also sufficient.
After enabling the the Cloud Scheduler API, I headed over to Cloud Scheduler in Developers Console. I created a job as follows:
Target: HTTP
URL: https://dataproc.googleapis.com/v1/projects/example/regions/global/workflowTemplates/terasort-example:instantiate?alt=json
HTTP Method: POST
Body: {}
Auth Header: OAuth Token
Service Account: workflow-starter#example.iam.gserviceaccount.com
Scope: (left blank)
You can test it by clicking Run Now.
Note you can also copy the entire workflow content in the Body as JSON payload. The last part of the URL would become workflowTemplates:instantiateInline?alt=json
Check out this official doc that discusses other scheduling options.
Please see the other answer for more comprehensive solution
What you will have to do is publish an event to pubsub topic from Cloud Scheduler and then have a Cloud Function react to that event.
Here's a complete example of using Cloud Function to trigger Dataproc:
How can I run create Dataproc cluster, run job, delete cluster from Cloud Function