Can Cloud Dataflow streaming job scale to zero? - google-cloud-platform

I'm using Cloud Dataflow streaming pipelines to insert events received from Pub/Sub into a BigQuery dataset. I need a few ones to keep each job simple and easy to maintain.
My concern is about the global cost. Volume of data is not very high. And during a few periods of the day, there isn't any data (any message on pub/sub).
I would like that Dataflow scale to 0 worker, until a new message is received. But it seems that minimum worker is 1.
So minimum price for each job for a day would be : 24 vCPU Hour... so at least $50 a month/job. (without discount for monthly usage)
I plan to run and drain my jobs via api a few times per day to avoid 1 full time worker. But this does not seem to be the right form for a managed service like DataFlow.
Is there something I missed?

Dataflow can't scale to 0 workers, but your alternatives would be to use Cron, or Cloud Functions to create a Dataflow streaming job whenever an event triggers it, and for stopping the Dataflow job by itself, you can read the answers to this question.
You can find an example here for both cases (Cron and Cloud Functions), note that Cloud Functions is not in Alpha release anymore and since July it's in General Availability release.

A streaming Dataflow job must always have a single worker. If the volume of data is very low, perhaps batch jobs fit the use case better. Using a scheduler or cron you can periodically start a batch job to drain the topic and this will save on cost.

Related

Running multiple jobs in SageMaker

I was wondering if it is possible to run large number of "jobs" (or "pipeline" or whatever is the right way) to execute some modelling tasks in parallel.
So what I planned to do is to do a ETL process and EDA done and after that when the data is ready, I would like to fire 2000 modelling jobs. We have 2000 products and each job can start with a data (SELECT * FROM DATA WHERE PROD_ID='xxxxxxxxx') and my idea is to run these training jobs in parallel (there is no dependency between them - so it makes sense to me).
First of all - 1) Can it be done in AWS SageMaker? 2) What would be the right approach? 3) Any special considerations I need to be aware of?
Thanks a lot in advance!
it's possible to run this on SageMaker, with SageMaker pipelines that will orchestrate a SageMaker Processing job, followed by a Training job. You can define the PROD_ID as a String parameter to the SageMaker Pipeline, then run multiple pipelines executions concurrently (default soft limit is 200 concurrent executions).
As you have a very high numbers of jobs (2K) which you want to run in parallel, and perhaps optimize compute usage, you might also want to look at AWS Batch, which allows you to queue up tasks, for a fleet of instances that starts containers to perform these jobs. AWS Batch also support Spot instances which could reduce your instance cost by 70%-90%. Another advantage of AWS Batch is that jobs reuse the same running instance (only container stop/start), while in SageMaker there's a ~2 minute overhead to start the instance per job. Additionally, AWS Batch also takes care of retries and allowing you to chain all 2,000 jobs together and run a "finisher" job when all jobs have completed.
Limits increase - For any service, you'll need to increase your service quota limits. It can be done from the console "Quotas" for most services, or by contacting AWS support. Some services has hard limits.

AWS Glue Job Alerting on Long Run Time

I'm hoping to configure some form of alerting for AWS Glue Jobs when they run longer than a configurable amount of time. These Glue jobs can be triggered at any time of day, and usually take less than 2 hours to complete. However if this exceeds the 2 hour threshold, I want to get a notification for this (via SNS).
Usually I can configure run time alerting in CloudWatch Metrics, but I am struggling to do this for a Glue Job. The only metric I can see that could be useful is
glue.driver.aggregate.elapsedTime, but it doesn't appear to help. Any advice would be appreciated.
You could use the library for that. You just need the job run id and then call getJobRun to get the execution time. Based on that you can then notify someone / some other service.

Where is FlexRS type in dataflow part of batch or stream processing?

I have a question about the FlexRS type while I was looking at the dataflow of Google Cloud Platform. :)
As far as I know, dataflow supports batch and stream, but I was curious to know that there is a FlexRS type.
Simply understanding FlexRS was difficult to identify in the document except that it was cheaper than a typical workflow.
Can I ask you to explain FlexRS?
Thank you.
Dataflow is a fully managed batch and streaming analytics service that minimises latency, processing time through autoscaling.
Regarding FlexRS for Dataflow, ou can use it in batch processing pipelines which are not time-critical, such as daily or weekly jobs that can be completed within a certain time-window. Normally, Dataflow uses both and preemptible and regular workers to execute your job. It takes into account the availability of preemptible VM`s and you are charged according to the documentation.
On the other hand, FlexRS offers a discounted rate for CPU and memory pricing for batch processing. It can delay your Dataflow batch job within a 6-hour window to identify the best point in time to start the job, based on the availability of resources. When enabled, FlexRS selects preemptible VMs for 90% of workers in the worker pool by default.
Therefore, FlexRS is used only for non time-critical batch workloads.

Should I use pub/sub

I am trying to write an ingestion application using GCP services. There could be around 1 TB of data each day which can come in a streaming way (i.e, 100 GIG each hour or even by once at a specific time)
I am trying to design an ingestion application, I first thoght it is a good idea to write a simple Python script within a cron job to read files sequentiallly (or even within two three threads) and then publish them as a message to pub/sub. Further I need to have a Dataflow job running always read data from pub/sub and save them to BigQuery.
But I really want to know If I need pub/sub at all here, I know dataflow could be very flexible and i wanted to know can I ingest 1 TB of data directly from GCS to BigQuery as batch job, or it is better to be done by a streaming job (by pub/sub) as I told above? what are the pros cons of each approach in terms of cost?
It seems like you don't need Pub/Sub at all.
There is already a Dataflow template for direct transfer of text files from Cloud Storage to BigQuery (in BETA just like the Pub/Sub to BigQuery template) and in general, batch jobs are cheaper than stream jobs (see Pricing Details).

GCP Dataflow vCPU usage and pricing question

I submited a GCP dataflow pipeline to receive my data from GCP Pub/Sub, parse and store to GCP Datastore. It seems work perfect.
Through 21 days, I found the cost is $144.54 and worked time is 2,094.72 hour. It means after I submitted it, it will be charged every sec, even there is not receive (process) any data from Pub/Sub.
Is this behavior normal? Or I set a wrong parameters?
I thought CPU use time only be counted when data is received.
Is there have any way to reduce the cost in same working model (receive from Pub/Sub and store to Datastore)?
The Cloud Dataflow service usage is billed in per second increments, on a per job basis. I guess your job used 4 n1-standard-1 workers, which used 4 vCPUs giving an estimated of 2,000 vCPU hr resource usage. Therefore, this behavior is normal. To reduce the cost, you can use either autoscaling, to specify the maximum number of workers, or the pipeline options, to override the resource settings that are allocated to each worker. Depending on your necessities, you could consider using Cloud Functions which cost less, but considering its limits.
Hope it helps.