How to do "live request batching" in gcloud - google-cloud-platform

Here is my situation:
I have a rather slow tensorflow model that runs on GPU (2 to 3 seconds per prediction)
A prediction for a single 'entity' vs a prediction for 8 'entities' takes about the same time
This means I could be 8 times as efficient by simply combining multiple predictions in the same request
I have a service on AI platform serving requests to that model
The service works for slow request rates but has trouble scaling up (anything over 4 QPS is too much to handle)
My question then is:
Is there a standard way / best practice for batching live client requests:
When receiving a request, wait a little bit for other requests
After a while, or when the number of requests reaches a set number, forward the requests in a single "batch" to another service.
If traffic is low, the delay will expire before the batch is full, but since traffic is low, that's not an issue
If traffic is high, the batch will be full before the delay, and the client will have to wait less
I have an almost-working solution with app-engine + firebase (for hosting the shared 'queue') but implementing the delay is giving me trouble (app engine doesn't seem to like python's threading.Timer
I'd appreciate something that could work with app engine, but at this point I'm open to any suggestions (as long as it is applicable on google cloud).
Thanks!

The perfect (but not the cheapest) is to use Dataflow.
When a prediction request comes in, publish it in PubSub
Deploy a dataflow in streaming mode, with fixed windows of X minutes, and another trigger, not accumulated, after Y event in the window.
When a window trigger is performed (either on the number of messages or on the timer) do the batch processing
You can imagine other designs, simpler/cheaper.
Still publish the prediction requests in PubSub
You can schedule a Cloud Functions, or a Cloud Run every X minutes to pull the pubsub subscription and then to trigger the batch job. But, it's a fixed time.
When you publish the message in PubSub, you can also store, in firestore for example, and increase a counter and the date of the 1st message published in PubSub.
If the number of message is above your threshold, perform a request to your other process that pull the PubSub subscription and run the batch processing (as before #1). Reset the counter value and the message date value
Set up a cloud scheduler which check, every minute, the value of the 1st message date in Firestore. If it's above your time limit, perform a request to your other process that pull the PubSub subscription and run the batch processing (as before #1). Reset the counter value and the message date value
The #2 will generate a lot of Firestore read/write, but will be cheaper than dataflow.

Related

Cloud Run: 429: The request was aborted because there was no available instance

We (as a company) experience large spikes every day. We use Pub/Sub -> Cloud Run combination.
The issue we experience is that when high traffic hits, Pub/Sub tries to push messages to Cloud/Run all at the same time without any flow control. The result?
429: The request was aborted because there was no available instance.
Although this is marked as a warning, every 4xx HTTP response results in the message retry delivery.
Messages, therefore, come back to the queue and wait. If a message repeats this process and the instances are still taken, Cloud Run returns 429 again, and the message is sent back to the queue. This process repeats x times (depends on what value we set in Maximum delivery attempts). After that, the message goes to the dead-letter queue.
We want to avoid this and ideally don't get any 429, so the message won't travel back and forth, and it won't end up in the dead-letter subscription because it is not one of the application errors we want to keep there, but rather a warning caused by Pub/Sub not controlling the flow and coordinating with Cloud Run.
Neither Pub/Sub nor a push subscription (which is required to use for Cloud Run) have any flow control feature.
Is there any way to control how many messages are sent to Cloud Run to avoid getting the 429 response? And also, why does Pub/Sub even try to deliver when it is obvious that Cloud Run hit the limit of instances. The best would be to keep the messages in a queue until the instances free up.
Most of the answers would probably suggest increasing the limit of instances. We already set 1000. This would not be scalable because even if we set the limit to 1500 and a huge spike comes, we would pass the limit and get the 429 messages again.
The only option I can think of is some flow control. So far, we have read about Cloud Tasks, but we are not sure if this can help us. Ideally, we don't want to introduce any new service, but if necessary, we will do.
Thank you for all your tips and time! :)

First Message is Not Reaching the subscribed Actor

We have a Messaging Platform built on top of Akka (2.5) using akka cluster and Distributed Pubsub. We have a cluster of 25 servers currently.
The scenario is as follows.
Actor1 created in Server1 subscribes to a topic Chat1.
Actor2 created in Server2 publishes a message over Chat1 (after around 100ms of subscription)
Sometimes the 1st message is not received by Actor1 but subsequent messages always do.
We could derive that this is happening because of the fact that a subscription takes some time to register on all the nodes of the cluster. These are the actions we took to solve this -
Decreased the gossip-interval from 1sec (default) to 50ms.
Added a delay of another 400ms thus giving the cluster 500ms in total to register the subscription. This reduced the probability of the issue happening but its still pretty frequent (1/6 times around)
So few questions here -
Is it expected for Pubsub to take more than 400ms in a cluster of just 25 (that too in private network of servers in the same data centre)
Are there additional configurations in akka which can help in tweaking the time taken for subscription propagation.
What are our options here to monitor the average time taken by Pubsub for subscription propagation within the cluster? This would help in getting the right estimate of delay to be introduced(if at all needed)
If the above mentioned delay is expected, Are there any workarounds which has been used by someone in the past to overcome this issue.

Throttle down GCP DataFlow?

Using the standard GCP provided Storage/text file to PubSub DataFlow template but although I have set #workernodes eq 1 the thruput of messages processed is "to high" for downstream components.
CloudFunction that runs on message event in Pub/Sub hits GCP quotas and with CloudRun I get a bunch of 500, 429 and 503 errors in the beginning (due to to step burst rate).
Is there any way to control the processing rate of DataFlow? Need to get a softer/slower start so downstream components have time to scale up.
Anyone?
You can use Stateful ParDo's to achieve this where in you can buffer events in batches and make an API call with all the keys at once. This is very nicely explained with code snippets here

How to use Google Cloud PubSub and Run to handle resource-intensive long-running tasks?

I've got a Google Cloud PubSub topic which at times has thousands of messages and at times zero messages coming in. These messages represent tasks which can take upwards of an hour each. Preferably I'm able to use Cloud Run for this, as it scales really well to the demand, if a thousand messages gets published, I want 100s of Cloud Run instances to spin up. These Run instances get started by a push subscription. The problem is that PubSub has a 600 second timeout for the acknowledgement. This means in order to have Cloud Run process these messages they have to finish within 600 seconds. If they do not, PubSub times it out, and sends it again, causing the task to be restarted until the first task finally does acknowledge it (this causes the same task to be ran many times). Cloud Run acknowledges the messages by returning a 2** HTTP status code. The documentation states
When an application running on Cloud Run finishes handling a request, the container instance's access to CPU will be disabled or severely limited. Therefore, you should not start background threads or routines that run outside the scope of the request handlers.
So is it maybe possible to acknowledge a PubSub request through code and continue the processing, without having Google Cloud Run hand over the resources? Or is there a better solution I'm unaware of?
Because these processes are so code/resource-intensive, I feel Cloud Functions will not suffice. I've looked at https://cloud.google.com/solutions/using-cloud-pub-sub-long-running-tasks and https://cloud.google.com/blog/products/gcp/how-google-cloud-pubsub-supports-long-running-workloads. But these didn't answer my question.
I've looked at Google Cloud Tasks, which might be something? But the rest of the project has been built around PubSub/Run/Functions, so preferably I stick with that.
This project is written in Python.
So preferably I would like to write my Google Cloud Run tasks like this:
#app.route('/', methods=['POST'])
def index():
"""Endpoint for Google Cloud PubSub messages"""
pubsub_message = request.get_json()
logger.info(f'Received PubSub pubsub_message {pubsub_message}')
if message_incorrect(pubsub_message):
return "Invalid request", 400 #use normal NACK handling
# acknowledge message here without returning
# ...
# Do actual processing of the task here
# ...
So how can or should I solve this, so that the the resource-intensive tasks get properly scaled on demand ( so a push PubSub subscription ). And the tasks only get executed once.
Answers:
In short what has been answered. Cloud Run and Functions are just not suited for this problem. There is no way to have them do tasks that take longer than 9 or 15 minutes respectively. The only solution is to switch over to another Google Service and use a pull style subscription and lose out on auto-scaling of GC Run/Functions
Cloud Run on GKE can handle long process, more CPU and memory than available on managed platform. However, you have a GKE cluster always running and you loose the "pay-as-you-use" benefit.
If you want to use this solution, don't link directly PubSub push subscription to your Cloud Run on GKE. Use Cloud Task with HTTP job for this. The timeout is longer than PubSub (up to 24h instead of 10 min) and the retry policies are customizables.
Neither Cloud Functions nor Cloud Run is sufficient for arbitrarily long running operations. Cloud Functions has a hard cap of 9 minutes per invocation, and Cloud Run caps at 60. If you need more time, you're going to have to delegate the work to another product, such as Google Compute Engine. It should be possible to kick off some Compute Engine work from one of the serverless products.
Give the limits of pubsub acks, you'll probably have to find a way for a client to be able to poll or listen to some resource to find out when the work is actually done. You could use a database for that, and Cloud Firestore lets you listen to documents to find out when they change. So you could use that to track the status of your long-running work.

Estimate SQS processing time and load

I am going to use AWS SQS(regular queue, not FIFO) to process different client side metrics.
I’m expect to have ~400 messages per second (worst case).My SQS message will contain S3 location of the file.
I created an application, which will listen to my SQS Queue, and process messages from it.
By process I mean:
read SQS message ->
take S3 location from that SQS message ->
call S3 client ->
Read that file ->
Add a few additional fields —>
Publish data from this file to AWS Kinesis Firehose.
Similar process will be for each SQS message in the Queue. The size of S3 file is small, less than 0,5 KB.
How can calculate if I will be able to process those 400 messages per second? How can I estimate that my solution would handle x5 increase in data?
How can calculate if I will be able to process those 400 messages per second? How can I estimate that my solution would handle x5 increase in data?
Test it! Start with a small scale, and do the math to extrapolate from there. Make your test environment as close to what it will be in production as feasible.
On a single host and single thread, the math is simple:
1000 / AvgTotalTimeMillis = AvgMessagesPerSecond, or
1000 / AvgMessagesPerSecond = AvgTotalTimeMillis
How to approach testing this:
Start with a single thread and host, and generate some timing metrics for each step that you outlined, along with a total time.
Figure out your average/max/min time, and how many messages per second that translates to
400 messages per second on a single thread & host would be under 3ms per message. Hopefully this makes it obvious you need multiple threads/hosts.
Scale up!
Now that you know how much a single thread can handle, figure out how many threads a single host can effectively handle (you'll need to experiment). Consider batching messages where possible - SQS provides batch operations.
Use math to calculate how many hosts you need
If you need 5X that number, go up from there
While you're doing this math, consider any limits of the systems you're using:
Review the throttling limits of SQS / S3 / Firehose / etc. If you plan to use Lambda to do the work instead of EC2, it has limits too. Make sure you're within those limits, and consider contacting AWS support if you are close to exceeding them.
A few other suggestions based on my experience:
Based on your workflow outline & details, using EC2 you can probably handle a decent number of threads per host
M5.large should be more than enough - you can probably go smaller, as the performance bottleneck will likely be networking I/O to fetch and send messages.
Consider using autoscaling to handle message spikes for when you need to increase throughput, though keep in mind autoscaling can take several minutes to kick in.
The only way to determine this is to create a test environment that mirrors your scenario.
If your solution is designed to handle messages in parallel, it should be possible to scale-up your system to handle virtually any workload.
A good architecture would be to use AWS Lambda functions to process the messages. Lambda defaults to 1000 concurrent functions. So, if a function takes 3 seconds to run, it would support 333 messages per second consistently. You can request for the Lambda concurrency to be increased to handle higher workloads.
If you are using Amazon EC2 instead of Lambda functions, then it would just be a matter of scaling-out and adding more EC2 instances with more workers to handle whatever workload you desired.