How would you replay a web service request every ten seconds for ten times until it answers?
I've tried RecoverWithRetries and InitialDelay, but the first recovery immediately replays the web service call:
FromThirdOfContract().RecoverWithRetries(e =>
{
return Source.FromTask(_third.GetThird(message.ContractIdLegacy)).InitialDelay(TimeSpan.FromSeconds(secondsbetween));
}, retry);
The first retry happens immediately instead of ten seconds later. In Akka, there's a RestartSource class; we don't have it in Akka.NET. Any ideas?
I finally played with Source.Lazily() with my source. It's working, it's not evaluated before the initial delay call. But I'm listening for any other ideas
Related
We have a Vertex AI model that takes a relatively long time to return a prediction.
When hitting the model endpoint with one instance, things work fine. But batch jobs of size say 1000 instances end up with around 150 504 errors (upstream request timeout). (We actually need to send batches of 65K but I'm troubleshooting with 1000).
I tried increasing the number of replicas assuming that the # of instances handed to the model would be (1000/# of replicas) but that doesn't seem to be the case.
I then read that the default batch size is 64 and so tried decreasing the batch size to 4 like this from the python code that creates the batch job:
model_parameters = dict(batch_size=4)
def run_batch_prediction_job(vertex_config):
aiplatform.init(
project=vertex_config.vertex_project, location=vertex_config.location
)
model = aiplatform.Model(vertex_config.model_resource_name)
model_params = dict(batch_size=4)
batch_params = dict(
job_display_name=vertex_config.job_display_name,
gcs_source=vertex_config.gcs_source,
gcs_destination_prefix=vertex_config.gcs_destination,
machine_type=vertex_config.machine_type,
accelerator_count=vertex_config.accelerator_count,
accelerator_type=vertex_config.accelerator_type,
starting_replica_count=replica_count,
max_replica_count=replica_count,
sync=vertex_config.sync,
model_parameters=model_params
)
batch_prediction_job = model.batch_predict(**batch_params)
batch_prediction_job.wait()
return batch_prediction_job
I've also tried increasing the machine type to n1-high-cpu-16 and that helped somewhat but I'm not sure I understand how batches are sent to replicas?
Is there another way to decrease the number of instances sent to the model?
Or is there a way to increase the timeout?
Is there log output I can use to help figure this out?
Thanks
Answering your follow up question above.
Is that timeout for a single instance request or a batch request. Also, is it in seconds?
This is a timeout for the batch job creation request.
The timeout is in seconds, according to create_batch_prediction_job() timeout refers to rpc timeout. If we trace the code we will end up here and eventually to gapic where timeout is properly described.
timeout (float): The amount of time in seconds to wait for the RPC
to complete. Note that if ``retry`` is used, this timeout
applies to each individual attempt and the overall time it
takes for this method to complete may be longer. If
unspecified, the the default timeout in the client
configuration is used. If ``None``, then the RPC method will
not time out.
What I could suggest is to stick with whatever is working for your prediction model. If ever adding the timeout will improve your model might as well build on it along with your initial solution where you used a machine with a higher spec. You can also try using a machine with higher memory like the n1-highmem-* family.
I have created a new Cloud Function using Java 11 (Beta) Runtime to handle HTML form submission for my static site. It's a simple 3-field form (name, email, message). No file upload is involved. The function does 2 things primarily:
Creates a pull request with BitBucket
Sends email to me using SendGrid
NOTE: It also verifies recaptcha but I've disabled it for testing.
The function when ran on my local machine (base model 2019 Macbook Pro 13") takes about 3 secs. I'm based in SE Asia. The same function when deployed to Google Cloud us-central1 takes about 25 secs (8 times slower). I have almost the same code running in production as part of a Servlet on GAE Java 8 runtime also in US Central region for a few years. It takes about 2-3 secs including recaptcha verification and sending the email. I'm trying to port it over to Cloud Function, but the performance is about 10 times slower with Cloud Function even without recaptcha verification.
For comparison, the Cloud Function is running on 256MB / 400GHz instance, whereas my GAE Java 8 runtime runs on F1 (128MB / 600GHz) instance. The function is using only about 75MB of memory. The function is configured to accept unauthenticated requests.
I noticed that even basic String concatenation like: String c = a + b; takes a good 100ms on the Cloud Function. I have timed the calls and a simple string concatenation of about 15 strings into one takes about 1.5-2.0 seconds.
Moreover, writing a small message (~ 1KB) to the HTTPUrlConnection output stream and reading the response back takes about 10 seconds (yes seconds)!
/* Writing < 1KB to output stream takes about 4-5 secs */
wr = new OutputStreamWriter(con.getOutputStream());
wr.write(encodedParams);
wr.flush();
wr.close();
/* Reading response also take about 4-5 secs */
String responseMessage = con.getResponseMessage();
Similarly, the SendGrid code below takes another 10 secs to send the email. It takes about 1 sec on my local machine.
Email from = new Email(fromEmail, fromName);
Email to = new Email(toEmail, toName);
Email replyTo = new Email(replyToEmail, replyToName);
Content content = new Content("text/html", body);
Mail mail = new Mail(from, subject, to, content);
mail.setReplyTo(replyTo);
SendGrid sg = new SendGrid(SENDGRID_API_KEY);
Request sgRequest = new Request();
Response sgResponse = null;
try {
sgRequest.setMethod(Method.POST);
sgRequest.setEndpoint("mail/send");
sgRequest.setBody(mail.build());
sgResponse = sg.api(sgRequest);
} catch (IOException ex) {
throw ex;
}
Something is obviously wrong with the Cloud Function. Since my original code is running on GAE Java 8 runtime, it was very easy for me to port it over to the Cloud Function with minor changes. Otherwise I would have gone with NodeJS runtime. I'm also not seeing any of the performance issues when running this function on my local machine.
Can someone help me make sense of the slow performance issue?
What you're seeing is almost certainly due to the "cold start" cost associated with the creation of a new server instance to handle the request. This is an issue with all types of Cloud Functions, as described in the documentation:
Several of the recommendations in this document center around what is known as a cold start. Functions are stateless, and the execution environment is often initialized from scratch, which is called a cold start. Cold starts can take significant amounts of time to complete. It is best practice to avoid unnecessary cold starts, and to streamline the cold start process to whatever extent possible (for example, by avoiding unnecessary dependencies).
I would expect JVM languages to have an even longer cold start time due to the amount of time that it takes to initalize a JVM, in addition to the server instance itself.
Other than the advice above, there is very little one can due to effectively mitigate cold starts. Efforts to keep a function warm are not as effective as you might imagine. There is a lot of discussion about this on the internet if you wish to search.
Keep in mind that the Java runtime is also in beta, so you can expect improvements in the future. The same thing happened with the other runtimes.
Our system processes messages delivered from a messaging system. If no message is received after 10 seconds, an error should be raised (inactivity timeout).
I was thinking of using a ScheduledExecutorService (with 1 Thread). Each time a message is received, I cancel the previous timeout task and submit a new one:
ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(1);
Callable timeoutTask = new Callable() {...};
...
synchronized {
timeout.cancel();
timeout = executor.schedule( timeoutTask, 10, TimeUnit.SECONDS);
}
In normal case, we process ~ 1000 / sec. Would this approach scale?
If you share the threadpool and keep the running time of timeoutTask low, it'd be most probably fine. If you'd have one threadpool per ~ 1000 / sec tasks then that would not work.
If you are still worried, you can have a look at HashedWheelTimer from the Netty project (link). This thing is very efficient in scheduling timeouts. Note that you MUST share instance of HashedWheelTimer as well.
We are using Jetty 8.1 as an embedded HTTP server. Under overload conditions the server sometimes starts flooding the log file with these messages:
warn: java.util.concurrent.RejectedExecutionException
warn: Dispatched Failed! SCEP#76107610{l(...)<->r(...),d=false,open=true,ishut=false,oshut=false,rb=false,wb=false,w=true,i=1r}...
The same message is repeated thousands of times, and the amount of logging appears to slow down the whole system. The messages itself are fine, our request handler ist just to slow to process the requests in time. But the huge number of repeated messages makes things actually worse and makes it more difficult for the system to recover from the overload.
So, my question is: is this a normal behaviour, or are we doing something wrong?
Here is how we set up the server:
Server server = new Server();
SelectChannelConnector connector = new SelectChannelConnector();
connector.setAcceptQueueSize( 10 );
server.setConnectors( new Connector[]{ connector } );
server.setThreadPool( new ExecutorThreadPool( 32, 32, 60, TimeUnit.SECONDS,
new ArrayBlockingQueue<Runnable>( 10 )));
The SelectChannelEndPoint is the origin of this log message.
To not see it, just set your named logger of org.eclipse.jetty.io.nio.SelectChannelEndPoint to LEVEL=OFF.
Now as for why you see it, that is more interesting to the developers of Jetty. Can you detail what specific version of Jetty you are using and also what specific JVM you are using?
There is a line in the 3rd tutorial on Boost asio that shows how to renew a timer and yet prevent there from being drift. The line is the following:
t->expires_at(t->expires_at() + boost::posix_time::seconds(1));
Maybe it's me but I wasn't able to find documentation on the 2nd usage of expires_at(), with no parameters. expires_at(x) sets the new expiration, cancelling any pending completion handlers. So presumably expires_at() does what, return time of the last expiry? So by adding one second, if there should be some number of ms, say n ms, then it will in essence be "subtracted" from the next expiry since the time is being accounted for? What happens then if the time it takes to perform this handler is greater than 1 second in this example? Does it fire immediately?
expires_at() return the time when it is set to timeout. So this will move the timeout to 1 second later.
When you set the time with expires_at(x) you will get a return of 0 if it already invoked due to time already passed. If return is bigger then 0 it indicates number of cancels that were made.