Jetty's HTTP/2 client slow?

Jetty's HTTP/2 client slow? - jetty

Doing simple GET requests over a high speed internet connection (within an AWS EC2 instance), the throughput seems to be really low. I've tried this with multiple servers.
Here's the code that I'm using:
HTTP2Client http2Client = new HTTP2Client();
http2Client.start();
SslContextFactory ssl = new SslContextFactory(true);
HttpClient client = new HttpClient(new HttpClientTransportOverHTTP2(http2Client), ssl);
client.start();
long start = System.currentTimeMillis();
for (int i = 0; i < 10000; i++) {
System.out.println("i" + i);
ContentResponse response = client.GET("https://http2.golang.org/");
System.out.println(response.getStatus());
}
System.out.println(System.currentTimeMillis() - start);
The throughput I get is about 8 requests per second.
This seems to be pretty low (as compared to curl on the command line).
Is there anything that I'm missing? Any way to turbo-charge this?
EDIT: How do I get Jetty to use multiple streams?

That is not the right way to measure throughput.
Depending where you are geographically, you will be dominated by the latency between your client and the server. If the latency is 125 ms, you can only make 8 requests/s.
For example, from my location, the ping to http2.golang.org is 136 ms.
Even if your latency is less, there are good chances that the server is throttling you: I don't think that http2.golang.org will be happy to see you making 10k requests in a tight loop.
I'll be curious to know what is the curl or nghttp latency in the same test, but I guess won't be much different (or probably worse if they close the connection after each request).
This test in the Jetty test suite, which is not a proper benchmark either, shows around 500 requests/s on my laptop; without TLS, goes to around 2500 requests/s.
I don't know exactly what you're trying to do, but your test does not tell you anything about the performance of Jetty's HttpClient.

Related

Vertex AI 504 Errors in batch job - How to fix/troubleshoot

We have a Vertex AI model that takes a relatively long time to return a prediction.
When hitting the model endpoint with one instance, things work fine. But batch jobs of size say 1000 instances end up with around 150 504 errors (upstream request timeout). (We actually need to send batches of 65K but I'm troubleshooting with 1000).
I tried increasing the number of replicas assuming that the # of instances handed to the model would be (1000/# of replicas) but that doesn't seem to be the case.
I then read that the default batch size is 64 and so tried decreasing the batch size to 4 like this from the python code that creates the batch job:
model_parameters = dict(batch_size=4)
def run_batch_prediction_job(vertex_config):
aiplatform.init(
project=vertex_config.vertex_project, location=vertex_config.location
)
model = aiplatform.Model(vertex_config.model_resource_name)
model_params = dict(batch_size=4)
batch_params = dict(
job_display_name=vertex_config.job_display_name,
gcs_source=vertex_config.gcs_source,
gcs_destination_prefix=vertex_config.gcs_destination,
machine_type=vertex_config.machine_type,
accelerator_count=vertex_config.accelerator_count,
accelerator_type=vertex_config.accelerator_type,
starting_replica_count=replica_count,
max_replica_count=replica_count,
sync=vertex_config.sync,
model_parameters=model_params
)
batch_prediction_job = model.batch_predict(**batch_params)
batch_prediction_job.wait()
return batch_prediction_job
I've also tried increasing the machine type to n1-high-cpu-16 and that helped somewhat but I'm not sure I understand how batches are sent to replicas?
Is there another way to decrease the number of instances sent to the model?
Or is there a way to increase the timeout?
Is there log output I can use to help figure this out?
Thanks

Answering your follow up question above.
Is that timeout for a single instance request or a batch request. Also, is it in seconds?
This is a timeout for the batch job creation request.
The timeout is in seconds, according to create_batch_prediction_job() timeout refers to rpc timeout. If we trace the code we will end up here and eventually to gapic where timeout is properly described.
timeout (float): The amount of time in seconds to wait for the RPC
to complete. Note that if ``retry`` is used, this timeout
applies to each individual attempt and the overall time it
takes for this method to complete may be longer. If
unspecified, the the default timeout in the client
configuration is used. If ``None``, then the RPC method will
not time out.
What I could suggest is to stick with whatever is working for your prediction model. If ever adding the timeout will improve your model might as well build on it along with your initial solution where you used a machine with a higher spec. You can also try using a machine with higher memory like the n1-highmem-* family.

Google Cloud Functions Java 11 (Beta) Runtime - Performance Issue

I have created a new Cloud Function using Java 11 (Beta) Runtime to handle HTML form submission for my static site. It's a simple 3-field form (name, email, message). No file upload is involved. The function does 2 things primarily:
Creates a pull request with BitBucket
Sends email to me using SendGrid
NOTE: It also verifies recaptcha but I've disabled it for testing.
The function when ran on my local machine (base model 2019 Macbook Pro 13") takes about 3 secs. I'm based in SE Asia. The same function when deployed to Google Cloud us-central1 takes about 25 secs (8 times slower). I have almost the same code running in production as part of a Servlet on GAE Java 8 runtime also in US Central region for a few years. It takes about 2-3 secs including recaptcha verification and sending the email. I'm trying to port it over to Cloud Function, but the performance is about 10 times slower with Cloud Function even without recaptcha verification.
For comparison, the Cloud Function is running on 256MB / 400GHz instance, whereas my GAE Java 8 runtime runs on F1 (128MB / 600GHz) instance. The function is using only about 75MB of memory. The function is configured to accept unauthenticated requests.
I noticed that even basic String concatenation like: String c = a + b; takes a good 100ms on the Cloud Function. I have timed the calls and a simple string concatenation of about 15 strings into one takes about 1.5-2.0 seconds.
Moreover, writing a small message (~ 1KB) to the HTTPUrlConnection output stream and reading the response back takes about 10 seconds (yes seconds)!
/* Writing < 1KB to output stream takes about 4-5 secs */
wr = new OutputStreamWriter(con.getOutputStream());
wr.write(encodedParams);
wr.flush();
wr.close();
/* Reading response also take about 4-5 secs */
String responseMessage = con.getResponseMessage();
Similarly, the SendGrid code below takes another 10 secs to send the email. It takes about 1 sec on my local machine.
Email from = new Email(fromEmail, fromName);
Email to = new Email(toEmail, toName);
Email replyTo = new Email(replyToEmail, replyToName);
Content content = new Content("text/html", body);
Mail mail = new Mail(from, subject, to, content);
mail.setReplyTo(replyTo);
SendGrid sg = new SendGrid(SENDGRID_API_KEY);
Request sgRequest = new Request();
Response sgResponse = null;
try {
sgRequest.setMethod(Method.POST);
sgRequest.setEndpoint("mail/send");
sgRequest.setBody(mail.build());
sgResponse = sg.api(sgRequest);
} catch (IOException ex) {
throw ex;
}
Something is obviously wrong with the Cloud Function. Since my original code is running on GAE Java 8 runtime, it was very easy for me to port it over to the Cloud Function with minor changes. Otherwise I would have gone with NodeJS runtime. I'm also not seeing any of the performance issues when running this function on my local machine.
Can someone help me make sense of the slow performance issue?

What you're seeing is almost certainly due to the "cold start" cost associated with the creation of a new server instance to handle the request. This is an issue with all types of Cloud Functions, as described in the documentation:
Several of the recommendations in this document center around what is known as a cold start. Functions are stateless, and the execution environment is often initialized from scratch, which is called a cold start. Cold starts can take significant amounts of time to complete. It is best practice to avoid unnecessary cold starts, and to streamline the cold start process to whatever extent possible (for example, by avoiding unnecessary dependencies).
I would expect JVM languages to have an even longer cold start time due to the amount of time that it takes to initalize a JVM, in addition to the server instance itself.
Other than the advice above, there is very little one can due to effectively mitigate cold starts. Efforts to keep a function warm are not as effective as you might imagine. There is a lot of discussion about this on the internet if you wish to search.
Keep in mind that the Java runtime is also in beta, so you can expect improvements in the future. The same thing happened with the other runtimes.

Is there a maximum concurrency for AWS s3 multipart uploads?

Referring to the docs, you can specify the number of concurrent connection when pushing large files to Amazon Web Services s3 using the multipart uploader. While it does say the concurrency defaults to 5, it does not specify a maximum, or whether or not the size of each chunk is derived from the total filesize / concurrency.
I trolled the source code and the comment is pretty much the same as the docs:
Set the concurrency level to use when uploading parts. This affects
how many parts are uploaded in parallel. You must use a local file as
your data source when using a concurrency greater than 1
So my functional build looks like this (the vars are defined by the way, this is just condensed for example):
use Aws\Common\Exception\MultipartUploadException;
use Aws\S3\Model\MultipartUpload\UploadBuilder;
$uploader = UploadBuilder::newInstance()
->setClient($client)
->setSource($file)
->setBucket($bucket)
->setKey($file)
->setConcurrency(30)
->setOption('CacheControl', 'max-age=3600')
->build();
Works great except a 200mb file takes 9 minutes to upload... with 30 concurrent connections? Seems suspicious to me, so I upped concurrency to 100 and the upload time was 8.5 minutes. Such a small difference could just be connection and not code.
So my question is whether or not there's a concurrency maximum, what it is, and if you can specify the size of the chunks or if chunk size is automatically calculated. My goal is to try to get a 500mb file to transfer to AWS s3 within 5 minutes, however I have to optimize that if possible.

Looking through the source code, it looks like 10,000 is the maximum concurrent connections. There is no automatic calculations of chunk sizes based on concurrent connections but you could set those yourself if needed for whatever reason.
I set the chunk size to 10 megs, 20 concurrent connections and it seems to work fine. On a real server I got a 100 meg file to transfer in 23 seconds. Much better than the 3 1/2 to 4 minute it was getting in the dev environments. Interesting, but thems the stats, should anyone else come across this same issue.
This is what my builder ended up being:
$uploader = UploadBuilder::newInstance()
->setClient($client)
->setSource($file)
->setBucket($bicket)
->setKey($file)
->setConcurrency(20)
->setMinPartSize(10485760)
->setOption('CacheControl', 'max-age=3600')
->build();
I may need to up that max cache but as of yet this works acceptably. The key was moving the processor code to the server and not relying on the weakness of my dev environments, no matter how powerful the machine is or high class the internet connection is.

We can abort the process during upload and can halt all the operations and abort the upload at any instance. We can set Concurrency and minimum part size.
$uploader = UploadBuilder::newInstance()
->setClient($client)
->setSource('/path/to/large/file.mov')
->setBucket('mybucket')
->setKey('my-object-key')
->setConcurrency(3)
->setMinPartSize(10485760)
->setOption('CacheControl', 'max-age=3600')
->build();
try {
$uploader->upload();
echo "Upload complete.\n";
} catch (MultipartUploadException $e) {
$uploader->abort();
echo "Upload failed.\n";
}

Improving Akka Remote Throughput

We're thinking about using Akka for client server communications, and trying to benchmark data transfer. Currently we're trying to send a million messages where each message is a case class with 8 string fields.
At this point we're struggling to get acceptable performance. We see about 600KB/s transfer rate and idle CPUs on client and server, so something is going wrong. Maybe it's our netty configuration.
This is our akka config
Server {
akka {
extensions = ["akka.contrib.pattern.ClusterReceptionistExtension"]
loggers = ["akka.event.slf4j.Slf4jLogger"]
loglevel = "DEBUG"
logging-filter = "akka.event.slf4j.Slf4jLoggingFilter"
log-dead-letters = 10
log-dead-letters-during-shutdown = on
actor {
provider = "akka.cluster.ClusterActorRefProvider"
}
remote {
enabled-transports = ["akka.remote.netty.tcp"]
netty.tcp {
hostname = "instance.myserver.com"
port = 2553
maximum-frame-size = 1000 MiB
send-buffer-size = 2000 MiB
receive-buffer-size = 2000 MiB
}
}
cluster {
seed-nodes = ["akka.tcp://server#instance.myserver.com:2553"]
roles = [master]
}
contrib.cluster.receptionist {
name = receptionist
role = "master"
number-of-contacts = 3
response-tunnel-receive-timeout = 300s
}
}
}
Client {
akka {
loggers = ["akka.event.slf4j.Slf4jLogger"]
loglevel = "DEBUG"
logging-filter = "akka.event.slf4j.Slf4jLoggingFilter"
actor {
provider = "akka.remote.RemoteActorRefProvider"
}
remote {
enabled-transports = ["akka.remote.netty.tcp"]
netty.tcp {
hostname = "127.0.0.1"
port = 0
maximum-frame-size = 1000 MiB
send-buffer-size = 2000 MiB
receive-buffer-size = 2000 MiB
}
}
cluster-client {
initial-contacts = ["akka.tcp://server#instance.myserver.com:2553/user/receptionist"]
establishing-get-contacts-interval = "10s"
refresh-contacts-interval = "10s"
}
}
}
UPDATE:
In the end, notwithstanding the discussion about serialisation (see below) we just converted our payloads to use byte arrays, that way serialisation would not affect the tests. We found that on core i7 with 24gb ram using jeroMQ (ie zeroMQ re-implemented in java - so still not the fastest) we saw consistently about 200k msgs/sec or about 20 MB/sec whilst on raw akka (ie without zeroMQ plugin) we saw about 10k msgs/sec or just under 1MB/sec. Trying on akka + zeroMQ made the performance worse.

To make it easy to get started with remoting akka uses Java serialization for message serialization, this is not what you'd typically use in production because it's not very fast and does not handle message versioning well.
What you should do is use kryo or protobuf for serialization and you should be able to get much better numbers.
You can read about how here, there are also a few links to available serializers at the bottom of the page: http://doc.akka.io/docs/akka/current/scala/serialization.html

Ok here's what we discovered:
Using our client server model we were able to get about < 3000 msg/sec without doing anything, which was not really acceptable to us, but we were really confused with what was happening because we weren't able to max out the cpu.
Therefore we went back to the akka source code and found a benchmarking sample there:
sample.remote.benchmark.Receiver
sample.remote.benchmark.Sender
These are two classes that use akka remote to ping pong themselves a bunch of messages from two jvms on the same machine. Using this benchmark we were able to get around 10~15k msg/sec on corei7 with 24g, using about 50% cpu. We found that adjusting the dispatchers and threads allocated to netty made a difference, but only marginally. Using asks instead of tells, made it a bit slower, but not by much. Using a balancing pool of many actors to send and receive made it considerably slower.
For comparison we did a similar test using jeromq and we managed to get about 100k msg/sec using 100% CPU (same payload, same machine).
We also hacked the Sender and Receivers to use the Akka zeromq extension to pipe messages directly from one jvm to another bypassing akka remote all together. In this test we found that we could get fast send and receive to start with, approx 100k msg/sec, but that performance quickly degraded. On further inspection the zeromq threads and the akka actor switching do not play nice together. We probably could have tuned the performance somewhat by being smarter with how zeromq and akka interacted with each other, but at that point we decided it would be better to go with raw zeromq.
Conclusion: Don't use akka if you care about fast serialisation of lots of data over the wire.

Why is Go's Martini less performant than Play Framework 2.2.x

I wrote two equal projects in Golang+Martini and Play Framework 2.2.x to compare it's performance. Both have 1 action that render 10K HTML View. Tested it with ab -n 10000 -c 1000 and monitored results via ab output and htop. Both uses production confs and compiled views. I wonder about results:
Play: ~17000 req/sec + constant 100% usage of all cores of my i7 = ~0.059 msec/req
Martini: ~4000 req/sec + constant 70% usage of all cores of my i7 = ~0.25 msec/req
...as I understand martini is not bloated, so why it 4.5 times slower? Any way to speedup?
Update: Added benchmark results
Golang + Martini:
./wrk -c1000 -t10 -d10 http://localhost:9875/
Running 10s test # http://localhost:9875/
10 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 241.70ms 164.61ms 1.16s 71.06%
Req/Sec 393.42 75.79 716.00 83.26%
38554 requests in 10.00s, 91.33MB read
Socket errors: connect 0, read 0, write 0, timeout 108
Requests/sec: 3854.79
Transfer/sec: 9.13MB
Play!Framework 2:
./wrk -c1000 -t10 -d10 http://localhost:9000/
Running 10s test # http://localhost:9000/
10 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 32.99ms 37.75ms 965.76ms 85.95%
Req/Sec 2.91k 657.65 7.61k 76.64%
276501 requests in 10.00s, 1.39GB read
Socket errors: connect 0, read 0, write 0, timeout 230
Requests/sec: 27645.91
Transfer/sec: 142.14MB
Martini running with runtime.GOMAXPROCS(runtime.NumCPU())
I want to use golang in production, but after this benchmark I don't know how can I make such decision...
Any way to speedup?

I made a simple martini app with rendering 1 html file and it's quite fast:
✗ wrk -t10 -c1000 -d10s http://localhost:3000/
Running 10s test # http://localhost:3000/
10 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 31.94ms 38.80ms 109.56ms 83.25%
Req/Sec 3.97k 3.63k 11.03k 49.28%
235155 requests in 10.01s, 31.17MB read
Socket errors: connect 774, read 0, write 0, timeout 3496
Requests/sec: 23497.82
Transfer/sec: 3.11MB
Macbook pro i7.
This is app https://gist.github.com/zishe/9947025
It helps if you show your code, perhaps you didn't disable logs or missed something.
But timeout 3496 seems bad.

#Kr0e, right! I figured out that heavy use of reflection in DI of martini makes it perform slowly.
I moved to gorilla mux and wrote some martini-style helpers and got a wanted performance.
#Cory LaNou: I can't accept yours comment) Now I agree with you, no framework in prod is good idea. Thanks
#user3353963: See my question: Both uses production confs and compiled views

add this code:
func init(){
martini.Env = martini.Prod
}
sorry, your code have done.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js