Rust vs Go concurrent webserver, why is Rust slow here?

Rust vs Go concurrent webserver, why is Rust slow here? - concurrency

I was trying out some benchmarking of the multi-threaded webserver example in the Rust book and for comparison I built something similar in Go and ran a benchmark using ApacheBench. Though its a simple example the difference was way too much. Go web server doing the same was 10 times faster. Since I was expecting Rust to be faster or at same level, I tried multiple revisions using futures and smol (Though my goal was to compare implementations using only standard library) but result was almost the same. Can anyone here suggest changes to the Rust implementation to make it faster without using a huge thread count?
Here is the code I used: https://github.com/deepu105/concurrency-benchmarks
The tokio-http version is the slowest, the other 3 rust versions give almost same result
Here are the benchmarks:
Rust (with 8 threads, with 100 threads the numbers are closer to Go):
❯ ab -c 100 -n 1000 http://localhost:8080/
This is ApacheBench, Version 2.3 <$Revision: 1879490 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking localhost (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests
Server Software:
Server Hostname: localhost
Server Port: 8080
Document Path: /
Document Length: 176 bytes
Concurrency Level: 100
Time taken for tests: 26.027 seconds
Complete requests: 1000
Failed requests: 0
Total transferred: 195000 bytes
HTML transferred: 176000 bytes
Requests per second: 38.42 [#/sec] (mean)
Time per request: 2602.703 [ms] (mean)
Time per request: 26.027 [ms] (mean, across all concurrent requests)
Transfer rate: 7.32 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 2 2.9 1 16
Processing: 4 2304 1082.5 2001 5996
Waiting: 0 2303 1082.7 2001 5996
Total: 4 2307 1082.1 2002 5997
Percentage of the requests served within a certain time (ms)
50% 2002
66% 2008
75% 2018
80% 3984
90% 3997
95% 4002
98% 4005
99% 5983
100% 5997 (longest request)
Go:
ab -c 100 -n 1000 http://localhost:8080/
This is ApacheBench, Version 2.3 <$Revision: 1879490 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking localhost (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests
Server Software:
Server Hostname: localhost
Server Port: 8080
Document Path: /
Document Length: 174 bytes
Concurrency Level: 100
Time taken for tests: 2.102 seconds
Complete requests: 1000
Failed requests: 0
Total transferred: 291000 bytes
HTML transferred: 174000 bytes
Requests per second: 475.84 [#/sec] (mean)
Time per request: 210.156 [ms] (mean)
Time per request: 2.102 [ms] (mean, across all concurrent requests)
Transfer rate: 135.22 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 2 1.4 2 5
Processing: 0 203 599.8 3 2008
Waiting: 0 202 600.0 2 2008
Total: 0 205 599.8 5 2013
Percentage of the requests served within a certain time (ms)
50% 5
66% 7
75% 8
80% 8
90% 2000
95% 2003
98% 2005
99% 2010
100% 2013 (longest request)

I only compared your "rustws" and the Go version. In Go you have unlimited goroutines (even though you limit them all to only one CPU core) while in rustws you create a thread pool with 8 threads.
Since your request handlers sleep 2 seconds for every 10th request you are limiting the rustws version to 80/2 = 40 requests per second which is what you are seeing in the ab results. Go does not suffer from this arbitrary bottleneck so it shows you the maximum it candle handle on a single CPU core.

I was finally able to get similar results in Rust using the async_std lib
❯ ab -c 100 -n 1000 http://localhost:8080/
This is ApacheBench, Version 2.3 <$Revision: 1879490 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking localhost (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests
Server Software:
Server Hostname: localhost
Server Port: 8080
Document Path: /
Document Length: 176 bytes
Concurrency Level: 100
Time taken for tests: 2.094 seconds
Complete requests: 1000
Failed requests: 0
Total transferred: 195000 bytes
HTML transferred: 176000 bytes
Requests per second: 477.47 [#/sec] (mean)
Time per request: 209.439 [ms] (mean)
Time per request: 2.094 [ms] (mean, across all concurrent requests)
Transfer rate: 90.92 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 2 1.7 2 7
Processing: 0 202 599.7 2 2002
Waiting: 0 201 600.1 1 2002
Total: 0 205 599.7 5 2007
Percentage of the requests served within a certain time (ms)
50% 5
66% 6
75% 9
80% 9
90% 2000
95% 2003
98% 2004
99% 2006
100% 2007 (longest request)
Here is the implementation
use async_std::net::TcpListener;
use async_std::net::TcpStream;
use async_std::prelude::*;
use async_std::task;
use std::fs;
use std::time::Duration;
#[async_std::main]
async fn main() {
let mut count = 0;
let listener = TcpListener::bind("127.0.0.1:8080").await.unwrap(); // set listen port
loop {
count = count + 1;
let count_n = Box::new(count);
let (stream, _) = listener.accept().await.unwrap();
task::spawn(handle_connection(stream, count_n)); // spawn a new task to handle the connection
}
}
async fn handle_connection(mut stream: TcpStream, count: Box<i64>) {
// Read the first 1024 bytes of data from the stream
let mut buffer = [0; 1024];
stream.read(&mut buffer).await.unwrap();
// add 2 second delay to every 10th request
if (*count % 10) == 0 {
println!("Adding delay. Count: {}", count);
task::sleep(Duration::from_secs(2)).await;
}
let contents = fs::read_to_string("hello.html").unwrap(); // read html file
let response = format!("{}{}", "HTTP/1.1 200 OK\r\n\r\n", contents);
stream.write(response.as_bytes()).await.unwrap(); // write response
stream.flush().await.unwrap();
}

Related

Why is Celery Async Task working slower than Synchronous task?

I'm working on a Django application that uses Celery to run some tasks Asynchronously. I tried to perform load testing and check response time using Apache Bench. From what I could figure out from the results is that response time is faster without celery async tasks.
I'm using: Django: 2.1.0celery: 4.2.1Redis (Broker): 2.10.5django-redis: 4.9.0
Celery configuration in Django settings.py:
BROKER_URL = 'redis://127.0.0.1:6379/1'
CELERY_RESULT_BACKEND = 'django-db' # Using django_celery_results
CELERY_ACCEPT_CONTENT = ['application/json']
CELERY_TASK_SERIALIZER = 'json'
CELERY_RESULT_SERIALIZER = 'json'
CELERY_TIMEZONE = 'Asia/Kolkata'
Following is my code (API exposed by my system):
class CustomerSearch(APIView):
def post(self, request):
request_dict = {# Request parameters}
# Async Block
response = celery_search_customer_task.delay(request_dict)
response = response.get()
# Synchronous Block (uncomment following to make synchronous call)
# api_obj = ApiCall(request=request_dict)
# response = api_obj.search_customer() # this makes an API call to
return Response(response)
And the celery task in tasks.py:
#app.task(bind=True)
def celery_search_customer_task(self, req_data={}):
api_obj = ApiCall(request=req_data)
response = api_obj.search_customer() # this makes an API call to another system
return response
Apache Bench command:
ab -p req_data.data -T application/x-www-form-urlencoded -l -r -n 10 -c 10 -k -H "Authorization: Token <my_token>" http://<my_host_name>/<api_end_point>/
Following is the result of ab:
Without celery Async Task
Concurrency Level: 10
Time taken for tests: 1.264 seconds
Complete requests: 10
Failed requests: 0
Keep-Alive requests: 0
Total transferred: 3960 bytes
Total body sent: 3200
HTML transferred: 1760 bytes
Requests per second: 7.91 [#/sec] (mean)
Time per request: 1264.011 [ms] (mean)
Time per request: 126.401 [ms] (mean, across all concurrent requests)
Transfer rate: 3.06 [Kbytes/sec] received
2.47 kb/s sent
5.53 kb/s total
Connection Times (ms)
min mean[+/-sd] median max
Connect: 259 270 10.7 266 298
Processing: 875 928 36.9 955 967
Waiting: 875 926 35.3 950 962
Total: 1141 1198 43.4 1224 1263
Percentage of the requests served within a certain time (ms)
50% 1224
66% 1225
75% 1231
80% 1233
90% 1263
95% 1263
98% 1263
99% 1263
100% 1263 (longest request)
With celery Async Task
Concurrency Level: 10
Time taken for tests: 10.776 seconds
Complete requests: 10
Failed requests: 0
Keep-Alive requests: 0
Total transferred: 3960 bytes
Total body sent: 3200
HTML transferred: 1760 bytes
Requests per second: 0.93 [#/sec] (mean)
Time per request: 10775.688 [ms] (mean)
Time per request: 1077.569 [ms] (mean, across all concurrent requests)
Transfer rate: 0.36 [Kbytes/sec] received
0.29 kb/s sent
0.65 kb/s total
Connection Times (ms)
min mean[+/-sd] median max
Connect: 259 271 9.2 268 284
Processing: 1132 6128 4091.9 8976 10492
Waiting: 1132 6127 4091.3 8975 10491
Total: 1397 6399 4099.3 9244 10775
Percentage of the requests served within a certain time (ms)
50% 9244
66% 9252
75% 10188
80% 10196
90% 10775
95% 10775
98% 10775
99% 10775
100% 10775 (longest request)
Isn't celery async task supposed to make tasks work faster than synchronous tasks? What is it that I might be missing here?
Any help would be appreciated. Thanks.

I think there are multiple misconceptions in your question that should be answered.
Isn't celery async task supposed to make tasks work faster than synchronous tasks?
As #Yugandhar indicate in his answer, by using something like Celery you are adding additional overhead to your processing. Instead of the same process executing the code, you are actually doing the following:
Client send message to broker.
Worker pick up message and execute it.
Worker return response to broker.
Client pick up response and process it.
As you can see, clearly there is additional overhead involved in using Celery relative to executing it synchronously. Because of this, it is not necessarily true to say that "async task is faster than synchronous tasks".
The question is then, why use asynchronous tasks? If it adds additional overhead and might slow down the execution, then what is the benefit of it? The benefit is that you don't need to await the response!
Let's take your ApiCall() as an example. Let's say that the call itself takes 10 seconds to execute. By executing it synchronously it means that you are blocking anything else to be done until the call is completed. If for example you have a form submission that triggers this, it means that the user have to wait for their browser to load for 10 seconds before they get their response. This is a pretty poor user experience.
By executing it asynchronously in the background, the call itself might take 10.01 seconds to execute (slower due to the overhead) but instead of having to wait for those seconds to pass, you can (if you choose to) immediately return the response back to the user and make the user experience much better.
Awaiting Results vs Callbacks
The problem with your code example is that the synchronouse and the "asynchronous" code basically do the same thing. Both of them await the results in a blocking fashion and you don't really get the benefits of executing it asychronously.
By using the .get() method, you tell the AsyncResult object to await the results. This means that it will block (just as if you executed it synchronously) anything until the Celery worker returns a response.
task.delay() # Async, don't await any response.
task.delay().get() # Blocks execution until response is returned.
Sometimes this is what you want, but in other cases you don't need to wait for the response and you can finish executing the HTTP Request and instead use a callback to handle the response of the task that you triggered.

Running code synchronously is straightforward blocking code on main thread,on the other hand celery works like producer consumer mechanism.
Celery forwards the task to a broker message queue like RabbitMQ or Redis this adds an extra processing time here. And depending upon where your celery is running you can consider network latency added if not running locally. If you are calling get or delay then returns a promise that can be used to monitor the status and get the result when it's ready.
So architecture basically becomes
web
broker
worker
result backend
Considering this much processing celery task is slower than running on main thread

AWS API request rate limits

Update: Keep-alive wasn't set on the AWS client. My fix was
var aws = require('aws-sdk');
aws.config.httpOptions.agent = new https.Agent({
keepAlive: true
});
I finally managed to debug it by using the Node --prof flag. Then using the node-tick-processor to analyze the output (it's a packaged version of a tool distributed in the Node/V8 source code). Most of the processing time was spent in SSL processing and that's when I thought to check whether or not is used keep-alive.
TL;DR Getting throttled by AWS when the number of requests is less than the configured DynamoDB throughput. Is there a request rate limit for all APIs?
I'm having a hard time finding documentation about the rate limiting of AWS APIs.
An application that I'm testing now is making about 80 requests per second to DynamoDB. This is a mix of PUTs and GETs. My DynamoDB table is configured with a throughput of: 250 reads / 250 writes. In the table CloudWatch metrics, the reads peak at 24 and the writes at 59 during the test period.
This is a sample of my response times. First, subsecond response times.
2015-10-07T15:28:55.422Z 200 in 20 milliseconds in request to dynamodb.us-east-1.amazonaws.com
2015-10-07T15:28:55.423Z 200 in 22 milliseconds in request to dynamodb.us-east-1.amazonaws.com
A lot longer, but fine...
2015-10-07T15:29:33.907Z 200 in 244 milliseconds in request to dynamodb.us-east-1.amazonaws.com
2015-10-07T15:29:33.910Z 200 in 186 milliseconds in request to dynamodb.us-east-1.amazonaws.com
The requests are piling up...
2015-10-07T15:32:41.103Z 200 in 1349 milliseconds in request to dynamodb.us-east-1.amazonaws.com
2015-10-07T15:32:41.104Z 200 in 1181 milliseconds in request to dynamodb.us-east-1.amazonaws.com
...no...
2015-10-07T15:41:09.425Z 200 in 6596 milliseconds in request to dynamodb.us-east-1.amazonaws.com
2015-10-07T15:41:09.428Z 200 in 5902 milliseconds in request to dynamodb.us-east-1.amazonaws.com
I went and got some tea...
2015-10-07T15:44:26.463Z 200 in 13900 milliseconds in request to dynamodb.us-east-1.amazonaws.com
2015-10-07T15:44:26.464Z 200 in 12912 milliseconds in request to dynamodb.us-east-1.amazonaws.com
Anyway, I stopped the test, but this is a Node.js application so a bunch of sockets were left open waiting for my requests to AWS to complete. I got response times > 60 seconds.
My DynamoDB throughput wasn't used much, so I assume that the limit is in API requests but I can't find any information on it. What's interesting is that the 200 part of the log entries is the response code from AWS which I got by hacking a bit of the SDK. I think AWS is supposed to return 429s -- all their SDKs implement exponential backoff.
Anyway -- I assumed that I could make as many requests to DynamoDB as configured throughput. Is that right? ...or what?

High load on jetty

I'm running load tests on my MBP. The load is injected using gatling.
My web server is jetty 9.2.6
On a heavy load, number of threads remains constant : 300 but the number open socket is growing from 0 to 4000+, which generates a too much open files at OS level.
What does it mean ?
Any idea to improve the situation ?
Here is the output of jetty stat
Statistics:
Statistics gathering started 643791ms ago
Requests:
Total requests: 56084
Active requests: 1
Max active requests: 195
Total requests time: 36775697
Mean request time: 655.7369791202325
Max request time: 12638
Request time standard deviation: 1028.5144674112403
Dispatches:
Total dispatched: 56084
Active dispatched: 1
Max active dispatched: 195
Total dispatched time: 36775697
Mean dispatched time: 655.7369791202325
Max dispatched time: 12638
Dispatched time standard deviation: 1028.5144648655212
Total requests suspended: 0
Total requests expired: 0
Total requests resumed: 0
Responses:
1xx responses: 0
2xx responses: 55644
3xx responses: 0
4xx responses: 0
5xx responses: 439
Bytes sent total: 281222714
Connections:
org.eclipse.jetty.server.ServerConnector#243883582
Protocols:http/1.1
Statistics gathering started 643784ms ago
Total connections: 8788
Current connections open: 1
Max concurrent connections open: 4847
Mean connection duration: 77316.87629452601
Max connection duration: 152694
Connection duration standard deviation: 36153.705226514794
Total messages in: 56083
Total messages out: 56083
Memory:
Heap memory usage: 1317618808 bytes
Non-heap memory usage: 127525912 bytes

Some advice:
Don't have the Client Load and the Server Load on the same machine (don't cheat and attempt to put the load on 2 different VMs on a single physical machine)
Use multiple client machines, not just 1 (when the Jetty developers test load characteristics, we use at least 10:1 ratio of client machines to server machines)
Don't test with loopback, virtual network interfaces, localhost, etc.. Use a real network interface.
Understand how your load client manages its HTTP version + connections (such as keep-alive or http/1.1 close), and make sure you read the response body content, close the response content / streams, and finally disconnect the connection.
Don't test with unrealistic load scenarios. A real-world usage of your server will be a majority of HTTP/1.1 pipelined connections with multiple requests per physical connection. Some on fast networks, some on slow networks, some even on unreliable networks (think mobile)
Raw speed, serving the same content, all on unique connections, is ultimately a fascinating number and can produce impressive results, and also completely pointless and proves nothing about how your application's performance on Jetty will behave with real world scenarios.
Finally, be sure you are testing load in realistic ways.

camel jetty benchmark testing for requests per second

I am building a high load http service that will consume thousands of messages per second and pass it to a messaging system like activemq.
I currently have a rest service (non-camel, non-jetty) that accepts posts from http clients and returns a plain successful respone and i could load test this using apache ab.
We are also looking at camel-jetty as input endpoint since it has integration components for activemq and be part of an esb if required. Before i start building a camel-jetty to activemq route i want to test the load that camel-jetty can support. What should my jetty only route look like,
I am thinking of the route
from("jetty:http://0.0.0.0:8085/test").transform(constant("a"));
and use apache ab to test.
I am concerned if this route provides a real camel-jetty capacity since transform could add overhead. or would it not.
Based on these tests i am planning to build the http-mq with or without camel.

the transform API will not add significant overhead...I just ran a test against your basic route...
ab -n 2000 -c 50 http://localhost:8085/test
and got the following...
Concurrency Level: 50
Time taken for tests: 0.459 seconds
Complete requests: 2000
Failed requests: 0
Write errors: 0
Non-2xx responses: 2010
Total transferred: 2916510 bytes
HTML transferred: 2566770 bytes
Requests per second: 4353.85 [#/sec] (mean)
Time per request: 11.484 [ms] (mean)
Time per request: 0.230 [ms] (mean, across all concurrent requests)
Transfer rate: 6200.21 [Kbytes/sec] received

Benchmarking EC2

I am running some quick tests to try to estimate hw costs for a launch and in the future.
Specs
Ubuntu Natty 11.04 64-bit
Nginx 0.8.54
m1.large
I feel like I must be doing something wrong here. What I am trying to do estimate how many
simultaneous I can support before having to add an extra machine. I am using django app servers but for right now I am just testing nginx server the static index.html page
Results:
$ ab -n 10000 http://ec2-107-20-9-180.compute-1.amazonaws.com/
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking ec2-107-20-9-180.compute-1.amazonaws.com (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests
Server Software: nginx/0.8.54
Server Hostname: ec2-107-20-9-180.compute-1.amazonaws.com
Server Port: 80
Document Path: /
Document Length: 151 bytes
Concurrency Level: 1
Time taken for tests: 217.748 seconds
Complete requests: 10000
Failed requests: 0
Write errors: 0
Total transferred: 3620000 bytes
HTML transferred: 1510000 bytes
Requests per second: 45.92 [#/sec] (mean)
Time per request: 21.775 [ms] (mean)
Time per request: 21.775 [ms] (mean, across all concurrent requests)
Transfer rate: 16.24 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 9 11 10.3 10 971
Processing: 10 11 9.7 11 918
Waiting: 10 11 9.7 11 918
Total: 19 22 14.2 21 982
Percentage of the requests served within a certain time (ms)
50% 21
66% 21
75% 22
80% 22
90% 22
95% 23
98% 25
99% 35
100% 982 (longest request)
So before I even add a django backend, the basic nginx setup can only supper 45 req/second?
This is horrible for an m1.large ... no?
What am I doing wrong?

You've only set the concurrency level to 1. I would recommend upping the concurrency (-c flag for Apache Bench) if you want more realistic results such as
ab -c 10 -n 1000 http://ec2-107-20-9-180.compute-1.amazonaws.com/.

What Mark said about concurrency. Plus I'd shell out a few bucks for a professional load testing service like loadstorm.com and hit the thing really hard that way. Ramp up load until it breaks. Creating simulated traffic that is at all realistic (which is important to estimating server capacity) is not trivial, and these services help by loading resources and following links and such. You won't get very realistic numbers just loading one static page. Get something like the real app running, and hit it with a whole lot of virtual browsers. You can't count on finding the limits of a well configured server with just one machine generating traffic.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js