Why Tokyo Tyrant so slow

Why Tokyo Tyrant so slow - tokyo-tyrant

I have follow situation tyrant server lunched on freebsd host, like this:
ttserver -uas -log /data/tyrant/1.log -sid 1 -thnum 8 -tout 5 /data/tyrant/data/1.tct
And i try to communicate this server on windows from python and pyrant-0.3.5: like this:
import pyrant;
import time;
t = pyrant.Tyrant(host="192.168.0.220", port=1978);
tbegin = time.time();
for i in xrange(4000000):
if i and ((i % 10000) == 0):
print time.time() - tbegin;
tbegin = time.time();
t[i] = {"text": "ruslan text", "value": i};
and have i think very slow performance about 5-6 per 10,000 records. But if i start this code on the same machine like server(ttserver). Performance are good - about 0.5 sec per 10,000 records
What i must do to workaround this problem?

I know this may be too obvious, but did you measure the latency to the server? This could be the bottleneck

you can tracert www.ttserver.com and look at the latency from your native host to the remote server. maybe it's the cause.

Related

Poor man's Python data parallelism doesn't scale with the # of processors? Why?

I'm needing to make many computational iterations, say 100 or so, with a very complicated function that accepts a number of input parameters. Though the parameters will be varied, each iteration will take nearly the same amount of time to compute. I found this post by Muhammad Alkarouri. I modified his code so as to fork the script N times where I would plan to set N equal at least to the number of cores on my desktop computer.
It's an old MacPro 1,0 running 32-bit Ubuntu 16.04 with 18GB of RAM, and in none of the runs of the test file below does the RAM usage exceed about 15% (swap is never used). That's according to the System Monitor's Resources tab, on which it is also shown that when I try running 4 iterations in parallel all four CPUs are running at 100%, whereas if I only do one iteration only one CPU is 100% utilized and the other 3 are idling. (Here are the actual specs for the computer. And cat /proc/self/status shows a Cpus_allowed of 8, twice the total number of cpu cores, which may indicate hyper-threading.)
So I expected to find that the 4 simultaneous runs would consume only a little more time than one (and going forward I generally expected run times to pretty much scale in inverse proportion to the number of cores on any computer). However, I found to the contrary that instead of the running time for 4 being not much more than the running time for one, it is instead a bit more than twice the running time for one. For example, with the scheme presented below, a sample "time(1) real" value for a single iteration is 0m7.695s whereas for 4 "simultaneous" iterations it is 0m17.733s. And when I go from running 4 iterations at once to 8, the running time scales in proportion.
So my question is, why doesn't it scale as I had supposed (and can anything be done to fix that)? This is, by the way, for deployment on a few desktops; it doesn't have to scale or run on Windows.
Also, I have for now neglected the multiprocessing.Pool() alternative as my function was rejected as not being pickleable.
Here is the modified Alkarouri script, multifork.py:
#!/usr/bin/env python
import os, cPickle, time
import numpy as np
np.seterr(all='raise')
def function(x):
print '... the parameter is ', x
arr = np.zeros(5000).reshape(-1,1)
for r in range(3):
for i in range(200):
arr = np.concatenate( (arr, np.log( np.arange(2, 5002) ).reshape(-1,1) ), axis=1 )
return { 'junk': arr, 'shape': arr.shape }
def run_in_separate_process(func, *args, **kwds):
numruns = 4
result = [ None ] * numruns
pread = [ None ] * numruns
pwrite = [ None ] * numruns
pid = [ None ] * numruns
for i in range(numruns):
pread[i], pwrite[i] = os.pipe()
pid[i] = os.fork()
if pid[i] > 0:
pass
else:
os.close(pread[i])
result[i] = func(*args, **kwds)
with os.fdopen(pwrite[i], 'wb') as f:
cPickle.dump((0,result[i]), f, cPickle.HIGHEST_PROTOCOL)
os._exit(0)
#time.sleep(17)
while True:
for i in range(numruns):
os.close(pwrite[i])
with os.fdopen(pread[i], 'rb') as f:
stat, res = cPickle.load(f)
result[i] = res
#os.waitpid(pid[i], 0)
if not None in result: break
return result
def main():
print 'Running multifork.py.'
print run_in_separate_process( function, 3 )
if __name__ == "__main__":
main()
With multifork.py, uncommenting os.waitpid(pid[i], 0) has no effect. And neither does uncommenting time.sleep() if 4 iterations are being run at once and if the delay is not set to more than about 17 seconds. Given that time(1) real is something like 0m17.733s for 4 iterations done at once, I take that to be an indication that the While True loop is not itself a cause of any appreciable inefficiency (due to the processes all taking the same amount of time) and that the 17 seconds are indeed being consumed solely by the child processes.
Out of a profound sense of mercy I have spared you for now my other scheme, with which I employed subprocess.Popen() in lieu of os.fork(). With that one I had to send the function to an auxiliary script, the script that defines the command that is the first argument of Popen(), via a file. I did however use the same While True loop. And the results? They were the same as with the simpler scheme that I am presenting here--- almost exactly.

Why don't you use joblib.Parallel feature?
#!/usr/bin/env python
from joblib import Parallel, delayed
import numpy as np
np.seterr(all='raise')
def function(x):
print '... the parameter is ', x
arr = np.zeros(5000).reshape(-1,1)
for r in range(3):
for i in range(200):
arr = np.concatenate( (arr, np.log( np.arange(2, 5002) ).reshape(-1,1) ), axis=1 )
return { 'junk': arr, 'shape': arr.shape }
def main():
print 'Running multifork.py.'
print Parallel(n_jobs=2)(delayed(function)(3) for _ in xrange(4))
if __name__ == "__main__":
main()
It seems that you have some bottleneck in your computations.
On your example you passes your data over the pipe which is not very fast method. To avoid this performance problem you should use shared memory. This is how multiprocessing and joblib.Parallel work.
Also you should remember that in single-thread case you don't have to serialize and deserialize the data but in case of multiprocess you have to.
Next, even if you have 8 cores it could be hyper-threading feature enabled with divides core performance into 2 threads, thus if you have 4 HW cores the OS things that there are 8 of them. There are a lot of advantages and disadvantages to use HT but the main thing is if you going to load all your cores to do some computations for a long time then you should disable it.
For example, I have a Intel(R) Core(TM) i3-2100 CPU # 3.10GHz with 2 HW cores and HT enabled. So in top I saw 4 cores. The time of computations for me are:
n_jobs=1 - 0:07.13
n_jobs=2 - 0:06.54
n_jobs=4 - 0:07.45
This is how my lscpu looks like:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 42
Stepping: 7
CPU MHz: 1600.000
BogoMIPS: 6186.10
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0-3
Note at Thread(s) per core line.
So in your example there are not so many computations rather that data transfering. You application doesn't have time to get the advantages of parallelism. If you will have a long computation job (about 10 minutes) I think you will get it.
ADDITION:
I've taken a look at your function more detailed. I've replaced multiple execution with just one execution of function(3) function and run it within profiler:
$ /usr/bin/time -v python -m cProfile so.py
Output if quite long, you can view full version here (http://pastebin.com/qLBBH5zU). But the main thing is that the program lives most of the time in numpy.concatenate function. You can see it:
ncalls tottime percall cumtime percall filename:lineno(function)
.........
600 1.375 0.002 1.375 0.002 {numpy.core.multiarray.concatenate}
.........
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.64
.........
If you will run multiple instances of this program you will see that the time increases very much as the execution time of individual program instance. I've started 2 copy simultaneously:
$ /usr/bin/time -v python -m cProfile prog.py & /usr/bin/time -v python -m cProfile prog.py &
On the other hand I wrote a small fibo function:
def fibo(x):
arr = (0, 1)
for _ in xrange(x):
arr = (arr[-1], sum(arr))
return arr[0]
And replace the concatinate line with fibo(10000). In this case the execution time of single-instanced program is 0:22.82 while the execution time of two instances takes almost the same time per instance (0:24.62).
Based on this I think perhaps numpy uses some shared resource which leads to parallization problem. Or it can be numpy or scipy -specific issue.
And the last thing about the code, you have to replace the block below:
for r in range(3):
for i in range(200):
arr = np.concatenate( (arr, np.log( np.arange(2, 5002) ).reshape(-1,1) ), axis=1 )
With the only line:
arr = np.concatenate( (arr, np.log(np.arange(2, 5002).repeat(3*200).reshape(-1,3*200))), axis=1 )

I am providing an answer here so as to not leave such a muddled record. First, I can report that most-serene frist's joblib code works, and note that it's shorter and that it also doesn't suffer from the limitations of the While True loop of my example, which only works efficiently with jobs that each take about the same amount of time. I see that the joblib project has current support and if you don't mind dependency upon a third-party library it may be an excellent solution. I may adopt it.
But, with my test function the time(1) real, the running time using the time wrapper, is about the same with either joblib or my poor man's code.
To answer the question of why the scaling of the running time in inverse proportion to the number of physical cores doesn't go as hoped, I had diligently prepared my test code that I presented here so as to produce similar output and to take about as long to run as the code of my actual project, per single-run without parallelism (and my actual project is likewise CPU-bound, not I/O-bound). I did so in order to test before undertaking the not-really-easy fitting of the simple code into my rather complicated project. But I have to report that in spite of that similarity, the results were much better with my actual project. I was surprised to see that I did get the sought inverse scaling of running time with the number of physical cores, more or less.
So I'm left supposing--- here's my tentative answer to my own question--- that perhaps the OS scheduler is fickle and very sensitive to the type of job. And there could be effects due to other processes that may be running even if, as in my case, the other processes are hardly using any CPU time (I checked; they weren't).
Tip #1: Never name your joblib test code joblib.py (you'll get an import error). Tip #2: Never rename your joblib.py file test code file and run the renamed file without deleting the joblib.py file (you'll get an import error).

Jetty's HTTP/2 client slow?

Doing simple GET requests over a high speed internet connection (within an AWS EC2 instance), the throughput seems to be really low. I've tried this with multiple servers.
Here's the code that I'm using:
HTTP2Client http2Client = new HTTP2Client();
http2Client.start();
SslContextFactory ssl = new SslContextFactory(true);
HttpClient client = new HttpClient(new HttpClientTransportOverHTTP2(http2Client), ssl);
client.start();
long start = System.currentTimeMillis();
for (int i = 0; i < 10000; i++) {
System.out.println("i" + i);
ContentResponse response = client.GET("https://http2.golang.org/");
System.out.println(response.getStatus());
}
System.out.println(System.currentTimeMillis() - start);
The throughput I get is about 8 requests per second.
This seems to be pretty low (as compared to curl on the command line).
Is there anything that I'm missing? Any way to turbo-charge this?
EDIT: How do I get Jetty to use multiple streams?

That is not the right way to measure throughput.
Depending where you are geographically, you will be dominated by the latency between your client and the server. If the latency is 125 ms, you can only make 8 requests/s.
For example, from my location, the ping to http2.golang.org is 136 ms.
Even if your latency is less, there are good chances that the server is throttling you: I don't think that http2.golang.org will be happy to see you making 10k requests in a tight loop.
I'll be curious to know what is the curl or nghttp latency in the same test, but I guess won't be much different (or probably worse if they close the connection after each request).
This test in the Jetty test suite, which is not a proper benchmark either, shows around 500 requests/s on my laptop; without TLS, goes to around 2500 requests/s.
I don't know exactly what you're trying to do, but your test does not tell you anything about the performance of Jetty's HttpClient.

Simulate a real UDP application and measure traffic load on OMNeT++

So I am experimenting with OMNeT++ and I have created a DataCenter with Fat Tree topology. Now I want to observe how a UDP application works in (as-real-as it gets) conditions. I have used the INET Framework and implemented the VideoStream Client-Server application.
So my question is that:
My network seems to work perfectly and I don't want to. I want to measure the datarate received by the client and compare it to the server's broadcast datarate. But even though I put much traffic into the network (several UDP and TCP applications) the datarate received is exactly the same as the datarate broadcasted and I am guessing in real conditions this traffic load is expected to vary in such highly dynamic enviroments. So how can I achieve the right conditions on OMNeT++ to simulate a real communication (maybe with packet losses and queuing time etc.) so I can measures these traffic loads?
There is the .ini file used:
[Config UDPStreamMultiple]
**.Pod[2].racks[1].servers[1].vms[2].numUdpApps = 1
**.Pod[2].racks[1].servers[1].vms[2].udpApp[0].typename = "UDPVideoStreamSvr"
**.Pod[2].racks[1].servers[1].vms[2].udpApp[0].localPort = 1000
**.Pod[2].racks[1].servers[1].vms[2].udpApp[0].sendInterval = 1s
**.Pod[2].racks[1].servers[1].vms[2].udpApp[0].packetLen = 20480B
**.Pod[2].racks[1].servers[1].vms[2].udpApp[0].videoSize = 512000B
**.Pod[3].racks[0].servers[0].vms[0].numUdpApps = 1
**.Pod[3].racks[0].servers[0].vms[0].udpApp[0].typename = "UDPVideoStreamSvr"
**.Pod[3].racks[0].servers[0].vms[0].udpApp[0].localPort = 1000
**.Pod[3].racks[0].servers[0].vms[0].udpApp[0].sendInterval = 1s
**.Pod[3].racks[0].servers[0].vms[0].udpApp[0].packetLen = 2048B
**.Pod[3].racks[0].servers[0].vms[0].udpApp[0].videoSize = 51200B
**.Pod[0].racks[0].servers[0].vms[0].numUdpApps = 1
**.Pod[0].racks[0].servers[0].vms[0].udpApp[0].typename = "UDPVideoStreamCli"
**.Pod[0].racks[0].servers[0].vms[0].udpApp[0].serverAddress = "20.0.0.47"
**.Pod[0].racks[0].servers[0].vms[0].udpApp[0].serverPort = 1000
**.Pod[1].racks[0].servers[0].vms[1].numUdpApps = 1
**.Pod[1].racks[0].servers[0].vms[1].udpApp[0].typename = "UDPVideoStreamCli"
**.Pod[1].racks[0].servers[0].vms[1].udpApp[0].serverAddress = "20.0.0.49"
**.Pod[1].racks[0].servers[0].vms[1].udpApp[0].serverPort = 1000
**.Pod[2].racks[0].servers[0].vms[1].numUdpApps = 1
**.Pod[2].racks[0].servers[0].vms[1].udpApp[0].typename = "UDPVideoStreamCli"
**.Pod[2].racks[0].servers[0].vms[1].udpApp[0].serverAddress = "20.0.0.49"
**.Pod[2].racks[0].servers[0].vms[1].udpApp[0].serverPort = 1000
**.Pod[2].racks[1].servers[0].vms[1].numUdpApps = 1
**.Pod[2].racks[1].servers[0].vms[1].udpApp[0].typename = "UDPVideoStreamCli"
**.Pod[2].racks[1].servers[0].vms[1].udpApp[0].serverAddress = "20.0.0.49"
**.Pod[2].racks[1].servers[0].vms[1].udpApp[0].serverPort = 1000
Thanks in advance.

So after a lot of research, thinking and tests I managed to achieve my desired goal. What I did is this:
Because the whole topology of the Datacenter is built that way so it will reduce the load that the routers have and balance the traffic, to take the necessary measures I created a smaller network. Just simple 3 nodes (StandardHost) from the INET Framework and a simple UDP Application that goes from node A to node B through midHost (e.g nodeA ---> midHost ---> nodeB). In the .ini file there should be a couple of commands like:
**.ppp[*].queueType = "DropTailQueue"
**.ppp[*].queue.frameCapacity = 50
**.ppp[*].numOutputHooks = 1
**.ppp[*].outputHook[*].typename = "ThruputMeter"
These commands are managing the links between the nodes and can be scaled according to someone needs (adjust the frame capacity maybe or the queue type). By making this small network you can easily customize it and take the metrics you need. I hope I can help anyone who wanted to do the same to get some feel about how to do it.

Improving Akka Remote Throughput

We're thinking about using Akka for client server communications, and trying to benchmark data transfer. Currently we're trying to send a million messages where each message is a case class with 8 string fields.
At this point we're struggling to get acceptable performance. We see about 600KB/s transfer rate and idle CPUs on client and server, so something is going wrong. Maybe it's our netty configuration.
This is our akka config
Server {
akka {
extensions = ["akka.contrib.pattern.ClusterReceptionistExtension"]
loggers = ["akka.event.slf4j.Slf4jLogger"]
loglevel = "DEBUG"
logging-filter = "akka.event.slf4j.Slf4jLoggingFilter"
log-dead-letters = 10
log-dead-letters-during-shutdown = on
actor {
provider = "akka.cluster.ClusterActorRefProvider"
}
remote {
enabled-transports = ["akka.remote.netty.tcp"]
netty.tcp {
hostname = "instance.myserver.com"
port = 2553
maximum-frame-size = 1000 MiB
send-buffer-size = 2000 MiB
receive-buffer-size = 2000 MiB
}
}
cluster {
seed-nodes = ["akka.tcp://server#instance.myserver.com:2553"]
roles = [master]
}
contrib.cluster.receptionist {
name = receptionist
role = "master"
number-of-contacts = 3
response-tunnel-receive-timeout = 300s
}
}
}
Client {
akka {
loggers = ["akka.event.slf4j.Slf4jLogger"]
loglevel = "DEBUG"
logging-filter = "akka.event.slf4j.Slf4jLoggingFilter"
actor {
provider = "akka.remote.RemoteActorRefProvider"
}
remote {
enabled-transports = ["akka.remote.netty.tcp"]
netty.tcp {
hostname = "127.0.0.1"
port = 0
maximum-frame-size = 1000 MiB
send-buffer-size = 2000 MiB
receive-buffer-size = 2000 MiB
}
}
cluster-client {
initial-contacts = ["akka.tcp://server#instance.myserver.com:2553/user/receptionist"]
establishing-get-contacts-interval = "10s"
refresh-contacts-interval = "10s"
}
}
}
UPDATE:
In the end, notwithstanding the discussion about serialisation (see below) we just converted our payloads to use byte arrays, that way serialisation would not affect the tests. We found that on core i7 with 24gb ram using jeroMQ (ie zeroMQ re-implemented in java - so still not the fastest) we saw consistently about 200k msgs/sec or about 20 MB/sec whilst on raw akka (ie without zeroMQ plugin) we saw about 10k msgs/sec or just under 1MB/sec. Trying on akka + zeroMQ made the performance worse.

To make it easy to get started with remoting akka uses Java serialization for message serialization, this is not what you'd typically use in production because it's not very fast and does not handle message versioning well.
What you should do is use kryo or protobuf for serialization and you should be able to get much better numbers.
You can read about how here, there are also a few links to available serializers at the bottom of the page: http://doc.akka.io/docs/akka/current/scala/serialization.html

Ok here's what we discovered:
Using our client server model we were able to get about < 3000 msg/sec without doing anything, which was not really acceptable to us, but we were really confused with what was happening because we weren't able to max out the cpu.
Therefore we went back to the akka source code and found a benchmarking sample there:
sample.remote.benchmark.Receiver
sample.remote.benchmark.Sender
These are two classes that use akka remote to ping pong themselves a bunch of messages from two jvms on the same machine. Using this benchmark we were able to get around 10~15k msg/sec on corei7 with 24g, using about 50% cpu. We found that adjusting the dispatchers and threads allocated to netty made a difference, but only marginally. Using asks instead of tells, made it a bit slower, but not by much. Using a balancing pool of many actors to send and receive made it considerably slower.
For comparison we did a similar test using jeromq and we managed to get about 100k msg/sec using 100% CPU (same payload, same machine).
We also hacked the Sender and Receivers to use the Akka zeromq extension to pipe messages directly from one jvm to another bypassing akka remote all together. In this test we found that we could get fast send and receive to start with, approx 100k msg/sec, but that performance quickly degraded. On further inspection the zeromq threads and the akka actor switching do not play nice together. We probably could have tuned the performance somewhat by being smarter with how zeromq and akka interacted with each other, but at that point we decided it would be better to go with raw zeromq.
Conclusion: Don't use akka if you care about fast serialisation of lots of data over the wire.

Why is Go's Martini less performant than Play Framework 2.2.x

I wrote two equal projects in Golang+Martini and Play Framework 2.2.x to compare it's performance. Both have 1 action that render 10K HTML View. Tested it with ab -n 10000 -c 1000 and monitored results via ab output and htop. Both uses production confs and compiled views. I wonder about results:
Play: ~17000 req/sec + constant 100% usage of all cores of my i7 = ~0.059 msec/req
Martini: ~4000 req/sec + constant 70% usage of all cores of my i7 = ~0.25 msec/req
...as I understand martini is not bloated, so why it 4.5 times slower? Any way to speedup?
Update: Added benchmark results
Golang + Martini:
./wrk -c1000 -t10 -d10 http://localhost:9875/
Running 10s test # http://localhost:9875/
10 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 241.70ms 164.61ms 1.16s 71.06%
Req/Sec 393.42 75.79 716.00 83.26%
38554 requests in 10.00s, 91.33MB read
Socket errors: connect 0, read 0, write 0, timeout 108
Requests/sec: 3854.79
Transfer/sec: 9.13MB
Play!Framework 2:
./wrk -c1000 -t10 -d10 http://localhost:9000/
Running 10s test # http://localhost:9000/
10 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 32.99ms 37.75ms 965.76ms 85.95%
Req/Sec 2.91k 657.65 7.61k 76.64%
276501 requests in 10.00s, 1.39GB read
Socket errors: connect 0, read 0, write 0, timeout 230
Requests/sec: 27645.91
Transfer/sec: 142.14MB
Martini running with runtime.GOMAXPROCS(runtime.NumCPU())
I want to use golang in production, but after this benchmark I don't know how can I make such decision...
Any way to speedup?

I made a simple martini app with rendering 1 html file and it's quite fast:
✗ wrk -t10 -c1000 -d10s http://localhost:3000/
Running 10s test # http://localhost:3000/
10 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 31.94ms 38.80ms 109.56ms 83.25%
Req/Sec 3.97k 3.63k 11.03k 49.28%
235155 requests in 10.01s, 31.17MB read
Socket errors: connect 774, read 0, write 0, timeout 3496
Requests/sec: 23497.82
Transfer/sec: 3.11MB
Macbook pro i7.
This is app https://gist.github.com/zishe/9947025
It helps if you show your code, perhaps you didn't disable logs or missed something.
But timeout 3496 seems bad.

#Kr0e, right! I figured out that heavy use of reflection in DI of martini makes it perform slowly.
I moved to gorilla mux and wrote some martini-style helpers and got a wanted performance.
#Cory LaNou: I can't accept yours comment) Now I agree with you, no framework in prod is good idea. Thanks
#user3353963: See my question: Both uses production confs and compiled views

add this code:
func init(){
martini.Env = martini.Prod
}
sorry, your code have done.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js