Improving Akka Remote Throughput - akka

We're thinking about using Akka for client server communications, and trying to benchmark data transfer. Currently we're trying to send a million messages where each message is a case class with 8 string fields.
At this point we're struggling to get acceptable performance. We see about 600KB/s transfer rate and idle CPUs on client and server, so something is going wrong. Maybe it's our netty configuration.
This is our akka config
Server {
akka {
extensions = ["akka.contrib.pattern.ClusterReceptionistExtension"]
loggers = ["akka.event.slf4j.Slf4jLogger"]
loglevel = "DEBUG"
logging-filter = "akka.event.slf4j.Slf4jLoggingFilter"
log-dead-letters = 10
log-dead-letters-during-shutdown = on
actor {
provider = "akka.cluster.ClusterActorRefProvider"
}
remote {
enabled-transports = ["akka.remote.netty.tcp"]
netty.tcp {
hostname = "instance.myserver.com"
port = 2553
maximum-frame-size = 1000 MiB
send-buffer-size = 2000 MiB
receive-buffer-size = 2000 MiB
}
}
cluster {
seed-nodes = ["akka.tcp://server#instance.myserver.com:2553"]
roles = [master]
}
contrib.cluster.receptionist {
name = receptionist
role = "master"
number-of-contacts = 3
response-tunnel-receive-timeout = 300s
}
}
}
Client {
akka {
loggers = ["akka.event.slf4j.Slf4jLogger"]
loglevel = "DEBUG"
logging-filter = "akka.event.slf4j.Slf4jLoggingFilter"
actor {
provider = "akka.remote.RemoteActorRefProvider"
}
remote {
enabled-transports = ["akka.remote.netty.tcp"]
netty.tcp {
hostname = "127.0.0.1"
port = 0
maximum-frame-size = 1000 MiB
send-buffer-size = 2000 MiB
receive-buffer-size = 2000 MiB
}
}
cluster-client {
initial-contacts = ["akka.tcp://server#instance.myserver.com:2553/user/receptionist"]
establishing-get-contacts-interval = "10s"
refresh-contacts-interval = "10s"
}
}
}
UPDATE:
In the end, notwithstanding the discussion about serialisation (see below) we just converted our payloads to use byte arrays, that way serialisation would not affect the tests. We found that on core i7 with 24gb ram using jeroMQ (ie zeroMQ re-implemented in java - so still not the fastest) we saw consistently about 200k msgs/sec or about 20 MB/sec whilst on raw akka (ie without zeroMQ plugin) we saw about 10k msgs/sec or just under 1MB/sec. Trying on akka + zeroMQ made the performance worse.

To make it easy to get started with remoting akka uses Java serialization for message serialization, this is not what you'd typically use in production because it's not very fast and does not handle message versioning well.
What you should do is use kryo or protobuf for serialization and you should be able to get much better numbers.
You can read about how here, there are also a few links to available serializers at the bottom of the page: http://doc.akka.io/docs/akka/current/scala/serialization.html

Ok here's what we discovered:
Using our client server model we were able to get about < 3000 msg/sec without doing anything, which was not really acceptable to us, but we were really confused with what was happening because we weren't able to max out the cpu.
Therefore we went back to the akka source code and found a benchmarking sample there:
sample.remote.benchmark.Receiver
sample.remote.benchmark.Sender
These are two classes that use akka remote to ping pong themselves a bunch of messages from two jvms on the same machine. Using this benchmark we were able to get around 10~15k msg/sec on corei7 with 24g, using about 50% cpu. We found that adjusting the dispatchers and threads allocated to netty made a difference, but only marginally. Using asks instead of tells, made it a bit slower, but not by much. Using a balancing pool of many actors to send and receive made it considerably slower.
For comparison we did a similar test using jeromq and we managed to get about 100k msg/sec using 100% CPU (same payload, same machine).
We also hacked the Sender and Receivers to use the Akka zeromq extension to pipe messages directly from one jvm to another bypassing akka remote all together. In this test we found that we could get fast send and receive to start with, approx 100k msg/sec, but that performance quickly degraded. On further inspection the zeromq threads and the akka actor switching do not play nice together. We probably could have tuned the performance somewhat by being smarter with how zeromq and akka interacted with each other, but at that point we decided it would be better to go with raw zeromq.
Conclusion: Don't use akka if you care about fast serialisation of lots of data over the wire.

Related

Google Cloud Functions Java 11 (Beta) Runtime - Performance Issue

I have created a new Cloud Function using Java 11 (Beta) Runtime to handle HTML form submission for my static site. It's a simple 3-field form (name, email, message). No file upload is involved. The function does 2 things primarily:
Creates a pull request with BitBucket
Sends email to me using SendGrid
NOTE: It also verifies recaptcha but I've disabled it for testing.
The function when ran on my local machine (base model 2019 Macbook Pro 13") takes about 3 secs. I'm based in SE Asia. The same function when deployed to Google Cloud us-central1 takes about 25 secs (8 times slower). I have almost the same code running in production as part of a Servlet on GAE Java 8 runtime also in US Central region for a few years. It takes about 2-3 secs including recaptcha verification and sending the email. I'm trying to port it over to Cloud Function, but the performance is about 10 times slower with Cloud Function even without recaptcha verification.
For comparison, the Cloud Function is running on 256MB / 400GHz instance, whereas my GAE Java 8 runtime runs on F1 (128MB / 600GHz) instance. The function is using only about 75MB of memory. The function is configured to accept unauthenticated requests.
I noticed that even basic String concatenation like: String c = a + b; takes a good 100ms on the Cloud Function. I have timed the calls and a simple string concatenation of about 15 strings into one takes about 1.5-2.0 seconds.
Moreover, writing a small message (~ 1KB) to the HTTPUrlConnection output stream and reading the response back takes about 10 seconds (yes seconds)!
/* Writing < 1KB to output stream takes about 4-5 secs */
wr = new OutputStreamWriter(con.getOutputStream());
wr.write(encodedParams);
wr.flush();
wr.close();
/* Reading response also take about 4-5 secs */
String responseMessage = con.getResponseMessage();
Similarly, the SendGrid code below takes another 10 secs to send the email. It takes about 1 sec on my local machine.
Email from = new Email(fromEmail, fromName);
Email to = new Email(toEmail, toName);
Email replyTo = new Email(replyToEmail, replyToName);
Content content = new Content("text/html", body);
Mail mail = new Mail(from, subject, to, content);
mail.setReplyTo(replyTo);
SendGrid sg = new SendGrid(SENDGRID_API_KEY);
Request sgRequest = new Request();
Response sgResponse = null;
try {
sgRequest.setMethod(Method.POST);
sgRequest.setEndpoint("mail/send");
sgRequest.setBody(mail.build());
sgResponse = sg.api(sgRequest);
} catch (IOException ex) {
throw ex;
}
Something is obviously wrong with the Cloud Function. Since my original code is running on GAE Java 8 runtime, it was very easy for me to port it over to the Cloud Function with minor changes. Otherwise I would have gone with NodeJS runtime. I'm also not seeing any of the performance issues when running this function on my local machine.
Can someone help me make sense of the slow performance issue?
What you're seeing is almost certainly due to the "cold start" cost associated with the creation of a new server instance to handle the request. This is an issue with all types of Cloud Functions, as described in the documentation:
Several of the recommendations in this document center around what is known as a cold start. Functions are stateless, and the execution environment is often initialized from scratch, which is called a cold start. Cold starts can take significant amounts of time to complete. It is best practice to avoid unnecessary cold starts, and to streamline the cold start process to whatever extent possible (for example, by avoiding unnecessary dependencies).
I would expect JVM languages to have an even longer cold start time due to the amount of time that it takes to initalize a JVM, in addition to the server instance itself.
Other than the advice above, there is very little one can due to effectively mitigate cold starts. Efforts to keep a function warm are not as effective as you might imagine. There is a lot of discussion about this on the internet if you wish to search.
Keep in mind that the Java runtime is also in beta, so you can expect improvements in the future. The same thing happened with the other runtimes.

rtsp-sink(RTSPStreamer) DirectShow.Net filter based RTSP server causes high latency in complete network and even slowdown the network speed by 90%

Ref: I have a RTSP media server that sends video via TCP streaming.I am using rtsp-sink(RTSPStreamer) DirectShow.Net filter based RTSP server which is developed in C++.Where as The Wrapper app is developed using C#.
The problem i am facing is the moment RTSP server start streaming,it affects the system level internet connection & drops internet connection speed by 90 percent.
I wanted to get your input on how it would be possible? (if at all).Because it impacts on system level internet connection not the very app level.
For Example:- My normal internet connection speed is 25 mbps. It suddenly drops to 2 mbps , whenever the RTSP streaming started in the app's server tab.
Sometimes even disables the internet connection in the system(Computer) where the app is running.
I'm asking you because I consider you an expert so please bear with me on this "maybe wild" question and thanks ahead.
...of all the things I've lost, I miss my mind the most.
Code Snippet of RTSPSender.CPP
//////////////////////////////////////////////////////
// CStreamingServer
//////////////////////////////////////////////////////
UsageEnvironment* CStreamingServer::s_pUsageEnvironment = NULL;
CHAR CStreamingServer::s_szDefaultBroadCastIP[] = "239.255.42.42";
//////////////////////////////////////////////////////
CStreamingServer::CStreamingServer(HANDLE hQuit)
: BasicTaskScheduler(10000)
, m_hQuit(hQuit)
, m_Streams(NAME("Streams"))
, m_bStarting(FALSE)
, m_bSessionReady(FALSE)
, m_pszURL(NULL)
, m_rtBufferingTime(UNITS * 2)
{
s_pUsageEnvironment = BasicUsageEnvironment::createNew(*this);
rtspServer = NULL;
strcpy_s(m_szAddress,"");
strcpy_s(m_szStreamName,"stream");
strcpy_s(m_szInfo,"media");
strcpy_s(m_szDescription,"Session streamed by \"RTSP Streamer DirectShow Filter\"");
m_nRTPPort = 6666;
m_nRTSPPort = 8554;
m_nTTL = 1;
m_bIsSSM = FALSE;
}
Edited:WireShark logs:
WireShark logs at the time of RTSP streaming started

Jetty's HTTP/2 client slow?

Doing simple GET requests over a high speed internet connection (within an AWS EC2 instance), the throughput seems to be really low. I've tried this with multiple servers.
Here's the code that I'm using:
HTTP2Client http2Client = new HTTP2Client();
http2Client.start();
SslContextFactory ssl = new SslContextFactory(true);
HttpClient client = new HttpClient(new HttpClientTransportOverHTTP2(http2Client), ssl);
client.start();
long start = System.currentTimeMillis();
for (int i = 0; i < 10000; i++) {
System.out.println("i" + i);
ContentResponse response = client.GET("https://http2.golang.org/");
System.out.println(response.getStatus());
}
System.out.println(System.currentTimeMillis() - start);
The throughput I get is about 8 requests per second.
This seems to be pretty low (as compared to curl on the command line).
Is there anything that I'm missing? Any way to turbo-charge this?
EDIT: How do I get Jetty to use multiple streams?
That is not the right way to measure throughput.
Depending where you are geographically, you will be dominated by the latency between your client and the server. If the latency is 125 ms, you can only make 8 requests/s.
For example, from my location, the ping to http2.golang.org is 136 ms.
Even if your latency is less, there are good chances that the server is throttling you: I don't think that http2.golang.org will be happy to see you making 10k requests in a tight loop.
I'll be curious to know what is the curl or nghttp latency in the same test, but I guess won't be much different (or probably worse if they close the connection after each request).
This test in the Jetty test suite, which is not a proper benchmark either, shows around 500 requests/s on my laptop; without TLS, goes to around 2500 requests/s.
I don't know exactly what you're trying to do, but your test does not tell you anything about the performance of Jetty's HttpClient.

Simulate a real UDP application and measure traffic load on OMNeT++

So I am experimenting with OMNeT++ and I have created a DataCenter with Fat Tree topology. Now I want to observe how a UDP application works in (as-real-as it gets) conditions. I have used the INET Framework and implemented the VideoStream Client-Server application.
So my question is that:
My network seems to work perfectly and I don't want to. I want to measure the datarate received by the client and compare it to the server's broadcast datarate. But even though I put much traffic into the network (several UDP and TCP applications) the datarate received is exactly the same as the datarate broadcasted and I am guessing in real conditions this traffic load is expected to vary in such highly dynamic enviroments. So how can I achieve the right conditions on OMNeT++ to simulate a real communication (maybe with packet losses and queuing time etc.) so I can measures these traffic loads?
There is the .ini file used:
[Config UDPStreamMultiple]
**.Pod[2].racks[1].servers[1].vms[2].numUdpApps = 1
**.Pod[2].racks[1].servers[1].vms[2].udpApp[0].typename = "UDPVideoStreamSvr"
**.Pod[2].racks[1].servers[1].vms[2].udpApp[0].localPort = 1000
**.Pod[2].racks[1].servers[1].vms[2].udpApp[0].sendInterval = 1s
**.Pod[2].racks[1].servers[1].vms[2].udpApp[0].packetLen = 20480B
**.Pod[2].racks[1].servers[1].vms[2].udpApp[0].videoSize = 512000B
**.Pod[3].racks[0].servers[0].vms[0].numUdpApps = 1
**.Pod[3].racks[0].servers[0].vms[0].udpApp[0].typename = "UDPVideoStreamSvr"
**.Pod[3].racks[0].servers[0].vms[0].udpApp[0].localPort = 1000
**.Pod[3].racks[0].servers[0].vms[0].udpApp[0].sendInterval = 1s
**.Pod[3].racks[0].servers[0].vms[0].udpApp[0].packetLen = 2048B
**.Pod[3].racks[0].servers[0].vms[0].udpApp[0].videoSize = 51200B
**.Pod[0].racks[0].servers[0].vms[0].numUdpApps = 1
**.Pod[0].racks[0].servers[0].vms[0].udpApp[0].typename = "UDPVideoStreamCli"
**.Pod[0].racks[0].servers[0].vms[0].udpApp[0].serverAddress = "20.0.0.47"
**.Pod[0].racks[0].servers[0].vms[0].udpApp[0].serverPort = 1000
**.Pod[1].racks[0].servers[0].vms[1].numUdpApps = 1
**.Pod[1].racks[0].servers[0].vms[1].udpApp[0].typename = "UDPVideoStreamCli"
**.Pod[1].racks[0].servers[0].vms[1].udpApp[0].serverAddress = "20.0.0.49"
**.Pod[1].racks[0].servers[0].vms[1].udpApp[0].serverPort = 1000
**.Pod[2].racks[0].servers[0].vms[1].numUdpApps = 1
**.Pod[2].racks[0].servers[0].vms[1].udpApp[0].typename = "UDPVideoStreamCli"
**.Pod[2].racks[0].servers[0].vms[1].udpApp[0].serverAddress = "20.0.0.49"
**.Pod[2].racks[0].servers[0].vms[1].udpApp[0].serverPort = 1000
**.Pod[2].racks[1].servers[0].vms[1].numUdpApps = 1
**.Pod[2].racks[1].servers[0].vms[1].udpApp[0].typename = "UDPVideoStreamCli"
**.Pod[2].racks[1].servers[0].vms[1].udpApp[0].serverAddress = "20.0.0.49"
**.Pod[2].racks[1].servers[0].vms[1].udpApp[0].serverPort = 1000
Thanks in advance.
So after a lot of research, thinking and tests I managed to achieve my desired goal. What I did is this:
Because the whole topology of the Datacenter is built that way so it will reduce the load that the routers have and balance the traffic, to take the necessary measures I created a smaller network. Just simple 3 nodes (StandardHost) from the INET Framework and a simple UDP Application that goes from node A to node B through midHost (e.g nodeA ---> midHost ---> nodeB). In the .ini file there should be a couple of commands like:
**.ppp[*].queueType = "DropTailQueue"
**.ppp[*].queue.frameCapacity = 50
**.ppp[*].numOutputHooks = 1
**.ppp[*].outputHook[*].typename = "ThruputMeter"
These commands are managing the links between the nodes and can be scaled according to someone needs (adjust the frame capacity maybe or the queue type). By making this small network you can easily customize it and take the metrics you need. I hope I can help anyone who wanted to do the same to get some feel about how to do it.

Akka durable mailbox results in dead letters

First, I am not an Akka newbie (been using it for 2+ years). I have a high throughput (many millions of msgs/min/node) application that does heavy network I/O. An initial actor (backed by a RandomRouter) receives messages and distributes them to the appropriate child actor for processing:
private val distributionRouter = system.actorOf(Props(new DistributionActor)
.withDispatcher("distributor-dispatcher")
.withRouter(RandomRouter(distrib.distributionActors)), "distributionActor")
The application has been highly tuned and has excellent performance. I'd like to make it more fault tolerant by using a durable mailbox in front of DistributionActor. Here's the relevant config (only change is the addition of the file-based mailbox):
akka.actor.mailbox.file-based {
directory-path = "./.akka_mb"
max-items = 2147483647
# attempting to add an item after the queue reaches this size (in bytes) will fail.
max-size = 2147483647 bytes
# attempting to add an item larger than this size (in bytes) will fail.
max-item-size = 2147483647 bytes
# maximum expiration time for this queue (seconds).
max-age = 3600s
# maximum journal size before the journal should be rotated.
max-journal-size = 16 MiB
# maximum size of a queue before it drops into read-behind mode.
max-memory-size = 128 MiB
# maximum overflow (multiplier) of a journal file before we re-create it.
max-journal-overflow = 10
# absolute maximum size of a journal file until we rebuild it, no matter what.
max-journal-size-absolute = 9223372036854775807 bytes
# whether to drop older items (instead of newer) when the queue is full
discard-old-when-full = on
# whether to keep a journal file at all
keep-journal = on
# whether to sync the journal after each transaction
sync-journal = off
# circuit breaker configuration
circuit-breaker {
# maximum number of failures before opening breaker
max-failures = 3
# duration of time beyond which a call is assumed to be timed out and considered a failure
call-timeout = 3 seconds
# duration of time to wait until attempting to reset the breaker during which all calls fail-fast
reset-timeout = 30 seconds
}
}
distributor-dispatcher {
executor = "thread-pool-executor"
type = Dispatcher
thread-pool-executor {
core-pool-size-min = 20
core-pool-size-max = 20
max-pool-size-min = 20
}
throughput = 100
mailbox-type = akka.actor.mailbox.FileBasedMailboxType
}
As soon as I introduced this I noticed lots of dropped messages. When I profile it via the Typesafe Console, I see a bunch of dead letters (like 100k or so per 1M). My mailbox files are only 12MB per actor, so they're not even close to the limit. I also set up a dead letters listener to count the dead letters so I could run it outside the profiler (maybe an instrumentation issue, I thought?). Same results.
Any idea what may be causing the dead letters?
I am using Akka 2.0.4 on Scala 2.9.2.
Update:
I have noticed that the dead letters seem to have been bound for a couple of child actors owned by DistributionActor. I'm not understanding why changing the parent's mailbox has any effect on this, but this is definitely the behavior.