Efficiency in sending UDP packets to the same address - c++

I am reworking some of the infrastructure in an existing application that sends UDP data packets to 1...N addresses (often multicast). Currently there are, let's say, T transmitter objects, and in some cases, all of the transmitters are sending to the the same address.
So to simplify and provide an example case, lets say there are 3 transmitter objects and they all need to send to a single specific address. My question is... which is more efficient?:
Option 1) Put a mutex around a single socket and have all the transmitters (T) share the same socket.
T----\
T----->Socket
T----/
Option 2) Use three separate sockets, all sending to the same location.
T----->Socket 1
T----->Socket 2
T----->Socket 3
I suspect that with the second option, under the hood, the OS or the NIC puts a mutex around the final transmit so in the big picture, Option 2 is probably not a whole lot different than Option 1.
I will probably set up an experiment on my development PC next week, but there's no way I can test all the potential computer configurations that users might install on. I also realize there are different implementations - Windows vs Linux, different NIC chipset manufacturers, etc, but I'm wondering if anyone might have some past experience or architectural knowledge that could shed light on an advantage of one option over the other.
Thanks!

After running some benchmarks on a Windows 10 computer, I have an "answer" that at least gives me a rough idea of what to expect. I can't be 100% sure that every system will behave the same way, but most of the servers I run use Intel NICs and Windows 10, and my typical packet sizes are around 1200 bytes, so the answer at least makes me comfortable that it's correct for my particular scenario. I decided to post the results here in case it might help anyone else can make use of the experiment.
I build a simple command line app that would first spawn T transmitter threads all using a single socket with a mutex around it. Immediately after, it would run another test with the same number of transmitters, but this time each transmitter would have its own socket so no mutex was needed, (although I'm sure at some lower level there was a locking mechanism). Each transmitter blasts out packets as fast as possible.
This is the test setup I used:
2,700,000 packets at 1200 bytes each.
Release mode, 64 bit.
i7-3930K CPU, Intel Gigabit CT PCIE adapter.
And here are the results
1 Transmitter : SharedSocket = 28.2650 sec : 1 Socket = 28.2073 sec.
3 Transmitters : SharedSocket = 28.4485 sec : MultipleSockets = 27.5190 sec.
6 Transmitters : SharedSocket = 28.7414 sec : MultipleSockets = 27.3485 sec.
12 Transmitters : SharedSocket = 27.9463 sec : MulitpleSockets = 27.3479 sec.
As expected, the test with only one thread had almost the same time for both. However, in the cases with 3, 6, and 12 transmitters, there is an approximately 3% better performance boost by using one socket per thread instead of sharing the socket. It's not a massive difference, but if you're trying to squeeze every last ounce out of your system, it could be a useful statistic. My particular application is for transmitting a massive amount of video.
Just as a sanity check.... here is a screenshot of the TaskManager's network page on the server side. You can see a throughput increase about half way through the test, which coincides with the switch to the second multiple socket test. I included a screencap of the client computer as well (it was a Windows 7 box).

Related

Response time of an application unintutively correlated with number of input triggers in a give time period

I have an Multi threaded C++ network application which listens for UDP packets as input - this data then hops through the application/processed via various queues and threads and finally pushed out on a TCP socket.
What I am seeing is that if inputs come in slow lets say 5/sec, the total response time (in-to-out) is slow (lets say 100ms) and if the inputs come in fast e.g. 20/sec, the response time is also fast (~50ms). This observation is really weird. And also messes up the response time in the fast case because the 1st response is always slow. Just to make sure - the application is doing exactly the same amount of work in both slow and fast cases.
Things that have been tried to investigate this -
Its a dual Xeon box running Linux 2.6 kernel - disabled Turbo boost, made sure processors are in C0 states.
eliminated network causes. the root cause is within the box. ( in software or hardware )
I have a fake input going through the system from input to output on a timer to keep the application "warm" - no effect. (the application
's worker threads are busy waiting and pinned to cores).
perf points indicate that EVERY thing gets slower - which basically mean that the processors are slowing down when not under continuous load - but nothing else suggests that ( 17z/turbostat) or I am reading them incorrectly.
Does someone has color on what might be happening?
In my experience, this would be the nasty doings of Nagle's devilish device. Sounds extremeliy plausible to me that more data in the buffer fills up TCP packet which than get's sent. Without much data, tcp packet is waiting for the ack from the other side, which is, as we all know, delayed.
Solution - learn to make sure Nagle's program killer is disabled the first thing after you create any send-capable TCP socket.

How to figure out why UDP is only accepting packets at a relatively slow rate?

I'm using Interix on Windows XP to port my C++ Linux application more readily to port to Windows XP. My application sends and receives packets over a socket to and from a nearby machine running Linux. When sending, I'm only getting throughput of around 180 KB/sec and when receiving I'm getting around 525 KB/sec. The same code running on Linux gets closer to 2,500 KB/sec.
When I attempt to send at a higher rate than 180 KB/sec, packets get dropped to bring the rate back down to about that level.
I feel like I should be able to get better throughput on sending than 180 KB/sec but am not sure how to go about determining what is the cause of the dropped packets.
How might I go about investigating this slowness in the hopes of improving throughput?
--Some More History--
To reach the above numbers, I have already improved the throughput a bit by doing the following (that made no difference on Linux, but help throughput on Interix):
I changed SO_RCVBUF and SO_SNDBUF from 256KB to 25MB, this improved throughput about 20%
I ran optimized instead of debug, this improved throughput about 15%
I turned off all logging messages going to stdout and a log file, this doubled throughput.
So it would seem that CPU is a limiting factor on Interix, but not on Linux. Further, I am running on a Virtual Machine hosted in a hypervisor. The Windows XP is given 2 cores and 2 GB of memory.
I notice that the profiler shows the cpu on the two cores never exceeding 50% utilization on average. This even occurs when I have two instances of my application running, still it hovers around 50% on both cores. Perhaps my application, which is multi-threaded, with a dedicated thread to read from UDP socket and a dedicated thread to write to UDP socket (only one is active at any given time) is not being scheduled well on Interix and thus my packets are dropping?
In answering your question, I am making the following assumptions based on your description of the problem:
(1) You are using the exact same program in Linux when achieving the throughput of 2,500 KB/sec, other than the socket library, which is of course, going to be different between Windows and Linux. If this assumption is correct, we probably shouldn't have to worry about other pieces of your code affecting the throughput.
(2) When using Linux to achieve 2,500 KB/sec throughput, the node is in the exact same location in the network. If this assumption is correct, we don't have to worry about network issues affecting your throughput.
Given these two assumptions, I would say that you likely have a problem in your socket settings on the Windows side. I would suggest checking the size of the send-buffer first. The size of the send-buffer is 8192 bytes by default. If you increase this, you should see an increase in throughput. Use setsockopt() to change this. Here is the usage manual: http://msdn.microsoft.com/en-us/library/windows/desktop/ms740476(v=vs.85).aspx
EDIT: It looks like I misread your post going through it too quickly the first time. I just noticed you're using Interix, which means you're probably not using a different socket library. Nevertheless, I suggest checking the send buffer size first.

UDP packets not sent on time

I am working on a C++ application that can be qualified as a router. This application receives UDP packets on a given port (nearly 37 bytes each second) and must multicast them to another destinations within a 10 ms period. However, sometimes after packet reception, the retransmission exceeds the 10 ms limit and can reach the 100 ms. these off-limits delays are random.
The application receives on the same Ethernet interface but on a different port other kind of packets (up to 200 packets of nearly 100 bytes each second). I am not sure that this later flow is disrupting the other one because these delay peaks are too scarce (2 packets among 10000 packets)
What can be the causes of these sporadic delays? And how to solve them?
P.S. My application is running on a Linux 2.6.18-238.el5PAE. Delays are measured between the reception of the packet and after the success of the transmission!
An image to be more clear :
10ms is a tough deadline for a non-realtime OS.
Assign your process to one of the realtime scheduling policies, e.g. SCHED_RR or SCHED_FIFO (some reading). It can be done in the code via sched_setscheduler() or from command line via chrt. Adjust the priority as well, while you're at it.
Make sure your code doesn't consume CPU more than it has to, or it will affect entire system performance.
You may also need RT_PREEMPT patch.
Overall, the task of generating Ethernet traffic to schedule on Linux is not an easy one. E.g. see BRUTE, a high-performance traffic generator; maybe you'll find something useful in its code or in the research paper.

Conceptual question on (a tool like) LoadRunner

I'm using LoadRunner to stress-test a J2EE application.
I have got: 1 MySQL DB server, and 1 JBoss App server. Each is a 16-core (1.8GHz) / 8GB RAM box.
Connection Pooling: The DB server is using max_connections = 100 in my.cnf. The App Server too is using min-pool-size and max-pool-size = 100 in mysql-ds.xml and mysql-ro-ds.xml.
I'm simulating a load of 100 virtual users from a 'regular', single-core PC. This is a 1.8GHz / 1GB RAM box.
The application is deployed and being used on a 100 Mbps ethernet LAN.
I'm using rendezvous points in sections of my stress-testing script to simulate real-world parallel (and not concurrent) use.
Question:
The CPU utilization on this load-generating PC never reaches 100% and memory too, I believe, is available. So, I could try adding more virtual users on this PC. But before I do that, I would like to know 1 or 2 fundamentals about concurrency/parallelism and hardware:
With only a single-core load generator as this one, can I really simulate a parallel load of 100 users (with each user using operating from a dedicated PC in real-life)? My possibly incorrect understanding is that, 100 threads on a single-core PC will run concurrently (interleaved, that is) but not parallely... Which means, I cannot really simulate a real-world load of 100 parallel users (on 100 PCs) from just one, single-core PC! Is that correct?
Network bandwidth limitations on user parallelism: Even assuming I had a 100-core load-generating PC (or alternatively, let's say I had 100, single-core PCs sitting on my LAN), won't the way ethernet works permit only concurrency and not parallelism of users on the ethernet wire connecting the load-generating PC to the server. In fact, it seems, this issue (of absence of user parallelism) will persist even in a real-world application usage (with 1 PC per user) since the user requests reaching the app server on a multi-core box can only arrive interleaved. That is, the only time the multi-core server could process user requests in parallel would be if each user had her own, dedicated physical layer connection between it and the server!!
Assuming parallelism is not achievable (due to the above 'issues') and only the next best thing called concurrency is possible, how would I go about selecting the hardware and network specification to use my simulation. For example, (a) How powerful my load-generating PCs should be? (b) How many virtual users to create per each of these PCs? (c) Does each PC on the LAN have to be connected via a switch to the server (to avoid) broadcast traffic which would occur if a hub were to be used in instead of a switch?
Thanks in advance,
/HS
Not only are you using Ethernet, assuming you're writing web services you're talking over HTTP(S) which sits atop of TCP sockets, a reliable, ordered protocol with the built-in round trips inherent to reliable protocols. Sockets sit on top of IP, if your IP packets don't line up with your Ethernet frames you'll never fully utilize your network. Even if you were using UDP, had shaped your datagrams to fit your Ethernet frames, had 100 load generators and 100 1Gbit ethernet cards on your server, they'd still be operating on interrupts and you'd have time multiplexing a little bit further down the stack.
Each level here can be thought of in terms of transactions, but it doesn't make sense to think at every level at once. If you're writing a SOAP application that operates at level 7 of the OSI model, then this is your domain. As far as you're concerned your transactions are SOAP HTTP(S) requests, they are parallel and take varying amounts of time to complete.
Now, to actually get around to answering your question: it depends on your test scripts, the amount of memory they use, even the speed your application responds. 200 or more virtual users should be okay, but finding your bottlenecks is a matter of scientific inquiry. Do the experiments, find them, widen them, repeat until you're happy. Gather system metrics from your load generators and system under test and compare with OS provider recommendations, look at the difference between a dying system and a working system, look for graphs that reach a plateau and so on.
It sounds to me like you're over thinking this a bit. Your servers are fast and new, and are more than suited to handle lots of clients. Your bottleneck (if you have one) is either going to be your application itself or your 100m network.
1./2. You're testing the server, not the client. In this case, all the client is doing is sending and receiving data - there's no overhead for client processing (rendering HTML, decoding images, executing javascript and whatever else it may be). A recent unicore machine can easily saturate a gigabit link; a 100 mbit pipe should be cake.
Also - The processors in newer/fancier ethernet cards offload a lot of work from the CPU, so you shouldn't necessarily expect a CPU hit.
3. Don't use a hub. There's a reason you can buy a 100m hub for $5 on craigslist.
Without having a better understanding of your application it's tough to answer some of this, but generally speaking you are correct that to achieve a "true" stress test of your server it would be ideal to have 100 cores (using a target of a 100 concurrent users), i.e. 100 PC's. Various issues, though, will probably show this as a no-brainer.
I have a communication engine I built a couple of years back (.NET / C#) that uses asyncrhonous sockets - needed the fastest speeds possible so we had to forget adding any additional layers on top of the socket like HTTP or any other higher abstractions. Running on a quad core 3.0GHz computer with 4GB of RAM that server easily handles the traffic of ~2,200 concurrent connections. There's a Gb switch and all the PC's have Gb NIC's. Even with all PC's communicating at the same time it's rare to see processor loads > 30% on that server. I assume this is because of all the latency that is inherent in the "total system."
We have a new requirement to support 50,000 concurrent users that I'm currently implementing. The server has dual quad core 2.8GHz processors, a 64-bit OS, and 12GB of RAM. Our modeling shows this computer is more than enough to handle the 50K users.
Issues like the network latency I mentioned (don't forget CAT 3 vs. CAT 5 vs. CAT 6 issue), database connections, types of data being stored and mean record sizes, referential issues, backplane and bus speeds, hard drive speeds and size, etc., etc., etc. play as much a role as anything in slowing down a platform "in total." My guess would be that you could have 500, 750, a 1,000, or even more users to your system.
The goal in the past was to never leave a thread blocked for too long ... the new goal is to keep all the cores busy.
I have another application that downloads and analyzes the content of ~7,800 URL's daily. Running on a dual quad core 3.0GHz (Windows Ultimate 7 64-bit edition) with 24GB of RAM that process used to take ~28 minutes to complete. By simply swiching the loop to a Parallel.ForEach() the entire process now take < 5 minutes. My processor load that we've seen is always less than 20% and maximum network loading of only 14% (CAT 5 on a Gb NIC through a standard Gb dumb hub and a T-1 line).
Keeping all the cores busy makes a huge difference, especially true on applications that spend allot of time waiting on IO.
As you are representing users, disregard the rendezvous unless you have either an engineering requirement to maintain simultaneous behavior or your agents are processes and not human users and these agents are governed by a clock tick. Humans are chaotic computing units with variant arrival and departure windows based upon how quickly one can or cannot read, type, converse with friends, etc... A great book on the subject of population behavior is "Chaos" by James Gleik (sp?)
The odds of your 100 decoupled users being highly synchronous in their behavior on an instant basis in observable conditions is zero. The odds of concurrent activity within a defined time window however, such as 100 users logging in within 10 minutes after 9:00am on a business morning, can be quite high.
As a side note, a resume with rendezvous emphasized on it is the #1 marker for a person with poor tool understanding and poor performance test process. This comes from a folio of over 1500 interviews conducted over the past 15 years (I started as a Mercury Employee on april 1, 1996)
James Pulley
Moderator
-SQAForums WinRunner, LoadRunner
-YahooGroups LoadRunner, Advanced-LoadRunner
-GoogleGroups lr-LoadRunner
-Linkedin LoadRunner (owner), LoadrunnerByTheHour (owner)
Mercury Alum (1996-2000)
CTO, Newcoe Performance Engineering

Programs run in 2 seconds on my machine but 15 seconds on others

I have two programs written in C++ that use Winsock. They both accept TCP connections and one sends data the other receives data. They are compiled in Visual Studio 2008. I also have a program written in C# that connects to both C++ programs and forwards the packets it receives from one and sends them to the other. In the process it counts and displays the number of packets forwarded. Also, the elapsed time from the first to the most recent packet is displayed.
The C++ program that sends packets simply loops 1000 times sending the exact same data. When I run all three apps on my development machine (using loopback or actual IP) the packets get run through the entire system in around 2 seconds. When I run all three on any other PC in our lab it always takes between 15 and 16 seconds. Each PC has different processors and amounts of memory but all of them run Windows XP Professional. My development PC actually has an older AMD Athlon with half as much memory as one of the machines that takes longer to perform this task. I have watched the CPU time graph in Task Manager on my machine and one other and neither of them is using a significant amount of the processor (i.e. more than 10%) while these programs run.
Does anyone have any ideas? I can only think to install Visual Studio on a target machine to see if it has something to do with that.
Problem Solved ====================================================
I first installed Visual Studio to see if that had any effect and it didn't. Then I tested the programs on my new development PC and it ran just as fast as my old one. Running the programs on a Vista laptop yielded 15 second times again.
I printed timestamps on either side of certain instructions in the server program to see which was taking the longest and I found that the delay was being caused by a Sleep() method call of 1 millisecond. Apparently on my old and new systems the Sleep(1) was being ignored because I would have anywhere from 10 to >20 packets being sent in the same millisecond. Occasionally I would have a break in execution of around 15 or 16 milliseconds which led to the the time of around 2 seconds for 1000 packets. On the systems that took around 15 seconds to run through 1000 packets I would have either a 15 or 16 millisecond gap between sending each packet.
I commented out the Sleep() method call and now the packets get sent immediately. Thanks for the help.
You should profile your application on the good, 2 second case, and the 15 second lab case and see where they differ. The difference could be due to any number of a problems (disk, antivirus, network) - without any data backing it up we'd just be shooting in the dark.
If you don't have access to a profiler, you can add timing instrumentation to various phases of your program to see which phase is taking longer.
You could try checking the Winsock performance tuning registry settings - it may be the installation of some game or utility has tweaked those on your PC.