I have a controller design application where i get data from 3 USB devices (seen as virtual com port under windows), process it, and then send the action to a 4th USB device (also virtual com port). I need to process that data once i recieve it with a minimum latency.
I decided to use C++ on windows embedded and do the serial communication using the .Net System::IO::Ports and DataReceived event. I tested my code performance using only one USB device where USB device sends one byte to the computer and then the computer sends it back. I measure the time difference and it was totally indeterministic. Sometimes 2 ms and sometime 20 ms.
Note: the process priority is set to be realtime.
Is it a better way to get a deterministic latency where a max delay is guaranteed? may be another API suitable for realtime serial communication on windows embedded?
Thanks in advance
In Windows, you cannot get a deterministic latency delay. You cannot have any guarantee. Because it is not a real-time OS.
If you want some guarantee that are mandatory for your project, you can either deport the real-time part into a smaller device like Arduino (open-source, some users made a Real Time OS with it) or a Beck chip (commercial, delivered with a Real Time miniOS); or you can install a RealTime Linux, which brings in some guaranteed delay.
I have an Multi threaded C++ network application which listens for UDP packets as input - this data then hops through the application/processed via various queues and threads and finally pushed out on a TCP socket.
What I am seeing is that if inputs come in slow lets say 5/sec, the total response time (in-to-out) is slow (lets say 100ms) and if the inputs come in fast e.g. 20/sec, the response time is also fast (~50ms). This observation is really weird. And also messes up the response time in the fast case because the 1st response is always slow. Just to make sure - the application is doing exactly the same amount of work in both slow and fast cases.
Things that have been tried to investigate this -
Its a dual Xeon box running Linux 2.6 kernel - disabled Turbo boost, made sure processors are in C0 states.
eliminated network causes. the root cause is within the box. ( in software or hardware )
I have a fake input going through the system from input to output on a timer to keep the application "warm" - no effect. (the application
's worker threads are busy waiting and pinned to cores).
perf points indicate that EVERY thing gets slower - which basically mean that the processors are slowing down when not under continuous load - but nothing else suggests that ( 17z/turbostat) or I am reading them incorrectly.
Does someone has color on what might be happening?
In my experience, this would be the nasty doings of Nagle's devilish device. Sounds extremeliy plausible to me that more data in the buffer fills up TCP packet which than get's sent. Without much data, tcp packet is waiting for the ack from the other side, which is, as we all know, delayed.
Solution - learn to make sure Nagle's program killer is disabled the first thing after you create any send-capable TCP socket.
I'm using Interix on Windows XP to port my C++ Linux application more readily to port to Windows XP. My application sends and receives packets over a socket to and from a nearby machine running Linux. When sending, I'm only getting throughput of around 180 KB/sec and when receiving I'm getting around 525 KB/sec. The same code running on Linux gets closer to 2,500 KB/sec.
When I attempt to send at a higher rate than 180 KB/sec, packets get dropped to bring the rate back down to about that level.
I feel like I should be able to get better throughput on sending than 180 KB/sec but am not sure how to go about determining what is the cause of the dropped packets.
How might I go about investigating this slowness in the hopes of improving throughput?
--Some More History--
To reach the above numbers, I have already improved the throughput a bit by doing the following (that made no difference on Linux, but help throughput on Interix):
I changed SO_RCVBUF and SO_SNDBUF from 256KB to 25MB, this improved throughput about 20%
I ran optimized instead of debug, this improved throughput about 15%
I turned off all logging messages going to stdout and a log file, this doubled throughput.
So it would seem that CPU is a limiting factor on Interix, but not on Linux. Further, I am running on a Virtual Machine hosted in a hypervisor. The Windows XP is given 2 cores and 2 GB of memory.
I notice that the profiler shows the cpu on the two cores never exceeding 50% utilization on average. This even occurs when I have two instances of my application running, still it hovers around 50% on both cores. Perhaps my application, which is multi-threaded, with a dedicated thread to read from UDP socket and a dedicated thread to write to UDP socket (only one is active at any given time) is not being scheduled well on Interix and thus my packets are dropping?
In answering your question, I am making the following assumptions based on your description of the problem:
(1) You are using the exact same program in Linux when achieving the throughput of 2,500 KB/sec, other than the socket library, which is of course, going to be different between Windows and Linux. If this assumption is correct, we probably shouldn't have to worry about other pieces of your code affecting the throughput.
(2) When using Linux to achieve 2,500 KB/sec throughput, the node is in the exact same location in the network. If this assumption is correct, we don't have to worry about network issues affecting your throughput.
Given these two assumptions, I would say that you likely have a problem in your socket settings on the Windows side. I would suggest checking the size of the send-buffer first. The size of the send-buffer is 8192 bytes by default. If you increase this, you should see an increase in throughput. Use setsockopt() to change this. Here is the usage manual: http://msdn.microsoft.com/en-us/library/windows/desktop/ms740476(v=vs.85).aspx
EDIT: It looks like I misread your post going through it too quickly the first time. I just noticed you're using Interix, which means you're probably not using a different socket library. Nevertheless, I suggest checking the send buffer size first.
I have to write a C++ application that reads from the serial port byte by byte. This is an important need as it is receiving messages over radio transmission using modbus and the end of transmission is defined by 3.5 character length duration so I MUST be able to get the message byte by byte. The current system utilises DOS to do this which uses hardware interrupts. We wish to transfer to use Linux as the OS for this software, but we lack expertise in this area. I have tried a number of things to do this - firstly using polling with non-blocking read, using select with very short timeout values, setting the size of the read buffer of the serial port to one byte, and even using a signal handler on SIGIO, but none of these things provide quite what I require. My boss informs me that the DOS application we currently run uses hardware interrupts to get notification when there is something available to read from the serial port and that the hardware is accessible directly. Is there any way that I can get this functionality from a user space Linux application? Could I do this if I wrote a custom driver (despite never having done this before and having close to zero knowledge of how the kernel works) ??. I have heard that Linux is a very popular OS for hardware control and embedded devices so I am guessing that this kind of thing must be possible to do somehow, but I have spent literally weeks on this so far and still have no concrete idea of how best to proceed.
I'm not quite sure how reading byte-by-byte helps you with fractional-character reception, unless it's that there is information encoded in the duration of intervals between characters, so you need to know the timing of when they are received.
At any rate, I do suspect you are going to need to make custom modifications to the serial port kernel driver; that's really not all that bad as a project goes, and you will learn a lot. You will probably also need to change the configuration of the UART "chip" (really just a tiny corner of some larger device) to make it interrupt after only a single byte (ie emulate a 16450) instead of when it's typically 16-byte (emulating at 16550) buffer is partway full. The code of the dos program might actually be a help there. An alternative if the baud rate is not too fast would be to poll the hardware in the kernel or a realtime extension (or if it is really really slow as it might be on an HF radio link, maybe even in userspace)
If I'm right about needing to know the timing of the character reception, another option would be offload the reception to a micro-controller with dual UARTS (or even better, one UART and one USB interface). You could then have the micro watch the serial stream, and output to the PC (either on the other serial port at a much faster baud rate, or on the USB) little packages of data that include one received character and a timestamp - or even have it decode the protocol for you. The nice thing about this is that it would get you operating system independence, and would work on legacy free machines (byte-by-byte access is probably going to fail with an off-the-shelf USB-serial dongle). You can probably even make it out of some cheap eval board, rather than having to manufacture any custom hardware.
I'm using LoadRunner to stress-test a J2EE application.
I have got: 1 MySQL DB server, and 1 JBoss App server. Each is a 16-core (1.8GHz) / 8GB RAM box.
Connection Pooling: The DB server is using max_connections = 100 in my.cnf. The App Server too is using min-pool-size and max-pool-size = 100 in mysql-ds.xml and mysql-ro-ds.xml.
I'm simulating a load of 100 virtual users from a 'regular', single-core PC. This is a 1.8GHz / 1GB RAM box.
The application is deployed and being used on a 100 Mbps ethernet LAN.
I'm using rendezvous points in sections of my stress-testing script to simulate real-world parallel (and not concurrent) use.
Question:
The CPU utilization on this load-generating PC never reaches 100% and memory too, I believe, is available. So, I could try adding more virtual users on this PC. But before I do that, I would like to know 1 or 2 fundamentals about concurrency/parallelism and hardware:
With only a single-core load generator as this one, can I really simulate a parallel load of 100 users (with each user using operating from a dedicated PC in real-life)? My possibly incorrect understanding is that, 100 threads on a single-core PC will run concurrently (interleaved, that is) but not parallely... Which means, I cannot really simulate a real-world load of 100 parallel users (on 100 PCs) from just one, single-core PC! Is that correct?
Network bandwidth limitations on user parallelism: Even assuming I had a 100-core load-generating PC (or alternatively, let's say I had 100, single-core PCs sitting on my LAN), won't the way ethernet works permit only concurrency and not parallelism of users on the ethernet wire connecting the load-generating PC to the server. In fact, it seems, this issue (of absence of user parallelism) will persist even in a real-world application usage (with 1 PC per user) since the user requests reaching the app server on a multi-core box can only arrive interleaved. That is, the only time the multi-core server could process user requests in parallel would be if each user had her own, dedicated physical layer connection between it and the server!!
Assuming parallelism is not achievable (due to the above 'issues') and only the next best thing called concurrency is possible, how would I go about selecting the hardware and network specification to use my simulation. For example, (a) How powerful my load-generating PCs should be? (b) How many virtual users to create per each of these PCs? (c) Does each PC on the LAN have to be connected via a switch to the server (to avoid) broadcast traffic which would occur if a hub were to be used in instead of a switch?
Thanks in advance,
/HS
Not only are you using Ethernet, assuming you're writing web services you're talking over HTTP(S) which sits atop of TCP sockets, a reliable, ordered protocol with the built-in round trips inherent to reliable protocols. Sockets sit on top of IP, if your IP packets don't line up with your Ethernet frames you'll never fully utilize your network. Even if you were using UDP, had shaped your datagrams to fit your Ethernet frames, had 100 load generators and 100 1Gbit ethernet cards on your server, they'd still be operating on interrupts and you'd have time multiplexing a little bit further down the stack.
Each level here can be thought of in terms of transactions, but it doesn't make sense to think at every level at once. If you're writing a SOAP application that operates at level 7 of the OSI model, then this is your domain. As far as you're concerned your transactions are SOAP HTTP(S) requests, they are parallel and take varying amounts of time to complete.
Now, to actually get around to answering your question: it depends on your test scripts, the amount of memory they use, even the speed your application responds. 200 or more virtual users should be okay, but finding your bottlenecks is a matter of scientific inquiry. Do the experiments, find them, widen them, repeat until you're happy. Gather system metrics from your load generators and system under test and compare with OS provider recommendations, look at the difference between a dying system and a working system, look for graphs that reach a plateau and so on.
It sounds to me like you're over thinking this a bit. Your servers are fast and new, and are more than suited to handle lots of clients. Your bottleneck (if you have one) is either going to be your application itself or your 100m network.
1./2. You're testing the server, not the client. In this case, all the client is doing is sending and receiving data - there's no overhead for client processing (rendering HTML, decoding images, executing javascript and whatever else it may be). A recent unicore machine can easily saturate a gigabit link; a 100 mbit pipe should be cake.
Also - The processors in newer/fancier ethernet cards offload a lot of work from the CPU, so you shouldn't necessarily expect a CPU hit.
3. Don't use a hub. There's a reason you can buy a 100m hub for $5 on craigslist.
Without having a better understanding of your application it's tough to answer some of this, but generally speaking you are correct that to achieve a "true" stress test of your server it would be ideal to have 100 cores (using a target of a 100 concurrent users), i.e. 100 PC's. Various issues, though, will probably show this as a no-brainer.
I have a communication engine I built a couple of years back (.NET / C#) that uses asyncrhonous sockets - needed the fastest speeds possible so we had to forget adding any additional layers on top of the socket like HTTP or any other higher abstractions. Running on a quad core 3.0GHz computer with 4GB of RAM that server easily handles the traffic of ~2,200 concurrent connections. There's a Gb switch and all the PC's have Gb NIC's. Even with all PC's communicating at the same time it's rare to see processor loads > 30% on that server. I assume this is because of all the latency that is inherent in the "total system."
We have a new requirement to support 50,000 concurrent users that I'm currently implementing. The server has dual quad core 2.8GHz processors, a 64-bit OS, and 12GB of RAM. Our modeling shows this computer is more than enough to handle the 50K users.
Issues like the network latency I mentioned (don't forget CAT 3 vs. CAT 5 vs. CAT 6 issue), database connections, types of data being stored and mean record sizes, referential issues, backplane and bus speeds, hard drive speeds and size, etc., etc., etc. play as much a role as anything in slowing down a platform "in total." My guess would be that you could have 500, 750, a 1,000, or even more users to your system.
The goal in the past was to never leave a thread blocked for too long ... the new goal is to keep all the cores busy.
I have another application that downloads and analyzes the content of ~7,800 URL's daily. Running on a dual quad core 3.0GHz (Windows Ultimate 7 64-bit edition) with 24GB of RAM that process used to take ~28 minutes to complete. By simply swiching the loop to a Parallel.ForEach() the entire process now take < 5 minutes. My processor load that we've seen is always less than 20% and maximum network loading of only 14% (CAT 5 on a Gb NIC through a standard Gb dumb hub and a T-1 line).
Keeping all the cores busy makes a huge difference, especially true on applications that spend allot of time waiting on IO.
As you are representing users, disregard the rendezvous unless you have either an engineering requirement to maintain simultaneous behavior or your agents are processes and not human users and these agents are governed by a clock tick. Humans are chaotic computing units with variant arrival and departure windows based upon how quickly one can or cannot read, type, converse with friends, etc... A great book on the subject of population behavior is "Chaos" by James Gleik (sp?)
The odds of your 100 decoupled users being highly synchronous in their behavior on an instant basis in observable conditions is zero. The odds of concurrent activity within a defined time window however, such as 100 users logging in within 10 minutes after 9:00am on a business morning, can be quite high.
As a side note, a resume with rendezvous emphasized on it is the #1 marker for a person with poor tool understanding and poor performance test process. This comes from a folio of over 1500 interviews conducted over the past 15 years (I started as a Mercury Employee on april 1, 1996)
James Pulley
Moderator
-SQAForums WinRunner, LoadRunner
-YahooGroups LoadRunner, Advanced-LoadRunner
-GoogleGroups lr-LoadRunner
-Linkedin LoadRunner (owner), LoadrunnerByTheHour (owner)
Mercury Alum (1996-2000)
CTO, Newcoe Performance Engineering