Real time application on Microsoft Windows 7 Pro

Real time application on Microsoft Windows 7 Pro - c++

I have opened this new thread after having tried lots of thing.
My application (C++ on VS2010) has to grab an image, elaborate the image, send through UDP the result. The problem is the frequency: 200 times/second. So I have a camera that records image in a double buffer at 200Hz, and I have to elaborate the image in less than 5 milliseconds. The application works at 99,999 % of the time but I think that Win7 Pro take out my realtime priority and so in 1 of 100000 cases something goes wrong.
Reading msdn forum and so on, I can only use:
SetPriorityClass(GetCurrentProcess(), REALTIME_PRIORITY_CLASS); To get a realtime priority of the process when it have been launched with administrator's priviledges
SetThreadPriority(HANDLE, THREAD_PRIORITY_ABOVE_NORMAL); or THREAD_PRIORITY_HIGHEST or THREAD_PRIORITY_TIME_CRITICAL.
Now, I have 5 threads started by me (_beginthreadex) and several thread started inside a compiled DLL of the camera. I think that if I set Time Critical priority to all my 5 thread, none of them has higher priority than others.
So I have two questions:
could I work at 200 Hz without Windows's lags?
have you any suggestions for my threads' settings?
Thanks!!
Bye bye
Paolo

Oh I would use more than two buffers for this. A pool of 200 image objects seems like a better bet.
How much latency can you afford, overall? It's always the same story with video streaming - you can have consistent, pause-free operation, or low latency, but not both.
How big is the video image buffer queue on the client side?
Edit:
'I must ever send a UDP datagram every 5 millisec' :((
OK, so you have an image output queue with a UDP send thread on a 5ms loop, yes? The queue must never empty. Sounds indeed like the elaborations are the bottleneck.
Do you have a [number of cores+] pool of threads doing the elaborations?

Related

Sync playback and recording on different computers

I would like to sync audio playback and recording on two different computers from a common trigger signal using a c/c++ script. Expected delay should not exceed 1ms.
Retrieving a signal and then start a script is not really an issue, delay is quite insignificant (few micro seconds).
For the moment, i'm stuck in an average delay (between the beginning of playback and beginning of the record) of about 20ms and deviation is quite important (5 to 10 ms).
Computers are running on Linux and I'm using aplay and arecord from alsa-utils (started directly from code using system() command).
Does someone has a good idea or experience to decrease or control the latency between the two audio interfaces ?
In my opinion, there should be a way to init both interface (rate, output format, ...) and, for the playback device, preload the data into the audio buffer and then start playing when signal is received.
Thanks

This is a tough one, but also technically very interesting. The best approach I can think of at the moment would be using a RTT (round trip time) approach (Given that you can control the delay of the audio devices to the extent required). You can emit a signal on the first system to the second system, to which the second system replies. The second system starts recording after a predefined amount of time (maybe 100 ms, but depends on the expected latency). When the fist system has received the response it may determine the round-trip-time. We can then start the playback after the predefined delay, minus the half round-trip-time - assuming that the way forth takes the same amount of time as the way back. The accuracy that can be achieved depends on the systems you are using for signalling.
EMIT SIGNAL ON SYSTEM 1
RECEIVE SIGNAL ON SYSTEM 2
EMIT SIGNAL ON SYSTEM 2
RECEIVE SIGNAL ON SYSTEM 1
DETERMINE ROUND-TRIP-TIME
START ON SYSTEM 2 AFTER X ms
START ON SYSTEM 1 ASTER (X-RTT/2) ms

Efficiency in sending UDP packets to the same address

I am reworking some of the infrastructure in an existing application that sends UDP data packets to 1...N addresses (often multicast). Currently there are, let's say, T transmitter objects, and in some cases, all of the transmitters are sending to the the same address.
So to simplify and provide an example case, lets say there are 3 transmitter objects and they all need to send to a single specific address. My question is... which is more efficient?:
Option 1) Put a mutex around a single socket and have all the transmitters (T) share the same socket.
T----\
T----->Socket
T----/
Option 2) Use three separate sockets, all sending to the same location.
T----->Socket 1
T----->Socket 2
T----->Socket 3
I suspect that with the second option, under the hood, the OS or the NIC puts a mutex around the final transmit so in the big picture, Option 2 is probably not a whole lot different than Option 1.
I will probably set up an experiment on my development PC next week, but there's no way I can test all the potential computer configurations that users might install on. I also realize there are different implementations - Windows vs Linux, different NIC chipset manufacturers, etc, but I'm wondering if anyone might have some past experience or architectural knowledge that could shed light on an advantage of one option over the other.
Thanks!

After running some benchmarks on a Windows 10 computer, I have an "answer" that at least gives me a rough idea of what to expect. I can't be 100% sure that every system will behave the same way, but most of the servers I run use Intel NICs and Windows 10, and my typical packet sizes are around 1200 bytes, so the answer at least makes me comfortable that it's correct for my particular scenario. I decided to post the results here in case it might help anyone else can make use of the experiment.
I build a simple command line app that would first spawn T transmitter threads all using a single socket with a mutex around it. Immediately after, it would run another test with the same number of transmitters, but this time each transmitter would have its own socket so no mutex was needed, (although I'm sure at some lower level there was a locking mechanism). Each transmitter blasts out packets as fast as possible.
This is the test setup I used:
2,700,000 packets at 1200 bytes each.
Release mode, 64 bit.
i7-3930K CPU, Intel Gigabit CT PCIE adapter.
And here are the results
1 Transmitter : SharedSocket = 28.2650 sec : 1 Socket = 28.2073 sec.
3 Transmitters : SharedSocket = 28.4485 sec : MultipleSockets = 27.5190 sec.
6 Transmitters : SharedSocket = 28.7414 sec : MultipleSockets = 27.3485 sec.
12 Transmitters : SharedSocket = 27.9463 sec : MulitpleSockets = 27.3479 sec.
As expected, the test with only one thread had almost the same time for both. However, in the cases with 3, 6, and 12 transmitters, there is an approximately 3% better performance boost by using one socket per thread instead of sharing the socket. It's not a massive difference, but if you're trying to squeeze every last ounce out of your system, it could be a useful statistic. My particular application is for transmitting a massive amount of video.
Just as a sanity check.... here is a screenshot of the TaskManager's network page on the server side. You can see a throughput increase about half way through the test, which coincides with the switch to the second multiple socket test. I included a screencap of the client computer as well (it was a Windows 7 box).

Response time of an application unintutively correlated with number of input triggers in a give time period

I have an Multi threaded C++ network application which listens for UDP packets as input - this data then hops through the application/processed via various queues and threads and finally pushed out on a TCP socket.
What I am seeing is that if inputs come in slow lets say 5/sec, the total response time (in-to-out) is slow (lets say 100ms) and if the inputs come in fast e.g. 20/sec, the response time is also fast (~50ms). This observation is really weird. And also messes up the response time in the fast case because the 1st response is always slow. Just to make sure - the application is doing exactly the same amount of work in both slow and fast cases.
Things that have been tried to investigate this -
Its a dual Xeon box running Linux 2.6 kernel - disabled Turbo boost, made sure processors are in C0 states.
eliminated network causes. the root cause is within the box. ( in software or hardware )
I have a fake input going through the system from input to output on a timer to keep the application "warm" - no effect. (the application
's worker threads are busy waiting and pinned to cores).
perf points indicate that EVERY thing gets slower - which basically mean that the processors are slowing down when not under continuous load - but nothing else suggests that ( 17z/turbostat) or I am reading them incorrectly.
Does someone has color on what might be happening?

In my experience, this would be the nasty doings of Nagle's devilish device. Sounds extremeliy plausible to me that more data in the buffer fills up TCP packet which than get's sent. Without much data, tcp packet is waiting for the ack from the other side, which is, as we all know, delayed.
Solution - learn to make sure Nagle's program killer is disabled the first thing after you create any send-capable TCP socket.

Idendify the reason for a 200 ms freezing in a time critical loop

New description of the problem:
I currently run our new data acquisition software in a test environment. The software has two main threads. One contains a fast loop which communicates with the hardware and pushes the data into a dual buffer. Every few seconds, this loop freezes for 200 ms. I did several tests but none of them let me figure out what the software is waiting for. Since the software is rather complex and the test environment could interfere too with the software, I need a tool/technique to test what the recorder thread is waiting for while it is blocked for 200 ms. What tool would be useful to achieve this?
Original question:
In our data acquisition software, we have two threads that provide the main functionality. One thread is responsible for collecting the data from the different sensors and a second thread saves the data to disc in big blocks. The data is collected in a double buffer. It typically contains 100000 bytes per item and collects up to 300 items per second. One buffer is used to write to in the data collection thread and one buffer is used to read the data and save it to disc in the second thread. If all the data has been read, the buffers are switched. The switch of the buffers seems to be a major performance problem. Each time the buffer switches, the data collection thread blocks for about 200 ms, which is far too long. However, it happens once in a while, that the switching is much faster, taking nearly no time at all. (Test PC: Windows 7 64 bit, i5-4570 CPU #3.2 GHz (4 cores), 16 GB DDR3 (800 MHz)).
My guess is, that the performance problem is linked to the data being exchanged between cores. Only if the threads run on the same core by chance, the exchange would be much faster. I thought about setting the thread affinity mask in a way to force both threads to run on the same core, but this also means, that I lose real parallelism. Another idea was to let the buffers collect more data before switching, but this dramatically reduces the update frequency of the data display, since it has to wait for the buffer to switch before it can access the new data.
My question is: Is there a technique to move data from one thread to another which does not disturb the collection thread?
Edit: The double buffer is implemented as two std::vectors which are used as ring buffers. A bool (int) variable is used to tell which buffer is the active write buffer. Each time the double buffer is accessed, the bool value is checked to know which vector should be used. Switching the buffers in the double buffer just means toggling this bool value. Of course during the toggling all reading and writing is blocked by a mutex. I don't think that this mutex could possibly be blocking for 200 ms. By the way, the 200 ms are very reproducible for each switch event.

Locking and releasing a mutex just to switch one bool variable will not take 200ms.
Main problem is probably that two threads are blocking each other in some way.
This kind of blocking is called lock contention. Basically this occurs whenever one process or thread attempts to acquire a lock held by another process or thread. Instead parallelism you have two thread waiting for each other to finish their part of work, having similar effect as in single threaded approach.
For further reading I recommend this article for a read, which describes lock contention with more detailed level.

Since you are running on windows maybe you use visual studio? if yes I would resort to VS profiler which is quite good (IMHO) in such cases, once you don't need to check data/instruction caches (then the Intel's vTune is a natural choice). From my experience VS is good enough to catch contention problems as well as CPU bottlenecks. you can run it directly from VS or as standalone tool. you don't need the VS installed on your test machine you can just copy the tool and run it locally.
VSPerfCmd.exe /start:SAMPLE /attach:12345 /output:samples - attach to process 12345 and gather CPU sampling info
VSPerfCmd.exe /detach:12345 - detach from process
VSPerfCmd.exe /shutdown - shutdown the profiler, the samples.vsp is written (see first line)
then you can open the file and inspect it in visual studio. if you don't see anything making your CPU busy switch to contention profiling - just change the "start" argument from "SAMPLE" to "CONCURRENCY"
The tool is located under %YourVSInstallDir%\Team Tools\Performance Tools\, AFAIR it is available from VS2010
Good luck

After discussing the problem in the chat, it turned out that the Windows Performance Analyser is a suitable tool to use. The software is part of the Windows SDK and can be opened using the command wprui in a command window. (Alois Kraus posted this useful link: http://geekswithblogs.net/akraus1/archive/2014/04/30/156156.aspx in the chat). The following steps revealed what the software had been waiting on:
Record information with the WPR using the default settings and load the saved file in the WPA.
Identify the relevant thread. In this case, the recording thread and the saving thread obviously had the highest CPU load. The saving thread could be easily identified. Since it saves data to disc, it is the one that with file access. (Look at Memory->Hard Faults)
Check out Computation->CPU usage (Precise) and select Utilization by Process, Thread. Select the process you are analysing. Best display the columns in the order: NewProcess, ReadyingProcess, ReadyingThreadId, NewThreadID, [yellow bar], Ready (µs) sum, Wait(µs) sum, Count...
Under ReadyingProcess, I looked for the process with the largest Wait (µs) since I expected this one to be responsible for the delays.
Under ReadyingThreadID I checked each line referring to the thread with the delays in the NewThreadId column. After a short search, I found a thread that showed frequent Waits of about 100 ms, which always showed up as a pair. In the column ReadyingThreadID, I was able to read the id of the thread the recording loop was waiting for.
According to its CPU usage, this thread did basically nothing. In our special case, this led me to the assumption that the serial port io command could cause this wait. After deactivating them, the delay was gone. The important discovery was that the 200 ms delay was in fact composed of two 100 ms delays.
Further analysis showed that the fetch data command via the virtual serial port pair gets sometimes lost. This might be linked to very high CPU load in the data saving and compression loop. If the fetch command gets lost, no data is received and the first as well as the second attempt to receive the data timed out with their 100 ms timeout time.

Threads are slow when audio is off

I have 2 projects. One is built by C++ Builder without MFC Style. And other one is VC++ MFC 11.
When I create a thread and create a cycle -- let's say this cycle adds one to progressbar position -- from 1 to 100 by using Sleep(10) it works of course for both C++ Builder and C++ MFC.
Now, Sleep(10) is wait 10 miliseconds. OK. But the problem is only if I have open media player, Winamp or anything else that produces "Sound". If I close all media player, winamp and other sound programs, my threads get slower than 10 miliseconds.
It takes like 50-100 ms / each. If I open any music, it works normally as I expected.
I have no any idea why this is happening. I first thought that I made a mistake inside MFC App but why does C++ Builder also slow down?
And yes, I am positively sure it is sound related because I even re-formated my windows, disabled everything. Lastly I discovered that sound issue.
Does my code need something?
Update:
Now, I follow the code and found that I used Sleep(1) in such areas to wait 1 miliseconds. The reason of this, I move an object from left to right. If I remove this sleep then the moving is not showing up because it is very fast. So, I should use Sleep(1). With Sleep(1), if audio is on than it works. If audio is off than it is very slow.
for (int i = 0; i <= 500; i++) {
theDialog->staticText->SetWindowsPosition(NULL, i, 20, 0, 0);
Sleep(1);
}
So, suggestions regarding this are really appreciated. What should I do?
I know this is the incorrect way. I should use something else that is proper and valid. But what exactly? Which function or class help me to move static texts from one position to another smoothly?
Also, changing the thread priority has not helped.
Update 2:
Update 1 is an another question :)

Sleep (10), will (as we know), wait for approximately 10 milliseconds. If there is a higher priority thread which needs to be run at that moment, the thread wakeup maybe delayed. Multimedia threads are probably running in a Real-Time or High priority, as such when you play sound, your thread wakeup gets delayed.
Refer to Jeffrey Richters comment in Programming Applications for Microsoft Windows (4th Ed), section Sleeping in Chapter 7:
The system makes the thread not schedulable for approximately the
number of milliseconds specified. That's right—if you tell the system
you want to sleep for 100 milliseconds, you will sleep approximately
that long but possibly several seconds or minutes more. Remember that
Windows is not a real-time operating system. Your thread will probably
wake up at the right time, but whether it does depends on what else is
going on in the system.
Also as per MSDN Multimedia Class Scheduler Service (Windows)
MMCSS ensures that time-sensitive processing receives prioritized access to CPU resources.
As per the above documentation, you can also control the percentage of CPU resources that will be guaranteed to low-priority tasks, through a registry key

Sleep(10) waits for at least 10 milliseconds. You have to write code to check how long you actually waited and if it's more than 10 milliseconds, handle that sanely in your code. Windows is not a real time operating system.

The minimum resolution for Sleep() timing is set system wide with timeBeginPeriod() and timeEndPeriod(). For example passing timeBeginPeriod(1) sets the minimum resolution to 1 ms. It may be that the audio programs are setting the resolution to 1 ms, and restoring it to something greater than 10 ms when they are done. I had a problem with a program that used Sleep(1) that only worked fine when the XE2 IDE was running but would otherwise sleep for 12 ms. I solved the problem by directly setting timeBeginPeriod(1) at the beginning of my program.
See: http://msdn.microsoft.com/en-us/library/windows/desktop/dd757624%28v=vs.85%29.aspx

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js