Windows Audio / WaveInAddBuffer() blocks

Windows Audio / WaveInAddBuffer() blocks - c++

My application records audio samples from a microphone connected to my PC. So I chose the Windows WaveInXXX API to do the job.
After reading the documentation I decided to avoid using the callback mechanism with WaveInProc to save me the hassle synchronizing the threads. The whole application is pretty big and I thought this would make debugging simpler. When the application requests a block of samples, I just iterate over my buffer queue, take one out, copy the data, unprepare it, prepare it and add it back to the buffer queue. Basic program structure looks like this, I hope it makes the basic program flow clear:
WaveInOpen()
WaveInStart()
FunctionAddingPreparedBuffersToTheQueue()
while(someConditionThatEventuallyBecomesFalse)
if(NextBufferInQueueIsMarkedDone)
GetDataFromBuffer()
UnpreparePrepareHeaderAndAddBuffer()
else
WaitForAShortTime()
WaveInStop()
WaveInClose()
Now the problem appears: After some time (and I am unable to reproduce the exact condition), WaveInAddBuffer() causes a deadlock although it's in the same thread as all the rest. The header for the buffer that shall be added when the deadlock happens is prepared and dwFlags == WHDR_PREPARED == 2.
Any ideas what could cause this deadlock?

I have not seen such a problem, but a guess might be something like fragmentation related to all the unprepare/prepare cycles. They are not necessary. You can do the prepare once for each buffer and then unprepare when finished recording. (Prepare locks the buffer into physical memory.)

Related

Idendify the reason for a 200 ms freezing in a time critical loop

New description of the problem:
I currently run our new data acquisition software in a test environment. The software has two main threads. One contains a fast loop which communicates with the hardware and pushes the data into a dual buffer. Every few seconds, this loop freezes for 200 ms. I did several tests but none of them let me figure out what the software is waiting for. Since the software is rather complex and the test environment could interfere too with the software, I need a tool/technique to test what the recorder thread is waiting for while it is blocked for 200 ms. What tool would be useful to achieve this?
Original question:
In our data acquisition software, we have two threads that provide the main functionality. One thread is responsible for collecting the data from the different sensors and a second thread saves the data to disc in big blocks. The data is collected in a double buffer. It typically contains 100000 bytes per item and collects up to 300 items per second. One buffer is used to write to in the data collection thread and one buffer is used to read the data and save it to disc in the second thread. If all the data has been read, the buffers are switched. The switch of the buffers seems to be a major performance problem. Each time the buffer switches, the data collection thread blocks for about 200 ms, which is far too long. However, it happens once in a while, that the switching is much faster, taking nearly no time at all. (Test PC: Windows 7 64 bit, i5-4570 CPU #3.2 GHz (4 cores), 16 GB DDR3 (800 MHz)).
My guess is, that the performance problem is linked to the data being exchanged between cores. Only if the threads run on the same core by chance, the exchange would be much faster. I thought about setting the thread affinity mask in a way to force both threads to run on the same core, but this also means, that I lose real parallelism. Another idea was to let the buffers collect more data before switching, but this dramatically reduces the update frequency of the data display, since it has to wait for the buffer to switch before it can access the new data.
My question is: Is there a technique to move data from one thread to another which does not disturb the collection thread?
Edit: The double buffer is implemented as two std::vectors which are used as ring buffers. A bool (int) variable is used to tell which buffer is the active write buffer. Each time the double buffer is accessed, the bool value is checked to know which vector should be used. Switching the buffers in the double buffer just means toggling this bool value. Of course during the toggling all reading and writing is blocked by a mutex. I don't think that this mutex could possibly be blocking for 200 ms. By the way, the 200 ms are very reproducible for each switch event.

Locking and releasing a mutex just to switch one bool variable will not take 200ms.
Main problem is probably that two threads are blocking each other in some way.
This kind of blocking is called lock contention. Basically this occurs whenever one process or thread attempts to acquire a lock held by another process or thread. Instead parallelism you have two thread waiting for each other to finish their part of work, having similar effect as in single threaded approach.
For further reading I recommend this article for a read, which describes lock contention with more detailed level.

Since you are running on windows maybe you use visual studio? if yes I would resort to VS profiler which is quite good (IMHO) in such cases, once you don't need to check data/instruction caches (then the Intel's vTune is a natural choice). From my experience VS is good enough to catch contention problems as well as CPU bottlenecks. you can run it directly from VS or as standalone tool. you don't need the VS installed on your test machine you can just copy the tool and run it locally.
VSPerfCmd.exe /start:SAMPLE /attach:12345 /output:samples - attach to process 12345 and gather CPU sampling info
VSPerfCmd.exe /detach:12345 - detach from process
VSPerfCmd.exe /shutdown - shutdown the profiler, the samples.vsp is written (see first line)
then you can open the file and inspect it in visual studio. if you don't see anything making your CPU busy switch to contention profiling - just change the "start" argument from "SAMPLE" to "CONCURRENCY"
The tool is located under %YourVSInstallDir%\Team Tools\Performance Tools\, AFAIR it is available from VS2010
Good luck

After discussing the problem in the chat, it turned out that the Windows Performance Analyser is a suitable tool to use. The software is part of the Windows SDK and can be opened using the command wprui in a command window. (Alois Kraus posted this useful link: http://geekswithblogs.net/akraus1/archive/2014/04/30/156156.aspx in the chat). The following steps revealed what the software had been waiting on:
Record information with the WPR using the default settings and load the saved file in the WPA.
Identify the relevant thread. In this case, the recording thread and the saving thread obviously had the highest CPU load. The saving thread could be easily identified. Since it saves data to disc, it is the one that with file access. (Look at Memory->Hard Faults)
Check out Computation->CPU usage (Precise) and select Utilization by Process, Thread. Select the process you are analysing. Best display the columns in the order: NewProcess, ReadyingProcess, ReadyingThreadId, NewThreadID, [yellow bar], Ready (µs) sum, Wait(µs) sum, Count...
Under ReadyingProcess, I looked for the process with the largest Wait (µs) since I expected this one to be responsible for the delays.
Under ReadyingThreadID I checked each line referring to the thread with the delays in the NewThreadId column. After a short search, I found a thread that showed frequent Waits of about 100 ms, which always showed up as a pair. In the column ReadyingThreadID, I was able to read the id of the thread the recording loop was waiting for.
According to its CPU usage, this thread did basically nothing. In our special case, this led me to the assumption that the serial port io command could cause this wait. After deactivating them, the delay was gone. The important discovery was that the 200 ms delay was in fact composed of two 100 ms delays.
Further analysis showed that the fetch data command via the virtual serial port pair gets sometimes lost. This might be linked to very high CPU load in the data saving and compression loop. If the fetch command gets lost, no data is received and the first as well as the second attempt to receive the data timed out with their 100 ms timeout time.

Is FindFirstChangeNotification API doing any disk access? [duplicate]

I've used FileSystemWatcher in the past. However, I am hoping someone can explain how it actually is working behind the scenes.
I plan to utilize it in an application I am making and it would monitor about 5 drives and maybe 300,000 files.
Does the FileSystemWatcher actually do "Checking" on the drive - as in, will it be causing wear/tear on the drive? Also does it impact hard drive ability to "sleep"
This is where I do not understand how it works - if it is like scanning the drives on a timer etc... or if its waiting for some type of notification from the OS before it does anything.
I just do not want to implement something that is going to cause extra reads on a drive and keep the drive from sleeping.

Nothing like that. The file system driver simply monitors the normal file operations requested by other programs that run on the machine against the filters you've selected. If there's a match then it adds an entry to an internal buffer that records the operation and the filename. Which completes the driver request and gets an event to run in your program. You'll get the details of the operation passed to you from that buffer.
So nothing actually happens the operations themselves, there is no extra disk activity at all. It is all just software that runs. The overhead is minimal, nothing slows down noticeably.

The short answer is no. The FileSystemWatcher calls the ReadDirectoryChangesW API passing it an asynchronous flag. Basically, Windows will store data in an allocated buffer when changes to a directory occur. This function returns the data in that buffer and the FileSystemWatcher converts it into nice notifications for you.

Multithreading a File Map into an Array of Buffers

I'm trying to work with nasty large xml and text documents: ~40GBs.
I'm using Visual Studio 2012 on Windows 7.
I'm going to use 'Xerces' to snag the header/'footer tag' from the xmls.
I want to map an area of the file, say.. 60-120MBs.
Split the Map into (3 * processors/cores) equal parts. Setting each part as a buffer and loading the buffers into an array.
Then using (#processors/cores) while statments in new threads, I will synchronously count characters/lines/xml cycles while chewing through the the buffer array. When one buffer is completed the the process will jump to the next 'available' buffer and the completed buffer will be dropped out of memory. At the end I will add the total results into a project log.
Afterwards, I will reference the log, Split the files by character count/size(Or other option) to the nearest line or cycle and drop in the header and 'footer tag' to all the splits.
I'm doing this so I can import massive data to a MySQL server over a network with multiple computers.
My Question is, how do I create the buffer array and the file map with new threads?
Can I use :
win CreateFile
win CreateFileMapping
win MapViewOfFile
with standard ifstream operations and char buffers or should I opt something else?
Futher clarification:
My thinking is that if I can have the hard drive streaming the file into memory from one place and in one direction that I can use the full processing power of the machine to chew through seperate but equal buffers.
~Flavor: It's kind of like being a Shepard trying to scoop food out from one huge bin with 3-6 Large buckets with only two arms for X sheep that need to stay inside the fenced area. But they all move at the speed of light.
A few ideas or pointers might help me along here.
Any thoughts are Most Welcome. Thanks.
while(getline(my_file, myStr))
{
characterCount += myStr.length();
lineCount++;
if(my_file.eof()){
break;
}
}
This was the only code at run time for the test. 2hours, 30+min. 45-50% total processor for the program running it on a dual core 1.6Mhz laptop with 2GB RAM. Most of the RAM loaded right now is 600+MB from ~50 tabs open in firefox, Visual Studio at 60MB, then etcs.
IMPORTANT: During the test, the program running the code, which is only a window, and a dialog box, seemed to dump it's own working and private set of ram, down to like 300K ish, and didn't respond for the length of the test. I need to make another thread for the while statement I'm sure. But this means that NONE of the file was read into a buffer. The CPU was struggling for the entire run to keep up with the tinyest effort from the hard drive.
P.S. Further proof of CPU bottlenecking. It might take me 20min to transfer than entire file to another computer over my wireless network. Which includes the read process and a socket catch to write process on the other computer.
UPDATE
I used this adorable little thing to go from the previous test time to about 15-20min which is in line with what Mats Petersson was saying.
while (my_file.read( &bufferOne[0], bufferOne.size() ))
{
int cc = my_file.gcount();
for (int i = 0; i < cc; i++)
{
if (bufferOne[i] == '\n')
lineCount++;
characterCount++;
}
currentPercent = characterCount/onePercent;
SendMessage(GetDlgItem(hDlg, IDC_GENPROGRESS), PBM_SETPOS, currentPercent, 0);
}
Granted this is a single loop and it actually behaved much more appropriately than the previous test. This test was ~800% faster than the tight loop shown above this one with Getline. I set the buffer for this loop at 20MB. I jacked this code from: SOF - Fastest Example
BUT...
I would like to point out that while polling the process in resource mon and task manager, it clearly showed the first core at 75-90% usage, the second fluxuately 25-50% (Pretty standard for some minor background stuff that I have open), and the hard drive at.. wait for it... 50%. Some 100% disk time spikes but also some lows at 25%. All of which basically means that Splitting the buffer processing between two different threads could very well be a benefit. It will use all the system resources but.. that's what I want. I'll update later today when I have the working prototype.
MAJOR UPDATE:
Finally finished my project after a bunch of learning. No File Map needed. Only a bunch of vector char's. I have successfully built a dynamically executing file stream line and character counter.
The good news, went from the previous 10-15min marker to ~3-4min on a 5.8GB file, BOOYA!~

Very short answer: Yes, you can use those functions.
For reading data, it's likely the most efficient method to map the file content into memory, since it saves having to copy the memory into a buffer in the application, just read it straight into the place it's supposed to go. So, no problem as long as you have enough address space available - 64-bit machines should certainly have plenty, in a 32-bit system it may be more of a scarce resource - but for sections of a few hundred MB, it shouldn't be a huge issue.
However, using multiple threads, I'm not at all convinced. I have a fair idea that reading more than one part of a very large file will be counter productive. This will increase the amount of head movement on the disk, which is a large portion of transfer rate. You can count on some 50-100MB/s transfer rates for "ordinary" systems. If the system has some sort of raid controller or some such, maybe around double that - very exotic raid controllers may achieve three times.
So reading 40GB will take somewhere in the order of 3-15 minutes.
The CPU is probably not going to be very busy, and running multiple threads is quite likely to worsen the overall performance of the system.
You may want to keep a thread for reading and one for writing, and only actually write out the data once you have a sufficient amount of it, again, to avoid unnecessary moves of the read/write head on the disk(s).

Loading large multi-sample audio files into memory for playback - how to avoid temporary freezing

I am writing an application needs to use large audio multi-samples, usually around 50 mb in size. One file contains approximately 80 individual short sound recordings, which can get played back by my application at any time. For this reason all the audio data gets loaded into memory for quick access.
However, when loading one of these files, it can take many seconds to put into memory, meaning my program if temporarily frozen. What is a good way to avoid this happening? It must be compatible with Windows and OS X. It freezes at this : myMultiSampleClass->open(); which has to do a lot of dynamic memory allocation and reading from the file using ifstream.
I have thought of two possible options:
Open the file and load it into memory in another thread so my application process does not freeze. I have looked into the Boost library to do this but need to do quite a lot of reading before I am ready to implement. All I would need to do is call the open() function in the thread then destroy the thread afterwards.
Come up with a scheme to make sure I don't load the entire file into memory at any one time, I just load on the fly so to speak. The problem is any sample could be triggered at any time. I know some other software has this kind of system in place but I'm not sure how it works. It depends a lot on individual computer specifications, it could work great on my computer but someone with a slow HDD/Memory could get very bad results. One idea I had was to load x samples of each audio recording into memory, then if I need to play, begin playback of the samples that already exist whilst loading the rest of the audio into memory.
Any ideas or criticisms? Thanks in advance :-)

Use a memory mapped file. Loading time is initially "instant", and the overhead of I/O will be spread over time.

I like solution 1 as a first attempt -- simple & to the point.
If you are under Windows, you can do asynchronous file operations -- what they call OVERLAPPED -- to tell the OS to load a file & let you know when it's ready.

i think the best solution is to load a small chunk or single sample of wave data at a time during playback using asynchronous I/O (as John Dibling mentioned) to a fixed size of playback buffer.
the strategy will be fill the playback buffer first then play (this will add small amount of delay but guarantees continuous playback), while playing the buffer, you can re-fill another playback buffer on different thread (overlapped), at least you need to have two playback buffer, one for playing and one for refill in the background, then switch it in real-time
later you can set how large the playback buffer size based on client PC performance (it will be trade off between memory size and processing power, fastest CPU will require smaller buffer thus lower delay).

You might want to consider a producer-consumer approach. This basically involved reading the sound data into a buffer using one thread, and streaming the data from the buffer to your sound card using another thread.
The data reader is the producer, and streaming the data to the sound card is the consumer. You need high-water and low-water marks so that, if the buffer gets full, the producer stops reading, and if the buffer gets low, the producer starts reading again.
A C++ Producer-Consumer Concurrency Template Library
http://www.bayimage.com/code/pcpaper.html
EDIT: I should add that this sort of thing is tricky. If you are building a sample player, the load on the system varies continuously as a function of which keys are being played, how many sounds are playing at once, how long the duration of each sound is, whether the sustain pedal is being pressed, and other factors such as hard disk speed and buffering, and amount of processor horsepower available. Some programming optimizations that you eventually employ will not be obvious at first glance.

Architectural Suggestions in a Linux App

I've done quite a bit of programming on Windows but now I have to write my first Linux app.
I need to talk to a hardware device using UDP. I have to send 60 packets a second with a size of 40 bytes. If I send less than 60 packets within 1 second, bad things will happen.
The data for the packets may take a while to generate. But if the data isn't ready to send out on the wire, it's ok to send the same data that was sent out last time.
The computer is a command-line only setup and will only run this program.
I don't know much about Linux so I was hoping to get a general idea how you might set up an app to meet these requirements.
I was hoping for an answer like:
Make 2 threads, one for sending packets and the other for the calculations.
But I'm not sure it's that simple (maybe it is). Maybe it would be more reliable to make some sort of daemon that just sent out packets from shared memory or something and then have another app do the calculations? If it is some multiple process solution, what communication mechanism would you recommend?
Is there some way I can give my app more priority than normal or something similar?
PS: The more bulletproof the better!

I've done a similar project: a simple software on an embedded Linux computer, sending out CAN messages at a regular speed.
I would go for the two threads approach. Give the sending thread a slightly higher priority, and make it send out the same data block once again if the other thread is slow in computing those blocks.
60 UDP packets per second is pretty relaxed on most systems (including embedded ones), so I would not spend much sweat on optimizing the sharing of the data between the threads and the sending of the packets.
In fact, I would say: keep it simple! I you really are the only app in the system, and you have reasonable control over that system, you have nothing to gain from a complex IPC scheme and other tricks. Keeping it simple will help you produce better code with less defects and in less time, which actually means more time for testing.

Two threads as you've suggested would work. If you have a pipe() between them, then your calculating thread can provide packets as they are generated, while your comms thread uses select() to see if there is any new data. If not, then it just sends the last one from it's cache.
I may have over simplified the issue a little...

The suggestion to use a pair of threads sounds like it will do the trick, as long as the burden of performing the calculations is not too great.
Instead of using the pipe() as suggested by Cogsy, I would be inclined to use a mutex to lock a chunk of memory that you use to contain the output of your calculation thread - using it as a transfer area between the threads.
When your calculation thread is ready to output to the buffer it would grab the mutex, write to the transfer buffer and release the mutex.
When your transmit thread was ready to send a packet it would "try" to lock the mutex.
If it gets the lock, take a copy of the transfer buffer and send it.
If it doesn't get the lock, send the last copy.
You can control the priority of your process by using "nice" and specifying a negative adjustment figure to give it higher priority. Note that you will need to do this as superuser (either as root, or using 'sudo') to be able to specify negative values.
edit: Forgot to add - this is a good tutorial on pthreads on linux. Also describes the use of mutexes.

I didn't quite understand how hard is your 60 packets / sec requirement. Does a burst of 60 packets per second fill the requirement? Or is a sharp 1/60 second interval between each packet required?
This might go a bit out of topic, but another important issue is how you configure the Linux box. I would myself use a real-time Linux kernel and disable all unneeded services. Other wise there is a real risk that your application misses a packet at some time, regardless of what architecture you choose.
Any way, two threads should work well.

I posted this answer to illustrate a quite different approach to the "obvious" one, in the hope that someone discovers it to be exactly what they need. I didn't expect it to be selected as the best answer! Treat this solution with caution, because there are potential dangers and concurrency issues...
You can use the setitimer() system call to have a SIGALRM (alarm signal) sent to your program after a specified number of milliseconds. Signals are asynchronous events (a bit like messages) that interrupt the executing program to let a signal handler run.
A set of default signal handlers are installed by the OS when your program begins, but you can install a custom signal handler using sigaction().
So all you need is a single thread; use global variables so that the signal handler can access the necessary information and send off a new packet or repeat the last packet as appropriate.
Here's an example for your benefit:
#include <stdio.h>
#include <signal.h>
#include <sys/time.h>
int ticker = 0;
void timerTick(int dummy)
{
printf("The value of ticker is: %d\n", ticker);
}
int main()
{
int i;
struct sigaction action;
struct itimerval time;
//Here is where we specify the SIGALRM handler
action.sa_handler = &timerTick;
sigemptyset(&action.sa_mask);
action.sa_flags = 0;
//Register the handler for SIGALRM
sigaction(SIGALRM, &action, NULL);
time.it_interval.tv_sec = 1; //Timing interval in seconds
time.it_interval.tv_usec = 000000; //and microseconds
time.it_value.tv_sec = 0; //Initial timer value in seconds
time.it_value.tv_usec = 1; //and microseconds
//Set off the timer
setitimer(ITIMER_REAL, &time, NULL);
//Be busy
while(1)
for(ticker = 0; ticker < 1000; ticker++)
for(i = 0; i < 60000000; i++)
;
}

Two threads would work, you will need to make sure you lock your shared data structure through so the sending thread doesn't see it half way through an update.
60 per second doesn't sound too tricky.
If you are really concerned about scheduling, set the sending thread's scheduling policy to SCHED_FIFO and mlockall() its memory. That way, nothing will be able to stop it sending a packet (they could still go out late though if other things are being sent on the wire at the same time)
There has to be some tolerance of the device - 60 packets per second is fine, but what is the device's tolerance? 20 per second? If the device will fail if it doesn't receive one, I'd send them at three times the rate it requires.

I would stay away from threads and use processes and (maybe) signals and files. Since you say "bad things" may happen if you don't send, you need to avoid lock ups and race conditions. And that is easier to do with separate processes and data saved to files.
Something along the line of one process saving data to a file, then renaming it and starting anew. And the other process picking up the current file and sending its contents once per second.
Unlike Windows, you can copy (move) over the file while it's open.

Follow long-time Unix best practices: keep it simple and modular, decouple the actions, and let the OS do as much work for you as possible.
Many of the answers here are on the right track, but I think they can be even simpler:
Use two separate processes, one to create the data and write it to stdout, and one to read data from stdin and send it. Let the basic I/O libraries handle the data stream buffering between processes, and let the OS deal with the thread management.
Build the basic sender first using a timer loop and a buffer of bogus data and get it sending to the device at the right frequency.
Next make the sender read data from stdin - you can redirect data from a file, e.g. "sender < textdata"
Build the data producer next and pipe its output to the sender, e.g. "producer | sender".
Now you have the ability to create new producers as necessary without messing with the sender side. This answer assumes one-way communication.
Keeping the answer as simple as possible will get you more success, especially if you aren't very fluent in Linux/Unix based systems yet. This is a great opportunity to learn a new system, but don't over-do it. It is easy to jump to complex answers when the tools are available, but why use a bulldozer when a simple trowel is plenty. Mutex, semaphores, shared memory, etc, are all useful and available, but add complexity that you may not really need.

I agree with the the two thread approach. I would also have two static buffers and a shared enum. The sending thread should have this logic.
loop
wait for timer
grab mutex
check enum {0, 1}
send buffer 0 or 1 based on enum
release mutex
end loop
The other thread would have this logic:
loop
check enum
choose buffer 1 or 0 based on enum (opposite of other thread)
generate data
grab mutex
flip enum
release mutex
end loop
This way the sender always has a valid buffer for the entire time it is sending data. Only the generator thread can change the buffer pointer and it can only do that if a send is not in progress. Additionally, the enum flip should never take so many cycles as to delay the higher priority sender thread for very long.

Thanks everyone, I will be using everyones advice. I wish I could select more answers than 1!
For those that are curious. I dont have source for the device, its a propietary locked down system. I havent done enough testing to see how picky the 60 packets a second is yet. Thats all their limited docs say is "60 packets a second". Due to the nature of the device though, bursts of packets will be a bad thing. I think I will be able to get away with sending more than 60 a second to make up for the occasional missed packets..

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js