Implementing Real Time frequency spectrum for a beginner - c++

I want to develop an application that would take audio(.wav) as input and display its real time simultaneous frequency spectrum . From what i have looked upon the subject , this requires fourier transform of the waves . Can someone suggest where i should start with ? Possible references and books . I want to learn the details of the implementations of realtime frequency spetrum rather than the development of GUI which i am quite familiar with(in C# and in C++).

There are already many libraries to do FFTs for you. No reason to reinvent the wheel. DirectX has an implementation but it might only be in the most recent version. Here's an open source C library for it.
If you want to understand the math behind it, here's a simple explanation and here's a complicated explanation.

You should begin with opening the wav file, extracting the audio stream and decoding it. There are 3rd party libraries to help on this operation.

Take a look at FFTW.
As far as books go, the classic text book on signal processing is Oppenheim and Schafer's Digital Signal Processing. Its college level but it is quite through. You do need some knowledge of calculus in places.

One should understand a bit of the theory before going off and implementing an application to display something. Here are some free online resources on digital signal processing, which is the basis for understanding FFTs and frequency spectrums, and maybe how not to misuse them.
http://www.dspguide.com/pdfbook.htm
http://www.bores.com/courses/intro/index.htm
http://ccrma.stanford.edu/courses/320/Welcome.html
http://yehar.com/blog/?p=121/

Related

detect different sounds/sources in audio recording

I need some advice on this idea that I've had for an UNI project.
I was wondering if it's possible to split an audio file into different "streams" from different audio sources.
For example, split the audio file into: engine noise, train noise, voices, different sounds that are not there all the time, etc.
I wouldn't necessarily need to do this from a programming language(although it would be ideal) but manually as well, by using some sound processing software like Sound Forge. I need to know if this is possible first, though. I know nothing about sound processing.
After the first stage is complete(separating the sounds) I want to determine if one of the processed sounds exists in another audio recording. The purpose would be sound detection. For (an ideal) example, take the car engine sound and match it against another file and determine that the audio is a recording of a car's engine or not. It doesn't need to be THAT precise, I guess detecting a sound that is not constant, like a honk! would be alright as well.
I will do the programming part, I just need some pointers on what to look for(software, math, etc). As I am no sound expert, this would really be an interesting project, if it's possible.
Thanks.
This problem of splitting sounds based on source is known in research as (Audio) Source Separation or Audio Signal Separation. If there is no more information about the sound sources or how they have been mixed, it is a Blind Source Separation problem. There are hundreds of papers on these topics.
However for the purpose of sound detection, it is not typically necessary to separate sounds at the audio level. Very often one can (and will) do detection on features computed on the mixed signal. Search literature for Acoustic Event Detection and Acoustic Event Classification.
For a introduction to the subject, check out a book like Computational Analysis of Sound Scenes and Events
It's extremely difficult to do automated source separation from a single audio stream. Your brain is uncannily good at this task, and it also benefits from a stereo signal.
For instance. voice is full of signals that aren't there all the time. Car noise has components that are quite stationary, but gear changes are outliers.
Unfortunately, there are no simple answers.
Correlate reference signals against the audio stream. Correlation can be done efficiently using FFTs. The output of the correlation calculation can be thresholded and 'debounced' in time for signal identification.

Qt and sound processing

I need to increase tempo of voice in sound file. (This effect will speed up the playing but leave the original tempo). Any ways to do this with c++ and Qt media library? Thanks.
Any links are excepted.
You may want to try SoundTouch.
Qt has hardly any sound processing capabilities, so nothing as advanced as what the original poster wants exists in it.
Qt cannot do what you want. Moreover, pitch-shifting is computationally intensive task involving fast Fourier transforms and then some trickery to counteract the phase shifting. Couple FFTW3 with this excellent guide, and you can do it.

Audio Subtitle Transcription - C++

I'm on a project that among other video related tasks should eventually be capable of extracting the audio of a video and apply some kind of speech recognition to it and get a transcribed text of what's said on the video. Ideally it should output some kind of subtitle format so that the text is linked to a certain point on the video.
I was thinking of using the Microsoft Speech API (aka SAPI). But from what I could see it is rather difficult to use. The very few examples that I found for speech recognition (most are for Text-To-Speech which mush easier) didn't perform very well (they don't recognize a thing). For example this one: http://msdn.microsoft.com/en-us/library/ms717071%28v=vs.85%29.aspx
Some examples use something called grammar files that are supposed to define the words that the recognizer is waiting for but since I haven't trained the Windows Speech Recognition thoroughly I think that might be adulterating the results.
So my question is... what's the best tool for something like this? Could you provide both paid and free options? Well the best "free" (as it comes with Windows) option I believe it's SAPI, all the rest should be paid but if they are really good it might be worth it. Also if you have any good tutorials for using SAPI (or other API) on a context similar to this it would be great.
On the whole this is a big ask!
The issue with any speech recognition system is that it functions best after training. It needs context (what words to expect) and some kind of audio benchmark (what does each voice sound like). This might be possible in some cases, such as a TV series if you wanted to churn through hours of speech -separated for each character- to train it. There's a lot of work there though. For something like a film there's probably no hope of training a recogniser unless you can get hold of the actors.
Most film and TV production companies just hire media companies to transcribe the subtitles based on either direct transcription using a human operator, or converting the script. The fact that they still need humans in the loop for these huge operations suggests that automated systems just aren't up to it yet.
In video you have a plethora of things that make you life difficult, pretty much spanning huge swathes of current speech technology research:
-> Multiple speakers -> "Speaker Identification" (can you tell characters apart? Also, subtitles normally have different coloured text for different speakers)
-> Multiple simultaneous speakers -> The "cocktail party problem" - can you separate the two voice components and transcribe both?
-> Background noise -> Can you pick the speech out from any soundtrack/foley/exploding helicopters.
The speech algorithm will need to be extremely robust as different characters can have different gender/accents/emotion. From what I understand of the current state of recognition you might be able to get a single speaker after some training, but asking a single program to nail all of them might be tough!
--
There is no "subtitle" format that I'm aware of. I would suggest saving an image of the text using a font like Tiresias Screenfont that's specifically designed for legibility in these circumstances, and use a lookup table to cross-reference images against video timecode (remembering NTSC/PAL/Cinema use different timing formats).
--
There's a bunch of proprietary speech recognition systems out there. If you want the best you'll probably want to license a solution off one of the big boys like Nuance. If you want to keep things free the universities of RWTH and CMU have put some solutions together. I have no idea how good they are or how well they might be suited to the problem.
--
The only solution I can think of similar to what you're aiming at is the subtitling you can get on news channels here in the UK "Live Closed Captioning". Since it's live, I assume they use some kind of speech recognition system trained to the reader (although it might not be trained, I'm not sure). It's got better over the past few years, but on the whole it's still pretty poor. The biggest thing it seems to struggle with is speed. Dialogue is normally really fast, so live subtitling has the extra issue of getting everything done in time. Live closed captions quite frequently get left behind and have to miss a lot of content out to catch up.
Whether you have to deal with this depends on whether you'll be subtitling "live" video or if you can pre-process it. To deal with all the additional complications above I assume you'll need to pre-process it.
--
As much as I hate citing the big W there's a goldmine of useful links here!
Good luck :)
This falls into the category of dictation, which is a very large vocabulary task. Products like Dragon Naturally Speaking are amazingly good and that has a SAPI interface for developers. But it's not so simple of a problem.
Normally a dictation product is meant to be single speaker and the best products adapt automatically to that speaker, thereby improving the underlying acoustic model. They also have sophisticated language modeling which serves to constrain the problem at any given moment by limiting what is known as the perplexity of the vocabulary. That's a fancy way of saying the system is figuring out what you're talking about and therefore what types of words and phrases are likely or not likely to come next.
It would be interesting though to apply a really good dictation system to your recordings and see how well it does. My suggestion for a paid system would be to get Dragon Naturally Speaking from Nuance and get the developer API. I believe that provides a SAPI interface, which has the benefit of allowing you to swap in the Microsoft speech or any other ASR engine that supports SAPI. IBM would be another vendor to look at but I don't think you will do much better than Dragon.
But it won't work well! After all the work of integrating the ASR engine, what you will probably find is that you get a pretty high error rate (maybe half). That would be due to a few major challenges in this task:
1) multiple speakers, which will degrade the acoustic model and adaptation.
2) background music and sound effects.
3) mixed speech - people talking over each other.
4) lack of a good language model for the task.
For 1) if you had a way of separating each actor on a separate track that would be ideal. But there's no reliable way of separating speakers automatically in a way that would be good enough for a speech recognizer. If each speaker were at a distinctly different pitch, you could try pitch detection (some free software out there for that) and separate based on that, but this is a sophisticated and error prone task.) The best thing would be hand editing the speakers apart, but you might as well just manually transcribe the speech at that point! If you could get the actors on separate tracks, you would need to run the ASR using different user profiles.
For music (2) you'd either have to hope for the best or try to filter it out. Speech is more bandlimited than music so you could try a bandpass filter that attenuates everything except the voice band. You would want to experiment with the cutoffs but I would guess 100Hz to 2-3KHz would keep the speech intelligible.
For (3), there's no solution. The ASR engine should return confidence scores so at best I would say if you can tag low scores, you could then go back and manually transcribe those bits of speech.
(4) is a sophisticated task for a speech scientist. Your best bet would be to search for an existing language model made for the topic of the movie. Talk to Nuance or IBM, actually. Maybe they could point you in the right direction.
Hope this helps.

3d max integration with c++, Cal3D where to start?

okay i'm making a game using c++ (for the engine) and openGL, now i've had lots of trouble using cal3d library for importing my 3d max models into my c++ project,
as a matter of fact i dunno where to even start, i can't find any decent guide and their documentation is pure shit really. i've been searching and trying stuff in this for over a month, but i don't even understand the file structure it uses so far :S
i really need some help, r there any other libraries? any decent guide i can use? i'm stuck
thnx alot
Rather than write your own exporter, consider using one of the built-in exporters for FBX, COLLADA, Crosswalk (.XSI), the Quake/Doom3 .MD3/.MD4 format, or even OBJ. It'll be much easier to parse the resulting file format on your end than to write and maintain a brand-new exporter.
Max is a complete pain for any kind of scripting or plugin. I'd suggest using maya instead if at all possible. You'll get better results for animation and rigging, too. I know it's not a direct answer to your question but part of the problem is the info for stuff like this is not easy to come by.

How to get started with game programming using VC++,C++,DirectX quickly?

Hi I am working in VC++ and I am quite interested in game programming and I have few queries.
1).What one must know before starting game programming ?
2).Can anybody give me info # resources like tutorial ,links ,etc. which would help me to start as fast as possible ?
3).Also give me info # some good books on game programming ?
Any help would be greatly appreciated.
Before you start programming you must have a good understanding of the language, how to program and how to structure and test your code. Oh, and a huge amount of either patience or free time. On the maths front, Vectors, Matrices and Quaternions are the main things I found I needed.
The other thing that often goes overlooked when I programmer starts writing a game is someone to create the assets. Preferably someone specialized in it.
You mention DirectX, which is not actually a fast way to go as you have to build everything from square one, which means a lot more maths, performance testing and overall handwork. I would suggest at least a rendering engine like Ogre3D. There are plenty of tutorials and a very good community.
There is a good post here on why you should write games not engines.
The main reason you would want to use DirectX is to enhance your understanding of the lower levels, all the things an engine is abstracting for you. While I think this is a good thing to do, I wouldn't want to do it for a major or first project.
The main site I used for help was gamedev.net, although I also found some intresting articles on gamesutra
It takes time and requires a lot of patience. And playable game is more than just working C++ code.
gamedev.net.
First, download Visual C# Express Edition, and then download XNA Game Studio 3.1.
After that, check out the XNA Creators Club - that has lots of help to get you up and running quickly.
Are you 100% dedicated to C++? If not, I would recommend starting with XNA/C# instead. DirectX will force you to spend a lot of time up front learning API calls before you ever get something on the screen. XNA will allow you to start coding your game very quickly while getting immediate feedback while you program.
If you are committed to C++, I would recommend Beginning Game Programming by Jonathan Harbour. He starts with an easy to understand framework that won't take long to pick up. Remember that to use DirectX you will have to learn win32, and low level DirectX code.
For tutorials, try googling "c++ beginning game programming tutorial". Gamedev.net will be another invaluable resource. Go to the "For Beginners" forum and look through the stickies.
As for what you must know, it depends on your aspirations and your choice of tools. As a beginner, you will want to start small and in 2D or text games. To get a Pong game going in XNA, you only need to have basic C# skills and basic collision detection. To get a Pong game going in DirectX, you will need to understand win32 code, and a ton of device calls. To do a console text game, you only need to know basic C++ and maybe some basic gameflow techniques.
If using DirectX is not a fixed requirement, you should consider OpenGL, and use a library like SFML or Allegro to handle all the basic stuff.
http://www.talula.demon.co.uk/allegro/
http://www.sfml-dev.org/