detect different sounds/sources in audio recording - c++

I need some advice on this idea that I've had for an UNI project.
I was wondering if it's possible to split an audio file into different "streams" from different audio sources.
For example, split the audio file into: engine noise, train noise, voices, different sounds that are not there all the time, etc.
I wouldn't necessarily need to do this from a programming language(although it would be ideal) but manually as well, by using some sound processing software like Sound Forge. I need to know if this is possible first, though. I know nothing about sound processing.
After the first stage is complete(separating the sounds) I want to determine if one of the processed sounds exists in another audio recording. The purpose would be sound detection. For (an ideal) example, take the car engine sound and match it against another file and determine that the audio is a recording of a car's engine or not. It doesn't need to be THAT precise, I guess detecting a sound that is not constant, like a honk! would be alright as well.
I will do the programming part, I just need some pointers on what to look for(software, math, etc). As I am no sound expert, this would really be an interesting project, if it's possible.
Thanks.

This problem of splitting sounds based on source is known in research as (Audio) Source Separation or Audio Signal Separation. If there is no more information about the sound sources or how they have been mixed, it is a Blind Source Separation problem. There are hundreds of papers on these topics.
However for the purpose of sound detection, it is not typically necessary to separate sounds at the audio level. Very often one can (and will) do detection on features computed on the mixed signal. Search literature for Acoustic Event Detection and Acoustic Event Classification.
For a introduction to the subject, check out a book like Computational Analysis of Sound Scenes and Events

It's extremely difficult to do automated source separation from a single audio stream. Your brain is uncannily good at this task, and it also benefits from a stereo signal.
For instance. voice is full of signals that aren't there all the time. Car noise has components that are quite stationary, but gear changes are outliers.
Unfortunately, there are no simple answers.

Correlate reference signals against the audio stream. Correlation can be done efficiently using FFTs. The output of the correlation calculation can be thresholded and 'debounced' in time for signal identification.

Related

Recognition of an animal in pictures

I am facing a challenging problem. On the courtyard of company I am working is a camera trap which takes a photo of every movement. On some of these pictures there are different kinds of animals (mostly deep gray mice) that cause damages to our cable system. My idea is to use some application that could recognize if there is a gray mouse on the picture or not. Ideally in realtime. So far we have developed a solution that sends alarms for every movement but most of alarms are false. Could you provide me some info about possible ways how to solve the problem?
In technical parlance, what you describe above is often called event detection. I know of no ready-made approach to solve all of this at once, but with a little bit of programming you should be all set even if you don't want to code any computer vision algorithms or some such.
The high-level pipeline would be:
Making sure that your video is of sufficient quality. Gray mice sound kind of tough, plus the pictures are probably taken at night - so you should have sufficient infrared lighting etc. But if a human can make it out whether an alarm is false or true, you should be fine.
Deploying motion detection and taking snapshot images at the time of movements. It seems like you have this part already worked out, great! Detailing your setup could benefit others. You may also need to crop only the area in motion from the image, are you doing that?
Building an archive of images, including your decision of whether they are false or true alarm (labels in machine learning parlance). Try to gather at least a few tens of example images for both cases, and make them representative of real-world variations (do you have the problem during daytime as well? is there snowfall in your region?).
Classifying the images taken from the video stream snapshot to check whether it's a false alarm or contains bad critters eating cables. This sounds tough, but deep learning and machine learning is making advances by leaps; you can either:
deploy your own neural network built in a framework like caffe or Tensorflow (but you will likely need a lot of examples, at least tens of thousands I'd say)
use an image classification API that recognizes general objects, like Clarifai or Imagga - if you are lucky, it will notice that the snapshots show a mouse or a squirrel (do squirrels chew on cables?), but it is likely that on a specialized task like this one, these engines will get pretty confused!
use a custom image classification API service which is typically even more powerful than rolling your own neural network since it can use a lot of tricks to sort out these images even if you give it just a small number of examples for each image category (false / true alarm here); vize.it is a perfect example of that (anyone can contribute more such services?).
The real-time aspect is a bit open-ended, as the neural networks take some time to process an image — you also need to include data transfer etc. when using a public API, but if you roll out your own, you will need to spend a lot of effort to get low latency as the frameworks are by default optimized for throughput (batch prediction). Generally, if you are happy with ~1s latency and have a good internet uplink, you should be fine with any service.
Disclaimer: I'm one of the co-creators of vize.it.
How about getting a cat?
Also, you could train your own custom classifier using the IBM Watson Visual Recognition service. (demo: https://visual-recognition-demo.mybluemix.net/train ) It's free to try and you just need to supply example images for the different categories you want to identify. Overall, Petr's answer is excellent.

Audio Subtitle Transcription - C++

I'm on a project that among other video related tasks should eventually be capable of extracting the audio of a video and apply some kind of speech recognition to it and get a transcribed text of what's said on the video. Ideally it should output some kind of subtitle format so that the text is linked to a certain point on the video.
I was thinking of using the Microsoft Speech API (aka SAPI). But from what I could see it is rather difficult to use. The very few examples that I found for speech recognition (most are for Text-To-Speech which mush easier) didn't perform very well (they don't recognize a thing). For example this one: http://msdn.microsoft.com/en-us/library/ms717071%28v=vs.85%29.aspx
Some examples use something called grammar files that are supposed to define the words that the recognizer is waiting for but since I haven't trained the Windows Speech Recognition thoroughly I think that might be adulterating the results.
So my question is... what's the best tool for something like this? Could you provide both paid and free options? Well the best "free" (as it comes with Windows) option I believe it's SAPI, all the rest should be paid but if they are really good it might be worth it. Also if you have any good tutorials for using SAPI (or other API) on a context similar to this it would be great.
On the whole this is a big ask!
The issue with any speech recognition system is that it functions best after training. It needs context (what words to expect) and some kind of audio benchmark (what does each voice sound like). This might be possible in some cases, such as a TV series if you wanted to churn through hours of speech -separated for each character- to train it. There's a lot of work there though. For something like a film there's probably no hope of training a recogniser unless you can get hold of the actors.
Most film and TV production companies just hire media companies to transcribe the subtitles based on either direct transcription using a human operator, or converting the script. The fact that they still need humans in the loop for these huge operations suggests that automated systems just aren't up to it yet.
In video you have a plethora of things that make you life difficult, pretty much spanning huge swathes of current speech technology research:
-> Multiple speakers -> "Speaker Identification" (can you tell characters apart? Also, subtitles normally have different coloured text for different speakers)
-> Multiple simultaneous speakers -> The "cocktail party problem" - can you separate the two voice components and transcribe both?
-> Background noise -> Can you pick the speech out from any soundtrack/foley/exploding helicopters.
The speech algorithm will need to be extremely robust as different characters can have different gender/accents/emotion. From what I understand of the current state of recognition you might be able to get a single speaker after some training, but asking a single program to nail all of them might be tough!
--
There is no "subtitle" format that I'm aware of. I would suggest saving an image of the text using a font like Tiresias Screenfont that's specifically designed for legibility in these circumstances, and use a lookup table to cross-reference images against video timecode (remembering NTSC/PAL/Cinema use different timing formats).
--
There's a bunch of proprietary speech recognition systems out there. If you want the best you'll probably want to license a solution off one of the big boys like Nuance. If you want to keep things free the universities of RWTH and CMU have put some solutions together. I have no idea how good they are or how well they might be suited to the problem.
--
The only solution I can think of similar to what you're aiming at is the subtitling you can get on news channels here in the UK "Live Closed Captioning". Since it's live, I assume they use some kind of speech recognition system trained to the reader (although it might not be trained, I'm not sure). It's got better over the past few years, but on the whole it's still pretty poor. The biggest thing it seems to struggle with is speed. Dialogue is normally really fast, so live subtitling has the extra issue of getting everything done in time. Live closed captions quite frequently get left behind and have to miss a lot of content out to catch up.
Whether you have to deal with this depends on whether you'll be subtitling "live" video or if you can pre-process it. To deal with all the additional complications above I assume you'll need to pre-process it.
--
As much as I hate citing the big W there's a goldmine of useful links here!
Good luck :)
This falls into the category of dictation, which is a very large vocabulary task. Products like Dragon Naturally Speaking are amazingly good and that has a SAPI interface for developers. But it's not so simple of a problem.
Normally a dictation product is meant to be single speaker and the best products adapt automatically to that speaker, thereby improving the underlying acoustic model. They also have sophisticated language modeling which serves to constrain the problem at any given moment by limiting what is known as the perplexity of the vocabulary. That's a fancy way of saying the system is figuring out what you're talking about and therefore what types of words and phrases are likely or not likely to come next.
It would be interesting though to apply a really good dictation system to your recordings and see how well it does. My suggestion for a paid system would be to get Dragon Naturally Speaking from Nuance and get the developer API. I believe that provides a SAPI interface, which has the benefit of allowing you to swap in the Microsoft speech or any other ASR engine that supports SAPI. IBM would be another vendor to look at but I don't think you will do much better than Dragon.
But it won't work well! After all the work of integrating the ASR engine, what you will probably find is that you get a pretty high error rate (maybe half). That would be due to a few major challenges in this task:
1) multiple speakers, which will degrade the acoustic model and adaptation.
2) background music and sound effects.
3) mixed speech - people talking over each other.
4) lack of a good language model for the task.
For 1) if you had a way of separating each actor on a separate track that would be ideal. But there's no reliable way of separating speakers automatically in a way that would be good enough for a speech recognizer. If each speaker were at a distinctly different pitch, you could try pitch detection (some free software out there for that) and separate based on that, but this is a sophisticated and error prone task.) The best thing would be hand editing the speakers apart, but you might as well just manually transcribe the speech at that point! If you could get the actors on separate tracks, you would need to run the ASR using different user profiles.
For music (2) you'd either have to hope for the best or try to filter it out. Speech is more bandlimited than music so you could try a bandpass filter that attenuates everything except the voice band. You would want to experiment with the cutoffs but I would guess 100Hz to 2-3KHz would keep the speech intelligible.
For (3), there's no solution. The ASR engine should return confidence scores so at best I would say if you can tag low scores, you could then go back and manually transcribe those bits of speech.
(4) is a sophisticated task for a speech scientist. Your best bet would be to search for an existing language model made for the topic of the movie. Talk to Nuance or IBM, actually. Maybe they could point you in the right direction.
Hope this helps.

Implementing Real Time frequency spectrum for a beginner

I want to develop an application that would take audio(.wav) as input and display its real time simultaneous frequency spectrum . From what i have looked upon the subject , this requires fourier transform of the waves . Can someone suggest where i should start with ? Possible references and books . I want to learn the details of the implementations of realtime frequency spetrum rather than the development of GUI which i am quite familiar with(in C# and in C++).
There are already many libraries to do FFTs for you. No reason to reinvent the wheel. DirectX has an implementation but it might only be in the most recent version. Here's an open source C library for it.
If you want to understand the math behind it, here's a simple explanation and here's a complicated explanation.
You should begin with opening the wav file, extracting the audio stream and decoding it. There are 3rd party libraries to help on this operation.
Take a look at FFTW.
As far as books go, the classic text book on signal processing is Oppenheim and Schafer's Digital Signal Processing. Its college level but it is quite through. You do need some knowledge of calculus in places.
One should understand a bit of the theory before going off and implementing an application to display something. Here are some free online resources on digital signal processing, which is the basis for understanding FFTs and frequency spectrums, and maybe how not to misuse them.
http://www.dspguide.com/pdfbook.htm
http://www.bores.com/courses/intro/index.htm
http://ccrma.stanford.edu/courses/320/Welcome.html
http://yehar.com/blog/?p=121/

Receive mic input and process

I am writing a small program in C++ that receives mic input and does some simple live audio processing. I have been looking around and the only things I have been able to find that work on Linux are PortAudio, QAudioInput, and fmod.
I am trying to stay away from any super low level programming and use a minimal amount of lines.
Which one of these would fit my needs best?
Check out JUCE. Juce will build on many platforms. JUCE does a lot more than just audio, but it was made with audio programmers in mind. Look at he JUCE demo application and then just chop up the source code from the audio demo to suit your needs. The API documentation is really good also. The abstraction from the low level stuff is good.

Video Game Bots? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Something I've always wondered, especially since it inspired me to start programming when I was a kid, was how video game bots work? I'm sure there are a lot of different methods, but what about automation for MMORPGs? Or even FPS-type bots?
I'm talking about player-made automation bots.
To 'bot' a game, you need to be able to do two things programmatically: detect what's going on in the game, and provide input to the game.
Detecting what's going on in the game tends to be the harder of the two. A few methods for doing this are:
Screen-Scraping This technique captures the image on the screen and parses it, looking for things like enemies, player status, power-ups, game messages, time clocks, etc. This tends to be a particularly difficult method. OCR techniques can be used to process text, but if the text is written on top of the game world (instead of on a UI element with a solid background), the ever-changing backdrop can make it difficult to get accurate and consistent results. Finding non-text objects on the screen can be even more difficult, especially in 3D worlds, because of the many different positions and orientations that a single object may possibly exist in.
Audio Cues In some games, actions and events are accompanied by unique sound effects. It is possible to detect these events by monitoring the audio output of the game and matching it against a recording of the associated sound effect. Some games allow the player to provide their own sound effects for events, which allows the use of sound effects that are designed to be easy to listen for and filter out.
Memory Monitoring If the internal workings of the game are well understood, then you can monitor the state of a game by inspecting the game's memory space. Some cheat tools for console systems (such as the Game Genie) use this method. By detecting what memory the game updates, it is possible to detect what the game is doing. Some games randomize the memory locations they use each time they are launched in an attempt to foil this vulnerability.
Packet Analysis With appropriate drivers, you can intercept the game's data packets as they are sent to or retrieved from your network card (for games played online). Analysis of these packets can reveal what your game client is communicating to the server, which usually revolves around player/enemy actions.
Game Scripting Some games have a built-in scripting interface. If available, this is usually the easiest method because it is something the game software is designed to do (the previous methods would all typically count as "hacks"). Some scripts must be run in-game (through a console or through an add-on system) and some can be run by external programs that communicate through the game via a published API.
Generating input events back into the game is typically the easier task. Some methods include:
Memory "Poking" Similar to the memory monitoring section above, memory poking is the act of writing data directly into the game's memory space. This is the method used by the Game Genie for applying its cheat codes. Given the complexity of modern games, this is a very difficult task and can potentially crash the entire game.
Input Emulation "Fake" keyboard or mouse signals can be generated in lieu of direct human interaction. This can be done in software using tools such as AutoIt. Hardware hacks can also be used, such as devices that connect to the computer's USB or PS/2 port and appear to the system to be a keyboard, but instead generate fake keypress events based on signals received from the computer (for instance, over a serial port). These methods can be harder for games to detect.
Game Scripting As mentioned above, some games provide built-in methods for controlling it programmatically, and taking advantage of those tools is usually the easiest (but perhaps not the most powerful) technique.
Note that running a 'bot' in a game is usually a violation of the game's Terms Of Use and can get you suspended, banned, or worse. In some jurisdictions, this may carry criminal penalties. This is another plus for using a game's built-in scripting capabilities; if it's designed to be a part of the game software, then the game publisher is most likely not going to prohibit you from using it.
Once I wrote a simple MMORPG bot by myself. I used AutoHotkey.
It provides lots of methods to simulate user input -- one will work. It's tedious to program a working one in C++ by oneself (Or look into AutoHotkey's source).
It can directly search the screen for pixel patterns, even game screens (DirectX)
So what I did was to search the screen for the name of an enemy (Stored as a picture with the game's font) and the script clicks a few pixel below it to attack. It also tracks the health bar and pots if it is too low.
Very trival. But I know of an WoW bot that is also made using AutoHotkey. And I see lots of other people had the same idea (Mine was not for WoW, but probably illegal, too).
More advanced techniques do not capture the screen but directly read the game's memory. You have to do a lot of reverse engineering to make this work. And it stops working when the game is updated.
How does an individual person go about their day to day?
This is sort of the problem that AIs in games solve.
What do you want your entity to do? Code your entity to do that. If you want your monster to chase the player's avatar, the monster just needs to face the avatar and then move toward it. When that monster gets within a suitable distance, it can choose to bite the player avatar, and this choice can be as simple as AmICloseEnough(monster, player); or more complex or even random.
Bots in an FPS are tricky to get right because it's easy to make them perfect but not so easy to make them fun. E.g. they always know exactly where the player is (gPlayer.GetPosition()) so it's easy to shoot the player in the head every time. It takes a bit of "art" to make the bot move like a human would.
For FPS-style bots, you could take a look at the Unreal Development Kit. As I understand it, this has got a lot of the actual game source code.
http://udn.epicgames.com/Three/DevelopmentKitHome.html
bta gave a very good reply. I just wanted to add on saying that the different methods are suspectible to different means of detection by the gaming company. Hacking into the game client via memory monitoring or packet analysis generally is more easily detectable. I generally don't recommend it since you can get caught very easily.
Screen-scraping used with input emulation is generally the safest way to bot a game and not get caught. Many people, (myself included) have been doing it for years with no problems.
In addition, to add an additional step between detecting what's going on in the game and providing input, some games require extensive calculation before you can decide what kind of input to provide to the game. For example, there was a game where I had to calculate the number of ships to send when attacking the enemy, and this was based on the number of ships I had, the type of ships, and who and what kind of enemy it was. The calculation is generally the "easy" part since you can do that usually in almost any programming language.
It's called AI (artificial intelligence) and really isn't that hard to replicate, a set of rules and commands in the programming language of your game will do the trick. For example a FPS bot would work by getting the coordinates of your player's body and setting your enemy bot's gun to aim at that coordinate and start shooting when in a certain range.