Getting the amplitude(or rms voltage) of audio signal captured in C++ by wavin lib.? - c++

I am working on a very basic robotics project, and wish to implement voice recognition in it.
i know its a complex thing but i wish to do it for only 3 or 4 commands(or words).
i know that using wavin i can record audio. but i wish to do real-time amplitude analysis on the audio signal, how can that be done, the wave will be inputed as 8-bit, mono.
i have thought of divinding the signal into a set of some specific time, further diving it into smaller subsets, getting the average rms value over the subset and then summing them up and then see how much different they are from the actual stored signal.If the error is below accepted value for all(or most) of the sets, then print the word.
How can this be implemented?
if you can provide me any other suggestion also, it would be great.
Thanks, in advance.

There is no simple way to recognize words, because they are basically a sequence of phonemes which can vary in time and frequency.
Classical isolated word recognition systems use signal MFCC (cepstral coefficients) as input data, and try to recognize patterns using HMM (hidden markov models) or DTW (dynamic time warping) algorithms.
You will also need a silence detection module if you don't want a record button.
For instance Edimburgh University toolkit provides some of these tools (with good documentation).
If you don't want to build it "from scratch" or have a source of inspiration, here is an (old but free) implementation of such a system (which uses its own toolkit) with a full explanation and practical examples on how it works.
This system is a LVCSR (Large-Vocabulary Continuous Speech Recognition) and you only need a subset of it. If someone know an open source reduced vocabulary system (like a simple IVR) it would be welcome.
If you want to make a basic system from your own, I recommend you to use MFCC and DTW:
For each target word to modelize:
record some instances of the word
compute some (eg each 10ms) delta-MFCC through the word to have a model
When you want to recognize a signal:
compute some delta-MFCC of this signal
use DTW to compare these delta-MFCC to each modelized word's delta-MFCC
output the word that fits the best (use a threshold to drop garbage)

If you just want to recognize a few commands, there are many commercial and free products you can use. See Need text to speech and speech recognition tools for Linux or What is the difference between System.Speech.Recognition and Microsoft.Speech.Recognition? or Speech Recognition on iPhone. The answers to these questions link to many available products and tools. Speech recognition and understanding of a list of commands is a very common problem solved commercially. Many of the voice automated phone systems you call uses this type of technology. The same technology is available for developers.
From watching these questions for few months, I've seen most developer choices break down like this:
Windows folks - use the System.Speech features of .Net or Microsoft.Speech and install the free recognizers Microsoft provides. Windows 7 includes a full speech engine. Others are downloadable for free. There is a C++ API to the same engines known as SAPI. See at http://msdn.microsoft.com/en-us/magazine/cc163663.aspx. or http://msdn.microsoft.com/en-us/library/ms723627(v=vs.85).aspx
Linux folks - Sphinx seems to have a good following. See http://cmusphinx.sourceforge.net/ and http://cmusphinx.sourceforge.net/wiki/
Commercial products - Nuance, Loquendo, AT&T, others
Online service - Nuance, Yapme, others
Of course this may also be helpful - http://en.wikipedia.org/wiki/List_of_speech_recognition_software

Related

Specific topics on Tensorflow for CNN

I have a mini project for my new course in Tensorflow for this semester with random topics. Since I have some background on Convolution Neuron Network, I intend to use it for my project. My computer can only run CPU version of TensorFlow.
However, as a new bee, I realize that there are a lot of topics such that MNIST, CIFAR-10, etc, thus I don't know which suitable topic I should pick out from them. I only have two weeks left. It would be great if the topic is not too complicated but too not easy for study because it matchs my intermediate level.
In your experience, could you give me some advice about the specific topic I should do for my project?
Moreover, it would be better if in this topic I can provide my own data to test my training, because my professor said that it is a plus point to get A grade in my project.
Thanks in advance,
I think that to answer this question you need to properly evaluate the marking criteria for your project. However, I can give you a brief overview of what you've just mentioned.
MNIST: MNIST is a Optical Character Recognition task for individual numbers 0-9 in images size 28px square. This is considered the "Hello World" of CNNs. It's pretty basic and might be too simplistic for your requirements. Hard to gauge without more information. Nonetheless, this will run pretty quickly with CPU Tensorflow and the online tutorial is pretty good.
CIFAR-10: CIFAR is a much bigger dataset of objects and vehicles. The image sizes are 32px square so individual image processing isn't too bad. But the dataset is very large and your CPU might struggle with it. It takes a long time to train. You could try training on a reduced dataset but I don't know how that would go. Again, depends on your course requirements.
Flowers-Poets: There is the Tensorflow for Poets re-training example which might not be suitable for your course, you could use the flowers dataset to build your own model.
Build-your-own-model: You could use tf.Layers to build your own network and experiment with it. tf.Layers is pretty easy to use. Alternatively you could look at the new Estimators API that will automate a lot of the training processes for you. There are a number of tutorials (of varying quality) on the Tensorflow website.
I hope that helps give you a run-down of what's out there. Other datasets to look at are PASCAL VOC and imageNet (however they are huge!). Models to look at experimenting with may include VGG-16 and AlexNet.

Audio Subtitle Transcription - C++

I'm on a project that among other video related tasks should eventually be capable of extracting the audio of a video and apply some kind of speech recognition to it and get a transcribed text of what's said on the video. Ideally it should output some kind of subtitle format so that the text is linked to a certain point on the video.
I was thinking of using the Microsoft Speech API (aka SAPI). But from what I could see it is rather difficult to use. The very few examples that I found for speech recognition (most are for Text-To-Speech which mush easier) didn't perform very well (they don't recognize a thing). For example this one: http://msdn.microsoft.com/en-us/library/ms717071%28v=vs.85%29.aspx
Some examples use something called grammar files that are supposed to define the words that the recognizer is waiting for but since I haven't trained the Windows Speech Recognition thoroughly I think that might be adulterating the results.
So my question is... what's the best tool for something like this? Could you provide both paid and free options? Well the best "free" (as it comes with Windows) option I believe it's SAPI, all the rest should be paid but if they are really good it might be worth it. Also if you have any good tutorials for using SAPI (or other API) on a context similar to this it would be great.
On the whole this is a big ask!
The issue with any speech recognition system is that it functions best after training. It needs context (what words to expect) and some kind of audio benchmark (what does each voice sound like). This might be possible in some cases, such as a TV series if you wanted to churn through hours of speech -separated for each character- to train it. There's a lot of work there though. For something like a film there's probably no hope of training a recogniser unless you can get hold of the actors.
Most film and TV production companies just hire media companies to transcribe the subtitles based on either direct transcription using a human operator, or converting the script. The fact that they still need humans in the loop for these huge operations suggests that automated systems just aren't up to it yet.
In video you have a plethora of things that make you life difficult, pretty much spanning huge swathes of current speech technology research:
-> Multiple speakers -> "Speaker Identification" (can you tell characters apart? Also, subtitles normally have different coloured text for different speakers)
-> Multiple simultaneous speakers -> The "cocktail party problem" - can you separate the two voice components and transcribe both?
-> Background noise -> Can you pick the speech out from any soundtrack/foley/exploding helicopters.
The speech algorithm will need to be extremely robust as different characters can have different gender/accents/emotion. From what I understand of the current state of recognition you might be able to get a single speaker after some training, but asking a single program to nail all of them might be tough!
--
There is no "subtitle" format that I'm aware of. I would suggest saving an image of the text using a font like Tiresias Screenfont that's specifically designed for legibility in these circumstances, and use a lookup table to cross-reference images against video timecode (remembering NTSC/PAL/Cinema use different timing formats).
--
There's a bunch of proprietary speech recognition systems out there. If you want the best you'll probably want to license a solution off one of the big boys like Nuance. If you want to keep things free the universities of RWTH and CMU have put some solutions together. I have no idea how good they are or how well they might be suited to the problem.
--
The only solution I can think of similar to what you're aiming at is the subtitling you can get on news channels here in the UK "Live Closed Captioning". Since it's live, I assume they use some kind of speech recognition system trained to the reader (although it might not be trained, I'm not sure). It's got better over the past few years, but on the whole it's still pretty poor. The biggest thing it seems to struggle with is speed. Dialogue is normally really fast, so live subtitling has the extra issue of getting everything done in time. Live closed captions quite frequently get left behind and have to miss a lot of content out to catch up.
Whether you have to deal with this depends on whether you'll be subtitling "live" video or if you can pre-process it. To deal with all the additional complications above I assume you'll need to pre-process it.
--
As much as I hate citing the big W there's a goldmine of useful links here!
Good luck :)
This falls into the category of dictation, which is a very large vocabulary task. Products like Dragon Naturally Speaking are amazingly good and that has a SAPI interface for developers. But it's not so simple of a problem.
Normally a dictation product is meant to be single speaker and the best products adapt automatically to that speaker, thereby improving the underlying acoustic model. They also have sophisticated language modeling which serves to constrain the problem at any given moment by limiting what is known as the perplexity of the vocabulary. That's a fancy way of saying the system is figuring out what you're talking about and therefore what types of words and phrases are likely or not likely to come next.
It would be interesting though to apply a really good dictation system to your recordings and see how well it does. My suggestion for a paid system would be to get Dragon Naturally Speaking from Nuance and get the developer API. I believe that provides a SAPI interface, which has the benefit of allowing you to swap in the Microsoft speech or any other ASR engine that supports SAPI. IBM would be another vendor to look at but I don't think you will do much better than Dragon.
But it won't work well! After all the work of integrating the ASR engine, what you will probably find is that you get a pretty high error rate (maybe half). That would be due to a few major challenges in this task:
1) multiple speakers, which will degrade the acoustic model and adaptation.
2) background music and sound effects.
3) mixed speech - people talking over each other.
4) lack of a good language model for the task.
For 1) if you had a way of separating each actor on a separate track that would be ideal. But there's no reliable way of separating speakers automatically in a way that would be good enough for a speech recognizer. If each speaker were at a distinctly different pitch, you could try pitch detection (some free software out there for that) and separate based on that, but this is a sophisticated and error prone task.) The best thing would be hand editing the speakers apart, but you might as well just manually transcribe the speech at that point! If you could get the actors on separate tracks, you would need to run the ASR using different user profiles.
For music (2) you'd either have to hope for the best or try to filter it out. Speech is more bandlimited than music so you could try a bandpass filter that attenuates everything except the voice band. You would want to experiment with the cutoffs but I would guess 100Hz to 2-3KHz would keep the speech intelligible.
For (3), there's no solution. The ASR engine should return confidence scores so at best I would say if you can tag low scores, you could then go back and manually transcribe those bits of speech.
(4) is a sophisticated task for a speech scientist. Your best bet would be to search for an existing language model made for the topic of the movie. Talk to Nuance or IBM, actually. Maybe they could point you in the right direction.
Hope this helps.

An example of an embedded project for a single person

I've been trying to wrap my head around embedded. Since I will be self-taught in this specific niche, I realize it will be harder to get a job in the field, so I'm hoping to add a completed project to my resume to prove to potential employers that I've done it and can do it again for them.
Can someone suggest a project that I can undertake as a single person and actually be able to finish, but at the same time not too simple that it doesn't prove anything? Something reasonable that I can aim for.
If you can substantiate your example with a project you worked on yourself, and mention how many people were involved, and how long it took to finish it, that would also help me gauge the difficulty of projects I see in general and rule out the ones that are probably too big for my capacity. It's very difficult to gauge the amount of work a project needs from my position.
You should take a look at the arduino. To quote their site:
Arduino is an open-source electronics prototyping platform based on flexible, easy-to-use hardware and software. It's intended for artists, designers, hobbyists, and anyone interested in creating interactive objects or environments.
There is a really handy playground listing a bunch of personal projects on the arduino, any one of which might fulfil your need to do some embedded development. You can also trawl around the internet (e.g. instructables) to find many other interesting arduino applications -- I particularly like the one building a fancy control system for an espresso machine, and, of course, there is the mandatory fart detecting chair that tweets its findings.
Being an arduino experimenter myself, I can attest to the simplicity and power of this device -- and the great fun you will have playing with it. If you want to get started quickly, I can recommend buying the starter kit from the very helpful people at oomlout.
Are you looking specifically at embedded software development, or are you interested in circuit board design as well?
If it's just software, then I would suggest getting hold of an ARM development board (Possibly the Philips LPC range - sparkfun have some nice ones) that you can program via a bootloader over usb and start hacking. Get one with a display and an ethernet port and you can build up to making some sort of network attached sensor (temperature, water level, object counter, etc). Start out little (turn on a LED from a button) and work your way up.
If you're also into the electronics side of things, I'd suggest something like an MP3 (or WAV) player and maybe stick to the AVR or PIC 8bit microcontrollers (AVR is used on the Arduino) as these are a little easier to deal with than ARM. Here you could start with a usb powered device that streams wav files from a PC serial port out to a pair of headphones, and build up to a battery powered board, feeding data to an MP3 decoder IC from an SD card.
Some things you may want to learn & demonstrate:
Understands the bounds of working with limited resources, including memory management (dynamic and/or static); resource management (locks, semaphores, mutex); multiple tasks (interrupts); and appropriate data structures
Ability to interface with other devices/ICs over various interconnects (analog & digital IO, serial bus (RS232, I2C, SPI))
Ability to sanely structure a program and segment the various modules without producing 'spaghetti' code
Ability to use source and integrate 3rd party libraries where appropriate (think FAT filesystem, or TCP/IP stack)
Misc Tips:
read and understand the datasheets (yes all of them)
code and test on the desktop where possible, but understand that there are differences and bugs will still creep through (this is where it helps to be using a tool-chain that is common with the desktop - GCC is good, but the tools are generally CLI)
use assert a lot - you can flash the line number of a failed assert using a single LED - this is invaluable
Most of all have fun - it still makes me smile when you first get a new component working (display, motor, sensor). Embedded makes the world go round :)

Writing a program which uses voice recogniton... where should I start?

I'm a design student currently dabbling with Arduino code (based on c/c++) and flash AS3. What I want to do is to be able to write a program with a voice control input.
So, program prompts user to spell a word. The user spells out the word. The program recognizes if this is right, adds one to a score if it's correct, and corrects the user if it's wrong. So I'm seeing a big list of words, each with an audio file of the word being read out, with the voice recognition part checking to see if the reply matches the input.
Ideally i'd like to be able to interface this with an Arduino microcontroller so that a physical output with a motor could be achieved in reaction also.
Thing is i'm not sure if I can make this program in flash, in Processing (associated with arduino) or if I need another C program-making-program. I guess I need to download a good voice recognizing program, but how can I interface this with anything else? Also, I'm on a mac. (not sure if this makes a difference)
I apologize for my cluelessness, any hints would be great!
-Susan
What you need is most likely not a speech recognition program. You are looking for a speech recognition library. You're probably not that familiar with programming yet, so the term may be unfamiliar. Basically, a library is an intermediate step between source code and a whole program.
In your case, you are really asking for a library that (1) does vocie recognition and (2) works with Adobe Flash. Unfortunately, I can't find one of them with Google. Furthermore, I've found people who've tried, and their experiments (while short of what you need) are described by others as impressive. That suggests the technology isn't there yet.
It is probably easier to move the voice recognition to the Arduino. "Voice recognition Arduino" provides a lot of good hits in Google.

How to implement speech recognition and text-to-speech in C++?

I want to know about various techniques to do speech recognition and text to speech conversion.
Also please let me know about any resources like links, tutorials ,ebooks etc. on it.
Which is the most efficient technique to achieve it ?
I'm going to answer the part about speech recognition (since I don't know much about text-to-speech):
http://ecx.images-amazon.com/images/I/4190SZC61CL._BO2,204,203,200_PIsitb-sticker-arrow-click,TopRight,35,-76_AA240_SH20_OU01_.jpg
This book, "Statistical Methods for Speech Recognition" is a classic that explains the mathematical foundations of statistical speech recognition, written by the founder of that area, Frederick Jelinek.
The most important concept you have to know is Hidden Markov Models. People have been using them in speech recognition for decades. A recent approach uses Conditional Random Fields, see the paper (PDF) and the associated software toolkit SCARF.
It is fairly hard to write your own speech recognizer. It's an active research area with several scientific conferences, e.g. ASRU, Interspeech, ICASSP.
Both are very wide areas.
About recognition: In this this schema you will find how to build a basic automatic speech recognition system. It isn't by any means close to the start of the art, but it is something achievable and it works. If you want to do something more advanced, read about cepstral coefficients and Hidden Markov Models. Have a look into HTK, it is a widely used toolkit for Hidden Markov Models.
About text to speech: I'd have a look at Festival.
There are multiple sphinx's. The main active ones are pocketsphinx and sphinx4.
Sphinx4 is written in Java. It is better for desktop and web applications.
Pocketsphinx is written in C. It is better for embedded devices. There are iphone/android apps that use it.
Sounds like you want pocketsphinx. Try out this tutorial:
http://www.speech.cs.cmu.edu/sphinx/tutorial.html
A better place to ask pocketsphinx/sphinx4 questions is on CMU's sourceforge forum.
Also you should provide more info like what you intend to make.
As for books, the bible of speech recognition is "Spoken Language Processing"
Since you mentioned MS -
You should just look at the Microsoft Speech site. It contains many resources for dealing with speech, including TTS and speech recognition.
If you're looking for some actual code, check out Sphinx, an open source speech recognition project from CMU. It's not written in C++, but if you're interested in algorithms, it's implemented a bunch of stuff you can learn from. (I'd like to echo #dehmann's point, too: read up on hidden markov models.)
If you are curious about what to do with your fancy speech recognition you should read:
Voice Interaction Design by Randy Allen Harris
It provides some great advice about when to use Voice and how to use it in an application.