Writing a program which uses voice recogniton... where should I start? - c++

I'm a design student currently dabbling with Arduino code (based on c/c++) and flash AS3. What I want to do is to be able to write a program with a voice control input.
So, program prompts user to spell a word. The user spells out the word. The program recognizes if this is right, adds one to a score if it's correct, and corrects the user if it's wrong. So I'm seeing a big list of words, each with an audio file of the word being read out, with the voice recognition part checking to see if the reply matches the input.
Ideally i'd like to be able to interface this with an Arduino microcontroller so that a physical output with a motor could be achieved in reaction also.
Thing is i'm not sure if I can make this program in flash, in Processing (associated with arduino) or if I need another C program-making-program. I guess I need to download a good voice recognizing program, but how can I interface this with anything else? Also, I'm on a mac. (not sure if this makes a difference)
I apologize for my cluelessness, any hints would be great!
-Susan

What you need is most likely not a speech recognition program. You are looking for a speech recognition library. You're probably not that familiar with programming yet, so the term may be unfamiliar. Basically, a library is an intermediate step between source code and a whole program.
In your case, you are really asking for a library that (1) does vocie recognition and (2) works with Adobe Flash. Unfortunately, I can't find one of them with Google. Furthermore, I've found people who've tried, and their experiments (while short of what you need) are described by others as impressive. That suggests the technology isn't there yet.
It is probably easier to move the voice recognition to the Arduino. "Voice recognition Arduino" provides a lot of good hits in Google.

Related

How can I write a microcontroller for an RGB keyboard?

I have a Modecom Volcano Hammer RGB keyboard, and I'd like to make a custom led controlling program.
I'm a programmer, but my main languages are C#, PHP and JavaScript. Unfortunately, I'm not so experienced with hardware programming.
How can I start? I read tons of pages, I made several google search, but I didn't find the answer. So I'd like to control my keyboard's LEDs, probably it's petty simple, but I don't know either the keywords.
Yes, I know, I have to ask the manufacturer, but it's not an option, because they don't answer.
So, any suggestions?

Audio Subtitle Transcription - C++

I'm on a project that among other video related tasks should eventually be capable of extracting the audio of a video and apply some kind of speech recognition to it and get a transcribed text of what's said on the video. Ideally it should output some kind of subtitle format so that the text is linked to a certain point on the video.
I was thinking of using the Microsoft Speech API (aka SAPI). But from what I could see it is rather difficult to use. The very few examples that I found for speech recognition (most are for Text-To-Speech which mush easier) didn't perform very well (they don't recognize a thing). For example this one: http://msdn.microsoft.com/en-us/library/ms717071%28v=vs.85%29.aspx
Some examples use something called grammar files that are supposed to define the words that the recognizer is waiting for but since I haven't trained the Windows Speech Recognition thoroughly I think that might be adulterating the results.
So my question is... what's the best tool for something like this? Could you provide both paid and free options? Well the best "free" (as it comes with Windows) option I believe it's SAPI, all the rest should be paid but if they are really good it might be worth it. Also if you have any good tutorials for using SAPI (or other API) on a context similar to this it would be great.
On the whole this is a big ask!
The issue with any speech recognition system is that it functions best after training. It needs context (what words to expect) and some kind of audio benchmark (what does each voice sound like). This might be possible in some cases, such as a TV series if you wanted to churn through hours of speech -separated for each character- to train it. There's a lot of work there though. For something like a film there's probably no hope of training a recogniser unless you can get hold of the actors.
Most film and TV production companies just hire media companies to transcribe the subtitles based on either direct transcription using a human operator, or converting the script. The fact that they still need humans in the loop for these huge operations suggests that automated systems just aren't up to it yet.
In video you have a plethora of things that make you life difficult, pretty much spanning huge swathes of current speech technology research:
-> Multiple speakers -> "Speaker Identification" (can you tell characters apart? Also, subtitles normally have different coloured text for different speakers)
-> Multiple simultaneous speakers -> The "cocktail party problem" - can you separate the two voice components and transcribe both?
-> Background noise -> Can you pick the speech out from any soundtrack/foley/exploding helicopters.
The speech algorithm will need to be extremely robust as different characters can have different gender/accents/emotion. From what I understand of the current state of recognition you might be able to get a single speaker after some training, but asking a single program to nail all of them might be tough!
--
There is no "subtitle" format that I'm aware of. I would suggest saving an image of the text using a font like Tiresias Screenfont that's specifically designed for legibility in these circumstances, and use a lookup table to cross-reference images against video timecode (remembering NTSC/PAL/Cinema use different timing formats).
--
There's a bunch of proprietary speech recognition systems out there. If you want the best you'll probably want to license a solution off one of the big boys like Nuance. If you want to keep things free the universities of RWTH and CMU have put some solutions together. I have no idea how good they are or how well they might be suited to the problem.
--
The only solution I can think of similar to what you're aiming at is the subtitling you can get on news channels here in the UK "Live Closed Captioning". Since it's live, I assume they use some kind of speech recognition system trained to the reader (although it might not be trained, I'm not sure). It's got better over the past few years, but on the whole it's still pretty poor. The biggest thing it seems to struggle with is speed. Dialogue is normally really fast, so live subtitling has the extra issue of getting everything done in time. Live closed captions quite frequently get left behind and have to miss a lot of content out to catch up.
Whether you have to deal with this depends on whether you'll be subtitling "live" video or if you can pre-process it. To deal with all the additional complications above I assume you'll need to pre-process it.
--
As much as I hate citing the big W there's a goldmine of useful links here!
Good luck :)
This falls into the category of dictation, which is a very large vocabulary task. Products like Dragon Naturally Speaking are amazingly good and that has a SAPI interface for developers. But it's not so simple of a problem.
Normally a dictation product is meant to be single speaker and the best products adapt automatically to that speaker, thereby improving the underlying acoustic model. They also have sophisticated language modeling which serves to constrain the problem at any given moment by limiting what is known as the perplexity of the vocabulary. That's a fancy way of saying the system is figuring out what you're talking about and therefore what types of words and phrases are likely or not likely to come next.
It would be interesting though to apply a really good dictation system to your recordings and see how well it does. My suggestion for a paid system would be to get Dragon Naturally Speaking from Nuance and get the developer API. I believe that provides a SAPI interface, which has the benefit of allowing you to swap in the Microsoft speech or any other ASR engine that supports SAPI. IBM would be another vendor to look at but I don't think you will do much better than Dragon.
But it won't work well! After all the work of integrating the ASR engine, what you will probably find is that you get a pretty high error rate (maybe half). That would be due to a few major challenges in this task:
1) multiple speakers, which will degrade the acoustic model and adaptation.
2) background music and sound effects.
3) mixed speech - people talking over each other.
4) lack of a good language model for the task.
For 1) if you had a way of separating each actor on a separate track that would be ideal. But there's no reliable way of separating speakers automatically in a way that would be good enough for a speech recognizer. If each speaker were at a distinctly different pitch, you could try pitch detection (some free software out there for that) and separate based on that, but this is a sophisticated and error prone task.) The best thing would be hand editing the speakers apart, but you might as well just manually transcribe the speech at that point! If you could get the actors on separate tracks, you would need to run the ASR using different user profiles.
For music (2) you'd either have to hope for the best or try to filter it out. Speech is more bandlimited than music so you could try a bandpass filter that attenuates everything except the voice band. You would want to experiment with the cutoffs but I would guess 100Hz to 2-3KHz would keep the speech intelligible.
For (3), there's no solution. The ASR engine should return confidence scores so at best I would say if you can tag low scores, you could then go back and manually transcribe those bits of speech.
(4) is a sophisticated task for a speech scientist. Your best bet would be to search for an existing language model made for the topic of the movie. Talk to Nuance or IBM, actually. Maybe they could point you in the right direction.
Hope this helps.

Create a C++ program that processes incoming calls on a phone-line (land-line)

For some time now I've been tossing around what I think is am awesome idea: I want to write essentially a C++ phone server to handle all of my incoming calls on a land-line. I'll have a white-list (yay never having to worry about telemarketers ever again!), a black-list, and will be able to access my phone using my gaming headset, allowing me to make/answer calls while I'm gaming or whatever. In the future I'd also like to hook it up to a gui and make it have pop-ups and other cool features.
The problem is, I have no idea where to start. I'm familiar enough with C++, but have no idea how to go about doing anything with a phone-line. I can plug a phone-line into my computer, but I have no idea how to get my program to be able to use that connection. There's WinSock2 for being able to use my ethernet connection, is there something similar I'd be able to use to use the phone line? As it's using the same ethernet jack, I wonder if it's even possible to use WinSock2 to use the phone-line?
I saw this post, which wasn't particularly helpful: stackoverflow link , which points out Dual-tone multi-frequency signaling. I stumbled across this site: link, but isn't really going to help me get started.
So I was wondering, is there some sort of library out there that would allow me to tap into a phone-line that's connected to my computer? Is there a standard somewhere out there concerning phone-lines and what the different combinations of tone's mean? Can anyone here help get me started? I realize it's somewhat of a big undertaking, so any push in the right direction would be greatly appreciated. Thanks.
[Update:]
I found this question, which is a step in the right direction, but I'm not sure yet if it helps me (I need to go to bed, and will take a look at it in the morning). I did see mention of a Microsoft Telephony API though, I'll try doing more research on that tomorrow.
If working with MS products is not an absolute necessity, you might also consider taking a shot at Asterisk. This is an open-source PBX (in software) that allows development on Linux, Windows (emulated) and Mac. At the company where I work, we use it for implementing small-scale exchanges, about a 100 lines or so. It also interfaces well with VoIP and allows a whole host of protocols. I have developed scripts and programs in C++ that work on voice packets in real-time, and so far, my experience has been good. As for your stated use-case of blocking telemarketers etc., this would be a very good fit. Check out further details here.
After doing more research, having one link lead to another link, and coming up with new search terms, I stumbled across this site that looks like it could kick me off using the Windows Telephony API in C++: link. This link includes open source c++ samples showing how to do the basics of what this question asks, I'll just have to test to see if they actually still work.
This is only the beginning of my research, so I'll keep you posted on any other findings. If anyone else is knowledgeable in this area, please still feel free to drop me information on what I want to accomplish.

Sound output through M-Audio ProFire 610

I got an assignment at work to create a system which will be able to direct sound to different output channels of our sound card. We are using M-Audio ProFire 610, which has 8 channel output and connects through FireWire. We are also using a Mac Mini as our host server and I'm gonna be working in Xcode.
This is the diagram of what I am building:
diagram http://img121.imageshack.us/img121/7865/diagramy.png
At first I thought that Java will be enough for this project, however later on I discovered that Java is not able to push sound to other than default output channels of the sound card so I decided to switch to C++. The problem is that I am a web developer and I don't have any experience in this language whatsoever - that is why I am looking for help from more experienced developers.
I found a Core Audio Primer for ios4 but not sure how much of it I can use for my project. I find it a bit confusing, too.
What steps should I take to complete this assignment? What frameworks should I use? Any code examples? I am looking for any help, hints, tips - well anything that will help me complete this project.
If you're just looking for audio pass-through, you might want to look at something that's already been built, like Jack which creates a software audio device that looks and works just like a real one (you can set it as default output for your app) and then allows you to route each channel anywhere you want (including to other applications).
If you want/need to make your own, definitely go with C++, for which there are many many tutorials (I learned from cplusplus.com). CoreAudio is the low-level C/C++ interface as Justin mentioned, but it's really hard to learn and use. A much simpler API is provided by PortAudio, for which I've worked a bit on the Mac implementation. Look at the tutorials there, make something similar for default input and output, and then to do the channel mapping use PaMacCore_SetupChannelMap, which is described here. You'll need to call it twice, once for the input stream and once for the output stream. Join the mailing list for PortAudio if you need more advice! Good luck!
the primary APIs are at CoreAudio/AudioHardware.h
most of the samples/supporting code provided by apple is in C++. however, the APIs are totally C (don't know if that helps you or not).
you'll want to access the Hardware Abstraction Layer (aka HAL), more details in this doc:
http://developer.apple.com/documentation/MusicAudio/Conceptual/CoreAudioOverview/CoreAudioOverview.pdf
for (a rather significant amount of) additional samples/usage, see $DEVELOPER_DIR/Extras/CoreAudio/

Getting the amplitude(or rms voltage) of audio signal captured in C++ by wavin lib.?

I am working on a very basic robotics project, and wish to implement voice recognition in it.
i know its a complex thing but i wish to do it for only 3 or 4 commands(or words).
i know that using wavin i can record audio. but i wish to do real-time amplitude analysis on the audio signal, how can that be done, the wave will be inputed as 8-bit, mono.
i have thought of divinding the signal into a set of some specific time, further diving it into smaller subsets, getting the average rms value over the subset and then summing them up and then see how much different they are from the actual stored signal.If the error is below accepted value for all(or most) of the sets, then print the word.
How can this be implemented?
if you can provide me any other suggestion also, it would be great.
Thanks, in advance.
There is no simple way to recognize words, because they are basically a sequence of phonemes which can vary in time and frequency.
Classical isolated word recognition systems use signal MFCC (cepstral coefficients) as input data, and try to recognize patterns using HMM (hidden markov models) or DTW (dynamic time warping) algorithms.
You will also need a silence detection module if you don't want a record button.
For instance Edimburgh University toolkit provides some of these tools (with good documentation).
If you don't want to build it "from scratch" or have a source of inspiration, here is an (old but free) implementation of such a system (which uses its own toolkit) with a full explanation and practical examples on how it works.
This system is a LVCSR (Large-Vocabulary Continuous Speech Recognition) and you only need a subset of it. If someone know an open source reduced vocabulary system (like a simple IVR) it would be welcome.
If you want to make a basic system from your own, I recommend you to use MFCC and DTW:
For each target word to modelize:
record some instances of the word
compute some (eg each 10ms) delta-MFCC through the word to have a model
When you want to recognize a signal:
compute some delta-MFCC of this signal
use DTW to compare these delta-MFCC to each modelized word's delta-MFCC
output the word that fits the best (use a threshold to drop garbage)
If you just want to recognize a few commands, there are many commercial and free products you can use. See Need text to speech and speech recognition tools for Linux or What is the difference between System.Speech.Recognition and Microsoft.Speech.Recognition? or Speech Recognition on iPhone. The answers to these questions link to many available products and tools. Speech recognition and understanding of a list of commands is a very common problem solved commercially. Many of the voice automated phone systems you call uses this type of technology. The same technology is available for developers.
From watching these questions for few months, I've seen most developer choices break down like this:
Windows folks - use the System.Speech features of .Net or Microsoft.Speech and install the free recognizers Microsoft provides. Windows 7 includes a full speech engine. Others are downloadable for free. There is a C++ API to the same engines known as SAPI. See at http://msdn.microsoft.com/en-us/magazine/cc163663.aspx. or http://msdn.microsoft.com/en-us/library/ms723627(v=vs.85).aspx
Linux folks - Sphinx seems to have a good following. See http://cmusphinx.sourceforge.net/ and http://cmusphinx.sourceforge.net/wiki/
Commercial products - Nuance, Loquendo, AT&T, others
Online service - Nuance, Yapme, others
Of course this may also be helpful - http://en.wikipedia.org/wiki/List_of_speech_recognition_software