Speech Recognition for small vocabulary (about 20 words) - c++

I am currently working on a project for my university. The task is to write speech recognition system that is going to run on a phone in background waiting for few commands (like. call 0 123 ...).
It's 2 months project so it does not have to be very accurate. The amount of acceptable noise can be small and words will be separated by moments of silence.
I am currently at point of loading sample word encoded in RAW 16 bit PCM format. Splitting it to chunks (about 50 per second) and running FFT on each chunk in order to get frequency spectrum.
Things to solve are:
1) going through the longer recording and splitting it into words.
2) finding to best match for the word
1) I was thinking about just checking chunk after chunk and if I encounter few chunks that have higher altitudes of human voice frequencies assume that the word has started. Anyway I am looking for resources that may help with this.
2) This one seams a little bit tougher. Is it necessary to use HMM's for system like this or maybe there are simpler methods assuming that the vocabulary is so small ( 20 words )?
Edit:
The point of the project is writing the system on my own so I cannot use ready libraries like Sphinx or HTK.
Regards,
Karol

If anybody will have the same question in future. Look for 2 main keywords:
MFCC - Mel-Frequency cepstrum coefficients to calculate series of coefficients for each word template
DTW - To match captured word with templates
Good enough description of DTW can be found on wikipedia
This approach was good enough to have around 80% accuracy on 20 words dictionary and give a good demo during the class.

To recognize commands on the phone you can use Pocketsphinx. Tutorial which covers speech recognition applications on Android is available on CMUSphinx website.

Related

Why Word2Vec's most_similar() function is giving senseless results on training?

I am running the gensim word2vec code on a corpus of resumes(stopwords removed) to identify similar context words in the corpus from a list of pre-defined keywords.
Despite several iterations with input parameters,stopword removal etc the similar context words are not at all making sense(in terms of distance or context)
Eg. correlation and matrix occurs in the same window several times yet matrix doesnt fall in the most_similar results for correlation
Following are the details of the system and codes
gensim 2.3.0 ,Running on Python 2.7 Anaconda
Training Resumes :55,418 sentences
Average words per sentence : 3-4 words(post stopwords removal)
Code :
wordvec_min_count=int()
size = 50
window=10
min_count=5
iter=50
sample=0.001
workers=multiprocessing.cpu_count()
sg=1
bigram = gensim.models.Phrases(sentences, min_count=10, threshold=5.0)
trigram = gensim.models.Phrases(bigram[sentences], min_count=10, threshold=5.0)
model=gensim.models.Word2Vec(sentences = trigram[sentences], size=size, alpha=0.005, window=window, min_count=min_count,max_vocab_size=None,sample=sample, seed=1, workers=workers, min_alpha=0.0001, sg=sg, hs=1, negative=0, cbow_mean=1,iter=iter)
model.wv.most_similar('correlation')
Out[20]:
[(u'rankings', 0.5009744167327881),
(u'salesmen', 0.4948525130748749),
(u'hackathon', 0.47931140661239624),
(u'sachin', 0.46358123421669006),
(u'surveys', 0.4472047984600067),
(u'anova', 0.44710394740104675),
(u'bass', 0.4449636936187744),
(u'goethe', 0.4413239061832428),
(u'sold', 0.43735259771347046),
(u'exceptional', 0.4313117265701294)]
I am lost as to why the results are so random ? Is there anyway to check the accuracy for word2vec ?
Also is there an alternative of word2vec for most_similar() function ? I read about gloVE but was not able to install the package.
Any information in this regard would be helpful
Enable INFO-level logging and make sure that it indicates real training is happening. (That is, you see incremental progress taking time over the expected number of texts, over the expected number of iterations.)
You may be hitting this open bug issue in Phrases, where requesting the Phrase-promotion (as with trigram[sentences]) only offers a single-iteration, rather than the multiply-iterable collection object that Word2Vec needs.
Word2Vec needs to pass over the corpus once for vocabulary-discovery, then iter times again for training. If sentences or the phrasing-wrappers only support single-iteration, only the vocabulary will be discovered – training will end instantly, and the model will appear untrained.
As you'll see in that issue, a workaround is to perform the Phrases-transformation and save the results into an in-memory list (if it fits) or to a separate text corpus on disk (that's already been phrase-combined). Then, use a truly restartable iterable on that – which will also save some redundant processing.

word2vec guesing word embeddings

can word2vec be used for guessing words with just context?
having trained the model with a large data set e.g. Google news how can I use word2vec to predict a similar word with only context e.g. with input ", who dominated chess for more than 15 years, will compete against nine top players in St Louis, Missouri." The output should be Kasparov or maybe Carlsen.
I'ven seen only the similarity apis but I can't make sense how to use them for this? is this not how word2vec was intented to use?
It is not the intended use of word2vec. The word2vec algorithm internally tries to predict exact words, using surrounding words, as a roundabout way to learn useful vectors for those surrounding words.
But even so, it's not forming exact predictions during training. It's just looking at a single narrow training example – context words and target word – and performing a very simple comparison and internal nudge to make its conformance to that one example slightly better. Over time, that self-adjusts towards useful vectors – even if the predictions remain of wildly-varying quality.
Most word2vec libraries don't offer a direct interface for showing ranked predictions, given context words. The Python gensim library, for the last few versions (as of current version 2.2.0 in July 2017), has offered a predict_output_word() method that roughly shows what the model would predict, given context-words, for some training modes. See:
https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.predict_output_word
However, considering your fill-in-the-blank query (also called a 'cloze deletion' in related education or machine-learning contexts):
_____, who dominated chess for more than 15 years, will compete against nine top players in St Louis, Missouri
A vanilla word2vec model is unlikely to get that right. It has little sense of the relative importance of words (except when some words are more narrowly predictive of others). It has no sense of grammar/ordering, or or of the compositional-meaning of connected-phrases (like 'dominated chess' as opposed to the separate words 'dominated' and 'chess'). Even though words describing the same sorts of things are usually near each other, it doesn't know categories to be able to determine that the blank must be a 'person' and a 'chess player', and the fuzzy-similarities of word2vec don't guarantee words-of-a-class will necessarily all be nearer-each-other than other words.
There has been a bunch of work to train word/concept vectors (aka 'dense embeddings') to be better at helping at such question-answering tasks. A random example might be "Creating Causal Embeddings for Question Answering with Minimal Supervision" but queries like [word2vec question answering] or [embeddings for question answering] will find lots more. I don't know of easy out-of-the-box libraries for doing this, with or without a core of word2vec, though.

Building Speech Dataset for LSTM binary classification

I'm trying to do binary LSTM classification using theano.
I have gone through the example code however I want to build my own.
I have a small set of "Hello" & "Goodbye" recordings that I am using. I preprocess these by extracting the MFCC features for them and saving these features in a text file. I have 20 speech files(10 each) and I am generating a text file for each word, so 20 text files that contains the MFCC features. Each file is a 13x56 matrix.
My problem now is: How do I use this text file to train the LSTM?
I am relatively new to this. I have gone through some literature on it as well but not found really good understanding of the concept.
Any simpler way using LSTM's would also be welcome.
There are many existing implementation for example Tensorflow Implementation, Kaldi-focused implementation with all the scripts, it is better to check them first.
Theano is too low-level, you might try with keras instead, as described in tutorial. You can run tutorial "as is" to understand how things goes.
Then, you need to prepare a dataset. You need to turn your data into sequences of data frames and for every data frame in sequence you need to assign an output label.
Keras supports two types of RNNs - layers returning sequences and layers returning simple values. You can experiment with both, in code you just use return_sequences=True or return_sequences=False
To train with sequences you can assign dummy label for all frames except the last one where you can assign the label of the word you want to recognize. You need to place input and output labels to arrays. So it will be:
X = [[word1frame1, word1frame2, ..., word1framen],[word2frame1, word2frame2,...word2framen]]
Y = [[0,0,...,1], [0,0,....,2]]
In X every element is a vector of 13 floats. In Y every element is just a number - 0 for intermediate frames and word ID for final frame.
To train with just labels you need to place input and output labels to arrays and output array is simpler. So the data will be:
X = [[word1frame1, word1frame2, ..., word1framen],[word2frame1, word2frame2,...word2framen]]
Y = [[0,0,1], [0,1,0]]
Note that output is vectorized (np_utils.to_categorical) to turn it to vectors instead of just numbers.
Then you create network architecture. You can have 13 floats for input, a vector for output. In the middle you might have one fully connected layer followed by one lstm layer. Do not use too big layers, start with small ones.
Then you feed this dataset into model.fit and it trains you the model. You can estimate model quality on heldout set after training.
You will have a problem with convergence since you have just 20 examples. You need way more examples, preferably thousands to train LSTM, you will only be able to use very small models.

Artificial Neural Network with large inputs & outputs

I've been following Dave Miller's ANN C++ Tutorial, and I've been having some problems getting it to function as expected.
You can view the code I'm working with here. It's an XCode project, but includes the main.cpp and data set file.
Previously, this program would only gives outputs between -1 and 1, I'm presuming due to the use of the tanh function. I've manipulated the data inputs so I can input my data that is much larger and have valid outputs. I've simply done this by multiplying the input values by 0.0001, and multiplying the output values by 10000.
The training data I'm using is the included CSV file. The last column is the expected output, the rest are inputs. Am I using the wrong mathematical function for these data?
Would you say that this is actually learning? This whole thing has stressed me out so much, I understand the theory behind ANN's but just can't implement from scratch for myself.
The net recent average error definitely gets smaller and smaller, which to me would say it is learning.
I'm sorry if I haven't explained myself very well, I'm very new to ANN's and this whole thing is very confusing to me. My university lecturers are useless when it comes to the practical side, they only teach us the theory of it.
I've been playing around with the eta and alpha values, along with the number of hidden layers.
You explained yourself quite well, if the net recent average is getting lower and lower it probably means that the network is actually learning, but here is my suggestion about how to be completely sure.
Take you CSV file and split it into 2 files one should be about 10% of the all data and the other all the remaining.
You start with an untrained network and you run your 10% file trough the net and for each line you save the difference between actual output and expected result.
Then you train the network only with the 90% of the CSV file you have and finally you re run trough the NET the first 10% file again and you compare the differences you had on the first run with the the latest ones.
You should find out that the new results are much closer to the expected values than the first time, and this would be the final proof that your network is learning.
Does this make any sense ? if not please send share some code or send me a link to the exercise you are running and I will try to explain it in code.

Is it possible to see if two MP3 files are the same song by analyzing the files' bytes?

This is to be done in C++ or C....
I know we can read the MP3s' meta data, but that information can be changed by anyone, can't it?
So is there a way to analyze a file's contents and compare it against another file and determine if it is in fact the same song?
edit
Lots of interesting things coming out that I hadn't thought of. Not at all a good idea to attempt this.
It's possible, but very hard.
Even the same original recording may well be encoded differently by different MP3 encoders or the same encoder with different settings... leading to different results when the MP3 is then decoded. You'd need to work out an aural model to "understand" how big the differences are, and make a judgement.
Then there's the matter of different recordings. If I sing "Once in Royal David's City" and Aled Jones sings it, are those the same song? What if there are two different versions of a song where one has slightly modified lyrics? The key could be different, it could be in a different vocal range - all kinds of things.
How different can two songs be but still count as "the same song"? Once you've decided that, then there's the small matter of implementing it ;)
If I really had to do this, my first attempt would be to take a Fourier transform of both songs and compare the histograms. You can use FFTW (http://www.fftw.org/) to take the Fourier transform, and then compare the histograms by summing the squares of the differences at each frequency. If the resultant sum is greater than some threshold (which you must determine by experimentation) then the songs are deemed to be different, otherwise they are the same.
No. Not SO simple.
You can check they contain the same encoded data, BUT:
Could be a different bitrate
Could be the same song, just a 1/100ths of a second off
In both cases the bytes would not match.
Basically, if a solution looks too simple to be true, it often is.
If you mean "same song" in the iTunes sense of "same recording", it would be possible to compares two audio files, but not by byte-by-byte comparison of an encoded file since even for the same format there are variables such as data rate and compression that are selected at time of encoding.
Also each encoding of the same recording may include different lead-in/lead-out timings, different amplitude and equalisation, and may have come from differing original sources (vinyl, CD, original master etc.). So you need a comparison method that takes all these variables into account, and even then you will end up with a 'likelihood' of a match rather than a definitive match.
If you genuinely mean "same song", i.e. any recording by any artist of the same composition and lyrics, then you are unlikely to get a high statistical correlation in most cases since pitch, tempo, range, instrumental arrangement will be very different.
In the "same recording" scenario, relatively simple signal processing and statistical techniques could be applied, in the "same song" scenario, AI techniques would need to be deployed, and even then the results I suspect would be poor.
If you want to compare MP3 files that originated from the same MP3, but have tagged with metadata differently, it would be straight forward to just compare the actual audio data. Since it originated from the same MP3 encoding, you should be able to do a byte by byte comparison. You would have to compare all byte. It should be sufficient to sample just a few to get a unique key that would be statistically almost impossible to find in another song.
If the files have been produced by different encoders, you would have to extract some "fuzzy" feature keys from the data and compare those keys. In a hurry I would probably construct an algorithm like this:
Decode audio to pulse-code modulation (wave) in a standard bit rate.
Find a fixed number of feature starting points using some dynamic location algorithm. For example find top 10 highest wave peaks ordered from beginning of wave or simply spread evenly across the wave (it would be a good idea to fix the first and last position dynamically though, since different encodings might not start and end at exactly the same point). An improvement would be to select feature points at positions in the wave that are not likely to be too repetitive.
Extract a set of one-dimensional feature key scalars from the feature points. For example, for each feature normalize the following n-sample values and count the number of zero-crossings, peak to average ratio, mean zero-crossing distance, signal-energy. The goal is to extract robust features that are relatively unique, while still characteristic even if some noise and distortion is added to the signal. This can obviously be improved almost infinitely.
Compare the extracted feature keys of the two files using some accuracy measurement (f.eks. 9 out of 10 feature extractions must match at least 99% on 4 out of 5 of their extracted feature keys).
The benefit of a feature extraction approach is that you can build a database of features for all your mp3-files and for a single file ask the question: What other media files have exactly or almost exactly the same feature as this one. The feature lookup could be implemented very efficiently with R*-trees or similar, which could be used to give you a fast distance measurement between the n-dimensional feature sets.
The above technique is essentially a variant of what is used in image search algorithms such as SIFT, which is probably the base of such application as Photosynth and Google Goggles. In image searching you filter the image for good candidate points for relatively unique features (such as corners of shapes), then you normalize the area around that feature to get normalized color, intensity, scale and direction of features. Finally you extract the features and search an n-dimensional database of features of other images and verify that found features in other images are geometrically positioned in the same pattern as in your search image. The technique for searching audio would be the same, only simpler, since audio is one dimensional.
Use the open source EchoPrint library to create a signature of the two audio files, and compare them with each other.
The library is very easy to use, and has clear examples on how to create the signatures.
http://echoprint.me/
You can even query their database with the signature and find matching song metadata (such as title, artist, etc).
I think the Fast Fourier-Transform (FFT) approach hinted by jstanley is pretty good for most use cases; in particular, it works for verifying that the two are the same release/ same recording by the same artist/ same bitrate / audio quality.
To be more explicit, sox and spek (via command line and GUI, respectively) can do this pretty painlessly.
Spek is pretty foolproof -- just open the software and point it to the two audio files in question.
sox can generate spectograms (FFTs) from the command line line so:
sox "$file" -n spectrogram -o "$outfile".
The result from either are two images; if they look basically identical, then for almost all intents and purposes, the two songs will be equivalent.
For example, I wanted to test if these two files:
Soundtrack to an imaginary film mixtape 2011.mp3
DJRUM - Sountrack to an imaginary film mixtape 2011 (for mary-anne hobbs).mp3
were the same. diff reported a difference in the binary files (perhaps due to metadata differences or minor encoding differences), but a quick glance at their spectrograms resolved it: