I have a use case where I want to cross compare 2 sets of images to know the best similar pairs.
However, the sets are quite big, and for performance purposes I don't want to open and close images all the time.
So my idea is:
std::map<int, Magic::Image> set1;
for(...) { set1[...] = Magic::Image(...);}
std::map<int, int> best;
for(...) {
set2 = Magic::Image(...);
//Compare with all the set1
...
best[...] = set1[...]->first;
}
Obviusly I don't need to store all the set 2, since I work image by image.
But in any case the set1 is already so big that storing 32bit images is too much. For reference: 15000 images, 300x300 = 5GB
I though about reducing the memory by downsampling the images to monochrome (it does not affect my use case). But how to do it? Even if I get a color channel, Image-Magick still threats the new image as 32bits, even if it is just a channel.
My final approach has been to write a self-parser that reads color by color, converts it, and creates a bit-vector. Then do XORs and count bits. That works. (using only 170 MB)
However, is not flexible. What if I want to use 2bits, or 8 bits at some point? Is it possible in any way using Imagemagick own classes and just call compare()?
Thanks!
I have a couple of suggestions - maybe something will give you an idea!
Suggestion 1
Maybe you could use a Perceptual Hash. Rather than holding all your images in memory, you calculate a hash one at a time for each image and then compare the distance between the hashes.
Some pHASHes are invariant to image scale (or you can scale all images to the same size before hashing) and most are invariant to image format.
Here is an article by Dr Neal Krawetz... Perceptual Hashing.
ImageMagick can also do Perceptual Hashing and is callable from PHP - see here.
I also wrote some code some time back for this sort of thing... code.
Suggestion 2
I understand that ImageMagick Version 7 is imminent - no idea who could tell you more - and that it supports true single-channel, grayscale images - as well as up to 32 channel multi-spectral images. I believe it can also act as a server - holding images in memory for subsequent use. Maybe that can help.
Suggestion 3
Maybe you can get some mileage out of GNU Parallel - it can keep all your CPU cores busy in parallel and also distribute work across a number of servers using ssh. There are plenty of tutorials and examples out there, but just to demonstrate comparing each item of a named set of images (a,b,c,d) with each of a numbered set of images (1,2), you could do this:
parallel -k echo {#} compare {1} {2} ::: a b c d ::: 1 2
Output
1 compare a 1
2 compare a 2
3 compare b 1
4 compare b 2
5 compare c 1
6 compare c 2
7 compare d 1
8 compare d 2
Obviously I have put echo in there so you can see the commands generated, but you can remove that and actually run compare.
So, your code might look more like this:
#!/bin/bash
# Create a bash function that GNU Parallel can call to compare two images
comparethem() {
result=$(convert -metric rmse "$1" "$2" -compare -format "%[distortion]" info:)
echo Job:$3 $1 vs $2 $result
}
export -f comparethem
# Next line effectively uses all cores in parallel to compare pairs of images
parallel comparethem {1} {2} {#} ::: set1/*.png ::: set2/*.png
Output
Job:3 set1/s1i1.png vs set2/s2i3.png 0.410088
Job:4 set1/s1i1.png vs set2/s2i4.png 0.408234
Job:6 set1/s1i2.png vs set2/s2i2.png 0.406902
Job:7 set1/s1i2.png vs set2/s2i3.png 0.408173
Job:8 set1/s1i2.png vs set2/s2i4.png 0.407242
Job:5 set1/s1i2.png vs set2/s2i1.png 0.408123
Job:2 set1/s1i1.png vs set2/s2i2.png 0.408835
Job:1 set1/s1i1.png vs set2/s2i1.png 0.408979
Job:9 set1/s1i3.png vs set2/s2i1.png 0.409011
Job:10 set1/s1i3.png vs set2/s2i2.png 0.407391
Job:11 set1/s1i3.png vs set2/s2i3.png 0.408614
Job:12 set1/s1i3.png vs set2/s2i4.png 0.408228
Suggestion 3
I wrote an answer a while back about using REDIS to cache images - that can also work in a distributed fashion amongst a small pool of servers. That answer is here.
Suggestion 4
You may find that you can get better performance by converting the second set of images to Magick Pixel Cache format so that they can be DMA'ed into memory rather than needing to be decoded and decompressed each time. So you would do this:
convert image.png image.mpc
which gives you these two files which ImageMagick can read really quickly.
-rw-r--r-- 1 mark staff 856 16 Jan 12:13 image.mpc
-rw------- 1 mark staff 80000 16 Jan 12:13 image.cache
Note that I am not suggesting you permanently store your images in MPC format as it is unique to ImageMagick and can change between releases. I am suggesting you generate a copy in that format just before you do your analysis runs each time.
Related
I convert my image data to caffe db format (leveldb, lmdb) using C++ as example I use this code for imagenet.
Is data need to be shuffled, can I write to db all my positives and then all my negatives like 00000000111111111, or data need to be shuffled and labels should look like 010101010110101011010?
How caffe sample data from DB, is it true that it use random subset of all data with size = batch_size?
Should you shuffle the samples? Think about the learning process if you don't shuffle; caffe sees only 0 samples - what do you expect the algorithm to deduce? simply predict 0 all the time and everything is cool. If you have plenty of 0 before you hit the first 1 caffe will be very confident in predicting always 0. It will be very difficult to move the model from this point.
On the other hand, if it constantly sees a mix of 0 and 1 it learns from the beginning meaningful features for separating the examples.
Bottom line: it is very advantageous to shuffle the training samples, especially when using SGD-based approaches.
AFAIK, caffe does not randomly sample batch_size samples, but rather goes sequentially over the input DB batch_size after batch_size samples.
TL;DR
shuffle.
I'm trying to do binary LSTM classification using theano.
I have gone through the example code however I want to build my own.
I have a small set of "Hello" & "Goodbye" recordings that I am using. I preprocess these by extracting the MFCC features for them and saving these features in a text file. I have 20 speech files(10 each) and I am generating a text file for each word, so 20 text files that contains the MFCC features. Each file is a 13x56 matrix.
My problem now is: How do I use this text file to train the LSTM?
I am relatively new to this. I have gone through some literature on it as well but not found really good understanding of the concept.
Any simpler way using LSTM's would also be welcome.
There are many existing implementation for example Tensorflow Implementation, Kaldi-focused implementation with all the scripts, it is better to check them first.
Theano is too low-level, you might try with keras instead, as described in tutorial. You can run tutorial "as is" to understand how things goes.
Then, you need to prepare a dataset. You need to turn your data into sequences of data frames and for every data frame in sequence you need to assign an output label.
Keras supports two types of RNNs - layers returning sequences and layers returning simple values. You can experiment with both, in code you just use return_sequences=True or return_sequences=False
To train with sequences you can assign dummy label for all frames except the last one where you can assign the label of the word you want to recognize. You need to place input and output labels to arrays. So it will be:
X = [[word1frame1, word1frame2, ..., word1framen],[word2frame1, word2frame2,...word2framen]]
Y = [[0,0,...,1], [0,0,....,2]]
In X every element is a vector of 13 floats. In Y every element is just a number - 0 for intermediate frames and word ID for final frame.
To train with just labels you need to place input and output labels to arrays and output array is simpler. So the data will be:
X = [[word1frame1, word1frame2, ..., word1framen],[word2frame1, word2frame2,...word2framen]]
Y = [[0,0,1], [0,1,0]]
Note that output is vectorized (np_utils.to_categorical) to turn it to vectors instead of just numbers.
Then you create network architecture. You can have 13 floats for input, a vector for output. In the middle you might have one fully connected layer followed by one lstm layer. Do not use too big layers, start with small ones.
Then you feed this dataset into model.fit and it trains you the model. You can estimate model quality on heldout set after training.
You will have a problem with convergence since you have just 20 examples. You need way more examples, preferably thousands to train LSTM, you will only be able to use very small models.
I am currently working on a project for my university. The task is to write speech recognition system that is going to run on a phone in background waiting for few commands (like. call 0 123 ...).
It's 2 months project so it does not have to be very accurate. The amount of acceptable noise can be small and words will be separated by moments of silence.
I am currently at point of loading sample word encoded in RAW 16 bit PCM format. Splitting it to chunks (about 50 per second) and running FFT on each chunk in order to get frequency spectrum.
Things to solve are:
1) going through the longer recording and splitting it into words.
2) finding to best match for the word
1) I was thinking about just checking chunk after chunk and if I encounter few chunks that have higher altitudes of human voice frequencies assume that the word has started. Anyway I am looking for resources that may help with this.
2) This one seams a little bit tougher. Is it necessary to use HMM's for system like this or maybe there are simpler methods assuming that the vocabulary is so small ( 20 words )?
Edit:
The point of the project is writing the system on my own so I cannot use ready libraries like Sphinx or HTK.
Regards,
Karol
If anybody will have the same question in future. Look for 2 main keywords:
MFCC - Mel-Frequency cepstrum coefficients to calculate series of coefficients for each word template
DTW - To match captured word with templates
Good enough description of DTW can be found on wikipedia
This approach was good enough to have around 80% accuracy on 20 words dictionary and give a good demo during the class.
To recognize commands on the phone you can use Pocketsphinx. Tutorial which covers speech recognition applications on Android is available on CMUSphinx website.
(Stata/MP 13.1)
I am working with a set of massive data sets that takes an extremely long time to load. I am currently looping through all the data sets to load them each time.
Is it possible to just tell Stata to load in the first 5 observations of each dataset (or in general the first n data sets in each use command) without actually having to load the entire data set? Otherwise, if I were to load in the entire data set and then just keep the first 5 observations, the process takes extremely long time.
Here are two work-arounds I have already tried
use in 1/5 using mydata : I think this is more efficient than just loading the data and then keeping the observations you want in a different line, but I think it still reads in the entire data set.
First load all the data sets, then save copies of all the data sets to just be the first 5 observations, and then just use the copies: This is cumbersome as I have a lot of different files; I would very much prefer just a direct way to read in the first 5 observations without having to resort to this method and without having to read the entire data set.
I'd say using in is the natural way to do this in Stata, but testing shows
you are correct: it really makes no "big" difference, given the size of the data set. An example is (with 148,000,000 observations)
sysuse auto, clear
expand 2000000
tempfile bigfile
save "`bigfile'", replace
clear
timer on 1
use "`bigfile'"
timer off 1
clear
timer on 2
use "`bigfile'" in 1/5
timer off 2
timer list
timer clear
Resulting in
. timer list
1: 6.44 / 1 = 6.4400
2: 4.85 / 1 = 4.8480
I find that surprising since in seems really efficicient in other contexts.
I would contact Stata Tech support (and/or search around, including www.statalist.com) only to ask why in isn't much faster
(independently of you finding some other strategy to handle this problem).
It's worth using, of course; but not fast enough for many applications.
In terms of workflow, your second option might be the best. Leave the computer running while the smaller datasets are created (use a for loop), and get back to your regular coding/debugging once that's finished. This really depends on what you're doing, so it may work or not.
Actually, I found the solution. If you run
use mybigdata if runiform() <= 0.0001
Stata will take a random sample of 0.0001 of the data set without reading the entire data set.
Thanks!
Vincent
Edit: 4/28/2015 (1:58 PM EST)
My apologies. It turns out the above was actually not a solution to the original question. It seems that on my system there was high variability in the speed of using
use mybigdata if runiform() <= 0.0001
each time I ran it. When I posted that the above was a solution, I think when I ran the code, it just happened to be a faster instance. However, as I now am repeatedly running
use mybigdata if runiform() <= 0.0001
vs.
use in 1/5 using mydata
I am actually finding that
use in 1/5 using mydata
is on average faster.
In general, my question is simply how to read in a portion of a Stata data set without having to read in the entire data set for computational purposes especially when the data set is really large.
Edit: 4/28/2015 (2:50 PM EST)
In total, I have 20 datasets, each with between 5 - 15 million observations. I only need to keep 8 of the variables (There are 58-65 variables in each data set). Below is the output from the first four "describe, short" statements.
2004 action1
Contains data from 2004action1.dta
obs: 15,039,576
vars: 64 30 Oct 2014 17:09
size: 2,827,440,288
Sorted by:
2004 action2578
Contains data from 2004action2578.dta
obs: 13,449,087
vars: 59 30 Oct 2014 17:16
size: 2,098,057,572
Sorted by:
2005 action1
Contains data from 2005action1.dta
obs: 15,638,296
vars: 65 30 Oct 2014 16:47
size: 3,143,297,496
Sorted by:
2005 action2578
Contains data from 2005action2578.dta
obs: 14,951,428
vars: 59 30 Oct 2014 17:03
size: 2,362,325,624
Sorted by:
Thanks!
Vincent
There are two parameters while using RBF kernels with Support Vector Machines: C and γ. It is not known beforehand which C and γ are the best for one problem; consequently some kind of model selection (parameter search) must be done. The goal is to identify good (C;γ) so that the classier can accurately predict unknown data (i.e., testing data).
weka.classifiers.meta.GridSearch is a meta-classifier for tuning a pair of parameters. It seems, however, that it takes ages to finish (when the dataset is rather large). What would you suggest to do in order to bring down the time required to accomplish this task?
According to A User's Guide to Support Vector Machines:
C : soft-margin constant . A smaller value of C allows to ignore points close to the boundary, and increases the margin.
γ> 0 is a parameter that controls the width of Gaussian
Hastie et al.'s SVMPath explores the entire regularization path for C and only requires about the same computational cost of training a single SVM model. From their paper:
Our R function SvmPath computes all 632 steps in the mixture example (n+ = n− =
100, radial kernel, γ = 1) in 1.44(0.02) secs on a pentium 4, 2Ghz linux machine; the svm
function (using the optimized code libsvm, from the R library e1071) takes 9.28(0.06)
seconds to compute the solution at 10 points along the path. Hence it takes our procedure
about 50% more time to compute the entire path, than it costs libsvm to compute a typical
single solution.
They released a GPLed implementation of the algorithm in R that you can download from CRAN here.
Using SVMPath should allow you to find a good C value for any given γ quickly. However, you would still need to do separate training runs for different γ values. But, this should be much faster than doing separate runs for each pair of C:γ values.