Attributes of complex numbers in Weka - weka

I created a dataset includes complex numbers (samples of complex signals). The dataset has 80 instances and 1024 attributes, and I need to classify these signals into two classes via Weka. However, the Weka does not deal with complex numbers.
I am just wondering how this can be done?
I tried to change each complex sample into amplitude part {sqrt((real^2)+(imag^2)) and phase {arctan(imag/real)}, but I am confused how to link each amplitude with its corresponding phase when I create the arff file.

Weka has no notion of imaginary numbers, just real-valued ones. You will have to treat the imaginary/real part (or amplitude/phase) as separate attributes. And hope that algorithms will learn a relationship between them.
Of course, you can always engineer additional features to help the learning process, e.g., in which quadrant these such a number is located.

Related

What can be inferred from two datasets using k-means or k-nn

I am wondering what you could infer using data mining from two large datasets that have similar properties. Say you have two datasets containing detailed information about schools in a country and each dataset belongs to a school stage for a particular year. What sorts of things can you do with these datasets using data mining?
I know how to use and apply the algorithms in pandas but I am having problems with getting the motivation behind the k-means especially.
I know you use the k-means to put the unlabeled data into clusters based on number of factors from the dataset and based on the property values of each data element, they are being placed in one of the clusters created. But then what do you do with these clusters? How can you use them for analysing the data? I read that it can even be used for cleaning the data or relating two datasets to each other, but I'm just having hard time to imagine how would you go about to do these things.
Any help is well appreciated. Thanks..
You could do many things with those datasets including:
See which students from a lower stage are more likely to be in which group (successful, unsuccessful etc.) when they reach a higher stage based on some factors
See what factors are affecting the success of the students at different stages (assuming the datasets contain this information)
You can do many different comparisons based on different factors
..and many more. The issue is that it is not really possible to say what can be inferred from your datasets without seeing what information they contain. My suggestion is you should look carefully in two datasets and see if they have some columns that are common and pick the ones that are interest you the most.

Building Speech Dataset for LSTM binary classification

I'm trying to do binary LSTM classification using theano.
I have gone through the example code however I want to build my own.
I have a small set of "Hello" & "Goodbye" recordings that I am using. I preprocess these by extracting the MFCC features for them and saving these features in a text file. I have 20 speech files(10 each) and I am generating a text file for each word, so 20 text files that contains the MFCC features. Each file is a 13x56 matrix.
My problem now is: How do I use this text file to train the LSTM?
I am relatively new to this. I have gone through some literature on it as well but not found really good understanding of the concept.
Any simpler way using LSTM's would also be welcome.
There are many existing implementation for example Tensorflow Implementation, Kaldi-focused implementation with all the scripts, it is better to check them first.
Theano is too low-level, you might try with keras instead, as described in tutorial. You can run tutorial "as is" to understand how things goes.
Then, you need to prepare a dataset. You need to turn your data into sequences of data frames and for every data frame in sequence you need to assign an output label.
Keras supports two types of RNNs - layers returning sequences and layers returning simple values. You can experiment with both, in code you just use return_sequences=True or return_sequences=False
To train with sequences you can assign dummy label for all frames except the last one where you can assign the label of the word you want to recognize. You need to place input and output labels to arrays. So it will be:
X = [[word1frame1, word1frame2, ..., word1framen],[word2frame1, word2frame2,...word2framen]]
Y = [[0,0,...,1], [0,0,....,2]]
In X every element is a vector of 13 floats. In Y every element is just a number - 0 for intermediate frames and word ID for final frame.
To train with just labels you need to place input and output labels to arrays and output array is simpler. So the data will be:
X = [[word1frame1, word1frame2, ..., word1framen],[word2frame1, word2frame2,...word2framen]]
Y = [[0,0,1], [0,1,0]]
Note that output is vectorized (np_utils.to_categorical) to turn it to vectors instead of just numbers.
Then you create network architecture. You can have 13 floats for input, a vector for output. In the middle you might have one fully connected layer followed by one lstm layer. Do not use too big layers, start with small ones.
Then you feed this dataset into model.fit and it trains you the model. You can estimate model quality on heldout set after training.
You will have a problem with convergence since you have just 20 examples. You need way more examples, preferably thousands to train LSTM, you will only be able to use very small models.

Regression Tree Forest in Weka

I'm using Weka and would like to perform regression with random forests. Specifically, I have a dataset:
Feature1,Feature2,...,FeatureN,Class
1.0,X,...,1.4,Good
1.2,Y,...,1.5,Good
1.2,F,...,1.6,Bad
1.1,R,...,1.5,Great
0.9,J,...,1.1,Horrible
0.5,K,...,1.5,Terrific
.
.
.
Rather than learning to predict the most likely class, I want to learn the probability distribution over the classes for a given feature vector. My intuition is that using just the RandomForest model in Weka would not be appropriate, since it would be attempting to minimize its absolute error (maximum likelihood) rather than its squared error (conditional probability distribution). Is that intuition right? Is there a better model to be using if I want to perform regression rather than classification?
Edit: I'm actually thinking now that in fact it may not be a problem. Presumably, classifiers are learning the conditional probability P(Class | Feature1,...,FeatureN) and the resulting classification is just finding the c in Class that maximizes that probability distribution. Therefore, a RandomForest classifier should be able to give me the conditional probability distribution. I just had to think about it some more. If that's wrong, please correct me.
If you want to predict the probabilities for each class explicitly, you need different input data. That is, you would need to replace the value to predict. Instead of one data set with the class label, you would need n data sets (for n different labels) with aggregated data for each unique feature vector. Your data would look something like
Feature1,...,Good
1.0,...,0.5
0.3,...,1.0
and
Feature1,...,Bad
1.0,...,0.8
0.3,...,0.1
and so on. You would need to learn one model for each class and run them separately on any data to be classified. That is, for each label you learn a model to predict a number that is the probability of being in that class, given a feature vector.
If you don't need the probabilities to be predicted explicitly, have a look at the Bayesian classifiers in Weka, which make use of probabilities in the models that they learn.

Face Recognition Using Backpropagation Neural Network?

I'm very new in image processing and my first assignment is to make a working program which can recognize faces and their names.
Until now, I successfully make a project to detect, crop the detected image, make it to sobel and translate it to array of float.
But, I'm very confused how to implement the Backpropagation MLP to learn the image so it can recognize the correct name for the detected face.
It's a great honor for all experts in stackoverflow to give me some examples how to implement the Image array to be learned with the backpropagation.
It is standard machine learning algorithm. You have a number of arrays of floats (instances in ML or observations in statistics terms) and corresponding names (labels, class tags), one per array. This is enough for use in most ML algorithms. Specifically in ANN, elements of your array (i.e. features) are inputs of the network and labels (names) are its outputs.
If you are looking for theoretical description of backpropagation, take a look at Stanford's ml-class lectures (ANN section). If you need ready implementation, read this question.
You haven't specified what are elements of your arrays. If you use just pixels of original image, this should work, but not very well. If you need production level system (though still with the use of ANN), try to extract more high level features (e.g. Haar-like features, that OpenCV uses itself).
Have you tried writing your feature vectors to an arff file and to feed them to weka, just to see if your approach might work at all?
Weka has a lot of classifiers integrated, including MLP.
As I understood so far, I suspect the features and the classifier you have chosen not to work.
To your original question: Have you made any attempts to implement a neural network on your own? If so, where you got stuck? Note, that this is not the place to request a complete working implementation from the audience.
To provide a general answer on a general question:
Usually you have nodes in an MLP. Specifically input nodes, output nodes, and hidden nodes. These nodes are strictly organized in layers. The input layer at the bottom, the output layer on the top, hidden layers in between. The nodes are connected in a simple feed-forward fashion (output connections are allowed to the next higher layer only).
Then you go and connect each of your float to a single input node and feed the feature vectors to your network. For your backpropagation you need to supply an error signal that you specify for the output nodes. So if you have n names to distinguish, you may use n output nodes (i.e. one for each name). Make them for example return 1 in case of a match and 0 else. You could very well use one output node and let it return n different values for the names. Probably it would even be best to use n completely different perceptrons, i.e. one for each name, to avoid some side-effects (catastrophic interference).
Note, that the output of each node is a number, not a name. Therefore you need to use some sort of thresholds, to get a number-name relation.
Also note, that you need a lot of training data to train a large network (i.e. to obey the curse of dimensionality). It would be interesting to know the size of your float array.
Indeed, for a complex decision you may need a larger number of hidden nodes or even hidden layers.
Further note, that you may need to do a lot of evaluation (i.e. cross validation) to find the optimal configuration (number of layers, number of nodes per layer), or to find even any working configuration.
Good luck, any way!

Is it possible to see if two MP3 files are the same song by analyzing the files' bytes?

This is to be done in C++ or C....
I know we can read the MP3s' meta data, but that information can be changed by anyone, can't it?
So is there a way to analyze a file's contents and compare it against another file and determine if it is in fact the same song?
edit
Lots of interesting things coming out that I hadn't thought of. Not at all a good idea to attempt this.
It's possible, but very hard.
Even the same original recording may well be encoded differently by different MP3 encoders or the same encoder with different settings... leading to different results when the MP3 is then decoded. You'd need to work out an aural model to "understand" how big the differences are, and make a judgement.
Then there's the matter of different recordings. If I sing "Once in Royal David's City" and Aled Jones sings it, are those the same song? What if there are two different versions of a song where one has slightly modified lyrics? The key could be different, it could be in a different vocal range - all kinds of things.
How different can two songs be but still count as "the same song"? Once you've decided that, then there's the small matter of implementing it ;)
If I really had to do this, my first attempt would be to take a Fourier transform of both songs and compare the histograms. You can use FFTW (http://www.fftw.org/) to take the Fourier transform, and then compare the histograms by summing the squares of the differences at each frequency. If the resultant sum is greater than some threshold (which you must determine by experimentation) then the songs are deemed to be different, otherwise they are the same.
No. Not SO simple.
You can check they contain the same encoded data, BUT:
Could be a different bitrate
Could be the same song, just a 1/100ths of a second off
In both cases the bytes would not match.
Basically, if a solution looks too simple to be true, it often is.
If you mean "same song" in the iTunes sense of "same recording", it would be possible to compares two audio files, but not by byte-by-byte comparison of an encoded file since even for the same format there are variables such as data rate and compression that are selected at time of encoding.
Also each encoding of the same recording may include different lead-in/lead-out timings, different amplitude and equalisation, and may have come from differing original sources (vinyl, CD, original master etc.). So you need a comparison method that takes all these variables into account, and even then you will end up with a 'likelihood' of a match rather than a definitive match.
If you genuinely mean "same song", i.e. any recording by any artist of the same composition and lyrics, then you are unlikely to get a high statistical correlation in most cases since pitch, tempo, range, instrumental arrangement will be very different.
In the "same recording" scenario, relatively simple signal processing and statistical techniques could be applied, in the "same song" scenario, AI techniques would need to be deployed, and even then the results I suspect would be poor.
If you want to compare MP3 files that originated from the same MP3, but have tagged with metadata differently, it would be straight forward to just compare the actual audio data. Since it originated from the same MP3 encoding, you should be able to do a byte by byte comparison. You would have to compare all byte. It should be sufficient to sample just a few to get a unique key that would be statistically almost impossible to find in another song.
If the files have been produced by different encoders, you would have to extract some "fuzzy" feature keys from the data and compare those keys. In a hurry I would probably construct an algorithm like this:
Decode audio to pulse-code modulation (wave) in a standard bit rate.
Find a fixed number of feature starting points using some dynamic location algorithm. For example find top 10 highest wave peaks ordered from beginning of wave or simply spread evenly across the wave (it would be a good idea to fix the first and last position dynamically though, since different encodings might not start and end at exactly the same point). An improvement would be to select feature points at positions in the wave that are not likely to be too repetitive.
Extract a set of one-dimensional feature key scalars from the feature points. For example, for each feature normalize the following n-sample values and count the number of zero-crossings, peak to average ratio, mean zero-crossing distance, signal-energy. The goal is to extract robust features that are relatively unique, while still characteristic even if some noise and distortion is added to the signal. This can obviously be improved almost infinitely.
Compare the extracted feature keys of the two files using some accuracy measurement (f.eks. 9 out of 10 feature extractions must match at least 99% on 4 out of 5 of their extracted feature keys).
The benefit of a feature extraction approach is that you can build a database of features for all your mp3-files and for a single file ask the question: What other media files have exactly or almost exactly the same feature as this one. The feature lookup could be implemented very efficiently with R*-trees or similar, which could be used to give you a fast distance measurement between the n-dimensional feature sets.
The above technique is essentially a variant of what is used in image search algorithms such as SIFT, which is probably the base of such application as Photosynth and Google Goggles. In image searching you filter the image for good candidate points for relatively unique features (such as corners of shapes), then you normalize the area around that feature to get normalized color, intensity, scale and direction of features. Finally you extract the features and search an n-dimensional database of features of other images and verify that found features in other images are geometrically positioned in the same pattern as in your search image. The technique for searching audio would be the same, only simpler, since audio is one dimensional.
Use the open source EchoPrint library to create a signature of the two audio files, and compare them with each other.
The library is very easy to use, and has clear examples on how to create the signatures.
http://echoprint.me/
You can even query their database with the signature and find matching song metadata (such as title, artist, etc).
I think the Fast Fourier-Transform (FFT) approach hinted by jstanley is pretty good for most use cases; in particular, it works for verifying that the two are the same release/ same recording by the same artist/ same bitrate / audio quality.
To be more explicit, sox and spek (via command line and GUI, respectively) can do this pretty painlessly.
Spek is pretty foolproof -- just open the software and point it to the two audio files in question.
sox can generate spectograms (FFTs) from the command line line so:
sox "$file" -n spectrogram -o "$outfile".
The result from either are two images; if they look basically identical, then for almost all intents and purposes, the two songs will be equivalent.
For example, I wanted to test if these two files:
Soundtrack to an imaginary film mixtape 2011.mp3
DJRUM - Sountrack to an imaginary film mixtape 2011 (for mary-anne hobbs).mp3
were the same. diff reported a difference in the binary files (perhaps due to metadata differences or minor encoding differences), but a quick glance at their spectrograms resolved it: