I entered my input data in CSV format into Weka software. Then I applied the wavelet to the data. The output of this work was Haar1, Haar2, Haar3, and Haar4. I wanted to ask what is the meaning of these outputs. Does it mean that wavelet grade 4 is considered? What is the default level of wavelet in Weka?
6 parameters were considered as modeling inputs and one parameter as modeling output. From the filtering section of the Weka software, the Haar wavelet was applied to the data. After applying this filter to the data, 5 input parameters were converted into 8 Haar. What is the meaning of this Haar and how should it be interpreted? Where should I find the degree of wavelet in Waka software?
Wavelets require the number of columns in the data to be powers of 2, otherwise they pad it to the next power. In your case, the data contains 5 columns and got padded to 8.
Then the Haar wavelet transform got applied, hence generating new attributes of format HaarX with X ranging from 1 to 8 (link to Weka source code).
Related
In the post A Comprehensive Guide to Convolutional Neural Networks — the ELI5 way, it says
A ConvNet is able to successfully capture the Spatial and Temporal
dependencies in an image through the application of relevant filters.
The architecture performs a better fitting to the image dataset due to
the reduction in the number of parameters involved and reusability of
weights.
I don't see how it reduce parameter and reuse weight. Could anyone give an example?
Consider the filter (or kernel) in image below having 9 pixels and the image having 49 pixels.
In a fully connected layer, we'll have 9*49 = 441 weights.
While in a CNN this same filter keeps on moving (convolving) over the entire image. All pixel values in image will be multiplied with those same 9 values of filter (hence we say weights are reused). So, we need just 9 weights per filter instead of 441 in FC layer.
The job of a filter is to identify features (such as texture, lines etc), which could be anywhere in an image. So, we want to reuse this same filter over the entire image.
We can calculate the parameters for the Convolution layer using the formula: ((width_of_Kernel * height_of_Kernel * input_channel)+1) * output_channel
Here we can see Kernel size, Input Channel and, Output Channel are affecting number of parameters. By altering them, we can reduce the parameters and it will result in reducing size.
Reading the YOLOv1 paper, it is mentioned[1] that the first part of the network, that is, those convolutional layers, are first trained at a input resolution of 224x224 on the ImageNet dataset. After that, the model is converted to perform detection, in which the input resolution is increased from 224x224 to 448x448. I am wondering that how can this convertion be done: if the input of the network is at first 224x224, then the number of parameters should differ from that of 448x448, which means that the convolutional layers trained on the ImageNet dataset cannot be reused for detection.
What am I missing here?
[1]: At the end of section "2.2 Training"
if the input of the network is at first 224x224, then the number of parameters should differ from that of 448x448
This is your misunderstanding.
The convolution operation has no constraints on the size of the input and thus on the size of the output. When you train a CNN that has fully connected layers at the end for classification, then you're constraining the input to be of a fixed size, because the number of input that a FC layer can accept is fixed.
But, if you remove the classification head from the network and you only use the trained weights of the CNN as a feature extractor, you'll notice that given an input of any dimension (>= the dimension the network has been trained on), the output will be a set of feature maps whose spatial extent increase as the spatial extent of the input increases.
In YOLO, hence, the network is initially trained to perform classificationm with a resolution of 224x224. In this way weights of the convolution operation + the weights of the FC layers at the end learned to extract & classify meaningful features.
After this first training, the FC layers are thrown away and only the feature extraction part is kept. In this way, you can use a good feature extractor, that already learned to extract meaningful features, in a convolutional fashion (ei, producing not a feature vector but a feature map as output, that can be post-processed as YOLO does)
As I understand it, Word2Vec builds a word dictionary (or, vocabulary) based on a training corpus, and outputs a K-dim vector for each word in the dictionary. My question is, what exactly is the source of those K-Dim vectors? I'm assuming each vector is either a row or column in one of the weight matrices between the input and hidden layer, or the hidden and output layer. However, I haven't been able to find any sources to back this up, and I'm not literate enough in programming languages examine the source code and figure it out myself. Any clarifying remarks on this topic would be greatly appreciated!
what exactly is the source of those K-Dim vectors? I'm assuming each vector is either a row or column in one of the weight matrices between the input and hidden layer, or the hidden and output layer.
In the word2vec model(CBOW, Skip-gram), it outputs a feature matrix of words. This matrix is first weight matrix between input layer and projection layer(in word2vec model has no hidden layer, no activation function in it). Because when we train word in the context(in CBOW Model), we updates this weight matrix.(second - between projection and output layer - matrix also updated. however we are not using it)
in the first matrix, rows mean a vocabulary words and columns mean feature of word(K-Dimension).
if you want more information, explore it
http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
word2vec uses machine learning to obtain word representations. It predicts a word using its context (CBOW) or vice versa (skip-gram).
In machine learning, you have a loss function that represents the error you model makes. This error depends on the model's parameters.
Training a model means minimizing the error with respect to the model's parameters.
In word2vec, these embedding matrices are the model's parameters that are being updated during the training. I hope, it helps you to understand where they come from. Indeed, they are first initialized randomly and they are changed during the training process.
You can take a look at this picture from this paper:
The W matrix that maps the input one-hot word representations to that k-dimensional vectors and the W' matrix that maps a k-dimensional representation to the output are both the model's parameters that we optimize during training.
My objective is to detected text in an image and recognize them.
I have achieved detecting characters using stroke width transform.
What to do to recognize them?
As per my knowledge, I thought of training the svm with my dataset of letters of different fonts[images] by detecting feature point and extracting feature vectors from each and every image.[I have used SIFT Feature vector,did build the dictionary using kmean clusetering and all].
I have detected a character before, i will extract the sift feature vector for this character . and i thought of feeding this into the svm prediction function.
I dont know how to recognize using svm. I am confused! Help me and correct me where ever I went wrong with concept..
I followed this turorial for recognizing part. Can this turotial can be applicable to recognize characters.
http://www.codeproject.com/Articles/619039/Bag-of-Features-Descriptor-on-SIFT-Features-with-O
SVM is a supervised classifier. To use it, you will need to have training data that is of the type of objects you are trying to recognize.
Step 1 - Prepare training data
The training data consists of pairs of feature vectors and their corresponding class labels. In your case, it appears that you have extracted a SIFT-based "Bag-of-word" (BOW) feature vector for the characters you detected. So, for your training data, you will need to find many examples of the different characters, extract this feature vector for each of them, and associate them with a label (sometimes called a class label, and typically an integer) which you will perhaps map to a textual description (for e.g., the number 0 could be mapped to the character 'a', and so on.)
Step 2 - Training the classifier
The SVM classifier takes in as input an array/Mat of feature vectors (one per row) and their associated labels. Tune the parameters of the SVM (i.e., the regularization parameter C, and if applicable, any other parameters for kernels) on a separate validation set.
Step 3 - Predict for unseen data
At test time, given a sample that was not seen by the SVM during training, you compute a feature vector (your SIFT-based BOW vector) for the sample. Pass this feature vector to the SVM's predict function, and it will return you an integer. Remember earlier when preparing your training data, you have associated an integer with each label? This is the label predicted by the SVM for this sample. You can then map this label to a character. For e.g., if you have associated 0 with 'a', 1 with 'b' etc., you can use a vector/hashmap to map the integer to its textual counterpart.
Additional Notes
You can check out OpenCV's SVM tutorial here for details.
NOTE: Often, for beginners, the hardest part (after getting the data) is tuning the classifier. My advice is first try a simple classifier (for e.g., a linear SVM) which has few parameters to tune. A decent one would be the linear SVM, which only requires you to adjust one parameter C. Once you manage to get somewhat decent results (which gives some assurance that the rest of your code is working) you can move on to more "sophisticated" classifiers.
Lastly, the training data and feature vectors you extract are very important. The training data must be "similar" to the test data you are trying to predict. For e.g., if you are predicting characters found in road signs which comes with different fonts, lighting conditions, and pose differences, then using training data consisting of characters taken from say a newspaper/book archive may not give you good results. This is an issue of domain adaptation in machine learning.
I am having some issues that I am hoping you will be able to clarify. I have self taught myself a video encoding process similar to Mpeg2. The process is as follows:
Split an RGBA image into 4 separate channel data memory blocks. so an array of all R values, a separate array of G values etc.
take the array and grab a block of 8x8 pixel data, to transform it using the Discrete Cosine Transform (DCT).
Quantize this 8x8 block using a pre-calculated quantization matrix.
Zigzag encode the output of the quantization step. So I should get a trail of consecutive numbers.
Run Length Encode (RLE) the output from the zigzag algorithm.
Huffman Code the data after the RLE stage. Using substitution of values from a pre-computed huffman table.
Go back to step 2 and repeat until all the channels data has been encoded
Go back to step 2 and repeat for each channel
First question is do I need to convert the RGBA values to YUV+A (YCbCr+A) values for the process to work or can it continue using RGBA? I ask as the RGBA->YUVA conversion is a heavy workload that I would like to avoid if possible.
Next question. I am wondering should the RLE store runs for just 0's or can that be extended to all the values in the array? See examples below:
440000000111 == [2,4][7,0][3,1] // RLE for all values
or
440000000111 == 44[7,0]111 // RLE for 0's only
The final question is what would a single symbol be in regard to the huffman stage? would a symbol to be replaced be a value like 2 or 4, or would a symbol be the Run-level pair [2,4] for example.
Thanks for taking the time to read and help me out here. I have read many papers and watched many youtube videos, which have aided my understanding of the individual algorithms but not how they all link to together to form the encoding process in code.
(this seems more like JPEG than MPEG-2 - video formats are more about compressing differences between frames, rather than just image compression)
If you work in RGB rather than YUV, you're probably not going to get the same compression ratio and/or quality, but you can do that if you want. Colour-space conversion is hardly a heavy workload compared to the rest of the algorithm.
Typically in this sort of application you RLE the zeros, because that's the element that you get a lot of repetitions of (and hopefully also a good number at the end of each block which can be replaced with a single marker value), whereas other coefficients are not so repetitive but if you expect repetitions of other values, I guess YMMV.
And yes, you can encode the RLE pairs as single symbols in the huffman encoding.
1) Yes you'll want to convert to YUV... to achieve higher compression ratios, you need to take advantage of the human eye's ability to "overlook" significant loss in color. Typically, you'll keep your Y plane the same resolution (presumably the A plane as well), but downsample the U and V planes by 2x2. E.g. if you're doing 640x480, the Y is 640x480 and the U and V planes are 320x240. Also, you might choose different quantization for the U/V planes. The cost for this conversion is small compared to DCT or DFT.
2) You don't have to RLE it, you could just Huffman Code it directly.