Encoding of categorical variables to reduce the effect of erroneous labels - pca

I have a structured dataset containing (nominal) categorical variables encoded as labels, let's say a feature includes labels from 1 to 20. Some of the labels in that feature could just be errors, that should not be present in the dataset and are not known priorly.
I wonder if there is an encoding method for that feature, such that the effect of erroneous labels will be mitigated. One-hot encoding of the labels before a ML task will create a dimension for each label, that could lead noisy labels to have a more dominating effect on the dataset.
In case of feature hashing, it's not that easy to determine the number of output features for each variable, therefore I think it wouldn't be reasonable to proceed with it.
Would a compression method such as PCA after having one-hot matrix (sparse) work well in this case? The labels would be represented by continuous values, although this could lead to an information loss for the correct labels besides noisy labels. But eventually noisy labels will not take up a dimension in the dataset, which is better.
I also believe that applying Fourier compression on one-hot matrices considering them as black and white images would be overcomplicated and nonsense for a tabular feature that frequencies do not matter.
What approach should I follow?

Related

Difference in hand color between pretrain dataset and fine dataset?

I have a pose estimation model pretrained on a dataset in which hands are in its nartural color. I want to finetune that model on the dataset of hands of surgeons doing surgeries. Those hands are in surgical gloves so the image of the hands are a bit different than normal hands.
pretraine image
finetune image
Does this difference in hand colors affect the model performance?
If I can make images of those surgical hands more like normal hands, will I get better performance?
Well, it depends on what your pre-trained model has learned to capture from the pre-training (initial) dataset. Suppose your model had many feature maps and not enough skin color variation in your pre-training dataset (leads to overfitting issues). In that case, your model has likely "taken the path of least resistance" and exploited that to learn feature maps that rely on the color space as means of feature extraction (which might not generalize well due to color differences).
The more your pre-training dataset match/overlap with your target dataset, the better the effects of transfer learning will be. So yes, there is a very high chance that making your target dataset (surgical hands) look more similar to your pre-training dataset will positively impact your model's performance. Moreover, I would conjecture that introducing some color variation (e.g., Color Jitter augmentation) in your pre-training dataset could also help your model generalize to your target dataset.

Principal component analysis on proportional data

Is it valid to run a PCA on data that is comprised of proportions? For example, I have data on the proportion of various food items in the diet of different species. Can I run a PCA on this type of data or should I transform the data or do something else beforehand?
I had a similar question. You should search for "compositional data analysis". There are transformation to apply to proportions in order to analyze them with multivariate tecniques such as PCA. You can find also "robust" PCA algorithms to run your analysis in R. Let us know if you find an appropriate solution to your specific problem.
I don't think so.
PCA will give you "impossible" answers. You might get principal components with values that proportions can't have, like negative values or values greater than 1. How would you interpret this component?
In technical terms, the support of your data is a subset of the support of PCA. Say you have $k$ classes. Then:
the support for PCA vectors is $\R^k$
the support for your proportion vectors is the $k$- dimensional simplex. By simplex I mean the set of $p$ vectors of length $k$ such that:
$0 \le p_i \le 1$ where $i = 1, ..., k$
$\sum_{i=1}^k{p_i} = 1$
One way around this is if there's a one to one mapping between the $k$-simplex to all of $\R^k$. If so, you could map from your proportions to $\R^k$, do PCA there, then map the PCA vectors to the simplex.
But I'm not sure the simplex is a self-contained linear space. If you add two elements of the simplex, you don't get an element of the simplex :/
A better approach, I think, is clustering, eg with Gaussian mixtures, or spectral clustering. This is related to PCA. But a nice property of clustering is you can express any element of your data as a "convex combination" of the clusters. If you analyze your proportion data and find clusters, they (unlike PCA vectors) will be within the simplex space, and any mixture of them will be, too.
I also recommend looking into nonnegative matrix factorization. This is like PCA but, as the name suggests, avoids negative components and also negative eigenvectors. It's very useful for inferring structure in strictly positive data, like proportions. But nmf does not give you a basis for simplex space.

Understanding the loss function in Yolo v1 research paper

I'm not able to understand the following piece of text from YOLO v1 research paper:
"We use sum-squared error because it is easy to optimize,
however it does not perfectly align with our goal of
maximizing average precision. It weights localization error
equally with classification error which may not be ideal.
Also, in every image many grid cells do not contain any
object. This pushes the “confidence” scores of those cells
towards zero, often overpowering the gradient from cells
that do contain objects. This can lead to model instability,
causing training to diverge early on.
To remedy this, we increase the loss from bounding box
coordinate predictions and decrease the loss from confidence
predictions for boxes that don’t contain objects. We
use two parameters, lambda(coord) and lambda(noobj) to accomplish this. We
set lambda(coord) = 5 and lambda(noobj) = .5"
What is the meaning of "overpowering" in the first paragraph and why would we decrease the loss from confidence prediction(must it not be already low especially for boxes that don't contain any object) and increase that from bounding box predictions ?
There are cells that contain objects and that do not. Model often very confident about the absence (confidence around zero) of the object in the grid cell, it make gradient from those cells be much greater than the gradient from cells that do contain objects but not with huge confidence, it overpowers them (i.e around 0.7-0.8).
So that we want to consider classification score less important because they are not very "fair", to implement this we make weight for coords prediction greater than for classification.

t-SNE Choosing the Number of Dimensions

I am using t-SNE for exploratory data analysis. I am using this instead of PCA because PCA is linear and t-SNE is non-linear.
It's really straight-forward to know how many dimensions are required to capture the necessary variance with PCA.
How do I know how many dimensions are required for my data using t-SNE?
I have read a popular website of very useful information, but it doesn't discuss dimensionality.
https://distill.pub/2016/misread-tsne/

How to train a svm for classifying images of english alphabet?

My objective is to detected text in an image and recognize them.
I have achieved detecting characters using stroke width transform.
What to do to recognize them?
As per my knowledge, I thought of training the svm with my dataset of letters of different fonts[images] by detecting feature point and extracting feature vectors from each and every image.[I have used SIFT Feature vector,did build the dictionary using kmean clusetering and all].
I have detected a character before, i will extract the sift feature vector for this character . and i thought of feeding this into the svm prediction function.
I dont know how to recognize using svm. I am confused! Help me and correct me where ever I went wrong with concept..
I followed this turorial for recognizing part. Can this turotial can be applicable to recognize characters.
http://www.codeproject.com/Articles/619039/Bag-of-Features-Descriptor-on-SIFT-Features-with-O
SVM is a supervised classifier. To use it, you will need to have training data that is of the type of objects you are trying to recognize.
Step 1 - Prepare training data
The training data consists of pairs of feature vectors and their corresponding class labels. In your case, it appears that you have extracted a SIFT-based "Bag-of-word" (BOW) feature vector for the characters you detected. So, for your training data, you will need to find many examples of the different characters, extract this feature vector for each of them, and associate them with a label (sometimes called a class label, and typically an integer) which you will perhaps map to a textual description (for e.g., the number 0 could be mapped to the character 'a', and so on.)
Step 2 - Training the classifier
The SVM classifier takes in as input an array/Mat of feature vectors (one per row) and their associated labels. Tune the parameters of the SVM (i.e., the regularization parameter C, and if applicable, any other parameters for kernels) on a separate validation set.
Step 3 - Predict for unseen data
At test time, given a sample that was not seen by the SVM during training, you compute a feature vector (your SIFT-based BOW vector) for the sample. Pass this feature vector to the SVM's predict function, and it will return you an integer. Remember earlier when preparing your training data, you have associated an integer with each label? This is the label predicted by the SVM for this sample. You can then map this label to a character. For e.g., if you have associated 0 with 'a', 1 with 'b' etc., you can use a vector/hashmap to map the integer to its textual counterpart.
Additional Notes
You can check out OpenCV's SVM tutorial here for details.
NOTE: Often, for beginners, the hardest part (after getting the data) is tuning the classifier. My advice is first try a simple classifier (for e.g., a linear SVM) which has few parameters to tune. A decent one would be the linear SVM, which only requires you to adjust one parameter C. Once you manage to get somewhat decent results (which gives some assurance that the rest of your code is working) you can move on to more "sophisticated" classifiers.
Lastly, the training data and feature vectors you extract are very important. The training data must be "similar" to the test data you are trying to predict. For e.g., if you are predicting characters found in road signs which comes with different fonts, lighting conditions, and pose differences, then using training data consisting of characters taken from say a newspaper/book archive may not give you good results. This is an issue of domain adaptation in machine learning.