When to filter data during dimensionality reduction of image data? - pca

I am extracting numerical data from biological images (phenotypic profiling of fluorescently labelled cells) to eventually be able to identify data clusters of cells that are phenotypically similar.
I record images on a microcope, extract data from an imaging plate that contains untreated cells (negative control), cells with a "strong" phenotype (as positive control), as well as several treatments. The data is organized in 2D with rows as cells and columns with information extracted from these cells.
My workflow is roughly as follows:
Plate-wise normalization of data
Elimination of features (= columns) that have reduntant information, contain too many NAs, show little variance or are not replicated betwen different experiments
PCA
tSNE for visualization
Cluster analysis
If I'm interested in only a subset of data (controls and let's say treatment 1 and 2, out of 10), when should I filter the data?
Currently, I filter before normalization, but I'm afraid that will impact behaviour and results of PCA/tSNE. Should I do the entire analysis with all the data and filter only before tSNE visualization?

Related

Feature Selection and PCA in Machine Learning

I have a dataset with around 15 numeric columns and two categorical columns which are a "State" column and an "Income" column with six buckets representing each different income range. Do I need to encode the "Income" column if it contains integers 1-6 representing each income range? In addition, what type of encoder should I use for the "state" column and does anyone have any good resources on this?
In addition, does one typically perform feature selection (wrapper and filter methods such as Pearson's and Recursive Feature Elimination) before PCA? What is the typical correlation threshold when using a method like Pearson's? And what is the ideal number of dimensions or explained variance ratio one should use when running PCA. I'm confused if you use one of them or both. Thank you.

Is it possible to set the number of rows in a Power BI Small Multiples grid to a variable?

I want a visual comparison of stacked histograms using the same x-axis scale. In this situation, I don't always have the same number of histograms for each dataset.
I would like to do a small multiples grid in Power BI that is N rows by 1 column; where N is the number of instruments that are collecting data for the client site selected in a Slicer. Some clients may be collecting data with one instrument. Other clients may use 3 or more instruments.
Is there a way to set the number of rows to a variable in a small multiples grid or would it be possible to embed a chart in a matrix?
The default is a 2×2 grid of small multiples, but you can adjust the number of rows and columns to up to 6×6. Any multiples that don’t fit on that grid will load in as you scroll down.
You can adjust the style and position of the small multiple titles in the Small multiple title card:
And you can change the dimensions of the grid in the Grid layout card:

How to manage custom number formatting in power BI?

How can I do custom number formatting in a Power Bi visual?
I don't want to show all value as million. I want to put thousand for 1-day value, and million for 1-week value and year for 1-year value.
Power BI charts follow the principles of good data visualisation. That includes a scale that is relevant to the data with labels that relate to the scale.
In the visualisation, the differences for the values less than 1M are not discernible. The label with the 0M supports that approach, although it doesn't look great. But that happens when you have a chart with very large AND very small values. Power BI only supports one display unit and you selected Millions.
You may want to consider using a different visual for the data. Not all visuals to be shown as charts. If you want to show the exact numbers, then a simple table might be a better approach. In a sorted list of numbers, the digits in a number act very much like a horizontal bar.
Or split the chart in two and show one chart for values above 1M and another for values below 1M.
Or use Thousands as display units instead of Millions.

FaceRecognition: Does all the images per person must be the same count?

I want to know that, is this important to have the same count of images per person (e.g: 10 image/person) to train faces in Eigen/Fisher/LBPH FaceRecognizer? Or it can be different count (person1: 10 images, person2: 20 images, ...)
For Eigen/Fisherfaces, the safest answer is that dataset volumes for each class have to be balanced. While lacking just some images for a class may be OK, having a class which has an order of magnitude more images than all the other ones will definetely cause a problem. Tolerable disbalance is individual for every task, I guess.
At the end of the day, each of the mentioned algorithms falls into finding the closest neighbor from the training dataset to query image. Eigen/Fisherfaces are trained on the whole given dataset, calculating vectors at which the dataset images vary the most. Over- or underrepresenting a class would result in a disbalanced model which would work inadequately towards it.
Conversely, LBPH is not trained on the whole dataset. It analyzes each image from the dataset independetly, and compares the query image to each of them separately. Dataset comprehensiveness and representativeness is thus more important here than image count.
OpenCV has an intro to inner workings of these algorithms.

Stratified sampling in WEKA

How can I split a data set into a training and test set of sizes 75% and 25% of the original data set, respectively using stratified sampling in order to preserve the proportional class sizes in these new sets. I am trying to do this with WEKA.
The "RemovePercentage" filter helps does not do it in a stratified manner, and the "StratifiedRemoveFolds" filter does not do this using percentages.
I would appreciate any help or suggestion.
So, as a work around, I split the data set into two using stratifiedRemoveFolds. in this case my number of folds was 2, yielding a 50%-50% data set. Then, I split one of the folds into two using the same method, yielding a 25%-25% subset of the original data set. Then I merged one of the 25% data sets to the left over 50% yielding a 75%-25% stratified split - which was my target.