How can I split a data set into a training and test set of sizes 75% and 25% of the original data set, respectively using stratified sampling in order to preserve the proportional class sizes in these new sets. I am trying to do this with WEKA.
The "RemovePercentage" filter helps does not do it in a stratified manner, and the "StratifiedRemoveFolds" filter does not do this using percentages.
I would appreciate any help or suggestion.
So, as a work around, I split the data set into two using stratifiedRemoveFolds. in this case my number of folds was 2, yielding a 50%-50% data set. Then, I split one of the folds into two using the same method, yielding a 25%-25% subset of the original data set. Then I merged one of the 25% data sets to the left over 50% yielding a 75%-25% stratified split - which was my target.
Related
I have a dataset with around 15 numeric columns and two categorical columns which are a "State" column and an "Income" column with six buckets representing each different income range. Do I need to encode the "Income" column if it contains integers 1-6 representing each income range? In addition, what type of encoder should I use for the "state" column and does anyone have any good resources on this?
In addition, does one typically perform feature selection (wrapper and filter methods such as Pearson's and Recursive Feature Elimination) before PCA? What is the typical correlation threshold when using a method like Pearson's? And what is the ideal number of dimensions or explained variance ratio one should use when running PCA. I'm confused if you use one of them or both. Thank you.
I am extracting numerical data from biological images (phenotypic profiling of fluorescently labelled cells) to eventually be able to identify data clusters of cells that are phenotypically similar.
I record images on a microcope, extract data from an imaging plate that contains untreated cells (negative control), cells with a "strong" phenotype (as positive control), as well as several treatments. The data is organized in 2D with rows as cells and columns with information extracted from these cells.
My workflow is roughly as follows:
Plate-wise normalization of data
Elimination of features (= columns) that have reduntant information, contain too many NAs, show little variance or are not replicated betwen different experiments
PCA
tSNE for visualization
Cluster analysis
If I'm interested in only a subset of data (controls and let's say treatment 1 and 2, out of 10), when should I filter the data?
Currently, I filter before normalization, but I'm afraid that will impact behaviour and results of PCA/tSNE. Should I do the entire analysis with all the data and filter only before tSNE visualization?
I have used to below hyper parameters to train the model.
rcf.set_hyperparameters(
num_samples_per_tree=200,
num_trees=250,
feature_dim=1,
eval_metrics =["accuracy", "precision_recall_fscore"])
is there any best way to choose the num_samples_per_tree and num_trees parameters.
what are the best numbers for both num_samples_per_tree and num_trees.
There are natural interpretations for these two hyper-parameters that can help you determine good starting approximations for HPO:
num_samples_per_tree -- the reciprocal of this value approximates the density of anomalies in your data set/stream. For example, if you set this to 200 then the assumption is that approximately 0.5% of the data is anomalous. Try exploring your dataset to make an educated estimate.
num_trees -- the more trees in your RCF model the less noise in scores. That is, if more trees are reporting that the input inference point is an anomaly then the point is much more likely to be an anomaly than if few trees suggest so.
The total number of points sampled from the input dataset is equal to num_samples_per_tree * num_trees. You should make sure that the input training set is at least this size.
(Disclosure - I helped create SageMaker Random Cut Forest)
With respect to semantic segmentation, it seems to me that there are multiple ways for the final pixel-wise labeling, such as
softmax, sigmoid, logistic regression or other classical classification methods.
However, for softmax approach, we need to ensure the output map resulting from the network architecture has multiple channels. The number of channels matches the number of classes. For instance, if we are talking two-classes problem, masks and un-masks, then we will use two channels. Is this right?
Moreover, each channel in the output map can be treated as a probability map for a given class. Is this understanding right?
Yes to both questions. The goal of the softmax function is to transform the scores into probabilities so that you can maximize the probability of the true label.
I have a list of dates I'd like to cluster into 3 clusters. Now, I can see hints that I should be looking at k-means, but all the examples I've found so far are related to coordinates, in other words, pairs of list items.
I want to take this list of dates and append them to three separate lists indicating whether they were before, during or after a certain event. I don't have the time for this event, but that's why I'm guessing it by breaking the date/times into three groups.
Can anyone please help with a simple example on how to use something like numpy or scipy to do this?
k-means is exclusively for coordinates. And more precisely: for continuous and linear values.
The reason is the mean functions. Many people overlook the role of the mean for k-means (despite it being in the name...)
On non-numerical data, how do you compute the mean?
There exist some variants for binary or categorial data. IIRC there is k-modes, for example, and there is k-medoids (PAM, partitioning around medoids).
It's unclear to me what you want to achieve overall... your data seems to be 1-dimensional, so you may want to look at the many questions here about 1-dimensional data (as the data can be sorted, it can be processed much more efficiently than multidimensional data).
In general, even if you projected your data into unix time (seconds since 1.1.1970), k-means will likely only return mediocre results for you. The reason is that it will try to make the three intervals have the same length.
Do you have any reason to suspect that "before", "during" and "after" have the same duration? If not, don't use k-means.
You may however want to have a look at KDE; and plot the estimated density. Once you have understood the role of density for your task, you can start looking at appropriate algorithms (e.g. take the derivative of your density estimation, and look for the largest increase / decrease, or estimate an "average" level, and look for the longest above-average interval).
Here are some workaround methods that may not be the best answer but should help.
You can plot the dates as converted durations from a starting date (such as one week)
and convert the dates to number representations for time in minutes or hours from the starting point.
These would all graph along an x-axis but Kmeans should still be possible and clustering still visible on a graph.
Here are more examples of numpy:Python k-means algorithm