Bootstraping community matrix in R - database-replication

I am new to bootstrapping analyses and I could not find much about bootstrapping community matrix. So my question is:
I have a community matrix (sample as rows and species as columns) composed of hundreds of species and about 10000 lines. I also have a column "groups" in this matrix, but the number of samples (rows) in each group is different. For example group "A" have 5 samples, "B" have 4 samples and "C" have 5 samples. I need to bootstrap each group 1000 times by taking only 4 random samples on each replication. For each replication, I need to calculate the mean value of each column (species). How can I bootstrap this matrix, in R, by groups, and generate a new matrix (1000 rows for each group) with the mean of each replication?

Related

When to filter data during dimensionality reduction of image data?

I am extracting numerical data from biological images (phenotypic profiling of fluorescently labelled cells) to eventually be able to identify data clusters of cells that are phenotypically similar.
I record images on a microcope, extract data from an imaging plate that contains untreated cells (negative control), cells with a "strong" phenotype (as positive control), as well as several treatments. The data is organized in 2D with rows as cells and columns with information extracted from these cells.
My workflow is roughly as follows:
Plate-wise normalization of data
Elimination of features (= columns) that have reduntant information, contain too many NAs, show little variance or are not replicated betwen different experiments
PCA
tSNE for visualization
Cluster analysis
If I'm interested in only a subset of data (controls and let's say treatment 1 and 2, out of 10), when should I filter the data?
Currently, I filter before normalization, but I'm afraid that will impact behaviour and results of PCA/tSNE. Should I do the entire analysis with all the data and filter only before tSNE visualization?

How to assess the variance explained by a principal component in a sub-set of data?

I have conducted a PCA (in Matlab) on a set of thousands of points of spatial data. I have also calculated the variance explained across the full dataset by each principal component (i.e. PC or eigenvector) by dividing its eigenvalue by the sum of all eigenvalues. As an example, PC 15 accounts for 2% of the variance in the entire dataset; however, there is a subset of points in this dataset for which I suspect PC 15 accounts for a much higher % of their variance (e.g. 80%).
My question is this, is there a way to calculate the variance explained by a given PC from my existing analysis for only a subset of points (i.e. 1000 pts from the full dataset of 500k+). I know that I could run another PCA on just the subset, but for my purposes, I need to continue to use the PCs from my original analysis. Any idea for how to do this would be very helpful.
Thanks!

Is it possible to set the number of rows in a Power BI Small Multiples grid to a variable?

I want a visual comparison of stacked histograms using the same x-axis scale. In this situation, I don't always have the same number of histograms for each dataset.
I would like to do a small multiples grid in Power BI that is N rows by 1 column; where N is the number of instruments that are collecting data for the client site selected in a Slicer. Some clients may be collecting data with one instrument. Other clients may use 3 or more instruments.
Is there a way to set the number of rows to a variable in a small multiples grid or would it be possible to embed a chart in a matrix?
The default is a 2×2 grid of small multiples, but you can adjust the number of rows and columns to up to 6×6. Any multiples that don’t fit on that grid will load in as you scroll down.
You can adjust the style and position of the small multiple titles in the Small multiple title card:
And you can change the dimensions of the grid in the Grid layout card:

Weighted Average Calculations across various combinations using Cube.js

We have a question on designing schema and handling analytics requirement for our product and would appreciate your advise on this. We are just getting started with Cube.js. Here is our req: We have data (for simplicity...i will use an example) where say we have multiple columns (attributes) and say 1 "value" and 1 "weight" column. We need to calculate weighted averages across all combinations of the columns (attributes) and the value / weight columns.
e.g. Group by Column 1 and weighted average (value/Weight column)
or Group by Column 1, 2 and weighted average etc. etc...
it can be many types of combinations and we have atleast 8 to 12 columns like that
Wondering how best to model?
Probably for you will be convenient to create one cube with several predefined segments or also you can create several cubes per each attribute.
It depends on your data.

data distribution in redshift for star schema model?

I have big fact table 2 billions rows and 19 dimensions ( product dimension is big 450 millions, another two dimensions are 100 millions each rest small dimensions table)
Can some one help me on data distribution for this scenarios ?