I have big fact table 2 billions rows and 19 dimensions ( product dimension is big 450 millions, another two dimensions are 100 millions each rest small dimensions table)
Can some one help me on data distribution for this scenarios ?
Related
I am extracting numerical data from biological images (phenotypic profiling of fluorescently labelled cells) to eventually be able to identify data clusters of cells that are phenotypically similar.
I record images on a microcope, extract data from an imaging plate that contains untreated cells (negative control), cells with a "strong" phenotype (as positive control), as well as several treatments. The data is organized in 2D with rows as cells and columns with information extracted from these cells.
My workflow is roughly as follows:
Plate-wise normalization of data
Elimination of features (= columns) that have reduntant information, contain too many NAs, show little variance or are not replicated betwen different experiments
PCA
tSNE for visualization
Cluster analysis
If I'm interested in only a subset of data (controls and let's say treatment 1 and 2, out of 10), when should I filter the data?
Currently, I filter before normalization, but I'm afraid that will impact behaviour and results of PCA/tSNE. Should I do the entire analysis with all the data and filter only before tSNE visualization?
I am new to bootstrapping analyses and I could not find much about bootstrapping community matrix. So my question is:
I have a community matrix (sample as rows and species as columns) composed of hundreds of species and about 10000 lines. I also have a column "groups" in this matrix, but the number of samples (rows) in each group is different. For example group "A" have 5 samples, "B" have 4 samples and "C" have 5 samples. I need to bootstrap each group 1000 times by taking only 4 random samples on each replication. For each replication, I need to calculate the mean value of each column (species). How can I bootstrap this matrix, in R, by groups, and generate a new matrix (1000 rows for each group) with the mean of each replication?
I want a visual comparison of stacked histograms using the same x-axis scale. In this situation, I don't always have the same number of histograms for each dataset.
I would like to do a small multiples grid in Power BI that is N rows by 1 column; where N is the number of instruments that are collecting data for the client site selected in a Slicer. Some clients may be collecting data with one instrument. Other clients may use 3 or more instruments.
Is there a way to set the number of rows to a variable in a small multiples grid or would it be possible to embed a chart in a matrix?
The default is a 2×2 grid of small multiples, but you can adjust the number of rows and columns to up to 6×6. Any multiples that don’t fit on that grid will load in as you scroll down.
You can adjust the style and position of the small multiple titles in the Small multiple title card:
And you can change the dimensions of the grid in the Grid layout card:
We have a question on designing schema and handling analytics requirement for our product and would appreciate your advise on this. We are just getting started with Cube.js. Here is our req: We have data (for simplicity...i will use an example) where say we have multiple columns (attributes) and say 1 "value" and 1 "weight" column. We need to calculate weighted averages across all combinations of the columns (attributes) and the value / weight columns.
e.g. Group by Column 1 and weighted average (value/Weight column)
or Group by Column 1, 2 and weighted average etc. etc...
it can be many types of combinations and we have atleast 8 to 12 columns like that
Wondering how best to model?
Probably for you will be convenient to create one cube with several predefined segments or also you can create several cubes per each attribute.
It depends on your data.
We have very sparse data that we are attempting to plot with Google Charts. There are 16 different vectors and each has about 12,000 points. The points are times. All of the times are different. My reading of the API is that I need to create a row where each element corresponds to a different vector. So that's a set of 192,000, where the first element in each row is the time and all of the other elements are null except for the one that has data there, for a total of 3,072,000 elements. When we give this to Google Charts, the browser dies.
The problem with using arrayToDataTable is that our array is sparse. Likewise, arrayToDataTable doesn't work.
My question: is there a more efficient way to do this? Can I plot each data value independently, rather than all at the same time?
It turns out that the answer to this question is to do server-side data reduction in the form of binning. The individual rows each have their own timestamp, but because we are displaying this in a graph with a width of at most 2000 pixel, it makes sense to bin on the server into 2000 individual rows, each one with 16 columns. Then the total array is 32,000 elements, which appears well within the limits of the browser.