What is exactly a measure in star schema in data warehouse design? - data-mining

Star schema consists of dimension and fact tables.
Fact tables contain foreign keys for each dimension and in addition to that, it contains "measures". What exactly comprises of this measure ?
Is it some aggregate function's answer that is stored ?

Basically yes.
If you had a simple grid
Salary Januari Februari March April May June
Q1 Q2
Me 1100 1100 1100 1100 1500 1500
Collegue1 2000 2000 2000 0 0 0
Time is a hierarchical dimension with two levels (shown).
The other dimension shown is 'EmployeeID'. The other dimension (not shown) could be in the PointOfView (e.g. Budget/Actual).
The Amount (1100, e.g.) is the Measure and it constitutes your facts (the non-identifying parts of the facts). The dimensions define consolidation functions for each measure on the various levels (E.g. Amount(Q1) == SUM(Amount(January...March))). Note that the consolidation will behave differently depending on the measure (e.g. the income tax % will not be summed, but somehow consolidated: how exactly is the art of OLAP Cube design).
(trivia: you can have calculated measures, that use MDX to query e.g. the deviation of Amount in comparison the the preceding Quarter, the Average salary acoss the whole quarter etc.; it will be pretty clear that again, the consolidation formulas require thought).
At this point you will start to see that designing the consolidation rules depends on the order in which the rules are calculated (if the formula for 'salary deviation %' is is evaluated FIRST and then consolidated, you need to average it; however if the raw SALARY measure is consolidated (summed) to the Q1,Q2 level first, then the derived Measure can be calculated like it was at the lowest level.
Now things become more fun when deciding how to store the cube. Basically two ways exist:
precalculate all cells (including all consolidations in all scenarios)
calculate on the fly
It won't surprise anyone that most OLAP engines have converged on hybrid methods (HOLAP), where significant parts of frequently accessed consolidation levels are pre-calculated and stored, and other parts are calculated on the fly.
Some will store the underlying data in a standard RDBMS (ROLAP) other won't (OLAP). The engines focused on high performance tend to keep all data in precalculated cubes (only resorting to 'many small sub-cubes' for very sparse dimensions).
Well, anywyas, this was a bit of a rant. I liked rambling off what I once learned when doing datawarehousing and OLAP

Fact and measure are synonyms afaik. Facts are data: sales, production, deliveries, etc. Dimensions are information tied to the fact (time, location, department).

Measures are one of two kinds of things.
Measures. Measurements. Numbers with units. Dollars, weights, volumes, sizes, etc. Measurements.
Aggregates. Sums (or sometimes averages) of data. It might be data in the warehouse: pre-computed aggregates for performance reasons. Or it might be data that can't be acquired (or isn't needed) because it's too detailed. Too high volume or something.
The most important thing about a fact table is that the non-key measures are actual measurements with units.

If it would be an adjacent tree model it would be the title-field or any other field that contains the data.

Related

How to choose the right number of dimension in UMAP?

I wanna try to use UMAP for my high-dimensional dataset as a preprocessing step (not for data visualization) in order to decrease the number of features, but how can I choose (if there is a method) the right number of dimensions in which to map the original data? For example, in PCA you can select the number of Factors that explain a fixed % of variances.
There is no good way to do this comparable to the explicit measure given by PCA. As a rule of thumb, however, you will get significantly diminishing returns for an embedding dimension larger than the n_neighbors value. With that in mind, and since you actually have a downstream task, it makes the most sense to build a pipeline to the downstream task evaluation and look at cross validation over the number of UMAP dimensions.

ELKI: Running LOF with varying k

Can I run LOF with varying k through ELKI so that it is easy to compare which k is the best?
Normally you choose a k, and then you can see the ROCAUC for example. I want to take out the best k for the data set, so I need to compare multiple runs. Can I do that some way easier than manually changing the value for k and doing runs? I want to for example compare all k=[1-100].
Thanks
The Greedy Ensemble shows how to run outlier detection methods for a whole range of k at once efficiently (by only computing the nearest-neighbors once, it will be a lot faster!) using the ComputeKNNOutlierScores application included with ELKI.
The application EvaluatePrecomputedOutlierScores can be used to bulk-evaluate these results with multiple measures.
This is what we used for the publication
G. O. Campos, A. Zimek, J. Sander, R. J. G. B. Campello, B. Micenková, E. Schubert, I. Assent and M. E. Houle
On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical Study
Data Mining and Knowledge Discovery 30(4): 891-927, 2016, DOI: 10.1007/s10618-015-0444-8
On the supplementary material website, you can look up the best results for many standard data sets, as well as download the raw results.
But beware that outlier detection quality results tend to be inconclusive. On one data set, one method performs best, on another data set another method. There is no clear winner, because data sets are very diverse.

Scheduling - Spread out assigned event times evenly

I am trying to schedule a certain number of events in the week according to certain constraints, and would like to spread out these events as evenly as possible throughout the week.
If I add the standard deviation of the intervals between events to the objective function, then CPLEX can minimise it.
I am struggling to define the standard deviation of the intervals in terms of CPLEX expressions, mainly because the events don't have to be in any particular sequence, and I don't know which event is prior to any other one.
I feel sure this must be a solved problem, but I have not been able to find help in IBM's cplex documentation or on the internet.
Scheduling Uniformly Spaced Events
Here a few possible ideas for you to try:
Let t0, t1, t2, t3 ... tn be the event times. (These are variables chosen by the model.)
Let d1 = t1-t0, d2=t2-t1 etc... dn.
Goal: We want all these d's to be roughly equal, which would have the effect of roughly evenly spacing out the t's.
Options
Option 1: Put a cost on the deviation from ideal
Let us take one example. Let's say that you want to schedule 10 events in a week (168 hours.) With no other constraint except
equal spacing, we could have the first event start at time=0, and the last one end at time t=168. The others would be 168/(10-9) =~ 18.6 hours apart. Let's call this d_ideal.
We don't want d's to be much less than d_ideal (18.6) or much greater than d_ideal.
That is, in the objective, add Cost_dev * (abs(d_ideal - dj))
(You have to create two variable for each d (d+ and d-to handle the absolute values in the objective function.)
Option 1a
In the method above, all deviations are priced the same. So the model doesn't care if it deviates by 3 hours, or two deviations
of 1.5 hours each. The way to handle that is to add step-wise costs. Small cost for small deviations, with very high cost for high deviations. (You make them step-wise linear so that the formulation stays an LP/IP)
Option 2: Max-min
This is around your minimize the std. deviation of d's idea. We want to maximize each d (increase the inter-event separation.)
But we would also hugely punish (big cost) that particular d value that is the greatest. (In English, we don't want to let
any single d to get too large)
This is the MinMax idea. (Minimize the maximum d value, but also maximize individual d's)
Option 3: Two LPs: Solve first, then move the events around in a second LP
One drawback of layering on more and more of these side constraints is that the formulation becomes complicated.
To address this, I have seen two (or more) passes used. You solve the base LP first, assign events and then in another
LP, you address the issue of uniformly distributing times.
The goal of the second LP is to move the events around, without breaking any hard constraints.
Option 3a: Choose one of many "copies"
To achieve this, we use the following idea:
We allow multiple possible time slots for an event, and make the model select one.
The Event e1 (currently assigned to time t1) is copied into (say) 3 other possible slots.
e11 + e12 + e13 + e14 = 1
The second model can choose to move the event to a "better" time slot, or leave it be. (The old solution
is always feasible.)
The reason you are not seeing much in CPLEX manuals is that these are all formulation ideas. If you search for job- or event-scheduling
using LP's, you will come across a few pdf's that mighe be helpful.

High-dimensional clustering from aggregates of observations

I have fallen into this weird high-dimensional clustering problem. Here is an analogy to explain it.
Imagine that 2^10 people enter into a forest, and we want to know how many bird species live there.
These birds differ from each other in, say, 128 dimensions, and all dimensions are binary. That is: either a bird has large beak or small beak, either it has a blue wing or it doesn't, etc. (Each bird species can be represented by 128 bits)
My problem is that when the guys get off the forest, we only have the aggregates of their observations:
"I saw 8 birds, 3 had blue beaks (5 didn't), 4 had blue wings (4 didn't), 1 had a large beak (7 didn't), etc". They do not report on the individual characteristics of their observations, but only on the aggregates of their observations.
There are two additional constraints:
i) all species are observed at least once;
ii) The number of species is small (~2^5).
Of course, we can compile the aggregate of their aggregates (of 3000 observations, 357 birds had large beaks, etc..). But what about the clusters?
So the questions are:
How can we find out how many species live there?
How can we find out the characteristics of each species?
Since 2^128 = 340282366920938463463374607431768211456, you would need a pretty high sample size to draw valid conclusions. Every bird observed could easily be unique.
If x an aggregate observation of a set of birds by a person, then you can approximate it by the matrix product Dz where D is a matrix whose columns represent characteristics of individual birds, and z is a vector of the counts of each bird.
If you assume that only a small number of birds are observed, then this acts as a constraint on the magnitude of z.
This problem is very similar to the sparse dictionary learning problem.
Here are a couple of links that both describe sparse dictionary learning (and related problems) and provide software to solve it: http://spams-devel.gforge.inria.fr/ and http://www.ux.uis.no/~karlsk/dle/index.html

How to select an unlike number in an array in C++?

I'm using C++ to write a ROOT script for some task. At some point I have an array of doubles in which many are quite similar and one or two are different. I want to average all the number except those sore thumbs. How should I approach it? For an example, lets consider:
x = [2.3, 2.4, 2.11, 10.5, 1.9, 2.2, 11.2, 2.1]
I want to somehow average all the numbers except 10.5 and 11.2, the dissimilar ones. This algorithm is going to repeated several thousand times and the array of doubles has 2000 entries, so optimization (while maintaining readability) is desired. Thanks SO!
Check out:
http://tinypic.com/r/111p0ya/3
The "dissimilar" numbers of the y-values of the pulse.
The point of this to determine the ground value for the waveform. I am comparing the most negative value to the ground and hoped to get a better method for grounding than to average the first N points in the sample.
Given that you are using ROOT you might consider looking at the TSpectrum classes which have support for extracting backgrounds from under an unspecified number of peaks...
I have never used them with so much baseline noise, but they ought to be robust.
BTW: what is the source of this data. The peak looks like a particle detector pulse, but the high level of background jitter suggests that you could really improve things by some fairly minor adjustments in the DAQ hardware, which might be better than trying to solve a difficult software problem.
Finally, unless you are restricted to some very primitive hardware (in which case why and how are you running ROOT?), if you only have a couple thousand such spectra you can afford a pretty slow algorithm. Or is that 2000 spectra per event and a high event rate?
If you can, maintain a sorted list; then you can easily chop off the head and the tail of the list each time you work out the average.
This is much like removing outliers based on the median (ie, you're going to need two passes over the data, one to find the median - which is almost as slow as sorting for floating point data, the other to calculate the average), but requires less overhead at the time of working out the average at the cost of maintaining a sorted list. Which one is fastest will depend entirely on your circumstances. It may be, of course, that what you really want is the median anyway!
If you had discrete data (say, bytes=256 possible values), you could use 256 histogram 'bins' with a single pass over your data putting counting the values that go in each bin, then it's really easy to find the median / approximate the mean / remove outliers, etc. This would be my preferred option, if you could afford to lose some of the precision in your data, followed by maintaining a sorted list, if that is appropriate for your data.
A quick way might be to take the median, and then take the averages of number not so far off from the median.
"Not so far off," being dependent of your project.
A good rule of thumb for determining likely outliers is to calculate the Interquartile Range (IQR), and then any values that are 1.5*IQR away from the nearest quartile are outliers.
This is the basic method many statistics systems (like R) use to automatically detect outliers.
Any method that is statistically significant and a good way to approach it (Dark Eru, Daniel White) will be too computationally intense to repeat, and I think I've found a work around that will allow later correction (meaning, leave it un-grounded).
Thanks for the suggestions. I'll look into them if I have time and want to see if their gain is worth the slowdown.
Here's a quick and dirty method that I've used before (works well if there are very few outliers at the beginning, and you don't have very complicated conditions for what constitutes an outlier)
The algorithm is O(N). The only really expensive part is the division.
The real advantage here is that you can have it up and running in a couple minutes.
avgX = Array[0] // initialize array with the first point
N = length(Array)
percentDeviation = 0.3 // percent deviation acceptable for non-outliers
count = 1
foreach x in Array[1..N-1]
if x < avgX + avgX*percentDeviation
and x > avgX - avgX*percentDeviation
count++
sumX =+ x
avgX = sumX / count
endif
endfor
return avgX